Skip to content

[C++] KeyValueMetadata accepts duplicate keys #50348

Description

@OmBiradar

I belive the KeyValueMetadata should not keep store of duplicate values of a key.
In case a user passes a key value pair such that the key already exists in the metadata, then it should update the existing key's value to the new value, thus loosing the previous value.

Currently when we pass a key value pair - (k, v2) which already exists in the metadata as - (k, v1), it stores it as it is, along with the old key value pair i.e. two paris - (k, v1), (k, v2), and while retrival, we always get the old value of v1 rather than the newest value v2. This also causes the metadata to bloat.

Another consequence of accepting duplicate key pairs is the Equal not working properly.
If suppose we had 2 metadatas - meta1 and meta2,
we insert into them the pairs in order:
meta1 -> (k1, v1) then (k1, v2)
meta2 -> (k1, v2) then (k1, v1)

They are actually equal, but meta1.Equals(meta2) always returns false. This is due to the logic through which the Equals works. It uses ArgSort to only sort the keys, thus the values remain unsorted and cause invalid comparisons. By removing the possibility of any duplicate keys being present, we can avoid this.

Another consequence of this is that the Set and FindKey functions only set/find the first key match, and ignore any duplicate keys produced later.

The script I used to test this:

#include <arrow/util/key_value_metadata.h>
#include <iostream>

using arrow::Status;
using std::cin;
using std::cout;
using std::endl;

void reproduce_duplicate_keys() {
  arrow::KeyValueMetadata meta1, meta2;
  meta1.Append("k1", "v1");
  meta1.Append("k1", "v2");
  meta2.Append("k1", "v2");
  meta2.Append("k1", "v1");
  cout << meta1.ToString() << endl;
  cout << meta2.ToString() << endl;
  cout << endl;
  cout << (meta1.Equals(meta2) ? "true" : "false") << endl;
  // returns false as the keys are already "sorted", but the values are
  // different for the duplicate keys

  std::unordered_map<std::string, std::string> map;
  meta1.ToUnorderedMap(&map);

  // Here, the unordered map completely misses the duplicate values
  // which suggests that the KeyValueMetadata should only have unique keys
  for (auto [k, v] : map) {
    cout << k << ": " << v << endl;
  }

  // also if find key is used, it only returns the first key found
  // and not an array of keys found
  // there is NO WAY TO FIND THE OTHER DUPLICATE KEY
  cout << "`k1` only found at : " << meta1.FindKey("k1") << endl;

  // even the `Set` fn to change the value of a key
  // only can ever change 1 key (which ever is first in order)
  meta1.Set("k1", "v3");

  cout << meta1.ToString() << meta2.ToString() << endl;

  // even the delete depends on what metadata comes first
  meta1.Delete("k1");
  meta2.Delete("k1");

  cout << meta1.ToString() << endl;
  cout << meta2.ToString() << endl;
}

void DuplicateKeysAndConsequences() {
  // The Metadata accepts duplicate keys, but when comparing it gives wrong
  // results. The comparision always using ArgSort to sort the keys, but never
  // the values. Thus as it accepts duplicate keys, it also messes up much of
  // the logic in it's functions
  reproduce_duplicate_keys();
}

int main() {
  cout << "Environment to test out bugs related to the KeyValueMetadata class"
       << endl;

  DuplicateKeysAndConsequences();
}

Output:

-- metadata --
k1: v1
k1: v2

-- metadata --
k1: v2
k1: v1

false
k1: v1
`k1` only found at : 0

-- metadata --
k1: v3
k1: v2
-- metadata --
k1: v2
k1: v1

-- metadata --
k1: v2

-- metadata --
k1: v1

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions