Skip to content

GH-50348: [C++] Fix duplicate keys issue in KeyValueMetadata#50349

Open
OmBiradar wants to merge 2 commits into
apache:mainfrom
OmBiradar:make-keyvaluemetadata-efficient
Open

GH-50348: [C++] Fix duplicate keys issue in KeyValueMetadata#50349
OmBiradar wants to merge 2 commits into
apache:mainfrom
OmBiradar:make-keyvaluemetadata-efficient

Conversation

@OmBiradar

@OmBiradar OmBiradar commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

KeyValueMetadata uses 2 vectors internally to manage the keys and the values. This causes it to accept duplicate keys. When new key-value pairs are appended such that the key already exists in the metadata, the metadata just accepts it as a duplicate, but while finding the key, it returns the first inserted value, not the last updated value. This causes the updated key added using the Append function to become "virtually non-existant". I use an hashmap - unordered_map to store pairs of (key, index) where index is the index of the key in the vector keys_. This enables significant performance gains also in some methods of KeyValueMetadata.

What changes are included in this PR?

  • I added an unordered_map along with the existing vectors to keep track of the key value pairs in the metadata.
  • Logic to avoid duplicate keys is added.
  • If a user adds an already existing key, then the existing key's value gets update rather than making a duplicate entry.
  • Reduces the time complexity of FindKey from O(n) to O(1).
  • Equals works properly with the new logic

Are these changes tested?

I have added tests in key_value_metadata_test.cc testing the duplicate keys scenario.
I tested them locally, all tests pass.

Are there any user-facing changes?

No, I tried to keep the API as it is.

Signed-off-by: OmBiradar <ombiradar04@gmail.com>
@OmBiradar OmBiradar requested a review from pitrou as a code owner July 3, 2026 00:50
Copilot AI review requested due to automatic review settings July 3, 2026 00:50
@OmBiradar OmBiradar changed the title Fix duplicate keys issue in KeyValueMetadata GH-50348: [C++] Fix duplicate keys issue in KeyValueMetadata Jul 3, 2026
@github-actions github-actions Bot added the awaiting review Awaiting review label Jul 3, 2026
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

⚠️ GitHub issue #50348 has been automatically assigned in GitHub to PR creator.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make arrow::KeyValueMetadata enforce unique keys (updating values on re-insert) and speed up key lookups by adding an internal unordered_map index, addressing duplicate-key bugs described in GH-50348.

Changes:

  • Add an internal unordered_map index (map_) to provide O(1) key lookup and enable in-place value updates for duplicate inserts.
  • Update Append, FindKey, and deletion logic to use/maintain the index.
  • Add a unit test covering duplicate-key append semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
cpp/src/arrow/util/key_value_metadata.h Adds map_ member to index key → vector position.
cpp/src/arrow/util/key_value_metadata.cc Implements map-backed append/find and attempts to keep the index in sync across mutations.
cpp/src/arrow/util/key_value_metadata_test.cc Adds test coverage for duplicate-key behavior (and suggests extending coverage for bulk deletions + vector-construction).
Comments suppressed due to low confidence (1)

cpp/src/arrow/util/key_value_metadata.cc:142

  • DeleteMany mutates keys_/values_ but doesn’t rebuild map_. After this PR, lookups (FindKey, Contains, Get, Set, Append) depend on map_, so DeleteMany can return stale indices or “find” deleted keys unless map_ is updated.
  }
  return Status::OK();
}

Status KeyValueMetadata::DeleteMany(std::vector<int64_t> indices) {

Comment on lines +64 to +76
ARROW_CHECK_EQ(keys_.size(), values_.size());
std::vector<std::string> unique_keys, unique_values;
for (size_t i = 0; i < keys_.size(); i++) {
if (map_.find(keys_[i]) == map_.end()) {
map_[keys_[i]] = i;
unique_keys.push_back(keys_[i]);
unique_values.push_back(values_[i]);
} else {
unique_values[map_[keys_[i]]] = values_[i];
}
}
keys_ = std::move(unique_keys);
values_ = std::move(unique_values);
Comment on lines 79 to 86
KeyValueMetadata::KeyValueMetadata(std::vector<std::string> keys,
std::vector<std::string> values)
: keys_(std::move(keys)), values_(std::move(values)) {
ARROW_CHECK_EQ(keys.size(), values.size());
for (size_t i = 0; i < keys_.size(); i++) {
map_[keys_[i]] = i;
}
}
"of size ",
keys_.size());
}
map_.erase(map_.find(keys_[index]));
Comment on lines +243 to +247
KeyValueMetadata metadata2;
metadata2.Append("k", "v2");

ASSERT_TRUE(metadata1.Equals(metadata2));
}
@OmBiradar

OmBiradar commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

I am still refactoring and adding a comprehensive test suite

Signed-off-by: OmBiradar <ombiradar04@gmail.com>
@OmBiradar

OmBiradar commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

I see tests failing due to pyarrow having duplicate keys support.

        # test duplicate key support
        md = pa.KeyValueMetadata([
            ('a', 'alpha'),
            ('b', 'beta'),
            ('a', 'Alpha'),
            ('a', 'ALPHA'),
        ])
    
        expected = [
            (b'a', b'alpha'),
            (b'b', b'beta'),
            (b'a', b'Alpha'),
            (b'a', b'ALPHA')
        ]
>       assert len(md) == 4
E       assert 2 == 4
E        +  where 2 = len(\n-- metadata --\na: ALPHA\nb: beta)

Before opening the issue #50348, I had put up the idea to eliminate the duplicate keys support in KeyValueMetadata in arrow community meeting with @ianmcook, as it has problems with value retrieval and locating keys.

I received positive feedback over his, and so I decided to approach this.

@pitrou it would be really nice to have your feedback on this matter. My main question is should duplicates be allowed in KeyValueMetadata?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants