GH-50348: [C++] Fix duplicate keys issue in KeyValueMetadata#50349
GH-50348: [C++] Fix duplicate keys issue in KeyValueMetadata#50349OmBiradar wants to merge 2 commits into
KeyValueMetadata#50349Conversation
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
KeyValueMetadataKeyValueMetadata
|
|
There was a problem hiding this comment.
Pull request overview
This PR aims to make arrow::KeyValueMetadata enforce unique keys (updating values on re-insert) and speed up key lookups by adding an internal unordered_map index, addressing duplicate-key bugs described in GH-50348.
Changes:
- Add an internal
unordered_mapindex (map_) to provide O(1) key lookup and enable in-place value updates for duplicate inserts. - Update
Append,FindKey, and deletion logic to use/maintain the index. - Add a unit test covering duplicate-key append semantics.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| cpp/src/arrow/util/key_value_metadata.h | Adds map_ member to index key → vector position. |
| cpp/src/arrow/util/key_value_metadata.cc | Implements map-backed append/find and attempts to keep the index in sync across mutations. |
| cpp/src/arrow/util/key_value_metadata_test.cc | Adds test coverage for duplicate-key behavior (and suggests extending coverage for bulk deletions + vector-construction). |
Comments suppressed due to low confidence (1)
cpp/src/arrow/util/key_value_metadata.cc:142
DeleteManymutateskeys_/values_but doesn’t rebuildmap_. After this PR, lookups (FindKey,Contains,Get,Set,Append) depend onmap_, soDeleteManycan return stale indices or “find” deleted keys unlessmap_is updated.
}
return Status::OK();
}
Status KeyValueMetadata::DeleteMany(std::vector<int64_t> indices) {
| ARROW_CHECK_EQ(keys_.size(), values_.size()); | ||
| std::vector<std::string> unique_keys, unique_values; | ||
| for (size_t i = 0; i < keys_.size(); i++) { | ||
| if (map_.find(keys_[i]) == map_.end()) { | ||
| map_[keys_[i]] = i; | ||
| unique_keys.push_back(keys_[i]); | ||
| unique_values.push_back(values_[i]); | ||
| } else { | ||
| unique_values[map_[keys_[i]]] = values_[i]; | ||
| } | ||
| } | ||
| keys_ = std::move(unique_keys); | ||
| values_ = std::move(unique_values); |
| KeyValueMetadata::KeyValueMetadata(std::vector<std::string> keys, | ||
| std::vector<std::string> values) | ||
| : keys_(std::move(keys)), values_(std::move(values)) { | ||
| ARROW_CHECK_EQ(keys.size(), values.size()); | ||
| for (size_t i = 0; i < keys_.size(); i++) { | ||
| map_[keys_[i]] = i; | ||
| } | ||
| } |
| "of size ", | ||
| keys_.size()); | ||
| } | ||
| map_.erase(map_.find(keys_[index])); |
| KeyValueMetadata metadata2; | ||
| metadata2.Append("k", "v2"); | ||
|
|
||
| ASSERT_TRUE(metadata1.Equals(metadata2)); | ||
| } |
|
I am still refactoring and adding a comprehensive test suite |
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
|
I see tests failing due to pyarrow having duplicate keys support. # test duplicate key support
md = pa.KeyValueMetadata([
('a', 'alpha'),
('b', 'beta'),
('a', 'Alpha'),
('a', 'ALPHA'),
])
expected = [
(b'a', b'alpha'),
(b'b', b'beta'),
(b'a', b'Alpha'),
(b'a', b'ALPHA')
]
> assert len(md) == 4
E assert 2 == 4
E + where 2 = len(\n-- metadata --\na: ALPHA\nb: beta)Before opening the issue #50348, I had put up the idea to eliminate the duplicate keys support in I received positive feedback over his, and so I decided to approach this. @pitrou it would be really nice to have your feedback on this matter. My main question is should duplicates be allowed in |
Rationale for this change
KeyValueMetadatauses 2 vectors internally to manage the keys and the values. This causes it to accept duplicate keys. When new key-value pairs are appended such that the key already exists in the metadata, the metadata just accepts it as a duplicate, but while finding the key, it returns the first inserted value, not the last updated value. This causes the updated key added using theAppendfunction to become "virtually non-existant". I use an hashmap - unordered_map to store pairs of (key, index) where index is the index of the key in the vectorkeys_. This enables significant performance gains also in some methods ofKeyValueMetadata.What changes are included in this PR?
unordered_mapalong with the existing vectors to keep track of the key value pairs in the metadata.FindKeyfromO(n)toO(1).Equalsworks properly with the new logicAre these changes tested?
I have added tests in
key_value_metadata_test.cctesting the duplicate keys scenario.I tested them locally, all tests pass.
Are there any user-facing changes?
No, I tried to keep the API as it is.