fix: handle non-contiguous RowRanges when resolving global row IDs by zhf999 · Pull Request #383 · alibaba/paimon-cpp

zhf999 · 2026-06-26T02:55:49Z

Purpose

Fix a correctness bug where upper layers assumed returned batch rows are continuous and derived file row IDs as previous_batch_start + offset.
Introduce unified usage of GetPreviousBatchFileRowId(batch_row_id) to resolve the file row ID for a row index inside the current batch.
In PrefetchFileBatchReaderImpl, cache the actual file row IDs for each returned batch and keep row-id mapping aligned when a batch is sliced by read_range.
For Parquet, add target row-range union and per-batch row mapping logic:
- Merge filtered target ranges into a file target row set.
- Build batch_row_id -> file_row_id mapping in NextBatch().
- Keep GetPreviousBatchFileRowId() correct under non-contiguous rows caused by predicate + bitmap filtering.
Update upper-layer consumers to query per-row file IDs directly (deletion vectors, bitmap index filtering, _ROW_ID field conversion, KeyValue iteration positions).

Tests

Updated and adapted reader tests to the new interface and semantics:
- src/paimon/format/avro/avro_file_batch_reader_test.cpp
- src/paimon/format/blob/blob_file_batch_reader_test.cpp
- src/paimon/format/lance/lance_format_reader_writer_test.cpp
- src/paimon/format/orc/orc_file_batch_reader_test.cpp
- src/paimon/format/parquet/parquet_file_batch_reader_test.cpp
- src/paimon/common/reader/prefetch_file_batch_reader_impl_test.cpp
Added/extended coverage in Parquet with TestRowMapping to validate file row mapping across non-contiguous ranges.
Remaining changes are interface migration plus consistency updates of call sites and assertions.

API and Format

Public reader API change: FileBatchReader and implementations now use GetPreviousBatchFileRowId(uint64_t batch_row_id).
Semantic change: returns the file row ID for the batch_row_id inside the current batch (instead of deriving by batch start + offset under contiguous assumptions).
Storage format and on-disk protocol are unchanged.

Documentation

No.

Generative AI tooling

gpt-5.3-codex

…ousBatchFirstRowId to GetGlobalRowId

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

lxy-9602 · 2026-06-27T03:19:24Z

-Result<uint64_t> PrefetchFileBatchReaderImpl::GetPreviousBatchFirstRowNumber() const {
-    return previous_batch_first_row_num_;
+Result<uint64_t> PrefetchFileBatchReaderImpl::GetPreviousBatchGlobalRowId(
+    uint64_t batch_row_id) const {


Why can’t we just return previous_batch_first_row_num_ + batch_row_id directly?

The PrefetchFileBatchReaderImpl may hold ParquetFileBatchReader, which may return contenation of two discontinuous batch. Should we fallback PrefetchFileBatchReaderImpl::GetPreviousBatchGlobalRowId to simply return previous_batch_first_row_num_ + batch_row_id like LanceFileBatchReader or BlobFileBatchReader?

lxy-9602 · 2026-06-27T03:50:16Z

+
+    static Result<std::shared_ptr<arrow::ChunkedArray>> CollectResultOneBatch(
+        BatchReader* batch_reader, int64_t max_data_processing_time_in_us) {
+        int64_t seed = DateTimeUtils::GetCurrentUTCTimeUs();


Can CollectResultOneBatch just return an Arrow array directly? It doesn’t look like we need a ChunkedArray here.

CollectResultOneBatch is designed to align with CollectResult. Should we implement CollectResultOneBatch with a different return value?

zjw1111 · 2026-06-29T08:53:08Z

            PrepareOrcFileBatchReader(file_name, &read_schema, batch_size, natural_read_size);
-        ASSERT_EQ(std::numeric_limits<uint64_t>::max(),
-                  orc_batch_reader->GetPreviousBatchFirstRowNumber().value());
+        ASSERT_EQ(orc_batch_reader->GetPreviousBatchFileRowId(0).value(), -1);


why -1 here? Status::Invalid?

zjw1111 · 2026-06-29T09:12:25Z

+        return CollectResultOneBatch(batch_reader, /*max_simulated_data_processing_time*/ 0);
+    }
+
+    static Result<std::shared_ptr<arrow::ChunkedArray>> CollectResultOneBatch(


It's very similar with CollectResult, can you refactor to extract the common parts?

zjw1111 · 2026-06-29T09:40:31Z

-    Result<uint64_t> GetPreviousBatchFirstRowNumber() const override {
-        assert(reader_);
-        return reader_->GetPreviousBatchFirstRowNumber();
+    Result<uint64_t> GetPreviousBatchFileRowId(uint64_t batch_row_id) const override {


change the interface in FileReaderWrapper together

zjw1111 · 2026-06-29T09:42:42Z

+                target_row_groups.emplace_back(
+                    /*rg_index=*/rg_id,
+                    /*is_partially_matched=*/false,
+                    /*ranges=*/
+                    RowRanges(Range(0, reader_->GetAllRowGroupRanges()[rg_id].second -
+                                           reader_->GetAllRowGroupRanges()[rg_id].first - 1)));


format it

Suggested change

target_row_groups.emplace_back(

/*rg_index=*/rg_id,

/*is_partially_matched=*/false,

/*ranges=*/

RowRanges(Range(0, reader_->GetAllRowGroupRanges()[rg_id].second -

reader_->GetAllRowGroupRanges()[rg_id].first - 1)));

target_row_groups.emplace_back(

/*rg_index=*/rg_id, /*is_partially_matched=*/false, /*ranges=*/

RowRanges(Range(0, reader_->GetAllRowGroupRanges()[rg_id].second -

reader_->GetAllRowGroupRanges()[rg_id].first - 1)));

zjw1111 · 2026-06-29T09:58:44Z

                         ReadResultCollector::CollectResult(
                             reader.get(), /*max simulated data processing time*/ 100));
-    ASSERT_EQ(reader->GetPreviousBatchFirstRowNumber().value(), 101);
+    ASSERT_NOK(reader->GetPreviousBatchFileRowId(0));


Now that the interface has been modified, these calls will always trigger ASSERT_NOK, so there's no point in testing them anymore, right? There seem to be similar issues in other test files as well.

zjw1111 · 2026-06-29T10:04:02Z

-    /// Get the row number of the first row in the previously read batch.
-    virtual Result<uint64_t> GetPreviousBatchFirstRowNumber() const = 0;
+    /// Get the global row number of the row in the previously read batch.
+    virtual Result<uint64_t> GetPreviousBatchFileRowId(uint64_t batch_row_id) const = 0;


Are there any explicit semantic constraints on this interface before reading begins and after EOF is reached? As per previous discussions, it returns Status::Invalid before reading starts. However, after reaching EOF, the behavior currently varies wildly—some continue to accumulate, while others return errors. Should we impose some constraints on this? cc @lxy-9602

zhf999 added 4 commits June 26, 2026 10:36

fix: FileBatchReader returns discontinuous batch, and change GetPrevi…

5d534db

…ousBatchFirstRowId to GetGlobalRowId

style: change interface name

ad03498

update header files

cd7bd44

Merge branch 'main' into fix-rowid

f9721a8

Copilot AI review requested due to automatic review settings June 26, 2026 02:55

Copilot AI reviewed Jun 26, 2026

zhf999 added 7 commits June 26, 2026 14:06

fix: return Status::Invalid intead of returning max value

41f932d

fix: lance and blob return NotImplemented

6bd98d8

fix: add inclusive extend for fully matched rowgroup in SetReadSchema

a694567

fix: calling SetReadSchema many time do not clear row_mapping

eb48e42

test: add test for PrefetchFileBatchReaderImpl

7b00794

Merge branch 'main' into fix-rowid

5b28d56

style: replace auto in assigning macro with explicit type

0d9a174

lxy-9602 reviewed Jun 27, 2026

View reviewed changes

zhf999 added 12 commits June 29, 2026 10:12

style: rename interfaces and parameters

1336922

fix: use a more efficient way to apply bitmap

8f82f44

update headers

d3b73e1

fix: use iterator to apply bitmap

fcc1ac8

test: add assertion

0fb2dec

test: use '.value()' directly to validate the result.

a3e37bd

update comments

8820e7c

style: change method name

376a312

fix: small fixes

f1c02db

fix: blob test

ad66b22

fix: blob and lance tests

25ef3d0

fix: blob

8482e90

zjw1111 reviewed Jun 29, 2026

View reviewed changes

Uh oh!

Conversation

zhf999 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhf999 commented Jun 26, 2026 •

edited

Loading