Skip to content

feat: parallelize update and scan processing#770

Merged
wgtmac merged 2 commits into
apache:mainfrom
HuaHuaY:parallel_optimize
Jun 24, 2026
Merged

feat: parallelize update and scan processing#770
wgtmac merged 2 commits into
apache:mainfrom
HuaHuaY:parallel_optimize

Conversation

@HuaHuaY

@HuaHuaY HuaHuaY commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@HuaHuaY HuaHuaY force-pushed the parallel_optimize branch from dbe9af7 to ac14be2 Compare June 23, 2026 07:51
Comment thread src/iceberg/update/snapshot_update.cc
Comment thread src/iceberg/update/snapshot_update.cc
@HuaHuaY HuaHuaY force-pushed the parallel_optimize branch from ac14be2 to e5da646 Compare June 24, 2026 06:27
plan_executor_, manifests_to_delete,
[this, &metadata](
const ManifestFile& manifest) -> Result<std::unordered_set<std::string>> {
auto result = ReadLiveDataFilePaths(metadata, manifest);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle delete manifests here too? ReadLiveDataFilePaths rejects them, so reachable cleanup removes the delete manifest but leaves the delete file behind. Java uses ManifestFiles.readPaths, which reads both data and delete manifests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a problem introduced by this PR. It can be fixed in a future PR.

///
/// \param executor Executor to use while planning expired snapshot metadata.
/// \return Reference to this for method chaining.
ExpireSnapshots& PlanWith(Executor& executor);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the same lifetime note as ExecuteDeleteWith. This stores only a reference, and planning can run later during Apply() or Finalize().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment from ExecuteDeleteWith. The existence of the executor is self-evident, guaranteed by the caller.

Comment thread src/iceberg/table_scan.h
///
/// \param executor Executor to use while planning manifests.
/// \return Reference to this for method chaining.
TableScanBuilder& PlanWith(Executor& executor);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the executor lifetime here too. The built scan stores this by reference and may use it later in PlanFiles().

///
/// \param executor Executor to use while planning manifests.
/// \return Reference to this for method chaining.
auto& ScanManifestsWith(this auto& self, Executor& executor) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the executor lifetime here too. This stores only a reference and Apply() may use it later.

///
/// \param executor Executor to use, or std::nullopt to read manifests serially.
/// \return Reference to this for method chaining.
Builder& PlanWith(OptionalExecutor executor);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this public API also take Executor&? If OptionalExecutor is only internal plumbing, it should live in an _internal.h header so users do not depend on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename the file in future PR.

@HuaHuaY HuaHuaY requested a review from wgtmac June 24, 2026 08:56
@wgtmac wgtmac changed the title feat: optimize some parallel comments except manifest feat: parallelize update and scan processing Jun 24, 2026
@wgtmac wgtmac merged commit 988f363 into apache:main Jun 24, 2026
21 checks passed
@HuaHuaY HuaHuaY deleted the parallel_optimize branch June 24, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants