Skip to content

[python] Roll manifest files by target size during commit and compaction#8373

Open
XiaoHongbo-Hope wants to merge 26 commits into
apache:masterfrom
XiaoHongbo-Hope:manifest_write_fix
Open

[python] Roll manifest files by target size during commit and compaction#8373
XiaoHongbo-Hope wants to merge 26 commits into
apache:masterfrom
XiaoHongbo-Hope:manifest_write_fix

Conversation

@XiaoHongbo-Hope

@XiaoHongbo-Hope XiaoHongbo-Hope commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Purposes

Commit writes all manifest entries into a single file without rolling, producing oversized manifest files. These oversized files slow every subsequent commit that scans them. This PR fixes the above issue by rolling manifest files by target size

Tests

ManifestFileManagerTest.test_commit_manifest_exceeds_target_size

…tion

Manifest files written by pypaimon commit and minor compaction were
not split by manifest-target-file-size. A single OVERWRITE commit
writing 400 partitions produced a 138 MiB manifest (17x the 8 MiB
default target), slowing every subsequent commit that reads it.

Add ManifestFileManager.rolling_write() that estimates the total
serialized size, then splits entries across multiple files so each
stays near the target. Apply it in FileStoreCommit delta/changelog
manifest writing and ManifestFileMerger minor compaction.
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Split manifest files by target size during commit and compaction [python] Roll manifest files by target size during commit and compaction Jun 28, 2026
…e serialization

- Add try/except around chunk loop to delete already-written files on failure
- Estimate entries_per_file from avg entry size instead of serializing twice
- Extract _flush() and simplify write() to reuse it
…plitting

When entry sizes are skewed, avg-based splitting can produce manifest
files that exceed 2x the target. Switch to adaptive rolling: serialize
each chunk, check actual size, shrink and re-serialize if overshooting,
then adjust entries_per_chunk for the next iteration based on actual
size. Add a skewed-entry test to cover this.
…eter

rolling_write now takes name_prefix directly and generates file names
as {prefix}-0, {prefix}-1, etc. Callers no longer append -0 suffix.
Replace full Avro serialization for size estimation with a 64-entry
sample. Only serialize all entries when the sample suggests they fit
in a single file. For multi-file rolling, the per-chunk adaptive
logic handles estimation drift.
Record delta, changelog, and merge manifest file metas into
new_manifest_files_for_abort so _cleanup_preparation_failure can
delete them if a later step (e.g. manifest list write) fails.
Previously the variable was declared but only populated by the
merger, missing the delta/changelog files from rolling_write.
Replace sample-based estimation with fastavro.Writer streaming API:
write entries one by one, flush every 100 records to check buf size,
roll to a new file when size >= target. This matches Java's
RollingFileWriterImpl behavior (check every N records, roll on
threshold).
Replace fixed 100-record flush interval with fastavro sync_interval
so buf.tell() updates after each record. Check size after every write
instead of batched flushes, giving tighter control near the target.
Use flush() instead of dump() to avoid writing empty trailing blocks.
Previously _merge_candidates skipped single-file groups unconditionally.
Now it only skips when the file is within the target size. An oversized
single manifest is re-read and re-written through rolling_write,
splitting it into multiple target-sized files.
Java's mergeCandidates unconditionally skips single-file groups.
Revert the Python-only divergence to keep consistency.
…preparation failure

When manifest list write fails after rolling_write succeeds, the
except block now deletes the already-written manifest files before
calling _cleanup_preparation_failure, preventing orphan files.
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review June 28, 2026 07:54
…uests

rolling_write already has the byte buffer, so pass len(avro_bytes)
directly instead of calling file_io.get_file_size() which issues
a remote HEAD request per file.
Use written_files list instead of result to track files that need
cleanup on failure. Covers the gap where _flush succeeds but
_build_meta fails, leaving the current chunk's file as an orphan.
Verify each rolling manifest file is readable and entry counts match
the metadata, ensuring the streaming Writer produces valid Avro.
Replace _cleanup_preparation_failure with two methods mirroring Java:
- _clean_up_reuse_tmp_manifests: read delta/changelog manifest lists
  to find and delete their manifest files, then delete the lists
- _clean_up_no_reuse_tmp_manifests: delete base manifest list, then
  only delete manifests in merge_after not present in merge_before
When manifest list write fails after rolling_write succeeds, reading
the manifest list back for cleanup also fails. Fall back to the
known metas already in hand to delete orphan manifest files.
…ove dead code

- Wrap cleanup calls in try/except so cleanup errors don't mask the
  original preparation failure
- Pass merge new_files directly instead of recomputing set difference
- Remove unused write_with_meta method
Java's cleanUpReuseTmpManifests only reads manifest list to find
files to delete. If manifest list write failed, it skips cleanup.
Align Python to the same behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant