[python] Roll manifest files by target size during commit and compaction#8373
Open
XiaoHongbo-Hope wants to merge 26 commits into
Open
[python] Roll manifest files by target size during commit and compaction#8373XiaoHongbo-Hope wants to merge 26 commits into
XiaoHongbo-Hope wants to merge 26 commits into
Conversation
…tion Manifest files written by pypaimon commit and minor compaction were not split by manifest-target-file-size. A single OVERWRITE commit writing 400 partitions produced a 138 MiB manifest (17x the 8 MiB default target), slowing every subsequent commit that reads it. Add ManifestFileManager.rolling_write() that estimates the total serialized size, then splits entries across multiple files so each stays near the target. Apply it in FileStoreCommit delta/changelog manifest writing and ManifestFileMerger minor compaction.
…e serialization - Add try/except around chunk loop to delete already-written files on failure - Estimate entries_per_file from avg entry size instead of serializing twice - Extract _flush() and simplify write() to reuse it
…plitting When entry sizes are skewed, avg-based splitting can produce manifest files that exceed 2x the target. Switch to adaptive rolling: serialize each chunk, check actual size, shrink and re-serialize if overshooting, then adjust entries_per_chunk for the next iteration based on actual size. Add a skewed-entry test to cover this.
…eter
rolling_write now takes name_prefix directly and generates file names
as {prefix}-0, {prefix}-1, etc. Callers no longer append -0 suffix.
Replace full Avro serialization for size estimation with a 64-entry sample. Only serialize all entries when the sample suggests they fit in a single file. For multi-file rolling, the per-chunk adaptive logic handles estimation drift.
Record delta, changelog, and merge manifest file metas into new_manifest_files_for_abort so _cleanup_preparation_failure can delete them if a later step (e.g. manifest list write) fails. Previously the variable was declared but only populated by the merger, missing the delta/changelog files from rolling_write.
Replace sample-based estimation with fastavro.Writer streaming API: write entries one by one, flush every 100 records to check buf size, roll to a new file when size >= target. This matches Java's RollingFileWriterImpl behavior (check every N records, roll on threshold).
Replace fixed 100-record flush interval with fastavro sync_interval so buf.tell() updates after each record. Check size after every write instead of batched flushes, giving tighter control near the target. Use flush() instead of dump() to avoid writing empty trailing blocks.
Previously _merge_candidates skipped single-file groups unconditionally. Now it only skips when the file is within the target size. An oversized single manifest is re-read and re-written through rolling_write, splitting it into multiple target-sized files.
Java's mergeCandidates unconditionally skips single-file groups. Revert the Python-only divergence to keep consistency.
…preparation failure When manifest list write fails after rolling_write succeeds, the except block now deletes the already-written manifest files before calling _cleanup_preparation_failure, preventing orphan files.
…uests rolling_write already has the byte buffer, so pass len(avro_bytes) directly instead of calling file_io.get_file_size() which issues a remote HEAD request per file.
Use written_files list instead of result to track files that need cleanup on failure. Covers the gap where _flush succeeds but _build_meta fails, leaving the current chunk's file as an orphan.
Verify each rolling manifest file is readable and entry counts match the metadata, ensuring the streaming Writer produces valid Avro.
Replace _cleanup_preparation_failure with two methods mirroring Java: - _clean_up_reuse_tmp_manifests: read delta/changelog manifest lists to find and delete their manifest files, then delete the lists - _clean_up_no_reuse_tmp_manifests: delete base manifest list, then only delete manifests in merge_after not present in merge_before
When manifest list write fails after rolling_write succeeds, reading the manifest list back for cleanup also fails. Fall back to the known metas already in hand to delete orphan manifest files.
…ove dead code - Wrap cleanup calls in try/except so cleanup errors don't mask the original preparation failure - Pass merge new_files directly instead of recomputing set difference - Remove unused write_with_meta method
Java's cleanUpReuseTmpManifests only reads manifest list to find files to delete. If manifest list write failed, it skips cleanup. Align Python to the same behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purposes
Commit writes all manifest entries into a single file without rolling, producing oversized manifest files. These oversized files slow every subsequent commit that scans them. This PR fixes the above issue by rolling manifest files by target size
Tests
ManifestFileManagerTest.test_commit_manifest_exceeds_target_size