Skip to content

Avoid mmap during predownload local segment check to reduce page faults#18467

Open
rohityadav1993 wants to merge 1 commit into
apache:masterfrom
rohityadav1993:predownload-page-fault
Open

Avoid mmap during predownload local segment check to reduce page faults#18467
rohityadav1993 wants to merge 1 commit into
apache:masterfrom
rohityadav1993:predownload-page-fault

Conversation

@rohityadav1993
Copy link
Copy Markdown
Contributor

@rohityadav1993 rohityadav1993 commented May 11, 2026

performance

Summary

Eliminates page faults during segment predownload before server start up. PredownloadScheduler does full segment load:

  1. to read CRC and compare with ZK metadata to validate if local copy is stale.
  2. to read directory size to log and emit metric for segment size downloaded to show progress

Both lead to major page faults, in practice for 6000 segments and 5TB data size to download led to 10K+/second major page faults and ~30% io waits.

Note: using stream+untar mode helps with l

Details

Problem

Phase 1
When PredownloadScheduler checks if segments already exist locally (Phase 1), it opens each segment directory via SegmentLocalFSDirectory(dir, ReadMode.mmap). This triggers SingleFileIndexDirectory.mapBufferEntries() which mmaps the entire index file and validates magic markers at every chunk boundary — causing a major page fault per index entry. With concurrent download threads across thousands of segments, the aggregate fault rate becomes very high.

If a segment's CRC matches but the local segment directory is corrupted (e.g. truncated or missing index files), the server startup path handles recovery: BaseTableDataManager.addNewOnlineSegment calls tryLoadExistingSegment, which performs a full segment load. If loading fails, it falls back to downloadAndLoadSegment to re-fetch the segment from deep store or peers.

Phase 2
After downloading segments (Phase 2), the scheduler calls loadSegmentFromLocal() again solely to populate the local size metric — mmapping a freshly-written cold segment and causing another round of major page faults per segment.

Fix

Phase 1 — CRC check: The CRC is only needed to check whether a segment was already downloaded to local disk. Replace SegmentDirectory open with a direct 8-byte read from creation.meta.

Phase 2 — post-download size: Replace getDiskSizeBytes() with FileUtils.sizeOfDirectory(). No mmap needed for a directory size walk.

Cluster Testing Results

Tested on hosts with ~5TB of segment data (thousands of segments), 48 cores, 266GB memory, SSD.

Test Setups

Setup Parallelism Stream+Untar Pagefault Fix
S1 48 Yes No
S3 48 No No
S4 48 No Yes
S5 48 Yes Yes

Predownload Duration

Setup Duration
S1 — stream+untar, no fix 33 min
S3 — no stream+untar, no fix 79 min
S4 — no stream+untar, with fix 63 min
S5 — stream+untar, with fix 31 min

Host Metrics During Predownload

Metric S1 (no fix, stream) S3 (no fix, no stream) S4 (fix, no stream) S5 (fix, stream)
Major faults delta Near-zero (2.6/s) +∞ (peak 10K/s) −10% (flat) −3% (flat)
Page-in (pgpgin) delta +404% +4350% +3% (flat) −2% (flat)
iowait 0.05% 29% 8.0% 0.2%
Dirty pages avg 0.1 GB 132 GB 113 GB ~0 GB

Key Takeaways

  • Best result: S5 (stream+untar + pagefault fix) — 31 min, ~0% iowait, zero dirty pages, major faults flat during predownload.
  • Worst result: S3 (no stream+untar, no pagefault fix) — 79 min, 29% iowait, 132 GB dirty pages, major fault spikes to 10K/s.
  • S1 note: S1 did not show elevated major faults despite lacking the pagefault fix. However, pgpgin +404% shows it was still performing unnecessary disk reads via mmap. The favorable host conditions (near-zero baseline faults, low memory pressure) masked the impact. On production hosts with higher memory contention, those same reads would cause major fault spikes (as seen in S3).

Test plan

  • Unit tests for PredownloadSegmentInfoTest — updated to verify Preconditions throws on non-directory paths and regular files
  • Unit tests for PredownloadTableInfoTest — existing tests cover CRC match/mismatch and missing segment scenarios
  • All 7 tests pass: ./mvnw -pl pinot-server -am -Dtest=PredownloadSegmentInfoTest,PredownloadTableInfoTest test
  • Cluster testing on 5TB hosts validates page fault elimination

🤖 Generated with Claude Code

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

Codecov Report

❌ Patch coverage is 85.18519% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.67%. Comparing base (5ebd445) to head (d10cfec).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...not/server/predownload/PredownloadSegmentInfo.java 69.23% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18467       +/-   ##
=============================================
+ Coverage     34.99%   63.67%   +28.68%     
- Complexity      869     1684      +815     
=============================================
  Files          3256     3262        +6     
  Lines        199548   199826      +278     
  Branches      30986    31031       +45     
=============================================
+ Hits          69840   127248    +57408     
+ Misses       123570    62428    -61142     
- Partials       6138    10150     +4012     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.67% <85.18%> (+28.68%) ⬆️
temurin 63.67% <85.18%> (+28.68%) ⬆️
unittests 63.67% <85.18%> (+28.68%) ⬆️
unittests1 55.79% <ø> (?)
unittests2 34.94% <85.18%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rohityadav1993 rohityadav1993 force-pushed the predownload-page-fault branch from cd2acab to d5473a3 Compare May 12, 2026 04:26
PredownloadScheduler.loadSegmentsFromLocal() previously opened a full
SegmentLocalFSDirectory(ReadMode.mmap) for each segment to read its CRC,
causing ~C×5 major page faults per segment (one per index chunk boundary
validated by SingleFileIndexDirectory.mapBufferEntries). On hosts with
thousands of segments this creates a page fault storm at startup.

The CRC is pre-written to creation.meta (8 bytes) during segment build
and does not require mmapping index files. Replace the SegmentDirectory
open with a direct read of creation.meta via DataInputStream, and replace
getDiskSizeBytes() with FileUtils.sizeOfDirectory() for the post-download
size metric. Both Phase 1 (local check) and Phase 2 (post-download size
reporting) now use this optimised path through loadSegmentFromLocal.

If a segment CRC matches but the local directory is corrupted, the server
startup path handles recovery: BaseTableDataManager.tryLoadExistingSegment
performs a full segment load, falling back to downloadAndLoadSegment to
re-fetch from deep store or peers.
@rohityadav1993 rohityadav1993 force-pushed the predownload-page-fault branch from d5473a3 to d10cfec Compare May 12, 2026 04:32
@rohityadav1993 rohityadav1993 marked this pull request as ready for review May 12, 2026 05:28
@rohityadav1993 rohityadav1993 marked this pull request as draft May 12, 2026 12:33
@rohityadav1993 rohityadav1993 marked this pull request as ready for review May 14, 2026 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants