Skip to content

Prefetch hints for arrangement reads (joins, reduces) #716

@antiguru

Description

@antiguru

Motivation

Arrangements backed by large batches can incur cache misses or page faults when cursors seek through cold data. With a custom allocator that supports prefetching (e.g., madvise(MADV_WILLNEED) or explicit page-in), we could hide this latency — if operators can express what data they'll need next.

This issue explores where prefetch hints could be injected and what API shapes might work.

Current access pattern

The join loop (join.rs:367–396) is a synchronized merge of two cursors:

while let (Some(batch_key), Some(trace_key), ..) = (..) {
    match trace_key.cmp(&batch_key) {
        Less    => trace.seek_key(trace_storage, batch_key),
        Greater => batch.seek_key(batch_storage, trace_key),
        Equal   => { /* load vals/times, compute, step both */ }
    }
}

Key observations:

  • The batch cursor drives the access pattern. Its keys are known upfront (small, new batch). The trace cursor seeks to match.
  • CursorList::seek_key fans out to every inner cursor (cursor_list.rs:151–155), each doing exponential+binary search in its keys container.
  • After a key match, all vals and (time, diff) pairs are loadedValueHistory::load walks the full value/time structure for that key.
  • No lookahead exists today. The join processes one key at a time with zero knowledge of the next.

Reduce is better positioned: it maintains a pending_keys list and could prefetch the full batch of upcoming keys.

Where prefetch helps

  1. Key lookup across batches — when seeking k in the trace, prefetch the approximate landing position for the next batch key in each spine batch.
  2. Value/update loading — once we know key_cursor, prefetch the vals and upds ranges before starting the Cartesian product for the previous key.
  3. Parallel prefetch across CursorList — issue prefetches to all batches before performing the actual seeks.

Possible API shapes

A) Cursor-level hint (operator-facing, minimal)

trait Cursor {
    /// Hint that the cursor will soon seek to `key`. Default: no-op.
    fn prefetch_key(&self, _storage: &Self::Storage, _key: Self::Key<'_>) {}
}

The join peeks at the next batch key and calls trace.prefetch_key(storage, next_key) one iteration ahead. CursorList fans out to all inner cursors.

Pros: Simple, opt-in, no breaking changes.
Cons: Only one key ahead; the operator must drive it.

B) Container-level hint (storage-facing, transparent)

trait BatchContainer {
    /// Hint that indices in [start, end) will be accessed soon. Default: no-op.
    fn prefetch(&self, _start: usize, _end: usize) {}
}

OrdValCursor::seek_key could, after computing the destination index, prefetch the associated value/update offset and data ranges. Invisible to operators.

Pros: Transparent; also benefits merge cursors.
Cons: Container must know its memory layout to compute addresses.

C) Trace-level batch hint (maximum information)

trait TraceReader {
    /// Hint that these keys will be looked up soon. Default: no-op.
    fn prefetch_keys<I: IntoIterator<Item = ...>>(&self, keys: I) {}
}

The operator extracts the next N keys from the batch cursor and passes them all at once. The spine fans out to each batch; the allocator gets a batch of prefetch requests.

Pros: Maximum information for the allocator; amortizable.
Cons: More invasive; trace must expose inner structure.

Pieces not yet in place

  • Allocator ↔ container connection. BatchContainer / Layout have no reference to an allocator. Containers would need either an allocator handle or a global/thread-local prefetch mechanism.
  • Address computation. Translating "key index N" to a byte range is trivial for Vec<T> (ptr + N * size_of::<T>()) but encoding-dependent for columnar containers.
  • Peek-ahead in cursors. The Cursor trait has no peek_next_key(). The join could read ahead in the batch directly (it owns the storage), or a small API addition could formalize this.

Suggested starting point

Combine A + B: add prefetch_key on Cursor (default no-op) and prefetch on BatchContainer (default no-op). In OrdValCursor::prefetch_key, estimate the key's landing position and prefetch the key/val/update byte ranges. In the join loop, peek at the next batch key and call the hint one iteration ahead. All zero-cost until a container opts in.

For reduce, prefetch the full pending_keys list upfront since it's known ahead of time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions