Prefetch hints for arrangement reads (joins, reduces)

## Motivation

Arrangements backed by large batches can incur cache misses or page faults when cursors seek through cold data. With a custom allocator that supports prefetching (e.g., `madvise(MADV_WILLNEED)` or explicit page-in), we could hide this latency — if operators can express what data they'll need next.

This issue explores where prefetch hints could be injected and what API shapes might work.

## Current access pattern

The join loop (`join.rs:367–396`) is a synchronized merge of two cursors:

```rust
while let (Some(batch_key), Some(trace_key), ..) = (..) {
    match trace_key.cmp(&batch_key) {
        Less    => trace.seek_key(trace_storage, batch_key),
        Greater => batch.seek_key(batch_storage, trace_key),
        Equal   => { /* load vals/times, compute, step both */ }
    }
}
```

Key observations:

- **The batch cursor drives the access pattern.** Its keys are known upfront (small, new batch). The trace cursor seeks to match.
- **`CursorList::seek_key` fans out to every inner cursor** (`cursor_list.rs:151–155`), each doing exponential+binary search in its `keys` container.
- **After a key match, all vals and (time, diff) pairs are loaded** — `ValueHistory::load` walks the full value/time structure for that key.
- **No lookahead exists today.** The join processes one key at a time with zero knowledge of the next.

Reduce is better positioned: it maintains a `pending_keys` list and could prefetch the full batch of upcoming keys.

## Where prefetch helps

1. **Key lookup across batches** — when seeking `k` in the trace, prefetch the approximate landing position for the *next* batch key in each spine batch.
2. **Value/update loading** — once we know `key_cursor`, prefetch the `vals` and `upds` ranges before starting the Cartesian product for the *previous* key.
3. **Parallel prefetch across CursorList** — issue prefetches to all batches before performing the actual seeks.

## Possible API shapes

### A) Cursor-level hint (operator-facing, minimal)

```rust
trait Cursor {
    /// Hint that the cursor will soon seek to `key`. Default: no-op.
    fn prefetch_key(&self, _storage: &Self::Storage, _key: Self::Key<'_>) {}
}
```

The join peeks at the next batch key and calls `trace.prefetch_key(storage, next_key)` one iteration ahead. `CursorList` fans out to all inner cursors.

**Pros:** Simple, opt-in, no breaking changes.
**Cons:** Only one key ahead; the operator must drive it.

### B) Container-level hint (storage-facing, transparent)

```rust
trait BatchContainer {
    /// Hint that indices in [start, end) will be accessed soon. Default: no-op.
    fn prefetch(&self, _start: usize, _end: usize) {}
}
```

`OrdValCursor::seek_key` could, after computing the destination index, prefetch the associated value/update offset and data ranges. Invisible to operators.

**Pros:** Transparent; also benefits merge cursors.
**Cons:** Container must know its memory layout to compute addresses.

### C) Trace-level batch hint (maximum information)

```rust
trait TraceReader {
    /// Hint that these keys will be looked up soon. Default: no-op.
    fn prefetch_keys<I: IntoIterator<Item = ...>>(&self, keys: I) {}
}
```

The operator extracts the next N keys from the batch cursor and passes them all at once. The spine fans out to each batch; the allocator gets a batch of prefetch requests.

**Pros:** Maximum information for the allocator; amortizable.
**Cons:** More invasive; trace must expose inner structure.

## Pieces not yet in place

- **Allocator ↔ container connection.** `BatchContainer` / `Layout` have no reference to an allocator. Containers would need either an allocator handle or a global/thread-local prefetch mechanism.
- **Address computation.** Translating "key index N" to a byte range is trivial for `Vec<T>` (`ptr + N * size_of::<T>()`) but encoding-dependent for columnar containers.
- **Peek-ahead in cursors.** The `Cursor` trait has no `peek_next_key()`. The join could read ahead in the batch directly (it owns the storage), or a small API addition could formalize this.

## Suggested starting point

Combine **A + B**: add `prefetch_key` on `Cursor` (default no-op) and `prefetch` on `BatchContainer` (default no-op). In `OrdValCursor::prefetch_key`, estimate the key's landing position and prefetch the key/val/update byte ranges. In the join loop, peek at the next batch key and call the hint one iteration ahead. All zero-cost until a container opts in.

For reduce, prefetch the full `pending_keys` list upfront since it's known ahead of time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefetch hints for arrangement reads (joins, reduces) #716

Motivation

Current access pattern

Where prefetch helps

Possible API shapes

A) Cursor-level hint (operator-facing, minimal)

B) Container-level hint (storage-facing, transparent)

C) Trace-level batch hint (maximum information)

Pieces not yet in place

Suggested starting point

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Prefetch hints for arrangement reads (joins, reduces) #716

Description

Motivation

Current access pattern

Where prefetch helps

Possible API shapes

A) Cursor-level hint (operator-facing, minimal)

B) Container-level hint (storage-facing, transparent)

C) Trace-level batch hint (maximum information)

Pieces not yet in place

Suggested starting point

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions