FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving

April 2026

tl;dr: Accelerating VLA model in streaming fashion.

The paper broke down the inference pipeline into four parts: vision ecnoder, prefill, decode and trajectory decode.

Q: “If the old frame is the same old frame, why can’t I just reuse its K/V exactly?”
A: In a sliding window session, although the input frame stay the same, the tranformer context has changed as the older frames have been dropped.
- Example: Old window: [A, B, C, D], Next window: [B, C, D, E]
- Cached D still carries some influence from A, fresh D would not. So reused K/V is stale.
- This would lead to a train/inference mismatch.
streaming fine-tuning means:
- “Train the action head on these approximate caches until it becomes robust to them.”

Provide feedback