Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 1.1 KB

File metadata and controls

24 lines (17 loc) · 1.1 KB

April 2026

tl;dr: Accelerating VLA model in streaming fashion.

Overall impression

The paper broke down the inference pipeline into four parts: vision ecnoder, prefill, decode and trajectory decode.

Key ideas

  • Q: “If the old frame is the same old frame, why can’t I just reuse its K/V exactly?”
  • A: In a sliding window session, although the input frame stay the same, the tranformer context has changed as the older frames have been dropped.
    • Example: Old window: [A, B, C, D], Next window: [B, C, D, E]
    • Cached D still carries some influence from A, fresh D would not. So reused K/V is stale.
    • This would lead to a train/inference mismatch.
  • streaming fine-tuning means:
    • “Train the action head on these approximate caches until it becomes robust to them.”

Technical details

Notes