Skip to content

[WIP] Feat/prefill decode#102

Draft
orionpapadakis wants to merge 6 commits intobeehive-lab:mainfrom
orionpapadakis:feat/prefill-decode
Draft

[WIP] Feat/prefill decode#102
orionpapadakis wants to merge 6 commits intobeehive-lab:mainfrom
orionpapadakis:feat/prefill-decode

Conversation

@orionpapadakis
Copy link
Copy Markdown
Collaborator

This WIP PR implements the prefill-decode concept in GPULlama3.java:

  1. Prefill (or ingestion) is the inference pass for prompt (input) tokens. Currently they are ingested sequentially (one-by-one). However, all prompt tokens are already known and also they are independent entities; hence the sequential ingestion is sub-optimal. A well known practice is to ingest prompt tokens in batches (i.e., of 32) instead of one-by-one. In addition, logits creation can be skipped for all prompt tokens but the last one as they are not used at all. The logits of the last prompt token are used to get the first generated token which in its turn feeds the decode phase.
  2. Decode is the inference pass for each generated token. In contrast to prefill, this phase remains token-sequential (as it is now) because each generated token depends on the previous one.

Based on the above, this PR breaks down the prefill-decode feature implementation in 4 discrete phases:

  • Phase 1: splits CPU inference to prefill + decode (no batching, skip logits during prefill)

  • Phase 2: splits GPU inference to prefill + decode (no batching, skip logits during prefill)

  • Phase 3: adds batching to CPU prefill

  • Phase 4: adds batching to GPU prefill

Current Working and Performance State

The implementation is progressing in stages, with varying levels of completeness across phases and configurations.

Functional Status

  • Phases 1–3 are fully implemented and functional for:

    • Model: LLaMA
    • Precision: FP16

    Remaining work:

    • Extension to additional models: Mistral, Qwen2, Qwen3, DeepSeek, Phi-3, Granite
    • Support for q8_0
    • General cleanup and refactoring of duplicated logic across CPU/GPU paths
  • Phase 4 (GPU prefill batching):

    • Functionally implemented but lags behind Phases 1–3
    • Currently works only with CUDA Graphs enabled
    • Requires:
      • Bug fixes to support non CUDA graph execution
      • The same with Phases 1–3 (model coverage, q8_0, cleanup)

Performance State (RTX 5090 ROG Laptop, TornadoVM PTX Backend)

The following results were collected using LLaMA 1B FP16 and batch size 32:

Configuration Short Prompt (tok/s) Speedup Long Prompt (tok/s) Speedup
CPU 22.28 1.00x 25.76 1.00x
CPU + batched-prefill 26.80 1.20x 42.25 1.64x
GPU 64.13 2.88x 57.18 2.22x
GPU + batched-prefill 71.09 3.19x 88.47 3.43x

Key Observations

  • Prefill batching is already beneficial for both CPU and GPU (Phases 1–4):

    • ~+20% improvement for short prompts
    • Up to +64% (CPU) and +55% (GPU) for long prompts
  • Long prompts benefit the most, confirming that prefill dominates runtime as input grows

  • GPU + batched prefill (Phase 4) delivers the highest throughput:

    • Up to 3.43× speedup over CPU baseline
    • However, results currently depend on CUDA Graphs

Summary

  • Functional coverage is complete for LLaMA FP16 (Phases 1–3)
  • Phase 4 is operational but incomplete
  • Performance results already validate the effectiveness of prefill batching, especially for long prompts
  • Further gains and broader applicability are expected after:
    • Model generalization
    • q8_0 support
    • Phase 4 stabilization and cleanup
    • Batch size experimentation

…ith InferenceCoreWithPrefillDecode and InferenceEngineWithPrefillDecode
… Implements `InferenceEngineWithPrefillDecode` and `TornadoVMMasterPlanWithPrefillDecode` for batched token generation. Refactor `Llama` to support the batched prefill flag.
@orionpapadakis orionpapadakis added the enhancement New feature or request label Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant