[WIP] Feat/prefill decode by orionpapadakis · Pull Request #102 · beehive-lab/GPULlama3.java

orionpapadakis · 2026-04-02T17:35:46Z

This WIP PR implements the prefill-decode concept in GPULlama3.java:

Prefill (or ingestion) is the inference pass for prompt (input) tokens. Currently they are ingested sequentially (one-by-one). However, all prompt tokens are already known and also they are independent entities; hence the sequential ingestion is sub-optimal. A well known practice is to ingest prompt tokens in batches (i.e., of 32) instead of one-by-one. In addition, logits creation can be skipped for all prompt tokens but the last one as they are not used at all. The logits of the last prompt token are used to get the first generated token which in its turn feeds the decode phase.
Decode is the inference pass for each generated token. In contrast to prefill, this phase remains token-sequential (as it is now) because each generated token depends on the previous one.

Based on the above, this PR breaks down the prefill-decode feature implementation in 4 discrete phases:

Phase 1: splits CPU inference to prefill + decode (no batching, skip logits during prefill)
Phase 2: splits GPU inference to prefill + decode (no batching, skip logits during prefill)
Phase 3: adds batching to CPU prefill
Phase 4: adds batching to GPU prefill

Current Working and Performance State

The implementation is progressing in stages, with varying levels of completeness across phases and configurations.

Functional Status

Phases 1–3 are fully implemented and functional for:
- Model: LLaMA
- Precision: FP16
Remaining work:
- Extension to additional models: Mistral, Qwen2, Qwen3, DeepSeek, Phi-3, Granite
- Support for q8_0
- General cleanup and refactoring of duplicated logic across CPU/GPU paths
Phase 4 (GPU prefill batching):
- Functionally implemented but lags behind Phases 1–3
- Currently works only with CUDA Graphs enabled
- Requires:
  - Bug fixes to support non CUDA graph execution
  - The same with Phases 1–3 (model coverage, q8_0, cleanup)

Performance State (RTX 5090 ROG Laptop, TornadoVM PTX Backend)

The following results were collected using LLaMA 1B FP16 and batch size 32:

Configuration	Short Prompt (tok/s)	Speedup	Long Prompt (tok/s)	Speedup
CPU	22.28	1.00x	25.76	1.00x
CPU + batched-prefill	26.80	1.20x	42.25	1.64x
GPU	64.13	2.88x	57.18	2.22x
GPU + batched-prefill	71.09	3.19x	88.47	3.43x

Key Observations

Prefill batching is already beneficial for both CPU and GPU (Phases 1–4):
- ~+20% improvement for short prompts
- Up to +64% (CPU) and +55% (GPU) for long prompts
Long prompts benefit the most, confirming that prefill dominates runtime as input grows
GPU + batched prefill (Phase 4) delivers the highest throughput:
- Up to 3.43× speedup over CPU baseline
- However, results currently depend on CUDA Graphs

Summary

Functional coverage is complete for LLaMA FP16 (Phases 1–3)
Phase 4 is operational but incomplete
Performance results already validate the effectiveness of prefill batching, especially for long prompts
Further gains and broader applicability are expected after:
- Model generalization
- q8_0 support
- Phase 4 stabilization and cleanup
- Batch size experimentation

…configuration

…ith InferenceCoreWithPrefillDecode and InferenceEngineWithPrefillDecode

… Implements `InferenceEngineWithPrefillDecode` and `TornadoVMMasterPlanWithPrefillDecode` for batched token generation. Refactor `Llama` to support the batched prefill flag.

…king state, with cuda graphs only)

orionpapadakis added 6 commits March 27, 2026 16:02

[refactor] Move QuantizedLayerPlanner to layerplanner package root-level

c00aa82

[prf/dec] Add CLI options for batched prefill and prefill batch size …

8be1c05

…configuration

[prf/dec] Add CPU path for prefill/decode. Separates inference path w…

ca4744f

…ith InferenceCoreWithPrefillDecode and InferenceEngineWithPrefillDecode

[prf/dec] Add GPU path for prefill/decode with TornadoVM integration.…

d5f5629

… Implements `InferenceEngineWithPrefillDecode` and `TornadoVMMasterPlanWithPrefillDecode` for batched token generation. Refactor `Llama` to support the batched prefill flag.

[prf/dec] Batch prompt tokens during prefill phase in CPU path

fbbc41f

[prf/dec][wip] Add GPU-based prefill-decode with batched prefill (wor…

f0bca5f

…king state, with cuda graphs only)

orionpapadakis requested review from mairooni, mikepapadim and stratika April 2, 2026 17:35

orionpapadakis added the enhancement New feature or request label Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feat/prefill decode#102

[WIP] Feat/prefill decode#102
orionpapadakis wants to merge 6 commits intobeehive-lab:mainfrom
orionpapadakis:feat/prefill-decode

orionpapadakis commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

orionpapadakis commented Apr 2, 2026

Current Working and Performance State

Functional Status

Performance State (RTX 5090 ROG Laptop, TornadoVM PTX Backend)

Key Observations

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant