🎯 Goal (What & Why)
The current memory footprint during the inference forward pass is approximately 2.5× higher for the same model and batch size when using Fast-LLM compared to Hugging Face Transformers.
🧠 Qwen2 1.5B — Batch Size: 16×4096, Flash Attention, bfloat16, H100 GPU
| Test |
Peak GPU Memory Usage (MB) |
| HF (no loss calculation) |
22,162.28 |
| HF (with loss calculation) |
40,962.78 |
| Fast-LLM (no loss calculation) |
59,013.70 |
| Fast-LLM (with loss calculation) |
OOM |
What is a reasonable target for reducing Fast-LLM's memory usage?
🚀 Execution Plan
(This section may start as an incomplete draft but must be defined before implementation begins.)
Step 1: What is the smallest working version?
(Describe the simplest way to implement this feature with minimal effort.)
Step 2: What additional optimizations are possible (but optional)?
(List potential refinements that can be added in later PRs if needed.)
📌 Acceptance Criteria (Must-Haves for Completion)
- The feature must be functional and tested.
- The implementation must be documented in practical terms.
- The PR must include a performance/impact summary.
- No refactors unless directly necessary for feature completion.
🛠️ Project Management
🎯 Goal (What & Why)
The current memory footprint during the inference forward pass is approximately 2.5× higher for the same model and batch size when using Fast-LLM compared to Hugging Face Transformers.
🧠 Qwen2 1.5B — Batch Size: 16×4096, Flash Attention, bfloat16, H100 GPU
What is a reasonable target for reducing Fast-LLM's memory usage?
🚀 Execution Plan
Step 1: What is the smallest working version?
Step 2: What additional optimizations are possible (but optional)?
📌 Acceptance Criteria (Must-Haves for Completion)
🛠️ Project Management
Estimatefield (in days) in the GitHub project.Sizefield to categorize the PR size (Small/Medium/Large).