Reduce Memory Usage During Inference

# 🎯 **Goal (What & Why)**

The current memory footprint during the inference forward pass is approximately **2.5× higher** for the same model and batch size when using Fast-LLM compared to Hugging Face Transformers.

### 🧠 Qwen2 1.5B — Batch Size: 16×4096, Flash Attention, bfloat16, H100 GPU

| Test                             | Peak GPU Memory Usage (MB) |
|----------------------------------|-----------------------------|
| HF (no loss calculation)         | 22,162.28                   |
| HF (with loss calculation)       | 40,962.78                   |
| Fast-LLM (no loss calculation)   | 59,013.70                   |
| Fast-LLM (with loss calculation) | OOM                         |

What is a reasonable target for reducing Fast-LLM's memory usage?


# 🚀 **Execution Plan**
> _(This section may start as an incomplete draft but must be defined before implementation begins.)_ 

### **Step 1: What is the smallest working version?**
> _(Describe the simplest way to implement this feature with minimal effort.)_  

### **Step 2: What additional optimizations are possible (but optional)?**  
> _(List potential refinements that can be added in later PRs if needed.)_  

# 📌 **Acceptance Criteria** (Must-Haves for Completion)
* The feature must be **functional and tested**.  
* The implementation must be **documented in practical terms**.  
* The PR must include a **performance/impact summary**.  
* **No refactors unless directly necessary** for feature completion.  

# 🛠️ **Project Management**
- [x] **Assign the project to the Fast-LLM project.**
- [ ] **Set the `Estimate` field (in days) in the GitHub project.**
- [ ] **Use the `Size` field to categorize the PR size (Small/Medium/Large).**
- [ ] **Assign an owner when opening the issue.**  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Memory Usage During Inference #298

🎯 Goal (What & Why)

🧠 Qwen2 1.5B — Batch Size: 16×4096, Flash Attention, bfloat16, H100 GPU

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: What additional optimizations are possible (but optional)?

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test	Peak GPU Memory Usage (MB)
HF (no loss calculation)	22,162.28
HF (with loss calculation)	40,962.78
Fast-LLM (no loss calculation)	59,013.70
Fast-LLM (with loss calculation)	OOM

Reduce Memory Usage During Inference #298

Description

🎯 Goal (What & Why)

🧠 Qwen2 1.5B — Batch Size: 16×4096, Flash Attention, bfloat16, H100 GPU

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: What additional optimizations are possible (but optional)?

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions