[Feature] 集成 TurboQuant KV Cache 压缩 by sunghajung6688 · Pull Request #6 · hw-native-sys/pypto-serving

sunghajung6688 · 2026-05-19T07:08:14Z

将 TurboQuant V3（纯 MSE 模式）集成到 pypto-serving 的 KV Cache 管理中，支持在线压缩/解压缩历史 token，减少 KV Cache 内存占用。

gemini-code-assist

Code Review

This pull request introduces TurboQuant KV cache quantization to optimize memory usage during prefill and decode phases. The implementation includes a new turboquant module featuring MSE-optimal compression using random rotation and Lloyd-Max quantization. Feedback focuses on critical performance bottlenecks: the current compression logic re-processes the entire sequence prefix in each step, leading to $O(N^2)$ complexity; the manual Python loops used for writing to paged cache should be replaced with vectorized advanced indexing; and several constant tensor allocations and PDF calculations should be pre-computed or vectorized to reduce overhead.

gemini-code-assist · 2026-05-19T07:10:27Z

+        if alloc.request_id not in pool.compressed_segments:
+            pool.compressed_segments[alloc.request_id] = {}
+
+        for layer_idx in range(pool.num_layers):


当前实现在每次调用时都会重新压缩整个序列前缀（从 token 0 到 tokens_used - residual_window）。在 decode 循环中调用时，这将导致相对于序列长度的 $O(N^2)$ 复杂度，随着上下文增长，性能将严重下降。建议实现增量压缩，仅处理刚刚超出 residual_window 的新 token。此外，调用的 read_context 方法内部也存在类似的循环读取瓶颈，建议一并优化。

gemini-code-assist · 2026-05-19T07:10:27Z

+        for row in range(keys.shape[0]):
+            token_index = start_token_index + row
+            page_idx = token_index // pool.page_size
+            offset = token_index % pool.page_size
+            physical_page = alloc.page_ids[page_idx]
+            pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype)
+            pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype)


在 Python 循环中遍历 token 并写入分页缓存（paged cache）非常低效，特别是对于设备张量（NPU/GPU）。应使用 PyTorch 的高级索引（advanced indexing）进行向量化，以在单个操作中完成更新。

Suggested change

for row in range(keys.shape[0]):

token_index = start_token_index + row

page_idx = token_index // pool.page_size

offset = token_index % pool.page_size

physical_page = alloc.page_ids[page_idx]

pool.key_pages[layer_idx, physical_page, :, offset, :] = keys[row].to(cache_dtype)

pool.value_pages[layer_idx, physical_page, :, offset, :] = values[row].to(cache_dtype)

token_indices = torch.arange(start_token_index, start_token_index + keys.shape[0], device=keys.device)

page_indices = token_indices // pool.page_size

offsets = token_indices % pool.page_size

physical_pages = torch.tensor(alloc.page_ids, device=keys.device)[page_indices]

pool.key_pages[layer_idx, physical_pages, :, offsets, :] = keys.to(cache_dtype)

pool.value_pages[layer_idx, physical_pages, :, offsets, :] = values.to(cache_dtype)

gemini-code-assist · 2026-05-19T07:10:28Z

+        idx_powers = torch.tensor(
+            [2 ** (self.bits * i) for i in range(indices_per_byte - 1, -1, -1)],
+            dtype=torch.long,
+            device=idx_flat.device,
+        )


idx_powers 是一个常量张量。在每次 compress 调用时创建它是低效的。建议在 __init__ 中根据 bits 预先计算并存储为 buffer，以减少冗余分配和 H2D 拷贝。

gemini-code-assist · 2026-05-19T07:10:28Z

+            pdf_vals = torch.tensor([pdf(x) for x in xs])
+            weighted = xs * pdf_vals


高斯 PDF 计算目前在 Python 列表推导式中执行，对于 2048 个样本来说较慢。可以使用 PyTorch 操作轻松实现向量化，提高初始化速度。

xs = torch.linspace(a, b, n_samples) pdf_vals = (1.0 / (math.sqrt(2 * math.pi) * sigma)) * torch.exp(-xs**2 / (2 * sigma**2)) weighted = xs * pdf_vals

[Feature] Add TurboQuant

2837700

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Add CLI arguments for TurboQuant KV cache compression

4b65bf3

sunghajung6688 force-pushed the turboquant branch from ee84d61 to 4b65bf3 Compare May 19, 2026 07:42

huangzhuo added 5 commits May 19, 2026 16:48

Add debug prints for TurboQuant initialization and compression

f201383

Add debug prints to trace startup progress

cb3ef61

Use uniform quantization for 8-bit, skip Lloyd-Max

de37af7

Add KV cache memory comparison print

fcaefe8

Add KV cache capacity test script

0c160bf

sunghajung6688 force-pushed the turboquant branch from 63f8ac7 to 0c160bf Compare May 22, 2026 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] 集成 TurboQuant KV Cache 压缩#6

[Feature] 集成 TurboQuant KV Cache 压缩#6
sunghajung6688 wants to merge 7 commits into
hw-native-sys:mainfrom
sunghajung6688:turboquant

sunghajung6688 commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		pdf_vals = torch.tensor([pdf(x) for x in xs])
		weighted = xs * pdf_vals

Conversation

sunghajung6688 commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant