From dc6c25fe3d6081bed15f0b2fb41b094cef17662c Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Sat, 4 Jul 2026 02:48:00 -0400
Subject: [PATCH 1/3] feat(i18n): add Simplified Chinese (/zh) pages across the
 whole site
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Every indexable page now has a hand-authored Chinese sibling under /zh so
the site is crawled and indexed in Chinese as well as English:

- /zh landing + 8 dashboard tabs (Chinese metadata + server-rendered SEO
  intro above the charts), about (incl. Chinese FAQ + FAQPage JSON-LD),
  quotes (all 45 supporter quotes translated), land-acknowledgement, and
  the compare / compare-per-dollar index pages
- Chinese blog: content/blog/zh/<slug>.mdx pairs by filename; all 14
  posts translated per the AGENTS.md quality bar; /zh/blog list +
  /zh/blog/[slug] pages; visibility gating derives from the English post
- Bidirectional hreflang (en / zh-CN / x-default) on both trees via new
  helpers in src/lib/i18n.ts; sitemap emits EN+zh pairs; OG locale zh_CN
- Locale-aware chrome: header nav + EN<->中文 toggle, dashboard TabNav
  labels/links, footer 中文版 link, zh 404 page, document lang fixup
- Blog lib: CJK-aware reading time (400 chars/min) and Han-preserving
  slugify so Chinese headings get real anchor ids
- AGENTS.md: new mandatory rule — every new page/tab/post ships its /zh
  version in the same PR; docs/i18n.md records the design rationale
- Tests: i18n + tab-meta-zh unit tests, zh blog lib tests, zh-pages
  Cypress spec (10 tests)

Not mirrored (documented): per-slug compare pages, /datasets, gated
tabs, feed.xml/llms.txt; zh posts reuse the English OG image (Satori
default font has no CJK glyphs).

中文：为全站新增简体中文 /zh 页面——首页、8 个仪表板标签页（中文元数据
与图表上方的服务器端中文简介）、关于页（含中文 FAQ 与 FAQPage 结构化
数据）、支持者页（45 条支持者评价全部翻译）、土地致谢页、GPU 对比索引
页，以及全部 14 篇博客文章翻译（content/blog/zh/，与英文原文按文件名
配对）。通过新的 i18n 辅助函数实现双向 hreflang，站点地图成对输出中英
URL；页眉新增 EN↔中文 切换，仪表板标签栏与页脚支持中文；博客库支持
CJK 阅读时长与中文标题锚点。AGENTS.md 新增强制规则：所有新页面、标签
页与博客文章必须在同一 PR 中提供中文版本；设计说明见 docs/i18n.md。

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 AGENTS.md                                     |  18 +-
 docs/i18n.md                                  |  29 +
 docs/index.md                                 |   1 +
 ...nvfp4-vs-h200-fp8-3-6x-perf-per-dollar.mdx | 215 ++++
 ...vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx | 248 +++++
 ...h200-int4-kimi-k2-vllm-perf-per-dollar.mdx | 201 ++++
 ...seekv4-16t-day-0-to-day-43-performance.mdx | 588 +++++++++++
 ...vl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200.mdx | 136 +++
 ...b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx | 192 ++++
 ...nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx | 181 ++++
 ...max-open-source-inference-benchmarking.mdx | 671 +++++++++++++
 ...x-v2-nvidia-blackwell-vs-amd-vs-hopper.mdx | 943 ++++++++++++++++++
 ...deepseek-v4-pro-sglang-110x-in-26-days.mdx | 254 +++++
 ...x-glm5-fp8-sglang-40-cheaper-than-b200.mdx | 195 ++++
 ...mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx | 130 +++
 ...i355x-qwen3-5-sglang-v0-5-12-up-to-17x.mdx | 179 ++++
 ...-0-5-6-b200-deepseek-r1-fp4-up-to-1-8x.mdx | 112 +++
 packages/app/cypress/e2e/zh-pages.cy.ts       |  87 ++
 packages/app/src/app/(landing)/page.tsx       |   3 +-
 packages/app/src/app/about/page.tsx           |   3 +-
 packages/app/src/app/blog/[slug]/page.tsx     |  15 +-
 packages/app/src/app/blog/page.tsx            |   3 +-
 .../app/src/app/compare-per-dollar/page.tsx   |   4 +-
 packages/app/src/app/compare/page.tsx         |   4 +-
 .../app/src/app/land-acknowledgement/page.tsx |   3 +-
 packages/app/src/app/quotes/page.tsx          |   3 +-
 packages/app/src/app/sitemap.ts               |  91 +-
 .../app/zh/(dashboard)/calculator/page.tsx    |  23 +
 .../app/zh/(dashboard)/evaluation/page.tsx    |  21 +
 .../app/zh/(dashboard)/gpu-metrics/page.tsx   |  16 +
 .../src/app/zh/(dashboard)/gpu-specs/page.tsx |  16 +
 .../app/zh/(dashboard)/historical/page.tsx    |  19 +
 .../src/app/zh/(dashboard)/inference/page.tsx |  19 +
 .../app/src/app/zh/(dashboard)/layout.tsx     |   5 +
 .../app/zh/(dashboard)/reliability/page.tsx   |  19 +
 .../app/zh/(dashboard)/submissions/page.tsx   |  16 +
 packages/app/src/app/zh/about/page.tsx        | 215 ++++
 .../app/zh/blog/[slug]/opengraph-image.tsx    |  43 +
 packages/app/src/app/zh/blog/[slug]/page.tsx  | 214 ++++
 packages/app/src/app/zh/blog/layout.tsx       |   9 +
 packages/app/src/app/zh/blog/page.tsx         | 127 +++
 .../src/app/zh/compare-per-dollar/layout.tsx  |  13 +
 .../src/app/zh/compare-per-dollar/page.tsx    | 149 +++
 packages/app/src/app/zh/compare/layout.tsx    |  13 +
 packages/app/src/app/zh/compare/page.tsx      | 161 +++
 .../src/app/zh/land-acknowledgement/page.tsx  |  99 ++
 packages/app/src/app/zh/layout.tsx            |  18 +
 packages/app/src/app/zh/not-found.tsx         |  16 +
 packages/app/src/app/zh/page.tsx              |  26 +
 packages/app/src/app/zh/quotes/page.tsx       |  23 +
 .../app/src/components/about/faq-data-zh.ts   | 127 +++
 .../src/components/blog/blog-back-link.tsx    |  12 +-
 .../src/components/blog/blog-post-card.tsx    |   6 +-
 .../app/src/components/blog/blog-post-nav.tsx |  18 +-
 .../app/src/components/blog/blog-tag-link.tsx |   6 +-
 packages/app/src/components/blog/blog-toc.tsx |   9 +-
 packages/app/src/components/footer/footer.tsx |   8 +
 packages/app/src/components/header/header.tsx |  56 +-
 packages/app/src/components/intro-section.tsx |  22 +-
 .../src/components/landing/landing-page.tsx   | 135 ++-
 .../app/src/components/quote-carousel.tsx     |   5 +-
 .../src/components/quotes/quotes-content.tsx  |  36 +-
 .../app/src/components/quotes/quotes-data.ts  | 151 ++-
 .../app/src/components/set-document-lang.tsx  |  22 +
 packages/app/src/components/tab-nav.tsx       |  25 +-
 .../app/src/components/zh/zh-tab-intro.tsx    |  19 +
 packages/app/src/lib/blog.test.ts             | 139 +++
 packages/app/src/lib/blog.ts                  |  82 +-
 packages/app/src/lib/i18n.test.ts             | 103 ++
 packages/app/src/lib/i18n.ts                  | 110 ++
 packages/app/src/lib/tab-meta-zh.test.ts      |  62 ++
 packages/app/src/lib/tab-meta-zh.ts           | 144 +++
 packages/app/src/lib/tab-meta.ts              |   8 +-
 packages/constants/src/seo.ts                 |  10 +
 74 files changed, 6892 insertions(+), 212 deletions(-)
 create mode 100644 docs/i18n.md
 create mode 100644 packages/app/content/blog/zh/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar.mdx
 create mode 100644 packages/app/content/blog/zh/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx
 create mode 100644 packages/app/content/blog/zh/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar.mdx
 create mode 100644 packages/app/content/blog/zh/deepseekv4-16t-day-0-to-day-43-performance.mdx
 create mode 100644 packages/app/content/blog/zh/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200.mdx
 create mode 100644 packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx
 create mode 100644 packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx
 create mode 100644 packages/app/content/blog/zh/inferencemax-open-source-inference-benchmarking.mdx
 create mode 100644 packages/app/content/blog/zh/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper.mdx
 create mode 100644 packages/app/content/blog/zh/mi355x-deepseek-v4-pro-sglang-110x-in-26-days.mdx
 create mode 100644 packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx
 create mode 100644 packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx
 create mode 100644 packages/app/content/blog/zh/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x.mdx
 create mode 100644 packages/app/content/blog/zh/sglang-0-5-6-b200-deepseek-r1-fp4-up-to-1-8x.mdx
 create mode 100644 packages/app/cypress/e2e/zh-pages.cy.ts
 create mode 100644 packages/app/src/app/zh/(dashboard)/calculator/page.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/evaluation/page.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/gpu-metrics/page.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/gpu-specs/page.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/historical/page.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/inference/page.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/layout.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/reliability/page.tsx
 create mode 100644 packages/app/src/app/zh/(dashboard)/submissions/page.tsx
 create mode 100644 packages/app/src/app/zh/about/page.tsx
 create mode 100644 packages/app/src/app/zh/blog/[slug]/opengraph-image.tsx
 create mode 100644 packages/app/src/app/zh/blog/[slug]/page.tsx
 create mode 100644 packages/app/src/app/zh/blog/layout.tsx
 create mode 100644 packages/app/src/app/zh/blog/page.tsx
 create mode 100644 packages/app/src/app/zh/compare-per-dollar/layout.tsx
 create mode 100644 packages/app/src/app/zh/compare-per-dollar/page.tsx
 create mode 100644 packages/app/src/app/zh/compare/layout.tsx
 create mode 100644 packages/app/src/app/zh/compare/page.tsx
 create mode 100644 packages/app/src/app/zh/land-acknowledgement/page.tsx
 create mode 100644 packages/app/src/app/zh/layout.tsx
 create mode 100644 packages/app/src/app/zh/not-found.tsx
 create mode 100644 packages/app/src/app/zh/page.tsx
 create mode 100644 packages/app/src/app/zh/quotes/page.tsx
 create mode 100644 packages/app/src/components/about/faq-data-zh.ts
 create mode 100644 packages/app/src/components/set-document-lang.tsx
 create mode 100644 packages/app/src/components/zh/zh-tab-intro.tsx
 create mode 100644 packages/app/src/lib/i18n.test.ts
 create mode 100644 packages/app/src/lib/i18n.ts
 create mode 100644 packages/app/src/lib/tab-meta-zh.test.ts
 create mode 100644 packages/app/src/lib/tab-meta-zh.ts
diff --git a/AGENTS.md b/AGENTS.md
index c2b8f25b..282a322b 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -6,6 +6,8 @@ For detailed subsystem docs, see [docs/index.md](./docs/index.md).
 
 > **Translation quality bar:** write natural technical Chinese, not word-for-word machine translation (style reference: [`vllm-project/vllm-ascend` `README.zh.md`](https://github.com/vllm-project/vllm-ascend/blob/main/README.zh.md)). Preserve product names, hardware SKUs, framework/library names (Next.js, React Query, D3.js, Tailwind ...), flags, and code identifiers in English. Use parenthetical English clarification for acronyms on first use. Preferred terms: benchmark 基准测试, dashboard 仪表板, chart 图表, config 配置, throughput 吞吐量, latency 延迟, single-node/multi-node 单节点/多节点, evaluation 评估, artifact 产物.
 
+> **The website itself is bilingual too — every indexable page must ship a Simplified Chinese sibling under `/zh`.** See [Chinese Website Pages](#chinese-website-pages-zh--mandatory-for-all-indexable-surfaces) below; a new page, tab, or blog post without its `/zh` version is 🔴 BLOCKING on PR review.
+
 ## Project Overview
 
 InferenceX App — Next.js 16 dashboard for ML inference benchmark data. DB-backed with Neon PostgreSQL, React Query for data fetching, D3.js for charts.
@@ -121,6 +123,18 @@ When adding a chart feature (toggle, label, overlay, filter, export, share-link
 
 If the feature genuinely cannot apply to overlays (e.g., it depends on data only ingested for official runs), say so explicitly in code comments and the PR description. Default to "must support overlays."
 
+## Chinese Website Pages (/zh) — Mandatory for All Indexable Surfaces
+
+The site ships a hand-authored Simplified Chinese sibling for every indexable page under the `/zh` route prefix (`/` ↔ `/zh`, `/about` ↔ `/zh/about`, `/blog/<slug>` ↔ `/zh/blog/<slug>`, …) so the site is crawled and indexed in Chinese as well as English. There is no i18n framework — each `/zh` page is a real page that reuses the shared helpers in `packages/app/src/lib/i18n.ts` (`zhAlternates`, `enAlternates`, `ZH_OG_LOCALE`, `ZH_MIRRORED_ROUTES`) and `src/lib/tab-meta-zh.ts`. The translation quality bar above applies to all site content.
+
+**Every new indexable page, dashboard tab, or blog post MUST ship its Chinese version in the same PR:**
+
+1. **New page** → create `packages/app/src/app/zh/<route>/page.tsx` with fully translated content and metadata. Metadata: `alternates: zhAlternates('<en-path>')` plus `openGraph.locale: ZH_OG_LOCALE`. Switch the English page's `alternates` to `enAlternates('<en-path>')` so both sides carry bidirectional hreflang. Register the route in `ZH_MIRRORED_ROUTES` (`src/lib/i18n.ts`) so the header nav and EN↔中文 toggle link to it, and add it to the sitemap via `localizedPair()` in `src/app/sitemap.ts`.
+2. **New dashboard tab** → add the tab to `ZH_TAB_KEYS`, `TAB_META_ZH`, `TAB_INTRO_ZH`, and `TAB_LABELS_ZH` in `src/lib/tab-meta-zh.ts`, then create `src/app/zh/(dashboard)/<tab>/page.tsx` mirroring the English page with `tabMetadataZh('<tab>')` and a `<ZhTabIntro tab="<tab>" />` block above the chart (the interactive chart UI itself stays English). `tab-meta-zh.test.ts` enforces dictionary completeness.
+3. **New blog post** → the translation `packages/app/content/blog/zh/<same-filename>.mdx` is REQUIRED in the same PR. Translate frontmatter `title`/`subtitle` and the body; keep `date`, `publishDate`, `modifiedDate`, `tags`, and the filename/slug identical (English and Chinese posts pair by filename; visibility gating always follows the English post's `publishDate`). Rewrite internal `/blog/<slug>` links to `/zh/blog/<slug>`; never alter numbers, code blocks, or `<Figure>`/`<JsonLd>` structure. The `/zh/blog` listing, hreflang, and sitemap pick the file up automatically.
+4. **Editing an existing English page or post** → update its Chinese sibling in the same PR. Content drift between languages is a 🔴 BLOCKING review issue.
+5. **Intentionally not mirrored** (skip these, or add them to `ZH_MIRRORED_ROUTES` when you do mirror them): per-slug compare pages (`/compare/[slug]`, `/compare-per-dollar/[slug]` — the `/zh/compare*` index pages link to the English slug pages), `/datasets`, feature-gated tabs (`ai-chart`, `current-inferencex-image`, `feedback`), `feed.xml`/`llms.txt`, and per-post OG images (Chinese posts reuse the English post's OG image — the OG renderer's font has no CJK glyphs).
+
 ## Chart Interpolation — TS and Python Helpers MUST Stay in Sync
 
 The blog-writing workflow (`.claude/skills/write-inferencex-blog/`) ships a Python port of the chart's interpolation algorithm at `.claude/skills/write-inferencex-blog/iso_interactivity.py`. It exists so iso-interactivity tables in blog posts produce **exactly the same numbers** readers see when they hover the rendered chart. Linear-interpolation shell scripts will produce visibly different values — Cursor Bugbot has flagged this on prior posts.
@@ -197,7 +211,8 @@ Authoritative total / active parameter counts for every model in the dashboard.
 
 1. Create `packages/app/content/blog/<slug>.mdx` with frontmatter: `title`, `subtitle`, `date` (required), `tags`, `modifiedDate` (optional)
 2. Write content using Markdown + custom MDX components (`Figure`, `Blur`)
-3. No code changes needed — the post automatically appears in the blog list, sitemap, RSS feed, llms.txt, and gets a generated OG image
+3. Create the Simplified Chinese translation at `packages/app/content/blog/zh/<slug>.mdx` (**required** — see [Chinese Website Pages](#chinese-website-pages-zh--mandatory-for-all-indexable-surfaces))
+4. No code changes needed — the post automatically appears in the blog list, sitemap, RSS feed, llms.txt, and gets a generated OG image; the zh file appears on `/zh/blog` with hreflang pairing
 
 See [Blog](./docs/blog.md) for content format, available MDX components, and design details.
 
@@ -222,6 +237,7 @@ See [Blog](./docs/blog.md) for content format, available MDX components, and des
 2. Create a per-section context provider (see `InferenceContext.tsx`, `EvaluationContext.tsx` for patterns)
 3. Use `ChartLegend` with `variant="sidebar"`, sorted by `HW_REGISTRY` sort order, default expanded
 4. Analytics: all interactive elements use `track()` with `{tabname}_` prefix
+5. Create the Chinese sibling: extend `src/lib/tab-meta-zh.ts` dictionaries and add `src/app/zh/(dashboard)/<tab>/page.tsx` (see [Chinese Website Pages](#chinese-website-pages-zh--mandatory-for-all-indexable-surfaces))
 
 ### Bumping dependencies
 
diff --git a/docs/i18n.md b/docs/i18n.md
new file mode 100644
index 00000000..50bdaca1
--- /dev/null
+++ b/docs/i18n.md
@@ -0,0 +1,29 @@
+# Chinese Pages (/zh)
+
+Why the Simplified Chinese site is a hand-authored `/zh` page tree instead of an i18n framework, and how the pieces fit together. The authoring rules (what you MUST do when adding a page/tab/post) live in [AGENTS.md — Chinese Website Pages](../AGENTS.md#chinese-website-pages-zh--mandatory-for-all-indexable-surfaces); this doc covers the design rationale.
+
+## Why hand-authored pages, not next-intl / `[locale]` routing
+
+- **SEO is the goal, not full UI translation.** The objective is Chinese pages that crawlers can index: Chinese titles, meta descriptions, server-rendered Chinese content, and bidirectional hreflang. The interactive dashboard (charts, filters, tooltips) stays English — model/GPU/framework names are English-first terms in Chinese ML writing anyway, and translating deep chart UI would touch hundreds of client components for little search value.
+- **A `[locale]` root segment would move every route** and force middleware rewrites to keep existing URLs stable — high blast radius for a two-locale site. A parallel `/zh` tree adds pages without touching English URLs.
+- **No message-catalog indirection.** With exactly two locales and mostly page-level content, colocated `STRINGS = { en, zh }` dictionaries (landing page, quotes content) and dedicated zh files (`tab-meta-zh.ts`, `faq-data-zh.ts`) are easier to review than a parallel key-file hierarchy, and dead strings die with their page.
+
+## Architecture
+
+| Piece                  | Where                                     | Notes                                                                                                                                                                                                                                   |
+| ---------------------- | ----------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| URL mapping + hreflang | `src/lib/i18n.ts`                         | `zhPath`, `enAlternates`/`zhAlternates` (both sides emit the same `languages` map, `x-default` = English), `ZH_MIRRORED_ROUTES` (single source of truth for "which routes have a zh sibling"), `switchLocalePath` for the header toggle |
+| Tab dictionaries       | `src/lib/tab-meta-zh.ts`                  | `TAB_META_ZH` (metadata), `TAB_INTRO_ZH` (server-rendered intro), `TAB_LABELS_ZH` (TabNav), `NAV_LABELS_ZH` (header). Completeness enforced by `tab-meta-zh.test.ts`                                                                    |
+| zh page tree           | `src/app/zh/**`                           | Mirrors the English tree; dashboard pages render `<ZhTabIntro>` above the same chart components                                                                                                                                         |
+| Blog translations      | `content/blog/zh/<slug>.mdx`              | Same filename pairs the languages. Visibility gating (publishDate) always derives from the **English** post so a translation can never publish early; title/subtitle/reading time come from the zh file                                 |
+| Locale-aware chrome    | `header.tsx`, `tab-nav.tsx`, `footer.tsx` | Client components detect `/zh` from `usePathname()` — no prop drilling, no context                                                                                                                                                      |
+
+## Non-obvious decisions
+
+- **`<html lang>`**: the root layout hardcodes `lang="en"` and Next.js cannot override it per segment without multiple root layouts. The zh layout wraps content in `<div lang="zh-CN">` (valid, scopes language for crawlers/AT) and `SetDocumentLang` fixes `document.documentElement.lang` after hydration. Google detects language from content, not the attribute, so this is not an SEO problem.
+- **Slugs stay English.** zh posts keep the English filename/slug — `slugify()` would previously have destroyed CJK slugs, and shared slugs are what pair the two languages for hreflang. `slugify()` now _preserves_ Han characters, but only so Chinese _headings_ get meaningful anchor ids (`extractHeadings` and the MDX heading renderer share it).
+- **Reading time is CJK-aware**: `getReadingTime` counts Han characters at 400 chars/min alongside Latin words at 265 wpm; pure word-splitting counts an entire Chinese paragraph as ~1 "word".
+- **zh OG images reuse the English post meta** — the `next/og` default Satori font has no CJK glyphs, so a Chinese title would render as tofu. Loading a subset CJK font is a known follow-up.
+- **`/zh/inference` canonicalizes to `/zh`**, mirroring the English quirk where `/inference` canonicalizes to `/`.
+- **Compare slug pages are not mirrored**: `compareTableNarrative` (`compare-ssr.ts`) generates hundreds of lines of English prose per programmatic page; translating the templates is a separate project. The `/zh/compare*` index pages exist and link to the English slug pages.
+- **Sitemap pairs**: `localizedPair()` in `sitemap.ts` emits the EN and zh URL together, both carrying the same `alternates.languages` map. Blog posts without a translation fall back to an English-only entry, so a missing translation degrades gracefully instead of 404-ing crawlers.
diff --git a/docs/index.md b/docs/index.md
index a95646a2..ea6154ed 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -15,3 +15,4 @@ Design rationale and non-obvious conventions. See [CLAUDE.md](../CLAUDE.md) for
 - [Data Transforms](./data-transforms.md) — Full pipeline from BenchmarkRow to RenderableGraph: type hierarchy, hardware key construction, derived metrics, memoization strategy
 - [State Ownership](./state-ownership.md) — Which context owns which state, availability filtering cascade, comparison date mechanics, URL param sync
 - [Blog](./blog.md) — MDX content system, SEO features (OG images, RSS, llms.txt, JSON-LD), TOC sidebar, reading progress, heading links, analytics events
+- [Chinese Pages (/zh)](./i18n.md) — Why hand-authored /zh pages instead of an i18n framework, hreflang pairing, blog translation pairing, html lang workaround, CJK reading time/slugs
diff --git a/packages/app/content/blog/zh/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar.mdx b/packages/app/content/blog/zh/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar.mdx
new file mode 100644
index 00000000..36d30314
--- /dev/null
+++ b/packages/app/content/blog/zh/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar.mdx
@@ -0,0 +1,215 @@
+---
+title: 'B200 NVFP4 对比 H200 FP8 运行 GLM-5：SGLang MTP 下性价比提升高达 3.65 倍'
+subtitle: '两款 GPU 均运行 SGLang EAGLE MTP；Blackwell 世代在峰值处带来约 1.2 倍的性价比提升，NVIDIA GLM-5-NVFP4 检查点搭配 FlashInfer TRT-LLM 稀疏 MLA 在 8K/1K 场景下再叠加约 2.4–3.0 倍优势'
+date: '2026-05-26'
+publishDate: '2026-05-26'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - glm5
+  - nvidia
+  - b200
+  - h200
+  - sglang
+  - fp4
+---
+
+在 GLM-5 8K/1K 工作负载下，H200 和 B200 均运行 SGLang 时，NVIDIA 的 GLM-5-NVFP4 检查点在 B200 上实现了**等交互性（iso-interactivity）下性价比最高达 H200 SGLang FP8 的 3.65 倍**——在 80 tok/s/user 时，H200 的成本为 $1.06/M tokens，而 B200 NVFP4 仅为 $0.29/M tokens。该优势在 H200 的整个 25–84 tok/s/user 运行区间内保持在 3.24x–3.65x 范围。数据基于 2026-05-25 InferenceX 基准测试（benchmark），使用 SGLang v0.5.12。
+
+这 3.65 倍的提升在峰值处可清晰分解。在 80 tok/s/user 时，**B200 SGLang FP8 + MTP 的性价比是 H200 SGLang FP8 + MTP 的 1.22 倍**——这是在相同精度和相同 EAGLE 方案下，仅靠 Blackwell 世代硬件 + 软件带来的提升。**将 B200 的权重从 `zai-org/GLM-5-FP8` 切换为 `nvidia/GLM-5-NVFP4` 再叠加 2.98 倍**——这是仅靠精度切换带来的提升，得益于 FlashInfer 的 TRT-LLM 稀疏 MLA 内核——该内核已在 [sgl-project/sglang #21783](https://github.com/sgl-project/sglang/pull/21783) 中被设为 sm100/sm103 的默认后端。1.22 × 2.98 ≈ 3.65。在不同运行区间，两个因素的贡献比例会互换——世代因素在低交互性时贡献更大（50 tok/s/user 时为 1.36x），精度因素在高交互性时贡献更大（84 tok/s/user 时为 3.07x）——但组合提升始终保持稳定。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-25&g_model=GLM-5&g_runid=26381101926&i_prec=fp4%2Cfp8&i_active=b200_sglang_mtp%2Ch200_sglang_mtp&i_linelabel=1">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<Figure
+  srcLight="/images/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar/benchmark-light.png"
+  srcDark="/images/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar/benchmark-dark.png"
+  alt="GLM-5 8K/1K 每 GPU 吞吐量与交互性关系图，三条 SGLang MTP 曲线：B200 NVFP4（最上方）、B200 FP8（中间）、H200 FP8（最下方）。B200 NVFP4 在 18 tok/s/user 时峰值超过 4,000 tok/s/GPU。"
+  caption="GLM-5（744B / 40B 激活参数），ISL 8192 / OSL 1024。H200 和 B200 均使用 SGLang v0.5.12 及基于 EAGLE 的 MTP。标签标注了每个配置的 GPU 数量（FP4 前沿为 TP=4 的 4 GPU 加上右端一个 TP=8 / 8 GPU 数据点；FP8 曲线为 TP=8 的 8 GPU）。"
+/>
+
+GLM-5 是智谱（ZAI/Zhipu）的 MoE 旗舰模型，发布于 2026-02-11——距本次测试约 14 周。它是一个 **744B 参数的稀疏 MoE，每个 token 激活约 40B 参数**：256 个专家 + top-8 路由（约 5.9% 稀疏度）加共享专家，解码阶段使用 **DeepSeek Sparse Attention（DSA）** 并搭配 Multi-head Latent Attention（MLA）进行 KV 缓存压缩，上下文窗口为 200K。发布的架构名称为 `glm_moe_dsa`——与 DeepSeek 在 V3.2 中引入的稀疏注意力模式相同，也是 SGLang 在 Blackwell 上的 TRT-LLM 稀疏 MLA 后端所针对优化的架构。
+
+NVIDIA 还发布了量化权重版本 [`nvidia/GLM-5-NVFP4`](https://huggingface.co/nvidia/GLM-5-NVFP4)——与 `zai-org/GLM-5-FP8` 采用相同的模型架构，但所有 MoE GEMM 权重从 FP8 重新转换为 NVFP4（16 元素分块、FP8 逐块缩放因子、FP32 逐张量缩放因子）。KV 缓存保持 FP8。这就是图表（chart）中 B200 曲线所加载的检查点；H200 曲线加载 `zai-org/GLM-5-FP8`，因为 Hopper 没有 FP4 张量核心。
+
+## 纸面规格
+
+在介绍具体方案之前，先看硬件。H200 SXM（Hopper）和 B200 SXM（Blackwell）相隔一代。下方雷达图（chart）将每个轴归一化到 [`/gpu-specs`](/gpu-specs) 中所有 NVIDIA + AMD SKU 的最大值——因此 H200 和 B200 的多边形在 GB200/GB300 NVL72 设定上限的轴上显得较小（特别是 Scale-up Domain Memory 和 Scale-up Domain Memory Bandwidth，它们随 72-GPU NVLink 域的机架级规模而扩展）。
+
+<Figure
+  srcLight="/images/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar/specs-radar-light.png"
+  srcDark="/images/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar/specs-radar-dark.png"
+  alt="雷达图对比 H200 SXM 和 B200 SXM 在显存、显存带宽、FP4/FP8/BF16 TFLOP/s、Scale-up 带宽、Scale-up Domain Memory 和 Scale-up Domain Memory Bandwidth 上的表现。各轴归一化至 /gpu-specs 中所有 GPU 的最大值。"
+  caption="H200 SXM 与 B200 SXM 的 InferenceX /gpu-specs 雷达图对比。各轴归一化至该指标的跨厂商最大值（例如 FP4 最大值为 GB300 NVL72 的 15 PFLOP/s/GPU，因此 B200 的 9 PFLOP/s 显示约 60%；Scale-up Domain Memory 和带宽最大值由 GB200/GB300 NVL72 的 72-GPU NVLink 域设定，因此 H200 和 B200 的 8-GPU 域均显示较低）。H200 在 FP4 轴上为 0%，因为 Hopper 没有 FP4 张量核心。"
+/>
+
+本次基准测试中两款 SKU 的绝对值：
+
+| 规格                              | H200 SXM            | B200 SXM            | B200 / H200 |
+| --------------------------------- | ------------------- | ------------------- | ----------- |
+| HBM 容量                          | 141 GB (HBM3e)      | 180 GB (HBM3e)      | 1.28x       |
+| HBM 带宽                          | 4.8 TB/s            | 8.0 TB/s            | 1.67x       |
+| Dense FP4 (TFLOP/s)               | —                   | 9,000               | —           |
+| Dense FP8 (TFLOP/s)               | 1,979               | 4,500               | 2.27x       |
+| Dense BF16 (TFLOP/s)              | 989                 | 2,250               | 2.28x       |
+| Scale-up 每 GPU 带宽（单向）      | 450 GB/s (NVLink 4) | 900 GB/s (NVLink 5) | 2.00x       |
+| Scale-up 节点规模                 | 8                   | 8                   | 1.00x       |
+| Scale-up Domain HBM 容量          | 1,128 GB            | 1,440 GB            | 1.28x       |
+| Scale-up Domain HBM 带宽（聚合）  | 38.4 TB/s           | 64.0 TB/s           | 1.67x       |
+| TCO（SemiAnalysis AI Cloud 模型） | $1.41/GPU/hr        | $1.95/GPU/hr        | 1.38x       |
+
+对 FP8 对 FP8 对比的启示：在相同精度和相同方案下，B200 相对 H200 的性价比上限在完全计算瓶颈的工作负载上约为 `2.27 / 1.38 ≈ 1.64x`，在完全显存带宽瓶颈的工作负载上约为 `1.67 / 1.38 ≈ 1.21x`（以 HBM 为带宽轴；若以 NVLink 带宽计，则上限为 `2.00 / 1.38 ≈ 1.45x`）。实测在 80 tok/s/user 时为 1.22x，落在显存带宽瓶颈区间内——GLM-5 在此并发度下的解码阶段主要受 MoE 权重和 KV 缓存的 HBM 读取限制，而非 FP8 GEMM 吞吐量，因此 Blackwell 的 dense 计算余量大部分未被利用。NVFP4 才是打破 GEMM 天花板的关键杠杆：H200 没有 FP4 张量核心，而 B200 拥有 9 PFLOP/s，由此带来的精度提升在世代提升之上再叠加 2.41x–3.07x。
+
+## 促成此结果的上游变更
+
+**上游软件栈。** SGLang [v0.5.10](https://github.com/sgl-project/sglang/releases/tag/v0.5.10)（2026-04-07）是 GLM-5 首次在 Blackwell 上跨所有四个精度/MTP/分离式推理变体完整端到端运行的稳定版本——[跟踪 issue #19380](https://github.com/sgl-project/sglang/issues/19380) 在同日将每个 Functional 和 Baseline Perf 行标记为 DONE。本文的基准测试运行于 [v0.5.12](https://github.com/sgl-project/sglang/releases/tag/v0.5.12)（发布于 2026-05-16），它继承了相同的 Blackwell 默认配置并增加了第一轮性能优化。关键内核变更：
+
+- [sgl-project/sglang #21783](https://github.com/sgl-project/sglang/pull/21783) 将 **FlashInfer TRT-LLM 稀疏 MLA 内核设为 sm100/sm103**（B200/B300）**的默认注意力后端**。DSA prefill 和 decode 现在运行在 GLM-5/V3.2 所针对调优的内核上，而非曾在 B200 上引发 [GLM-5 精度回归](https://github.com/sgl-project/sglang/issues/21291) 的旧 `flashmla_kv` 路径。
+- [sgl-project/sglang #21405](https://github.com/sgl-project/sglang/pull/21405) 为稀疏 MLA 启用了 **IndexCache**，在连续 decode 步骤间复用索引张量，在相同内核调用序列上带来 >10% 的 decode 吞吐量提升。
+- [flashinfer-ai/flashinfer #2726](https://github.com/flashinfer-ai/flashinfer/pull/2726)（FlashInfer v0.6.6.post1）修复了一个间歇性 NVFP4 非法内存访问 bug，此前一直[阻塞](https://github.com/sgl-project/sglang/issues/19081) NVFP4 的功能验证签核；[flashinfer-ai/flashinfer #2836](https://github.com/flashinfer-ai/flashinfer/pull/2836)（v0.6.7）提升了 trtllm-gen 稀疏 MLA 的性能上限。
+
+**MTP。** GLM-5 复用了 SGLang 为 DeepSeek V3.2 构建的 EAGLE 推测解码管线（`--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4`），并通过 `SGLANG_ENABLE_SPEC_V2=1` 启用 overlap 调度器。H200 和 B200 使用完全相同的参数集——两款 SKU 在下面方案中唯一的不同是模型检查点和注意力后端的选择。
+
+## 详细数据
+
+所有行均为 GLM-5 在 **ISL 8192 / OSL 1024** 下的单节点非分离式推理结果，数据来自 2026-05-25 的 InferenceX 基准测试，使用 **SGLang v0.5.12** 并在每个方案中启用基于 EAGLE 的 MTP。每百万 total tokens 成本计算公式为 `TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6)`，H200 为 $1.41/GPU/hr，B200 为 $1.95/GPU/hr，来源于 [SemiAnalysis AI Cloud TCO 模型](https://newsletter.semianalysis.com/p/ai-cloud-economics)。
+
+容器镜像：两款 SKU 均使用 `lmsysorg/sglang:v0.5.12-cu130`。
+
+**H200 SGLang FP8 + MTP，TP=8，8 GPU**（模型 `zai-org/GLM-5-FP8`）：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 347.9     | 84.49      | 11.84     | $1.13      |
+| 8    | 489.7     | 59.82      | 16.72     | $0.80      |
+| 16   | 675.9     | 39.64      | 25.22     | $0.58      |
+| 32   | 851.9     | 24.90      | 40.16     | $0.46      |
+| 64   | 847.2     | 20.80      | 48.08     | $0.46      |
+
+并发 64 时 tok/s/GPU 略有回落，因为首 token 延迟（TTFT）开始主导请求时间预算——并发 32 在此方案下设定了 H200 的吞吐量上限和成本下限。Pareto 前沿剔除了并发 64，因为并发 32 在两个轴上都优于它。
+
+**B200 SGLang FP8 + MTP，TP=8，8 GPU**（模型 `zai-org/GLM-5-FP8`）：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 417.0     | 100.85     | 9.92      | $1.30      |
+| 8    | 650.1     | 77.82      | 12.85     | $0.83      |
+| 16   | 952.7     | 56.93      | 17.57     | $0.57      |
+| 32   | 1,296.8   | 38.16      | 26.21     | $0.42      |
+| 64   | 1,619.3   | 23.56      | 42.45     | $0.33      |
+| 128  | 1,929.5   | 13.78      | 72.59     | $0.28      |
+| 256  | 1,947.3   | 11.88      | 84.15     | $0.28      |
+
+**B200 SGLang NVFP4 + MTP，TP=4，4 GPU**（模型 `nvidia/GLM-5-NVFP4`）——成本前沿的锚点：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 1,038.7   | 121.22     | 8.25      | $0.52      |
+| 8    | 1,523.5   | 94.53      | 10.58     | $0.36      |
+| 16   | 2,228.1   | 66.27      | 15.09     | $0.24      |
+| 32   | 3,037.3   | 43.99      | 22.73     | $0.18      |
+| 64   | 3,739.7   | 26.78      | 37.33     | $0.14      |
+| 128  | 4,115.5   | 17.63      | 56.73     | $0.13      |
+| 256  | 4,090.7   | 17.37      | 57.57     | $0.13      |
+
+**B200 SGLang NVFP4 + MTP，TP=8，8 GPU**——单个高交互性数据点，向右延伸 FP4 前沿：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 579.2     | 140.08     | 7.14      | $0.94      |
+
+TP=8 / 8 GPU 配置以牺牲一半的每 GPU 吞吐量为代价，在相同并发下获得了比 TP=4 高 16% 的交互性——额外的 GPU 将 TPOT 从 8.25 ms 降至 7.14 ms。FP4 的综合 Pareto 前沿从 $0.13/M（TP=4，并发 128）的 18 tok/s/user 延伸至 $0.94/M（TP=8，并发 4）的 140 tok/s/user。
+
+## 等交互性下的性价比对比
+
+在匹配的交互性水平下，沿每款 SKU 的 Pareto 前沿插值得出的每 GPU 吞吐量和每百万 tokens 成本。最后一列的性价比提升倍数为 $/M 比值的倒数——B200 NVFP4 相对于 H200 的性价比。超出前沿测量范围的单元格标记为 _unreachable_。
+
+| 交互性 (tok/s/user) | H200 FP8 MTP $/M | B200 FP8 MTP $/M | B200 NVFP4 MTP $/M | B200 NVFP4 性价比 vs H200 |
+| ------------------- | ---------------- | ---------------- | ------------------ | ------------------------- |
+| 25                  | $0.46            | $0.34            | $0.14              | 3.24x                     |
+| 30                  | $0.50            | $0.37            | $0.15              | 3.32x                     |
+| 40                  | $0.58            | $0.43            | $0.17              | 3.44x                     |
+| 50                  | $0.69            | $0.51            | $0.19              | 3.54x                     |
+| 60                  | $0.80            | $0.60            | $0.22              | 3.60x                     |
+| 70                  | $0.93            | $0.72            | $0.26              | 3.63x                     |
+| **80**              | **$1.06**        | **$0.87**        | **$0.29**          | **3.65x**                 |
+| 84                  | $1.12            | $0.94            | $0.31              | 3.64x                     |
+| 100                 | _unreachable_    | $1.28            | $0.38              | _∞_                       |
+| 120                 | _unreachable_    | _unreachable_    | $0.51              | _∞_                       |
+| 140                 | _unreachable_    | _unreachable_    | $0.93              | _∞_                       |
+
+B200 NVFP4 相对 H200 的性价比提升在 **80 tok/s/user 时达到峰值 3.65 倍**，且在整个 H200 运行区间内保持在 3.24x–3.65x 范围——不存在 H200 FP8 + MTP 能在 3 倍以内接近 B200 NVFP4 + MTP 的交互性点。仅精度切换带来的提升（B200 FP8 → B200 NVFP4）随交互性单调递增，从 **25 tok/s/user 时的 2.41 倍到 84 tok/s/user 时的 3.07 倍**，因为 B200 FP8 的性价比随批量减小而下降得更快。在 84 tok/s/user 以上，对比便不复存在：H200 没有任何方案能再提供一个 tok/s/user 的交互性，而 B200 NVFP4 的运行区间还可以延伸 60 tok/s/user，直达 TP=8 下的 140 tok/s/user。
+
+<Figure
+  srcLight="/images/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar/benchmark-light.png"
+  srcDark="/images/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar/benchmark-dark.png"
+  alt="GLM-5 8K/1K 每 GPU 吞吐量与交互性关系图，三条 SGLang MTP 曲线：B200 NVFP4（最上方）、B200 FP8（中间）、H200 FP8（最下方）。B200 NVFP4 在 18 tok/s/user 时峰值超过 4,000 tok/s/GPU。"
+  caption="GLM-5（744B / 40B 激活参数），ISL 8192 / OSL 1024。H200 和 B200 均使用 SGLang v0.5.12 及基于 EAGLE 的 MTP。标签标注了每个配置的 GPU 数量（FP4 前沿为 TP=4 的 4 GPU 加上右端一个 TP=8 / 8 GPU 数据点；FP8 曲线为 TP=8 的 8 GPU）。"
+/>
+
+[在线图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-25&g_model=GLM-5&g_runid=26381101926&i_prec=fp4%2Cfp8&i_active=b200_sglang_mtp%2Ch200_sglang_mtp&i_linelabel=1)，已预筛选为 2026-05-25 测试中 H200 + B200 上的 GLM-5 SGLang MTP。[在线成本视图](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-25&g_model=GLM-5&g_runid=26381101926&i_prec=fp4%2Cfp8&i_active=b200_sglang_mtp%2Ch200_sglang_mtp&i_metric=y_costh&i_linelabel=1)展示相同对比的成本维度。
+
+## GLM-5 在 Blackwell 上的后续进展
+
+三个方向仍有望进一步提升当前数字，均已在上游跟踪中：
+
+- **NVL72 上的分离式推理。** 上述数字均为单节点聚合方式。[跟踪 issue](https://github.com/sgl-project/sglang/issues/19380) 正在积极推进 FP8 B200 分离式 8K/1K 及 GB300 分离式 MTP 的工作。宽 EP（Expert Parallelism）在 NVL72 上已在 Kimi K2.5 上展示了[每 GPU 吞吐量约 3 倍的优势](/zh/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200)——同样的杠杆应该能在 FP4 前沿趋于平台的低交互性/高吞吐量端进一步提升 GLM-5 的性价比。
+
+对于在 25–84 tok/s/user 区间的聊天场景 GLM-5 推理，B200 NVFP4 + MTP 在使用 SGLang 的每个可测量运行点上均实现了 H200 FP8 + MTP 3.2x–3.65 倍的性价比优势。
+
+## 致谢
+
+本轮方案优化进展迅速，得益于 [SGLang 与 NVIDIA 的协作](https://github.com/sgl-project/sglang/issues/19380)在大约一个季度内完成了 Blackwell 上 no-MTP/MTP 和 Agg/Disagg 的所有 Functional 和 Baseline Perf 行——FlashInfer 中的 NVFP4 IMA 修复、sm100/sm103 上的稀疏 MLA 默认配置、IndexCache、GLM-5 的基于 EAGLE 的 MTP——而 InferenceX 方案循环在上游稳定后一周内即完成了 H200 MTP 兄弟方案的接入。感谢 SGLang 维护者、FlashInfer 团队、NVIDIA SGLang 协作线程以及在[跟踪 issue](https://github.com/sgl-project/sglang/issues/19380) 上提交 PR 的所有人。上游到基准测试的闭环速度就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-25&g_model=GLM-5&g_runid=26381101926&i_prec=fp4%2Cfp8&i_active=b200_sglang_mtp%2Ch200_sglang_mtp&i_linelabel=1">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "NVIDIA B200 NVFP4 在 GLM-5 SGLang MTP 推理中的性价比比 H200 FP8 高多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 GLM-5 8K/1K 序列长度下使用 SGLang v0.5.12 并在两款 SKU 上均启用基于 EAGLE 的 MTP，B200 NVFP4（模型 nvidia/GLM-5-NVFP4）在等交互性下性价比最高达 H200 FP8（模型 zai-org/GLM-5-FP8）的 3.65 倍。峰值提升出现在 80 tok/s/user：H200 每百万 tokens 成本为 $1.06，B200 NVFP4 仅为 $0.29。该优势在 H200 的 25 至 84 tok/s/user 全运行区间内保持在 3.24 倍至 3.65 倍。TCO 输入为 H200 $1.41/GPU/hr、B200 $1.95/GPU/hr，来源于 SemiAnalysis AI Cloud TCO 模型。数据来自 2026-05-25 的 InferenceX 基准测试（GHA run 26381101926）。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "3.65 倍的 B200 NVFP4 vs H200 FP8 性价比提升中，世代因素和精度因素各贡献多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "该提升在峰值处可清晰分解。保持精度和 MTP 不变，B200 SGLang FP8 + MTP 在 80 tok/s/user 时的性价比约为 H200 SGLang FP8 + MTP 的 1.22 倍——这是在两款 SKU 上使用相同 EAGLE 方案和相同 zai-org/GLM-5-FP8 检查点时，仅靠 Blackwell 世代加软件带来的提升。将 B200 的权重从 zai-org/GLM-5-FP8 切换为 nvidia/GLM-5-NVFP4 再叠加 2.98 倍——纯粹的精度提升。1.22 乘以 2.98 约等于 3.65。在 H200 运行区间内，世代因素在 1.19 倍（84 tok/s/user）到 1.36 倍（50 tok/s/user）之间，精度因素在 2.41 倍（25 tok/s/user）到 3.07 倍（84 tok/s/user）之间；两者贡献比例互换但组合提升保持在 3.24 倍至 3.65 倍。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "SGLang v0.5.10 / v0.5.12 中的哪些变更使 GLM-5 NVFP4 + MTP 得以在 Blackwell 上运行？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "SGLang v0.5.10（发布于 2026-04-07）是 GLM-5 (G)B200 跟踪 issue（sgl-project/sglang #19380）的所有 Functional 和 Baseline Perf 行在 no-MTP/MTP 和 Agg/Disagg 全组合、FP8 和 NVFP4 两种精度下均标记为 DONE 的首个稳定版本。v0.5.12（发布于 2026-05-16）是 InferenceX 基准测试实际运行的版本。关键内核变更：sgl-project/sglang #21783 将 FlashInfer TRT-LLM 稀疏 MLA 内核设为 sm100/sm103（B200/B300）的默认后端；sgl-project/sglang #21405 启用 IndexCache 实现 >10% 的 decode 吞吐量提升；flashinfer-ai/flashinfer #2726 修复了一个间歇性 NVFP4 非法内存访问 bug，此前一直阻塞 NVFP4 的功能验证签核；flashinfer-ai/flashinfer #2836 提升了 trtllm-gen 稀疏 MLA 的性能上限。MTP 使用 EAGLE，参数为 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4，并通过 SGLANG_ENABLE_SPEC_V2=1 启用 overlap 调度器。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 B200 上 FP4 vs FP8 的精度差距从 25 tok/s/user 的 2.41 倍扩大到 84 tok/s/user 的 3.07 倍？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在低交互性（高并发）时，B200 FP8 + MTP 和 B200 NVFP4 + MTP 均达到权重加载带宽饱和，因此成本比接近 FP4 vs FP8 GEMM 吞吐量的原始 2 倍差异——18 tok/s/user 时为 2.26 倍，25 tok/s/user 时为 2.41 倍。随着交互性提升、并发降低，每个 decode 步骤在更少的 token 间分摊权重加载开销。每个 token 的 GEMM 时间增长，FP4 张量核心的计算优势与 TP=4 NVFP4 方案中更小的每 rank 权重占用（4 GPU vs FP8 曲线的 8 GPU）叠加。在 84 tok/s/user 时，B200 FP8 插值为每百万 tokens $0.94，而 B200 NVFP4 为 $0.31——差距为 3.07 倍，精度差异已成为该方案的成本下限。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "这次 B200 NVFP4 vs H200 FP8 的 GLM-5 对比中未涵盖哪些内容？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "还有三个方向未覆盖。（1）NVL72 上的分离式推理：本次对比均为单节点聚合方式；SGLang 跟踪 issue #19380 正在积极推进 FP8 B200 分离式 8K/1K 和 GB300 分离式 MTP 的工作，宽 Expert Parallelism 在 NVL72 上已在 Kimi K2.5 上展示了约 3 倍的每 GPU 吞吐量提升。（2）B200 FP8 聚合方式的分段 CUDA graph（sgl-project/sglang #23351 审核中，#24276 后续跟进），预计对 B200 FP8 曲线的提升大于 B200 NVFP4 曲线，将在高交互性端缩小仅精度差异的比值。（3）H200 SGLang MTP 方案（InferenceX PR #1480）在本次测试前一周才上线；H200 分离式推理、trtllm-mha 注意力或 H200 上的 KV FP8 都将在宣布世代对比完结之前进一步推高 H200 曲线。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx b/packages/app/content/blog/zh/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx
new file mode 100644
index 00000000..73da4323
--- /dev/null
+++ b/packages/app/content/blog/zh/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx
@@ -0,0 +1,248 @@
+---
+title: 'B200 NVFP4 vs H100 FP8 运行 MiniMax-M2.5：vLLM 下每美元性能最高提升 8.2 倍'
+subtitle: 'vLLM PR #36307 为 MiniMax 在 B200 上解锁了 trtllm-gen FP8 MoE 模块化内核；结合 NVFP4，在 8K/1K 负载下性能/成本从 22 tok/s/user 时的 4.0 倍扩大到 110 tok/s/user 时的 8.2 倍'
+date: '2026-05-26'
+publishDate: '2026-05-26'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - minimax
+  - nvidia
+  - b200
+  - h100
+  - vllm
+  - fp4
+---
+
+在 MiniMax-M2.5 8K/1K 负载下使用 vLLM，NVIDIA 的 NVFP4 量化版 MiniMax-M2.5 在 B200 上实现了**等交互性下相比 H100 vLLM FP8 最高 8.2 倍的每美元性能提升**——110 tok/s/user 时 H100 为 $0.74/M tokens，B200 NVFP4 为 $0.09/M tokens。提升幅度在 H100 的 21–111 tok/s/user 工作区间内单调递增，从低端的 4.0 倍（22 tok/s/user，$0.12 vs $0.031）增长到高端的 8.2 倍。测量于 2026-05-22 的 InferenceX。
+
+这 8.2 倍在峰值处可以清晰分解。在 110 tok/s/user 时，**B200 vLLM FP8 相比 H100 vLLM FP8 实现了 2.94 倍的每美元性能提升**——这是纯硬件代际提升，两个 SKU 上使用相同的 `MiniMaxAI/MiniMax-M2.5` 权重和相同的 vLLM 构建。**将 B200 权重切换为 `nvidia/MiniMax-M2.5-NVFP4` 后又叠加了 2.77 倍**——这是纯精度提升，由 [vllm-project/vllm #36307](https://github.com/vllm-project/vllm/pull/36307) 解锁——该 PR 添加了 trtllm-gen FP8 MoE 内核的**模块化变体**，使 MiniMax 的非标准路由方法得以使用该内核。2.94 × 2.77 ≈ 8.14。精度提升随交互性增长而扩大（22 tok/s/user 时 1.65 倍 → 110 时 2.77 倍），这是因为 trtllm-gen 内核相比旧版 triton 路径的优势在 GEMM 天花板成为瓶颈时更为突出。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=MiniMax-M2.5&i_prec=fp4%2Cfp8&i_active=b200_vllm%2Ch100_vllm&i_linelabel=1&i_advlabel=1">
+  点击查看完整的 InferenceX 仪表板 →
+</DashboardCTA>
+
+<Figure
+  srcLight="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/benchmark-light.png"
+  srcDark="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/benchmark-dark.png"
+  alt="MiniMax-M2.5 8K/1K 每 GPU 吞吐量与交互性对比，三条 vLLM 曲线：B200 NVFP4（最上方，峰值 17.6k tok/s/GPU、21 tok/s/user）、B200 FP8（中间）、H100 FP8（底部，峰值约 3k tok/s/GPU、21 tok/s/user）。"
+  caption="MiniMax-M2.5（230B / 10B 激活参数）在 ISL 8192 / OSL 1024 下的表现。H100 和 B200 均使用 vLLM。标注表示每种配置的张量并行和专家并行度（如 TEP2 = TP=2 + EP=2，TP4 = TP=4 密集）。B200 NVFP4 前沿涵盖 TP=1/2/4/8 配方；B200 FP8 前沿涵盖 TP=2 和 TP=4；H100 FP8 为 TP=8。"
+/>
+
+MiniMax-M2.5 是 MiniMax AI 的旗舰 MoE 模型：**总参数 230B，每 token 激活 10B**，采用 256 个小专家（架构不同于早期 MiniMax-Text-01 的 32 个大专家）。公开权重发布于 [`MiniMaxAI/MiniMax-M2.5`](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)，NVIDIA 发布了量化版 [`nvidia/MiniMax-M2.5-NVFP4`](https://huggingface.co/nvidia/MiniMax-M2.5-NVFP4)，将所有 MoE GEMM 权重从 BF16/FP8 重新量化为 NVFP4（16 元素分组、FP8 逐组缩放因子、FP32 逐张量缩放因子）。KV cache 保持 FP8。图表中 B200 NVFP4 曲线加载 NVIDIA 量化权重；B200 FP8 和 H100 FP8 曲线加载原始 MiniMaxAI 权重。
+
+本文涉及的关键架构细节是**路由方法**。MiniMax M2 的专家路由层生成的 routing logits 使用的数据类型不被原始（"单体式"）trtllm-gen FP8 MoE 内核接受——这就是为何 MiniMax 在 B200 上一直被限制在较慢的 triton MoE 路径上，直到 vLLM PR #36307 添加了在外部处理路由的模块化内核变体。我们将在"关键技术贡献"部分详细展开。
+
+## 为何 MiniMax-M2.5 值得投入优化
+
+MiniMax-M2.5 是 M2 系列中面向编码和智能体的开源权重模型。其 256 小专家路由层针对软件工程工作负载进行了调优（SWE-Bench 系列、Terminal Bench、智能体工具使用评估），在主要编码质量评估中**与 Claude Opus 4.5/4.6 在每项基准测试上的差距仅 1–4 分**，并在 Multi-SWE-Bench（51.3 vs 42.7）和 VIBE-Pro（54.2 vs 36.9）上**领先 Gemini 3 Pro**。特别是 Multi-SWE-Bench——Opus 4.5 得分 50.0、Opus 4.6 得分 50.3——M2.5 以 51.3 领先。
+
+<Figure
+  srcLight="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/quality-benchmarks-light.png"
+  srcDark="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/quality-benchmarks-dark.png"
+  alt="六个柱状图对比 MiniMax M2.5（红色）和 MiniMax M2.1（浅红色）与 Claude Opus 4.5、Claude Opus 4.6、Gemini 3 Pro 和 GPT-5.2 在 SWE-Bench Verified、SWE-Bench Pro、Terminal Bench 2、Multi-SWE-Bench、SWE-Bench Multilingual 和 VIBE-Pro (AVG) 上的表现。MiniMax M2.5 在每项基准测试上与 Claude Opus 的差距仅几分，在 Multi-SWE-Bench（51.3 vs 50.3）上领先，在与 Gemini 3 Pro 的对比中在 Multi-SWE-Bench 和 VIBE-Pro 上领先。"
+  caption="MiniMax-M2.5 与前沿编码模型在智能体和软件工程基准测试上的对比（MiniMax 发布的得分）。M2.5 在 SWE-Bench 系列和 Terminal Bench 上与 Claude Opus 4.5/4.6 持平或差距 0–4 分，在 Multi-SWE-Bench 上领先，在已报告的项目中与 GPT-5.2 持平或领先，在六项中的三项上领先 Gemini 3 Pro。M2.7（相同的 230B / 10B 激活架构，经过后训练优化）继承了这些结果并继续使用相同的 vLLM 内核路径。"
+/>
+
+质量水准是服务成本故事的前提。一个**激活参数仅 10B** 的开源 MoE 模型，在前沿专有编码模型的射程范围内——在 B200 NVFP4 的吞吐量锚点处仅需 **$0.031/M tokens**——相比将同样的工作负载路由到闭源 API 前沿模型，完全是不同的部署成本量级。下文介绍的 vLLM PR + NVFP4 + B200 技术栈，将这类智能体/SWE 循环工作负载的推理成本从"连续运行成本高昂"压缩到了可以让自主编码智能体持续运行数小时而不至于账单本身成为架构约束的区间。
+
+## H100 vs B200 纸面规格对比
+
+在介绍具体配方之前，先看硬件。H100 SXM（Hopper，2023）和 B200 SXM（Blackwell，2025）相隔两代。下方雷达图将每个轴归一化到 [`/gpu-specs`](/gpu-specs) 中所有 NVIDIA + AMD SKU 的最大值——因此 H100 和 B200 的可见多边形在 GB200/GB300 NVL72 设定天花板的轴上（特别是扩展域显存容量和扩展域显存带宽，按 72-GPU NVLink 域扩展）被压缩。
+
+<Figure
+  srcLight="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/specs-radar-light.png"
+  srcDark="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/specs-radar-dark.png"
+  alt="雷达图对比 H100 SXM 和 B200 SXM 在显存容量、显存带宽、FP4/FP8/BF16 TFLOP/s、扩展带宽、扩展域显存容量和扩展域显存带宽上的表现。各轴按 /gpu-specs 中所有 GPU 的最大值归一化。"
+  caption="H100 SXM vs B200 SXM 在 InferenceX /gpu-specs 雷达图上的对比。各轴按每个指标的跨厂商最大值归一化（例如 FP4 最大值为 GB300 NVL72 的 15 PFLOP/s/GPU，因此 B200 的 9 PFLOP/s 读数约 60%；扩展域显存和带宽最大值由 GB200/GB300 NVL72 的 72-GPU NVLink 域设定，因此 H100 和 B200 的 8-GPU 域读数都较低）。H100 的 FP4 读数为 0%，因为 Hopper 没有 FP4 张量核心。"
+/>
+
+本次基准测试涉及的两个 SKU 的绝对值：
+
+| 规格                              | H100 SXM            | B200 SXM            | B200 / H100 |
+| --------------------------------- | ------------------- | ------------------- | ----------- |
+| HBM 容量                          | 80 GB (HBM3)        | 180 GB (HBM3e)      | 2.25x       |
+| HBM 带宽                          | 3.35 TB/s           | 8.0 TB/s            | 2.39x       |
+| 密集 FP4 (TFLOP/s)                | —                   | 9,000               | —           |
+| 密集 FP8 (TFLOP/s)                | 1,979               | 4,500               | 2.27x       |
+| 密集 BF16 (TFLOP/s)               | 989                 | 2,250               | 2.28x       |
+| 每 GPU 扩展带宽（单向）           | 450 GB/s (NVLink 4) | 900 GB/s (NVLink 5) | 2.00x       |
+| 扩展域 GPU 数                     | 8                   | 8                   | 1.00x       |
+| 扩展域 HBM 容量                   | 640 GB              | 1,440 GB            | 2.25x       |
+| 扩展域 HBM 带宽（聚合）           | 26.8 TB/s           | 64.0 TB/s           | 2.39x       |
+| TCO（SemiAnalysis AI Cloud 模型） | $1.30/GPU/hr        | $1.95/GPU/hr        | 1.50x       |
+
+硅片本身带来了什么？**B200 的 FP8 算力是 H100 的 2.27 倍**（4,500 vs 1,979 TFLOP/s），**B200 的 FP4 算力是 H100 FP8 算力的 4.55 倍**（9,000 vs 1,979 TFLOP/s——Hopper 没有 FP4 张量核心，因此精度步骤是跨精度算力提升，而非同精度对比），**HBM 带宽提升 2.39 倍**（8.0 vs 3.35 TB/s），TCO 仅为 1.50 倍。这些原始比率限定了性能/成本的天花板：**FP8 vs FP8 计算瓶颈下为 1.51 倍**（`2.27 / 1.50`），**HBM 带宽瓶颈下为 1.59 倍**（`2.39 / 1.50`），或者 **B200 NVFP4 vs H100 FP8 跨精度算力轴下为 3.03 倍**（`4.55 / 1.50`）。
+
+实测数据更为出色。2.94 倍的 FP8 代际提升**几乎是 FP8 硅片天花板的 2 倍**，8.16 倍的综合提升**约为跨精度 FP4 硅片天花板的 2.7 倍**——超出硅片天花板的部分来自 trtllm-gen 模块化 FP8 MoE 内核（vLLM PR #36307），实现了旧版 triton MoE 路径无法做到的优化。H100 上的 vLLM 技术栈在运行 MiniMax-M2.5 时相比 B200 上 trtllm-gen 内核路径的发挥还留有大量空间。NVFP4 在此基础上叠加了精度步骤的提升，因为 B200 拥有 9 PFLOP/s 的 FP4 算力，而 H100 为零。
+
+## TensorRT-LLM MoE 内核集成至 vLLM
+
+**上游：vLLM PR #36307——TRTLLM FP8 MoE 模块化内核。** [vllm-project/vllm #36307](https://github.com/vllm-project/vllm/pull/36307)，由 [Wei Zhao](https://github.com/wzhao18) 提交，2026-03-12 合入，为 Blackwell 添加了 trtllm-gen FP8 MoE 内核的**模块化**变体。此前的"单体式"trtllm-gen 内核仅接受特定数据类型的 routing logits，这排除了 MiniMax M2 等路由层输出不同数据类型的模型。模块化内核在外部完成路由，从而消除了数据类型约束——MiniMax M2（及更广泛的 MoE 模型）现在可以使用 DeepSeek/Kimi/GLM-5 在 B200 上已有的快速 attention + MoE 内核路径。该 PR 的测试计划在 `MiniMaxAI/MiniMax-M2.5` 上以 TP=2 + 专家并行运行，与下方 B200 前沿使用的配方形状相同。
+
+内核的选择至关重要。以下是 MiniMax-M2.5 上的逐内核对比（1K/1K——_不是_ 8K/1K，但内核排序一致），展示了 vLLM 中各 MoE 后端的表现：
+
+<Figure
+  srcLight="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/moe-kernel-comparison-light.png"
+  srcDark="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/moe-kernel-comparison-dark.png"
+  alt="两个并排散点图对比 trtllm（蓝色）、deep_gemm（橙色）和 triton（绿色虚线）MoE 内核在 MiniMax-M2.5 ISL=1024 OSL=1024 下的表现。左图：生成 TPS vs 交互性 TPS。右图：总 TPS vs 交互性 TPS。trtllm 在全曲线上位于 triton 之上，且远超 deep_gemm（后者坍缩至约 250 gen TPS）。"
+  caption="vLLM MoE 后端在 B200 上运行 MiniMax-M2.5 ISL=1024 / OSL=1024 的对比。trtllm-gen 模块化（蓝色）是 vLLM PR #36307 解锁的内核；triton（绿色虚线）是此前的回退路径；deep_gemm（橙色）在此路由方法下无法产生有竞争力的工作点。1K/1K 负载相比本文其他部分使用的 8K/1K 放大了内核间差距（8K/1K 将每步内核时间分摊到更多的 prefill 带宽中），但排序相同。"
+/>
+
+## 数据详情
+
+所有数据行均为 MiniMax-M2.5 在 **ISL 8192 / OSL 1024** 下，使用单节点非分离式配置，于 2026-05-22 在 InferenceX 上使用 vLLM 测量。每百万总 token 的成本计算公式为 `TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6)`，其中 H100 为 $1.30/GPU/hr，B200 为 $1.95/GPU/hr，参照 [SemiAnalysis AI Cloud TCO 模型](https://newsletter.semianalysis.com/p/ai-cloud-economics)。
+
+**H100 vLLM FP8，TP=8，8 GPU**（模型 `MiniMaxAI/MiniMax-M2.5`）：
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 476.5     | 110.93     | 9.01      | $0.76      |
+| 8    | 771.4     | 89.71      | 11.15     | $0.47      |
+| 16   | 1,193.7   | 69.09      | 14.47     | $0.30      |
+| 32   | 1,707.6   | 49.70      | 20.12     | $0.21      |
+| 64   | 2,317.0   | 33.00      | 30.31     | $0.16      |
+| 128  | 2,985.6   | 21.19      | 47.19     | $0.12      |
+
+**B200 vLLM FP8，TP=2，2 GPU**（大部分曲线的吞吐量锚点）：
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 1,926.8   | 112.30     | 8.90      | $0.28      |
+| 8    | 3,143.5   | 92.25      | 10.84     | $0.17      |
+| 16   | 4,684.8   | 68.04      | 14.70     | $0.12      |
+| 32   | 6,514.0   | 47.40      | 21.10     | $0.08      |
+| 64   | 9,079.0   | 32.35      | 30.91     | $0.06      |
+| 128  | 10,053.9  | 23.91      | 41.82     | $0.05      |
+| 256  | 10,134.1  | 23.92      | 41.81     | $0.05      |
+| 512  | 10,112.2  | 23.85      | 41.93     | $0.05      |
+
+**B200 vLLM FP8，TP=4，4 GPU**（延伸低交互性段）：
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 256  | 11,035.8  | 19.67      | 50.85     | $0.05      |
+| 512  | 11,827.1  | 12.71      | 78.68     | $0.05      |
+
+**B200 vLLM NVFP4，TP=2，2 GPU**（左侧成本前沿锚点，模型 `nvidia/MiniMax-M2.5-NVFP4`）：
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 128  | 16,256.5  | 28.82      | 34.70     | $0.03      |
+| 256  | 17,407.3  | 20.61      | 48.51     | $0.03      |
+| 512  | 17,577.0  | 20.63      | 48.47     | $0.03      |
+
+**B200 vLLM NVFP4，TP=1，1 GPU**（中高交互性段）：
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 4,488.8   | 131.18     | 7.62      | $0.12      |
+| 8    | 6,683.0   | 97.87      | 10.22     | $0.08      |
+| 16   | 9,546.6   | 68.76      | 14.54     | $0.06      |
+| 32   | 11,698.0  | 44.16      | 22.65     | $0.05      |
+| 256  | 11,962.0  | 44.29      | 22.58     | $0.05      |
+
+**B200 vLLM NVFP4，TP=4 和 TP=8，4 和 8 GPU**（在 conc=4 处延伸高交互性段）：
+
+| 配方          | 并发 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ------------- | ---- | --------- | ---------- | --------- | ---------- |
+| TP=4 / 4 GPUs | 4    | 1,423.3   | 165.92     | 6.03      | $0.38      |
+| TP=4 / 4 GPUs | 8    | 2,468.9   | 144.91     | 6.90      | $0.22      |
+| TP=8 / 8 GPUs | 4    | 768.6     | 180.26     | 5.55      | $0.70      |
+
+NVFP4 组合帕累托前沿从 21 tok/s/user 时的 $0.031/M（TP=2，conc=512）延伸到 180 tok/s/user 时的 $0.70/M（TP=8，conc=4）。
+
+## 等交互性每美元性能
+
+在匹配交互性水平下的每 GPU 吞吐量和每百万 token 成本，沿各 SKU 的帕累托前沿插值。最后一列的每美元性能提升为 $/M 比率的倒数——B200 NVFP4 相对于 H100 的性能/成本。超出前沿测量范围的单元格显示为 _unreachable_。
+
+| 交互性 (tok/s/user) | H100 FP8 $/M  | B200 FP8 $/M  | B200 NVFP4 $/M | B200 NVFP4 性能/$ vs H100 |
+| ------------------- | ------------- | ------------- | -------------- | ------------------------- |
+| 22                  | $0.12         | $0.05         | $0.031         | 3.96x                     |
+| 30                  | $0.15         | $0.06         | $0.034         | 4.32x                     |
+| 40                  | $0.18         | $0.07         | $0.042         | 4.23x                     |
+| 50                  | $0.21         | $0.08         | $0.048         | 4.39x                     |
+| 60                  | $0.25         | $0.10         | $0.052         | 4.85x                     |
+| 70                  | $0.31         | $0.12         | $0.058         | 5.36x                     |
+| 80                  | $0.38         | $0.14         | $0.065         | 5.84x                     |
+| 90                  | $0.47         | $0.16         | $0.073         | 6.41x                     |
+| 100                 | $0.60         | $0.20         | $0.083         | 7.19x                     |
+| **110**             | **$0.74**     | **$0.25**     | **$0.091**     | **8.16x**                 |
+| 130                 | _unreachable_ | $0.44         | $0.118         | _∞_                       |
+| 150                 | _unreachable_ | _unreachable_ | $0.250         | _∞_                       |
+| 175                 | _unreachable_ | _unreachable_ | $0.569         | _∞_                       |
+
+B200 NVFP4 相对于 H100 的性能/成本优势**从 22 tok/s/user 时的 3.96 倍单调递增至 110 tok/s/user 时的 8.16 倍**，峰值 8.23 倍出现在 110.8 tok/s/user——H100 可测量范围的右端。与同代对比中提升基本恒定不同，此处提升增长迅速：H100 的前沿在右端急剧下降（其 conc=4 点仅为 476 tok/s/GPU、$0.76/M），而 B200 NVFP4 在相同交互性下仍有 4–5 倍的吞吐量。纯精度提升（B200 FP8 → B200 NVFP4）从 22 tok/s/user 时的 1.65 倍扩大到 110 时的 2.77 倍——随着批量缩小，trtllm-gen 内核的 GEMM 优势相比 triton 路径成为主导约束。在 110 tok/s/user 以上，对比不再成立：H100 没有任何配方能再提供一个 tok/s/user，而 B200 NVFP4 将可用范围延伸至 TP=8 下的 180 tok/s/user。
+
+<Figure
+  srcLight="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/benchmark-light.png"
+  srcDark="/images/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar/benchmark-dark.png"
+  alt="MiniMax-M2.5 8K/1K 每 GPU 吞吐量与交互性对比，三条 vLLM 曲线：B200 NVFP4（最上方，峰值 17.6k tok/s/GPU、21 tok/s/user）、B200 FP8（中间）、H100 FP8（底部，峰值约 3k tok/s/GPU、21 tok/s/user）。"
+  caption="MiniMax-M2.5（230B / 10B 激活参数）在 ISL 8192 / OSL 1024 下的表现。H100 和 B200 均使用 vLLM。标注表示每种配置的张量并行和专家并行度（如 TEP2 = TP=2 + EP=2，TP4 = TP=4 密集）。B200 NVFP4 前沿涵盖 TP=1/2/4/8 配方；B200 FP8 前沿涵盖 TP=2 和 TP=4；H100 FP8 为 TP=8。"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=MiniMax-M2.5&i_prec=fp4%2Cfp8&i_active=b200_vllm%2Ch100_vllm&i_linelabel=1&i_advlabel=1)，已预过滤为 MiniMax-M2.5 vLLM 在 H100 + B200 上 2026-05-22 运行的数据。[实时成本视图](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=MiniMax-M2.5&i_prec=fp4%2Cfp8&i_active=b200_vllm%2Ch100_vllm&i_metric=y_costh&i_linelabel=1&i_advlabel=1)展示相同对比。
+
+## Blackwell 在 MiniMax-M2.5 上的后续方向
+
+仍有三个方向可以进一步扩大或强化标题数据：
+
+- **NVL72 分离式（无需宽专家并行）。** **更宽的专家并行对此模型不是正确的杠杆**——在 10B 激活参数、256 个小专家的情况下，TP=2 / 8-GPU 配置中每个 rank 已经只持有少量专家，因此在 72-GPU NVLink 域中扩展 EP 不会显著缩小每 rank 权重占用（不同于 DeepSeek R1 或 Kimi K2.5，后者可通过 EP 集合上的计算-通信重叠实现复合增益）。**分离式 prefill + decode 仍然可行**：当前的单节点聚合配方将两个阶段放在同一个 TP=2 岛上，在 conc 256+ 饱和拐点处争抢 HBM 带宽；GB200/GB300 NVL72 上的分离式配方将 KV 通过 NVLink 5 在专用 prefill 和 decode 池之间传输，使 decode 池在饱和前能吸收更多并发。InferenceX 尚未发布 MiniMax 在 NVL72 上的分离式配方。
+
+对于当前的 MiniMax-M2.5 服务，B200 NVFP4 在 H100 可达的每个交互性点上都是更经济的选择，优势为 4 倍至 8.2 倍。
+
+## 致谢
+
+NVIDIA 的 [Wei Zhao](https://github.com/wzhao18) 于 2026-03-12 在 vLLM 中合入了 [trtllm FP8 MoE 模块化内核](https://github.com/vllm-project/vllm/pull/36307)。感谢 Wei Zhao 和 vLLM TRT-LLM 内核协作者、InferenceX 配方维护者以及 MiniMax AI 团队发布的开源权重。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=MiniMax-M2.5&i_prec=fp4%2Cfp8&i_active=b200_vllm%2Ch100_vllm&i_linelabel=1&i_advlabel=1">
+  点击查看完整的 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "NVIDIA B200 NVFP4 相比 H100 FP8 在 MiniMax-M2.5 vLLM 上每美元性能提升了多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 MiniMax-M2.5 8K/1K 序列长度下使用 vLLM，B200 NVFP4（模型 nvidia/MiniMax-M2.5-NVFP4）在等交互性下相比 H100 FP8（模型 MiniMaxAI/MiniMax-M2.5）最高实现 8.2 倍的每美元性能提升。峰值出现在 110 tok/s/user：H100 为每百万 token $0.74，B200 NVFP4 为每百万 token $0.091。提升从 22 tok/s/user 时的 3.96 倍单调递增至 110 时的 8.16 倍。TCO 参数为 H100 $1.30/GPU/小时、B200 $1.95/GPU/小时，来自 SemiAnalysis AI Cloud TCO 模型。测量于 2026-05-22 的 InferenceX（GHA 运行 26306422380）。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "vLLM PR #36307 做了什么？为什么对 B200 上的 MiniMax-M2.5 至关重要？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "vllm-project/vllm PR #36307 由 Wei Zhao 提交（2026-03-12 合入），为 Blackwell 添加了 trtllm-gen FP8 MoE 内核的模块化变体。此前的单体式 trtllm-gen 内核有 routing logits 数据类型约束，排除了 MiniMax M2 等路由器输出不同数据类型的模型。模块化内核在外部完成路由，消除了约束。有了 #36307，B200 上的 MiniMax M2 vLLM 终于可以使用 DeepSeek、Kimi 和 GLM-5 已有的快速 attention + MoE 内核路径。在 1K/1K 的补充逐内核对比中，trtllm-gen 模块化内核在低交互性时生成 TPS 比 triton MoE 回退路径高约 1.4 倍，总 TPS 高约 2 倍；deep_gemm 在此路由方法下无竞争力。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "B200 NVFP4 vs H100 FP8 的 8.2 倍每美元性能提升中，代际和精度各贡献了多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "提升在峰值处可以清晰分解。保持精度不变，B200 vLLM FP8 在 110 tok/s/user 时相比 H100 vLLM FP8 实现约 2.94 倍的每美元性能提升——这是 Blackwell 代际加软件步骤，两个 SKU 使用相同的 MiniMaxAI/MiniMax-M2.5 权重和相同的 vLLM 构建。将 B200 权重切换为 nvidia/MiniMax-M2.5-NVFP4 后在 110 时又叠加了 2.77 倍——纯精度步骤，依赖 vLLM PR #36307 解锁的 trtllm-gen FP8 MoE 模块化内核。2.94 乘以 2.77 约为 8.14。在 H100 的工作区间内，代际贡献在 2.40 倍（22 tok/s/user）到 2.99 倍（103）之间，精度贡献在 1.65 倍（22）到 2.81 倍（110）之间；两个步骤在更高交互性下都扩大，这就是综合提升从 3.96 倍增长到 8.16 倍而非保持恒定的原因。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 B200 vs H100 的代际提升（2.94 倍）几乎是纸面 FP8 天花板（1.51 倍）的 2 倍？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "从原始硅片看，B200 的密集 FP8 吞吐量是 H100 的 2.27 倍，TCO 为 1.50 倍，在计算瓶颈下的纸面 FP8 性能/成本天花板为 1.51 倍，在显存带宽瓶颈下为 1.59 倍。实测 110 tok/s/user 时 2.94 倍几乎是该天花板的 2 倍，这意味着对比不受硅片能力限制——H100 上的 vLLM 技术栈在运行 MiniMax-M2.5 时相比 B200 通过 PR #36307 获得的 trtllm-gen 内核路径还留有大量性能空间。如果 H100 技术栈采用 FP8 KV cache、FlashInfer 注意力升级或类似的快速 MoE 内核，代际步骤将向其纸面天花板收窄。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "B200 NVFP4 vs H100 FP8 的 MiniMax-M2.5 对比中还有哪些未覆盖的领域？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "仍有三处差距。（1）投机解码：两个配方均未在 MiniMax-M2.5 上运行 MTP 或 EAGLE，因为 vLLM 对 MiniMax M2 路由层的 MTP 支持落后于 DeepSeek/Kimi/GLM-5；一旦实现，高交互性下的精度步骤应收窄，低交互性下的绝对下限应进一步降低。（2）NVL72 上的分离式服务：本次对比为单节点聚合；NVL72 上的宽专家并行在 Kimi K2.5 上已展示约 3 倍的每 GPU 吞吐量提升，将进一步提升 MiniMax-M2.5 的性能/成本。（3）H100 vLLM 技术栈仍有提升空间：当前运行 vllm/vllm-openai v0.18.0；FlashInfer 注意力升级和 H100 上的 FP8 KV 路径均可推高 H100 曲线。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar.mdx b/packages/app/content/blog/zh/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar.mdx
new file mode 100644
index 00000000..38e84357
--- /dev/null
+++ b/packages/app/content/blog/zh/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar.mdx
@@ -0,0 +1,201 @@
+---
+title: 'B200 NVFP4 对比 H200 INT4 运行 Kimi K2.5/K2.6：性价比提升高达 2.95 倍'
+subtitle: '在 vLLM 8K/1K 工作负载下，B200 NVFP4 路径在 30–90 tok/s/user 推理区间内每百万 tokens 成本比 H200 INT4 低 2.71x–2.95x，比同一 B200 硬件上的 INT4 低 2.45x–2.74x。三个因素——B200 的 HBM 带宽、HBM 容量和 NVFP4 张量核心——可清晰分解该优势'
+date: '2026-05-26'
+publishDate: '2026-05-26'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - kimi
+  - nvidia
+  - b200
+  - h200
+  - vllm
+  - nvfp4
+---
+
+Kimi K2.5 和 K2.6 是 xAI Cursor Composer 2 和 Composer 2.5 背后的开源权重模型——Cursor IDE 日活用户超百万，且 K2.6 以 58.6% 的成绩领跑 SWE-Bench Pro。在 8K/1K 工作负载下，vLLM 在 NVIDIA B200 上以 NVFP4 运行 K2.5/K2.6，在整个单节点 Pareto 前沿上均比 H200 INT4 更便宜。**在 30–90 tok/s/user 推理区间内，B200 NVFP4 每百万 tokens 成本比 H200 INT4 低 2.71x–2.95x**，峰值为 **32 tok/s/user 时的 2.95 倍**（B200 NVFP4 为 $0.140/M vs H200 INT4 为 $0.413/M——成本降低 66%）。在相同 B200 硬件上，从 INT4 切换到 NVFP4 在等交互性下还可额外带来 **2.45x–2.74x 的优势**（40 tok/s/user 时 $0.397/M → $0.154/M）。数据来自 SemiAnalysis InferenceX，2026-05-19，[GHA run 26118912054](https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26118912054)。
+
+两款 SKU 均运行相同的 `vllm/vllm-openai:v0.21.0` 容器。差距来自硬件和精度。B200 的 FP8 dense 吞吐量是 H200 的 2.27 倍（4,500 vs 1,979 TFLOP/s）、HBM 带宽 1.67 倍（8 vs 4.8 TB/s）、NVLink Scale-up 带宽 2.00 倍（900 vs 450 GB/s 单向）。在 FP4 轴上 H200 完全空白——Hopper SM90 没有 FP4 张量核心，[官方数据表](https://resources.nvidia.com/en-us-data-center-overview/gtc24-h200-datasheet)止步于 FP8。B200 的 NVFP4 核心提供 9,000 TFLOP/s。实测的约 3 倍 token 成本差距，就是这些硅片比值在折算 B200 1.38 倍 TCO 溢价（$1.95 vs $1.41/GPU/hr，来源于 [SemiAnalysis AI Cloud TCO 模型](https://newsletter.semianalysis.com/p/ai-cloud-economics)）之后的呈现。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-19&g_runid=26118912054&g_model=Kimi-K2.5&i_prec=fp4%2Cint4&i_active=b200_vllm%2Ch200_vllm">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<Figure
+  srcLight="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/benchmark-light.png"
+  srcDark="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/benchmark-dark.png"
+  alt="Kimi K2.5/K2.6 1T 在 FP4 / INT4 下的 8K / 1K 吞吐量，三条 vLLM 曲线：B200 NVFP4（浅绿色圆点）在 32 tok/s/user 时峰值约 3.9k tok/s/GPU；B200 INT4（浅绿色方块）在 26 tok/s/user 时峰值约 1.8k tok/s/GPU；H200 INT4（深绿色方块）在 16.7 tok/s/user 时峰值约 1.17k tok/s/GPU。B200 NVFP4 曲线在整个重叠区间内大致位于 H200 INT4 之上 3 倍、B200 INT4 之上 2 倍。数据点标签标注每个配置的 GPU 数量（B200 NVFP4 高吞吐量段为 TP=4，其余为 TP=8）。"
+  caption="Kimi K2.5/K2.6（1T 总参数，32B 激活参数），vLLM，ISL 8192 / OSL 1024，单 NVIDIA 节点。来源：SemiAnalysis InferenceX，2026-05-19。数据点标签标注每个配置的 GPU 数量。"
+/>
+
+## Kimi K2.5 / K2.6 模型架构及下游 Cursor Composer 2.5 模型
+
+[Kimi K2.5](https://huggingface.co/moonshotai/Kimi-K2.5)（发布于 2026-01-27）和 [Kimi K2.6](https://huggingface.co/moonshotai/Kimi-K2.6)（发布于 2026-04-20）共享原始 Kimi K2 骨干网络：**1.0T 参数的 MoE，每个 token 激活 32B 参数**，**DeepSeek 风格的 top-8-of-385 专家路由，跨 61 个 Transformer 层（1 个 dense 块 + 60 个 MoE 块）**，**Multi-head Latent Attention（MLA）**、SwiGLU、**YaRN RoPE**，163,840 词汇量，以及 **256K 上下文窗口**（262,144 tokens）。HF 检查点为 [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5) 和 [`moonshotai/Kimi-K2.6`](https://huggingface.co/moonshotai/Kimi-K2.6)——两者是在同一预训练架构上的后训练优化，因此**本文中的每一个推理结果都同样适用于这两个版本**。
+
+<Figure
+  srcLight="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/kimi-k2-architecture-light.png"
+  srcDark="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/kimi-k2-architecture-dark.png"
+  alt="Kimi K2.5/K2.6 架构图，来自 Moonshot AI 模型卡：token embedding（d=7168，vocab=163840）→ 1 个 dense Transformer 块（FFN=18432）→ 60 个 MoE Transformer 块（Multi-head Latent Attention，top-8 of 385 专家）→ RMSNorm → 输出 LM head（vocab=163840）。类型：MoE。层数：1D + 60M。注意力：MLA。上下文：262K。专家：8/385。特性：Multi-head Latent Attention、DeepSeek 风格 MoE、YaRN RoPE。发布者：Moonshot AI，2026 年 1 月 26 日。"
+  caption="Kimi K2.5/K2.6 架构（1.0T 总参数 / 32B 激活参数 / 262K 上下文）。两个版本共享骨干网络——K2.6 是 K2.5 预训练权重的后训练优化版本。来源：Moonshot AI 模型卡，经 SemiAnalysis InferenceX 仪表板展示。"
+/>
+
+**K2.5 和 K2.6 是 xAI Cursor Composer 2 和 Composer 2.5 背后的开源权重模型**，服务于 Cursor IDE 超过百万的日活用户。**K2.6 还在公开 agentic 编程基准测试中领先前沿模型**：SWE-Bench Pro 得分 58.6%——领先 GPT-5.4（57.7%）、Claude Opus 4.6（53.4%）和 Gemini 3.1 Pro（54.2%）——SWE-Bench Verified 得分 80.2%（[Moonshot K2.6 模型卡](https://huggingface.co/moonshotai/Kimi-K2.6)）。Cline 的[生产部署数据](https://cline.bot/blog/moonshots-kimi-k2-for-coding-our-first-impressions-in-cline)显示其在复杂 diff 编辑任务上的失败率为 3.3%，与 Claude 4 Sonnet 持平。K2.6 的 Agent Swarm 原语可扇出至 **300 个并行子 agent，跨 4,000 个协调步骤**，从 K2.5 的 100 / 1,500 提升。如果你今天在托管开源 agentic 编程栈，K2.5 或 K2.6 就是你在服务的模型。
+
+关于量化的说明：Moonshot 发布 K2.5/K2.6 时，**原生 INT4 权重**是默认的开源权重检查点——本文中 H200 INT4 和 B200 INT4 曲线直接使用该检查点。**B200 NVFP4 曲线使用的是相同权重的 NVFP4 再量化版本**，以便 B200 的 FP4 张量核心能以全速率执行 MoE GEMM。H200 无法运行此路径——Hopper SM90 没有 FP4 张量核心。
+
+## 纸面规格
+
+NVIDIA B200 SXM（Blackwell，2025）vs NVIDIA H200 SXM（Hopper，2024）——两者均为 NVIDIA，均运行 vLLM，均部署在 8-GPU NVLink 域中。下方雷达图（chart）将每个轴归一化到 [`/gpu-specs`](/gpu-specs) 中的跨厂商最大值，因此可见多边形在 GB200 NVL72 / GB300 NVL72 设定上限的轴上被压缩（Scale Up Domain Memory + 带宽在 72-GPU 节点规模下），FP4 轴由 GB300 NVL72 的 15,000 TFLOP/s 主导——B200 的 9,000 TFLOP/s 在该轴上约为 60%。
+
+<Figure
+  srcLight="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/specs-radar-light.png"
+  srcDark="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/specs-radar-dark.png"
+  alt="GPU 规格雷达图对比 H200 SXM（深绿色）和 B200 SXM（浅绿色），来自 /gpu-specs。B200 在每个单 GPU 轴上都填充了更大面积，Memory 轴除外（约 60% vs H200 约 45%——两者均被 MI355X 288 GB 上限压缩）。B200 最显著的优势：Mem BW（100%，H200 约 55%）、Scale Up BW（100%，H200 约 50%）、BF16 + FP8 TFLOP/s（约 85%，H200 约 35%）。H200 在 FP4 轴上为 0%，因为 Hopper 没有 FP4 张量核心。"
+  caption="B200 SXM（浅绿色）vs H200 SXM（深绿色）的 /gpu-specs 对比。各轴归一化至跨厂商所有 SKU 的最大值。B200 在每个单 GPU 轴上领先 H200；FP4 轴差距无穷大——H200 为 0%，因为 Hopper 没有 FP4 张量核心路径。Scale-up Domain 轴被 GB200/GB300 NVL72 的 72-GPU 节点规模压缩，因此两款 8-GPU SKU 均约为 11%。"
+/>
+
+| 规格                              | H200 SXM            | B200 SXM            | B200 / H200 |
+| --------------------------------- | ------------------- | ------------------- | ----------- |
+| HBM 容量                          | 141 GB              | 180 GB              | 1.28x       |
+| HBM 带宽                          | 4.8 TB/s            | 8 TB/s              | **1.67x**   |
+| Dense FP4 (TFLOP/s)               | —（无 FP4 核心）    | 9,000               | **∞**       |
+| Dense FP8 (TFLOP/s)               | 1,979               | 4,500               | **2.27x**   |
+| Dense BF16 (TFLOP/s)              | 989                 | 2,250               | 2.27x       |
+| Scale-up 每 GPU 带宽（单向）      | 450 GB/s (NVLink 4) | 900 GB/s (NVLink 5) | **2.00x**   |
+| Scale-up 节点规模                 | 8                   | 8                   | 1.00x       |
+| Scale-up Domain HBM 容量          | 1.13 TB             | 1.44 TB             | 1.28x       |
+| Scale-up Domain HBM 带宽（聚合）  | 38.4 TB/s           | 64 TB/s             | 1.67x       |
+| TCO（SemiAnalysis AI Cloud 模型） | $1.41/GPU/hr        | $1.95/GPU/hr        | 1.38x       |
+
+**从硅片规格到实测性能的映射。** 当两款 SKU 在同一模型上都运行 vLLM INT4 时，工作负载在 **decode 路径上受 HBM 带宽瓶颈限制**——每一步通过 HBM 流式读取活跃专家权重，在并发用户间分批执行。B200 1.67 倍的 HBM 带宽优势直接体现在吞吐量上：在 iv = 26 tok/s/user 时，**B200 INT4 达到 1,791 tok/s/GPU vs H200 INT4 的插值 1,055——比值为 1.70x，正好位于硅片极限**。扣除 1.38 倍 TCO 溢价后，B200 INT4 相对 H200 INT4 获得 1.22 倍的 token 成本优势。
+
+**HBM 容量带来了雷达图上看不到的第二个硅片优势：更低的 TP，每个 token 更少的集合通信开销。** Kimi K2.5/K2.6 INT4 的模型活跃状态约占 **500 GB**（1T 总参数 × 约 4 bit + 激活值 + KV 缓存 + paged attention 暂存空间）。在 B200 的 **180 GB/GPU** 上，可以放入 **4 GPU（720 GB 聚合，约 30% 空间留给 KV 缓存和激活值）→ TP=4 可行**。在 H200 的 **141 GB/GPU** 上，同样的模型需要 **至少 8 GPU（1,128 GB 聚合）才能留出足够的 KV 缓存空间 → 必须使用 TP=8**。本文中每一个 Pareto 最优的 B200 NVFP4 数据点都是 **TP=4**；每一个 H200 INT4 数据点都是 **TP=8**。
+
+张量并行度减半意味着每个 decode 步骤的集合通信流量减半——注意力输出投影、MoE gather 和 post-MLP reduce 上各少一个 log₂N AllReduce 跳。Amdahl 定律在串行集合通信瓶颈上拉低了每步延迟下限。B200 NVFP4 曲线不仅因精度比值而位于 B200 INT4 之上；它还因每个 decode 步骤完成更快而在交互性轴上向左偏移。
+
+**精度解锁叠加在以上两个因素之上。** 将 B200 的路径从 INT4 切换为 NVFP4，使其 dense 张量核心吞吐量翻倍——这条路径承担了 K2 中 MoE GEMM 的大部分计算——且无需额外的 HBM 开销。B200 NVFP4 在 32 tok/s/user 时达到 **3,879 tok/s/GPU，是 B200 INT4 峰值（26 tok/s/user）的 2.17 倍**。将三个因素相乘——**1.67x HBM 带宽（decode 瓶颈下的吞吐量下限）× 约 2x NVFP4（精度解锁）× TP=4 vs TP=8 的集合通信优势**——再除以 1.38x TCO 溢价。最终得到实测的 **2.95 倍每百万 tokens 成本优势**。
+
+## 详细数据
+
+所有行均为 Kimi K2.5 / K2.6 在 **ISL 8192 / OSL 1024** 下的单 8-GPU 节点结果，数据来自 2026-05-19 的 InferenceX 基准测试，[GHA run 26118912054](https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26118912054)。吞吐量为每 GPU 数值。每百万 tokens 成本使用 SemiAnalysis AI Cloud TCO 模型：H200 $1.41/GPU/hr，B200 $1.95/GPU/hr。公式：`$/M tok = TCO\_$/GPU/hr × 1e6 / (3600 × tput_per_gpu)`。
+
+**H200 vLLM INT4 (TP=8)**——参考基准：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 384.4     | 91.18      | 10.97     | $1.019     |
+| 8    | 590.2     | 70.28      | 14.23     | $0.664     |
+| 16   | 797.9     | 46.64      | 21.44     | $0.491     |
+| 32   | 990.9     | 28.86      | 34.65     | $0.395     |
+| 64   | 1,174.5   | 16.67      | 59.98     | $0.334     |
+
+**B200 vLLM INT4 (TP=8)**——相同精度下的 Blackwell 硬件，隔离纯硬件差异：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 446.7     | 104.36     | 9.58      | $1.213     |
+| 8    | 692.8     | 81.12      | 12.33     | $0.782     |
+| 16   | 969.4     | 59.21      | 16.89     | $0.559     |
+| 32   | 1,351.4   | 40.48      | 24.70     | $0.401     |
+| 64   | 1,790.7   | 26.01      | 38.45     | $0.303     |
+
+**B200 vLLM NVFP4 (TP=4 + TP=8)**——标题中的最优方案；dense Pareto 最优段在所有并发度下均为 TP=4，外加一个 TP=8 conc=4 数据点延伸高交互性端：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens | TP   |
+| ---- | --------- | ---------- | --------- | ---------- | ---- |
+| 4    | 532.0     | 125.51     | 7.97      | $1.018     | TP=8 |
+| 4    | 947.4     | 111.08     | 9.00      | $0.572     | TP=4 |
+| 8    | 1,537.2   | 90.66      | 11.03     | $0.352     | TP=4 |
+| 16   | 2,318.7   | 67.40      | 14.84     | $0.234     | TP=4 |
+| 32   | 3,202.7   | 46.83      | 21.35     | $0.169     | TP=4 |
+| 64   | 3,879.3   | 32.19      | 31.07     | **$0.140** | TP=4 |
+
+加粗行即为标题数字：**B200 NVFP4 在 32 tok/s/user 时每百万 tokens 仅需 $0.140**，为图表中的最低推理成本。
+
+## 等交互性成本对比
+
+在匹配的交互性水平下，沿每款 SKU 的 Pareto 前沿插值得出的每百万 tokens 成本。超出前沿测量范围的单元格标记为 `_unreachable_`（比值列标记为 `_∞_`）。三条曲线的重叠区间为 **30–90 tok/s/user**——这是有意义的三方对比所在的区间。
+
+| 交互性 (tok/s/user) | H200 INT4 $/M | B200 INT4 $/M | B200 NVFP4 $/M | H200 / B200 NVFP4 | H200 / B200 INT4 | B200 INT4 / B200 NVFP4 |
+| ------------------- | ------------- | ------------- | -------------- | ----------------- | ---------------- | ---------------------- |
+| **32**              | **$0.413**    | **$0.343**    | **$0.140**     | **2.95x**         | **1.20x**        | **2.45x**              |
+| 35                  | $0.427        | $0.362        | $0.145         | 2.95x             | 1.18x            | 2.50x                  |
+| 40                  | $0.453        | $0.397        | $0.154         | 2.94x             | 1.14x            | 2.58x                  |
+| 50                  | $0.511        | $0.477        | $0.177         | 2.88x             | 1.07x            | 2.69x                  |
+| 60                  | $0.569        | $0.566        | $0.206         | 2.75x             | 1.00x            | **2.74x**              |
+| 70                  | $0.660        | $0.655        | $0.244         | 2.71x             | 1.01x            | 2.69x                  |
+| 80                  | $0.811        | $0.766        | $0.286         | 2.84x             | 1.06x            | 2.68x                  |
+| 90                  | $0.996        | $0.927        | $0.347         | 2.87x             | 1.07x            | 2.67x                  |
+| 100                 | _unreachable_ | $1.123        | $0.421         | _∞_               | _unreachable_    | 2.67x                  |
+| 110                 | _unreachable_ | _unreachable_ | $0.550         | _∞_               | _∞_              | _∞_                    |
+| 125                 | _unreachable_ | _unreachable_ | $1.000         | _∞_               | _∞_              | _∞_                    |
+
+**B200 NVFP4 vs H200 INT4 的差距在重叠区间内几乎恒定：30 到 90 tok/s/user 范围内为 2.71x–2.95x。** 曲线的两端获得相同的优势。在低交互性/高批量端，工作负载受 decode 瓶颈限制，B200 的 HBM 带宽 + NVFP4 张量核心均保持饱和。在高交互性/低批量端，NVFP4 随批量缩小持续降低每 token 计算开销。同精度行（H200 INT4 vs B200 INT4）则呈现不同的趋势：在 **60–80 tok/s/user 时收窄至 1.00x–1.07x**，B200 的硅片优势仅仅能覆盖其 TCO 溢价。精度解锁才是支撑标题数字的核心。
+
+在 100 tok/s/user 以上，只有 B200 NVFP4 还有可用方案。H200 INT4 的前沿在 91 tok/s/user 终止（并发 4 时单步计算饱和）；B200 INT4 在 104 tok/s/user 终止。**B200 NVFP4 仍可在 125 tok/s/user 时以 $1.00/M 提供服务**——这是任何 Hopper 方案都无法到达的区间。
+
+<Figure
+  srcLight="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/benchmark-light.png"
+  srcDark="/images/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar/benchmark-dark.png"
+  alt="Kimi K2.5/K2.6 1T 在 FP4 / INT4 下的 8K / 1K 吞吐量，三条 vLLM 曲线：B200 NVFP4（浅绿色圆点）在 32 tok/s/user 时峰值约 3.9k tok/s/GPU；B200 INT4（浅绿色方块）在 26 tok/s/user 时峰值约 1.8k tok/s/GPU；H200 INT4（深绿色方块）在 16.7 tok/s/user 时峰值约 1.17k tok/s/GPU。B200 NVFP4 曲线在整个重叠区间内大致位于 H200 INT4 之上 3 倍、B200 INT4 之上 2 倍。数据点标签标注每个配置的 GPU 数量（B200 NVFP4 高吞吐量段为 TP=4，其余为 TP=8）。"
+  caption="Kimi K2.5/K2.6（1T 总参数，32B 激活参数），vLLM，ISL 8192 / OSL 1024，单 NVIDIA 节点。来源：SemiAnalysis InferenceX，2026-05-19。数据点标签标注每个配置的 GPU 数量。"
+/>
+
+[在线图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-19&g_runid=26118912054&g_model=Kimi-K2.5&i_prec=fp4%2Cint4&i_active=b200_vllm%2Ch200_vllm)，已预筛选为 2026-05-19 测试中 B200 + H200 上的 vLLM Kimi K2.5/K2.6 FP4 和 INT4 对比。
+
+## 致谢
+
+Kimi K2.5 和 K2.6 是 [Moonshot AI](https://www.moonshot.ai/) 的工作成果，权重发布于 [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5) 和 [`moonshotai/Kimi-K2.6`](https://huggingface.co/moonshotai/Kimi-K2.6)。Blackwell 上的 vLLM NVFP4 路径是 [vLLM 项目](https://github.com/vllm-project/vllm)以及 NVIDIA TensorRT-LLM / AITER 内核团队的工作成果，vLLM 链接了他们的 FP4 MoE 内核。持续基准测试由 SemiAnalysis 在 InferenceX 上执行。速度就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-19&g_runid=26118912054&g_model=Kimi-K2.5&i_prec=fp4%2Cint4&i_active=b200_vllm%2Ch200_vllm">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "NVIDIA B200 NVFP4 在 Kimi K2.5 和 K2.6 推理中比 H200 INT4 便宜多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 8K/1K 工作负载下使用 vLLM，B200 NVFP4 在 30 到 90 tok/s/user 推理区间内每百万 tokens 成本比 H200 INT4 低 2.71 倍到 2.95 倍。峰值差距为 32 tok/s/user 时的 2.95 倍，B200 NVFP4 每百万 tokens 为 $0.140，H200 INT4 为 $0.413——成本降低 65%。在 100 tok/s/user 以上，H200 INT4 没有可用方案，而 B200 NVFP4 仍可在 125 tok/s/user 时以 $1.00/M 提供服务。TCO 使用 SemiAnalysis AI Cloud TCO 模型：H200 $1.41/GPU/hr，B200 $1.95/GPU/hr。数据来自 InferenceX，GHA run 26118912054，2026-05-19。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "差距中有多少来自硅片，多少来自精度解锁？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "三个因素叠加。（1）HBM 带宽：B200 的 HBM 带宽是 H200 的 1.67 倍（8 vs 4.8 TB/s），且工作负载受 decode 瓶颈限制，因此在 iv = 26 tok/s/user 时 B200 INT4 达到 1,791 tok/s/GPU，而 H200 INT4 插值为 1,055——1.70 倍，正好位于硅片比值。（2）HBM 容量解锁更低的张量并行度：Kimi K2.5/K2.6 INT4 的模型活跃状态约 500 GB，可放入 4 块 B200 GPU（每块 180 GB，聚合 720 GB），但需要 8 块 H200 GPU（每块 141 GB，聚合 1,128 GB）才能留出足够的 KV 缓存空间。本文中每个 Pareto 最优的 B200 NVFP4 方案为 TP=4；每个 H200 INT4 数据点为 TP=8。张量并行度减半意味着每个 decode 步骤的集合通信流量减半（少一个 log-base-2-N AllReduce 跳），Amdahl 定律在串行集合通信瓶颈上拉低每步延迟下限。（3）精度解锁：将 B200 从 INT4 切换为 NVFP4 使 dense 张量核心吞吐量翻倍，B200 NVFP4 峰值达到 3,879 tok/s/GPU（B200 INT4 峰值的 2.17 倍）。三者相乘再除以 1.38 倍 TCO 溢价（B200 $1.95 vs H200 $1.41/GPU/hr），得到实测的 2.71x-2.95x 每百万 tokens 成本优势。NVFP4 是精度杠杆；HBM 带宽是吞吐量下限；HBM 容量是降低 TP 的杠杆；H200 三者皆无（Hopper 不支持 FP4 张量核心、更低的 HBM 带宽、更低的 HBM 容量迫使 TP=8）。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "在相同 B200 硬件上从 INT4 切换到 NVFP4 值得吗？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "值得。在相同 B200 硬件上，将 vLLM 精度从原生 INT4 切换为 NVFP4，在 30 到 90 tok/s/user 推理区间内可获得 2.45 倍到 2.74 倍的等交互性成本优势，峰值在 60 tok/s/user 时达到 2.74 倍（INT4 每百万 tokens $0.566 vs NVFP4 $0.206）。机制：NVFP4 启用了 B200 的 9,000 TFLOP/s FP4 张量核心，而 INT4 路径不使用这些核心。NVFP4 还扩展了可达的交互性范围——B200 INT4 上限为 104 tok/s/user，B200 NVFP4 可服务至 125 tok/s/user。无需更换硬件，无 TCO 变化，仅靠精度切换。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 Kimi K2.5 / K2.6 是这里重要的模型？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "Kimi K2.5 和 K2.6 是 xAI Cursor Composer 2 和 Composer 2.5 后端的开源权重模型，服务于 Cursor IDE 超过百万的日活用户。K2.6 还在公开 agentic 编程基准测试中领先前沿模型：SWE-Bench Pro 得分 58.6%，领先 GPT-5.4（57.7）、Claude Opus 4.6（53.4）和 Gemini 3.1 Pro（54.2），SWE-Bench Verified 得分 80.2%。Cline 的生产部署数据显示其在复杂 diff 编辑任务上的失败率为 3.3%，与 Claude 4 Sonnet 持平。架构为 1T 总参数、每 token 32B 激活、384 个专家（选 8 个加 1 个共享）、61 个 Transformer 层、Multi-head Latent Attention 和 256K 上下文窗口。K2.5（发布于 2026-01-27）和 K2.6（发布于 2026-04-20）共享相同的预训练骨干网络，因此本文中的每一个推理结果都同样适用于两者——它们是后训练优化，不是新架构。如果你今天在托管开源 agentic 编程栈，K2.5 或 K2.6 就是你在服务的模型。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "Kimi K2.5 / K2.6 推理还有哪些未覆盖的方向？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "四个方向。第一，AMD MI355X 目前还没有 K2.5 / K2.6 的 InferenceX 方案；一旦内核覆盖到位，相同的精度解锁论证应同样适用（MI355X 有 10,066 TFLOP/s FP4 张量核心，略高于 B200）。第二，PD 分离式推理（AMD 上的 mori-sglang、NVIDIA Dynamo）是下一个约 1.5 倍的提升杠杆，目前 InferenceX 循环中还没有 K2 的方案。第三，GB200 NVL72 和 GB300 NVL72 的机架级宽 Expert Parallelism 路径尚未为 K2.5 / K2.6 接入，尽管 384 专家的架构天然适配。第四，本文测量的是 8K / 1K；32K / 2K 和 128K / 2K 的 agentic 工具调用工作负载会在 KV 缓存压力开始对 256K 上下文窗口模型产生影响后重新排列曲线。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/deepseekv4-16t-day-0-to-day-43-performance.mdx b/packages/app/content/blog/zh/deepseekv4-16t-day-0-to-day-43-performance.mdx
new file mode 100644
index 00000000..7bb0d216
--- /dev/null
+++ b/packages/app/content/blog/zh/deepseekv4-16t-day-0-to-day-43-performance.mdx
@@ -0,0 +1,588 @@
+---
+title: 'DeepSeekV4 1.6T 第0天至第43天性能演进 — Huawei、GB300 NVL72、MI355X、B200'
+subtitle: '第0天推理性能、InferenceX、26天内性能提升100倍、每百万 token 成本、Huawei 950DT 推理 Trace 分析'
+date: '2026-06-09'
+publishDate: '2026-06-09'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - deepseek
+  - nvidia
+  - amd
+  - huawei
+  - gb300
+  - b300
+  - b200
+  - mi355x
+  - h200
+  - sglang
+  - vllm
+  - trtllm
+---
+
+_本文最初于 2026 年 6 月 9 日发布在 [SemiAnalysis 通讯](https://newsletter.semianalysis.com/p/deepseekv4-16t-day-0-to-day-43-performance)。_
+
+DeepSeek v4 的发布标志着开源模型社区又迈出了重要一步——毫不意外，它再次出自中国实验室之手。其推理性能的演进对整个 AI 生态系统至关重要。[开源 InferenceX 工程团队连续多个通宵，在第 0 天、第 1 天、第 2 天及之后持续测量该模型的性能表现，并将结果公之于众。](https://inferencex.semianalysis.com/)在本文中，我们将重点介绍 DeepSeek v4 的第 0 天性能，并解释模型发布后数周内所取得的重大改进。我们还将阐述 DeepSeek v4 模型架构的核心组件，并探讨其部分设计如何针对 Huawei Ascend 推理进行了协同优化。
+
+在博客文章的第二节中，我们对 DeepSeekv4 在 Huawei Ascend 950DT 上的第 0 天推理进行了全面分析。本文是 Ascend 950DT 上 DeepSeekv4 推理的首篇分析报告，我们详细拆解了 Huawei 为优化性能所做的计算↔通信重叠以及不同的计算流设计。
+
+InferenceX 的一个核心目标，尤其是在模型第 0 天发布窗口期间，是使用开源镜像和配方，尽可能多地在各种框架上记录每种 SKU 的性能表现，无论这些镜像和配方的实际性能如何。这使我们能够追踪性能的持续改进，我们认为这最能反映每种芯片真实可部署的性能。下方视频展示了 vLLM/SGLang 从第 0 天起非 MTP 配置的迭代改进过程。[访问 inference.com 查看从第 0 天起的 MTP 配置](https://inferencemax.ai/)。
+
+<video
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/day0-onward-vllm-nonmtp.mp4"
+  controls
+  muted
+  loop
+  playsInline
+  className="rounded-lg w-full md:w-3/4 mx-auto my-6 block"
+/>
+
+_vLLM 非 MTP DeepSeek V4 Pro 配置从第 0 天起的改进过程。来源：SemiAnalysis InferenceX_
+
+这些图表反映了数千小时的工程投入，用于调优 DeepSeek v4 推理性能，且大部分优化已合并到 SGLang/vLLM 的主分支。InferenceX 的核心目标之一是展示性能*随时间*的迭代改进，而非仅仅呈现性能的静态快照。毕竟在工程领域，探索过程中学到的东西往往与最终结果同等重要。
+
+<video
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/day0-onward-sglang-nonmtp.mp4"
+  controls
+  muted
+  loop
+  playsInline
+  className="rounded-lg w-full md:w-3/4 mx-auto my-6 block"
+/>
+
+_SGLang 非 MTP DeepSeek V4 Pro 配置从第 0 天起的改进过程。来源：SemiAnalysis InferenceX_
+
+在 DeepSeek v4 Pro 发布初期，CUDA vLLM、CUDA SGLang 以及 CUDA vLLM 分离式 prefill 均可开箱即用且表现出色，这证明了 vLLM 和 SGLang 开源生态的强大实力。这两个推理引擎对全球 ML 生态系统至关重要，以至于两个团队都已成立自己的公司——Inferact 和 RadixArk，各自融资数亿美元以持续推动其开源推理引擎的发展。
+
+Huawei Ascend 也在其文档中描述并展示了 DeepSeekV4 的第 0 天推理性能支持。目前中国在开源模型领域占据主导地位，[Kimi K2.6 在编程方面仍然击败了 Jensen 的 Nemotron Committee Coalition 的 Nemotron 3 Ultra](https://x.com/SemiAnalysis_/status/2062942704296743164)。此外，[Nvidia 自研的 TensorRT-LLM 对 DeepSeek v4 的支持不佳，我们 SemiAnalysis 不得不修复了他们开源的 mHC kernel 启动代码](https://github.com/NVIDIA/TensorRT-LLM/pull/13710)。[感谢 NVIDIA 工程师的 rebase 和合并我们的补丁](https://github.com/NVIDIA/TensorRT-LLM/pull/13771)！
+
+ROCm 在 DeepSeek v4 发布的头几天同样表现不佳。不过，在 HaiShaw 的技术领导下，AMD SGLang 工程团队在首月内大幅提升了性能——在第 26 天实现了超过 100 倍的性能提升。我们将在即将发布的综合性 State of AMD 2026 文章中更多地讨论 AMD 软件进展的优劣。
+
+所有性能追踪数据均记录在我们的开源 GitHub 仓库中。如果您觉得这个仓库有用，欢迎给我们一个 star：[https://github.com/SemiAnalysisAI/InferenceX](https://github.com/SemiAnalysisAI/InferenceX)。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/inferencex-github-repo.png"
+  alt="SemiAnalysisAI/InferenceX GitHub 仓库页面截图"
+  caption="来源：SemiAnalysis - InferenceX GitHub"
+/>
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?preset=dsv4-launch">
+  点击查看完整的 DeepSeekV4 InferenceX 仪表板 →
+</DashboardCTA>
+
+SemiAnalysis InferenceX 推理项目得到了 ML 社区众多伙伴的支持，包括 OpenAI、Oracle、Microsoft、Weka、PyTorch Foundation、vLLM、SGLang 和 CoreWeave 等。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/inferencex-supporters.png"
+  alt="InferenceX 支持者徽标，包括 OpenAI、Oracle、Microsoft、Weka、PyTorch Foundation、vLLM、SGLang 和 CoreWeave"
+  caption="来源：SemiAnalysis InferenceX 支持者"
+/>
+
+[查看所有 InferenceX 支持者](https://inferencex.semianalysis.com/quotes)
+
+InferenceX 团队非常感谢 vLLM 社区维护者和 Inferact 持续投入的工程努力，也感谢全球各地的 SGLang 维护者——来自 RadixArk、Meta 及其他机构。我们还要特别感谢 Nvidia 工程师 Kedar Potdar、Ankur Singh、Xin Li、Alec Flowers 以及许多其他 Nvidia 工程师在此项目第 0 天提供的支持。同时也向 AMD 工程团队致以谢意，感谢他们在 DeepSeek v4 Pro 发布后持续支持 ROCm 栈。
+
+不巧的是，DeepSeek v4 发布时我们的 GB300 集群恰好宕机了。幸运的是，[CoreWeave 伸出援手，为开源社区和维护者贡献了算力，紧急调配了两个备用开发 GB300 NVL72 机架。](https://x.com/SemiAnalysis_/status/2048082151711641829) 我们的 GB300 测试结果完全得益于他们的支持，目前我们正全天候使用这些资源来推动进一步的性能改进。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/coreweave-gb300-nvl72-racks.jpg"
+  alt="CoreWeave 数据中心的 GB300 NVL72 机架"
+  caption="来源：SemiAnalysis"
+/>
+
+如果您有兴趣从事底层基准测试（benchmark）、InferenceX 或其他有趣的技术工作，请将简历发送至 [letsgo@semianalysis.com](mailto:letsgo@semianalysis.com)，并附上三条要点来展示您的工程能力。如果方便，请附上 GitHub 仓库链接、个人网站或博客来展示您的项目、工作成果和专业知识。
+
+## 第一节：DeepSeekV4 Pro 第 0 天性能
+
+在本节中，我们将首先讨论 DeepSeek v4 Pro 在第 0 天的开箱即用性能。我们将引用吞吐量（throughput）-交互性曲线，说明不同并行策略如何在吞吐量和交互性之间取舍，以及 MTP 和分离式推理等其他推理优化技术——这些在 [InferenceX V2 文章](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs)中已有详细解释。
+
+> 请注意，为了避免潜在的"推理世界大战 3"以及防止[又一轮 vLLM vs SGLang 的推特骂战/说唱对决](https://x.com/EmbeddedLLM/status/1913854116545307094)，本文不会在同一图表上同时展示同一硬件 SKU 的 vLLM 和 SGLang 结果。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/vllm-vs-sglang-twitter-drama-meme.png"
+  alt="战争电影 meme，配文 'vLLM vs SGLang 推特骂战'"
+  caption="来源：SemiAnalysis"
+/>
+
+以下两张图表展示了我们记录的所有第 0 天配方（recipe），大多数配方使用原生模型检查点，采用混合 FP4 MoE-FP8 Attention 量化权重（H200 和 MI355X SKU 除外）。由于 DeepSeekV4 Pro 的原生 FP4+FP8 检查点在第 0 天无法在 MI355X 上使用，我们只能选择使用完整 FP8 非原生检查点。
+
+不幸的是，AMD SGLang 和 AMD vLLM 的分布式推理在 DeepSeekV4 Pro 上仍然无法工作。
+
+转向 [SGLang](https://github.com/sgl-project/sglang/pull/23600) 和 [vLLM](https://github.com/vllm-project/vllm/pull/40760)，两者均在 CUDA 平台上于模型公开发布的第一时间支持了原生 DeepSeek v4 Pro。大多数发布的配方，特别是针对 B200/B300 等较新 SKU 的配方，均可开箱即用且无重大问题。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/day0-vllm-pareto.png"
+  alt="第 0 天 DeepSeek V4 Pro 1.6T (FP4/FP8, 8K/1K) 各硬件 token 吞吐量/GPU 与交互性对比图，涵盖 GB200 NVL72 Dynamo vLLM、B300 vLLM、B200 vLLM、MI355X ATOM、MI355X SGLang 和 H200 vLLM。MI355X 数据点位于左下角。"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=DeepSeek-V4-Pro&g_rundate=2026-04-27&g_runid=25016676395&i_hc=1&i_prec=fp4%2Cfp8&i_active=b200_vllm%2Cb300_vllm%2Cgb200_dynamo-vllm%2Ch200_vllm%2Cmi355x_atom%2Cmi355x_sglang)
+
+下图展示了 SGLang 第 0 天的性能表现：
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/day0-sglang-pareto.png"
+  alt="SGLang 第 0 天 DeepSeek V4 Pro 1.6T 在 GB300 NVL72、B300、B200、MI355X 和 H200 上的 token 吞吐量/GPU 与交互性对比"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?i_hc=1&g_model=DeepSeek-V4-Pro&g_rundate=2026-04-25&i_prec=fp4%2Cfp8&g_runid=24943464864&i_active=b200_sglang%2Cb300_sglang%2Cmi355x_sglang)
+
+现在让我们深入了解每组第 0 天结果的细节。
+
+### 第 0 天 GB200 NVL72 多节点分离式 Prefill
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/day0-gb200-nvl72-disagg-vllm.png"
+  alt="GB200 NVL72 Dynamo vLLM 分离式 prefill 第 0 天吞吐量-交互性曲线与 B200 对比，在较低交互性下性能提升最高达 5 倍"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=DeepSeek-V4-Pro&g_rundate=2026-04-27&g_runid=25016676395&i_prec=fp4%2Cfp8&i_hc=1&i_legend=0&i_active=b200_vllm%2Cgb200_dynamo-vllm)
+
+vLLM 和 Nvidia 非常快速地在 [srt-slurm](https://github.com/NVIDIA/srt-slurm/pull/71) 中交付了他们的 GB200 分布式推理 Dynamo vLLM 配方。分离式推理（disaggregated inference）和宽专家并行（WideEP）是能够显著提升每美元性能的推理优化技术——读者可以在我们的 [InferenceX V2 文章](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs)中了解更多。该配方本身较为基础：prefill 使用 eager 模式，通过 NIXL 进行 KV cache 传输。我们独立复现了该配方，在使用较低交互性配置时，实现了比 B200 运行最高 5 倍的性能提升。
+
+这是 CUDA 护城河的一个典型案例：借助 CUDA，分布式推理通常能在最新开源模型发布的第 0 天附近即获得支持。
+
+### 第 3 天 多 token 预测（MTP）推测性解码
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/day3-sglang-mtp.png"
+  alt="SGLang 第 3 天 MTP DeepSeek V4 Pro 吞吐量-交互性曲线，在较高交互性下表现出显著提升"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=DeepSeek-V4-Pro&g_rundate=2026-04-27&g_runid=25016676395&i_prec=fp4%2Cfp8&i_hc=1&i_legend=0&i_active=b300_sglang%2Cb300_sglang_mtp)
+
+DeepSeek v4 的首个 MTP 支持在第 3 天由 SGLang 交付。使用 MTP 在较高交互性下带来了吞吐量的大幅提升。关于 MTP 及其如何惠及内存受限的小批量解码，可参阅我们的 [InferenceX V2 文章](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs)。
+
+### 第 0 天 ROCm AMD MI355X 的失望表现
+
+转向 AMD MI355X 上的 ROCm，我们第 0 天 DeepSeek v4 的结果令人困惑。大多数开源生态系统中的 AMD 用户也同样感到迷茫。MI355X 在第 0 天只能运行 FP8，在下方总览图表中的位置位于左下角。推理在技术上是可以运行的，但由于交互性极低——仅 1-2 tok/s/user，远低于用户平均阅读速度——使用体验极差。
+
+我们使用了 AMD 的 HaiShaw 等人通过 [SGLang PR](https://github.com/sgl-project/sglang/pull/23608#issuecomment-4311952977) 提供的第 0 天 WIP 配方。这是我们在第 0 天能找到的唯一可用配方。不幸的是，其性能令人失望且原生 FP4+FP8 检查点无法使用——这可能是由于 ROCm 生态系统相对不够成熟。然而，正如我们将在文章后面讨论的，HaiShaw 的团队最终交出了出色的答卷，通过一系列基于第一性原理的经典工程工作，从第 0 天到第 26 天实现了超过 100 倍的性能提升。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/day0-mi355x-rocm.png"
+  alt="第 0 天 MI355X ROCm DeepSeek V4 Pro 结果，交互性被限制在 1-2 tok/s/user，位于图表左下角"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=DeepSeek-V4-Pro&g_rundate=2026-04-25&i_hc=1&i_legend=0&g_runid=24943464864&i_prec=fp4%2Cfp8&i_gpus=mi355x_atom%2Cmi355x_sglang&i_dstart=2026-04-25&i_dend=2026-04-26&i_active=mi355x_sglang)
+
+### AMD ATOM 推理引擎的失望表现
+
+ATOM 在交互性方面稍好一些，但在并发数（concurrency）大于 1 时仍然力不从心。在 DeepSeek v4 早期，我们使用的 [ATOM #650](https://github.com/ROCm/ATOM/pull/650) 硬编码了 `kv_cache[:1,...]`，这意味着 KV cache 被锁定在单个序列槽位上。只有一个槽位可用，第二个并发请求便无处存储其 KV 状态。这是因为实现批处理的基础设施尚未就位，所以我们只能运行批大小为 1 的单用户请求。
+
+ATOM 的几乎每条热路径也都走了 fallback：FP4 MoE 被迫使用 Triton（因为 AITER 的 `fused_moe` 在 GFX950 上损坏了），mHC 的 pre-projection 也回退到了 Torch（因为 AITER 的 kernel 会崩溃），从而强制使用 eager 模式执行。
+
+### NVIDIA TensorRT-LLM 的 Bug 及缺乏第 0 天 DeepSeekV4 Pro 支持
+
+TensorRT 无法开箱即用地支持 DeepSeek v4，因为 `mhcFusedHcKernel.cu` 中有一个硬编码的 `FHC_HIDDEN = 4096` 常量。问题在于 SHAPE_K、residual/x TMA 描述符以及 MMA kernel 模板实例化都绑定在该隐藏维度大小上。所有之前的 DeepSeek 模型和 DeepSeek v4 flash 的隐藏维度均为 4096，因此暂时不出问题。但尝试对 DeepSeek v4 Pro 进行推理时，会触发 `"mhcFusedHcLaunch: hidden_size=7168 not supported (only 4096)"` 的保护错误。
+
+Nvidia 工程师也遇到了这个保护错误，但他们没有添加代码来支持 DeepSeek v4 Pro 的 7168 隐藏维度，而是直接[移除了保护检查](https://github.com/NVIDIA/TensorRT-LLM/commit/b3f45bb608aecca666a451ca5138b81470487f05)。毫不意外，错误确实消失了。
+
+由于这个"修复"，有超过一周的时间，除非使用环境变量 `TRTLLM_MHC_ENABLE_FUSED_HC=0`，否则 kernel 会专门为 4096 编译，且没有任何检查拒绝 7,168 的调用。在默认设置下（fused HC 默认开启；B300 = SM10x → MMA 路径），stock trtllm-serve 运行 DeepSeek v4 Pro 时会将 7,168 张量送入为 4,096 编译的 kernel。在这些设置下运行推理[不会立即崩溃，但会产生隐藏的后果：引擎最终会损坏隐藏状态并产生无效生成结果](https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25231354124/job/73987414247)。这个问题通过[我们编写的 PR](https://github.com/NVIDIA/TensorRT-LLM/pull/13710) 得到了修复，令人惊讶的是，这样一个简单的问题竟然过了一周才被发现，而且 PR 又花了好几天才被批准。
+
+从诊断问题并将其缩小到 fused HC 隐藏维度不匹配所花费的时间来看，我们已经到了 DeepSeek v4 Pro 发布后的第 9 天。这一事件是一个很好的案例研究，证明了原生 SGLang 和原生 vLLM 引擎开源生态的强大。正因为有了这些健壮的生态系统，第 0 天支持将始终首先到达原生 SGLang 和原生 vLLM，然后才是 TensorRT-LLM 或 AMD 的 ATOM 引擎（顺便说一下，ATOM 目前零生产客户）。
+
+在下方图表中，我们可以看到截至目前，TRT-LLM 在较大批量时性能更优，但在较高交互性水平下往往落后。
+
+## 第 1.5 节：性能随时间的演进
+
+如文章前面提到的，我们捕获第 0 天各推理引擎和配方的性能快照，将其作为衡量性能随时间改进的基线。有了这一基线性能，我们便能测量并呈现以下分析性能提升的数据。
+
+### DeepSeek v4 Pro 在 MI355X 上 — 不到 1 个月内提升 100 倍
+
+在第 0 天，DeepSeek v4 Pro 在 MI355X 上技术上可以运行，但显然无法部署到任何生产工作流中。然而，此后的改进令人惊叹——在 HaiShaw 的领导下，AMD 团队在 DeepSeek v4 发布后不到一个月内实现了超过 100 倍的吞吐量提升。
+
+<video
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/mi355x-improvement-over-time.mp4"
+  controls
+  muted
+  loop
+  playsInline
+  className="rounded-lg w-full md:w-3/4 mx-auto my-6 block"
+/>
+
+_MI355X DeepSeek V4 Pro 性能从第 0 天起的改进过程。来源：SemiAnalysis InferenceX_
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/mi355x-100x-over-time.png"
+  alt="MI355X SGLang DeepSeek V4 Pro Pareto 前沿从第 0 天 FP8 构建（4 月 25 日）到 FP4 构建（5 月 27 日）提升超过 100 倍"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=DeepSeek-V4-Pro&g_rundate=2026-04-25&i_legend=0&g_runid=24943464864&i_prec=fp4%2Cfp8&i_gpus=mi355x_sglang%2Ch200_sglang&i_dstart=2026-04-25&i_dend=2026-05-27&i_dates=2026-05-02%2C2026-05-03%2C2026-05-04%2C2026-05-08%2C2026-05-10%2C2026-05-19%2C2026-05-21&i_active=mi355x_sglang)
+
+上方图表显示了从 4 月 25 日第 0 天发布的 FP8 构建到 5 月 27 日发布的 FP4 构建之间，吞吐量 Pareto 最优前沿的攀升。性能提升几乎完全来自 AMD 用真正的 AITER、Triton、TileLang 和 FlyDSL kernel 替换 PyTorch 原生 fallback 路径。
+
+推动最大份额性能提升的有两个关键步骤。最大百分比的改进实际上来自基线第 0 天提交之后的第一次 commit——团队清理了大量低垂果实，显著改善了 FP8 基线的首次迭代。下一个最大改进在几天后到来，AMD 团队终于让 FP4 权重 MoE 正常工作，使我们能将 MoE 专家从 FP8 切换到原生 FP4 (MXFP4)，改善了专家权重的带宽。这也将 FlashMLA 和稀疏注意力索引器从 torch fallback 移至 TileLang kernel，并启用了 HIP graphs。
+
+我们看到的下一个重大改进来自 AITER mHC kernel 的引入，这些 kernel 在每一层都会使用。这一改进使得 MI355X 在较低交互性水平下首次超越了 H200 的 DeepSeek v4 Pro 性能。
+
+在窗口注意力 kernel 运行之前，需要知道每个查询的窗口覆盖了哪些 KV-cache 槽位。这由 SWA-prepare 完成，其在 Triton 中的实现也有助于性能改进。
+
+下一个重大跳跃出现在 5 月 19 日，团队退役了剩余的 fallback：FlashMLA 从 TileLang 迁移到了 Triton，AITER FlyDSL FP4 MoE kernel 也已落地。团队还启用了 fused hash-topk、DSv4 radix attention、fused store-cache、fused WQA/WKV projection 和 fused paged-compress，进一步提升了性能。并发扫描范围也增加到了 1024，绘制出此前不存在的高吞吐、低交互性端的前沿。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/atom-frontier-over-time.png"
+  alt="MI355X ATOM 前沿从单个 conc=1 数据点扩展到完整的 Pareto 前沿，H200 作为参考"
+  caption="ATOM 从左下角的单个数据点发展到完整前沿。H200 作为参考。来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-31&g_runid=26696268345&g_model=DeepSeek-V4-Pro&i_gpus=mi355x_atom%2Ch200_sglang&i_dstart=2026-04-26&i_dend=2026-05-14&i_prec=fp4%2Cfp8&i_active=mi355x_atom&i_dates=2026-05-02)
+
+ATOM 也取得了巨大进步，从单个 conc=1 数据点扩展到在整个 Pareto 前沿提供可观的吞吐量，部分数据点甚至超越了 H200。第一个提升来自 [AITER fix #2916](https://github.com/ROCm/aiter/pull/2916)，该修复纠正了导致 mHC 崩溃的设备分配 bug，使 ATOM 得以恢复该 AITER kernel。接着，FP4 专家迁移到了 AITER 的 fused MoE kernel（移除了 Triton override），稀疏注意力的 OOM 问题被清除，eager 模式和单序列限制得以解除。批处理支持也已实现，将并发扫描从 conc=1 扩展到 conc 1–512，性能大幅改善。
+
+#### MI355X MTP
+
+到第 4 周，MTP 已在 AMD 所有框架上正常工作，在等交互性（iso-interactivity）条件下实现了数倍的吞吐量提升。不过我们注意到一个一致的特征：MTP 在较高吞吐量下往往效果更差。这是因为 MTP 利用的是内存受限解码中的计算空闲，因此计算受限的大批量解码任务中，MTP 的开销会超过草稿 token 带来的收益。
+
+### B300
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/b300-sglang-over-time.png"
+  alt="B300 SGLang DeepGEMM MegaMoE 吞吐量在不到一周内提升 3 倍"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-29&g_model=DeepSeek-V4-Pro&g_runid=26649066318&i_gpus=b300_sglang%2Cb300_trt%2Cb300_vllm&i_active=b300_sglang%2Cb300_trt&i_dstart=2026-04-24&i_dend=2026-05-18&i_dates=2026-04-25%2C2026-04-26%2C2026-04-27%2C2026-04-28%2C2026-04-29%2C2026-05-03%2C2026-05-05)
+
+对于 B300 上的 SGLang，DeepGEMM MegaMoE 的结果显示不到一周内实现了 3 倍性能提升，这得益于分组 FP4 MoE GEMM（将专家保持驻留并执行一次 mega-dispatch 而非逐专家 kernel 调度），以及从 EP8 调优至 EP4。
+
+### B200
+
+B200 的性能与 B300 较为相似，在较低交互性下 TRT 对 B200 表现更优。但 TRTLLM 无法开箱即用，相比之下 CUDA vLLM 和 SGLang vLLM 可以直接使用。
+
+### GB300 NVL72
+
+<video
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/gb300-nvl72-improvement-over-time.mp4"
+  controls
+  muted
+  loop
+  playsInline
+  className="rounded-lg w-full md:w-3/4 mx-auto my-6 block"
+/>
+
+_GB300 NVL72 DeepSeek V4 Pro 性能随时间的改进。来源：SemiAnalysis InferenceX_
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/gb300-nvl72-mtp-over-time.png"
+  alt="GB300 NVL72 SGLang MTP 吞吐量-交互性曲线：第 0 天窄 EP=8 配方 vs 6 月 2 日 W4A4 MXFP4 MegaMoE 宽 EP=16 解码拓扑"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?i_hc=1&g_model=DeepSeek-V4-Pro&g_rundate=2026-06-08&g_runid=27099659001&i_prec=fp4%2Cfp8&i_gpus=gb300_dynamo-sglang%2Cgb300_dynamo-sglang_mtp&i_active=gb300_dynamo-sglang&i_dates=2026-04-30%2C2026-05-07%2C2026-05-11%2C2026-05-20%2C2026-05-22%2C2026-05-28%2C2026-06-02%2C2026-06-03%2C2026-06-08&i_dstart=2026-04-30&i_dend=2026-06-08)
+
+GB300 SGLang MTP 最显著的改进出现在 6 月 2 日，源自 W4A4 (MXFP4) MegaMoE 的实现。相比 5 月 7 日使用的非 MTP 实现，6 月 2 日版本的主要改进完全来自 GB300 解码拓扑的重构，而非触及 kernel 或精度。第 0 天的配方大多数点以窄 EP=8 运行，由一到两个 prefill worker 供给，并发上限为 16,384；5 月 20 日的运行将解码扩展到 EP=16，prefill 扩展到每个解码 worker 对应四到十二个 prefill worker，并将并发推到了 21,504。
+
+根据上述图表和分析，我们可以看到，正如预期的那样，对于更大 world size 的推理系统，宽专家并行（Wide EP）是 GB300 卓越性能的主要杠杆，这通过在更多 GPU 之间分摊权重加载来实现。阅读 [InferenceX V2](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs) 文章了解更多关于 Wide EP 的内容。
+
+这些 GB300 的测试结果完全得益于 CoreWeave 的支持。
+
+## B200 每兆瓦（MW）token 吞吐量改进
+
+对于使用 vLLM 引擎的 B200，在 50 tok/s/user 交互性下，每全电源配置（all-in provisioned utility）兆瓦的 token 吞吐量在第 0 天达到每秒每 MW 300,000 token，到 6 月 5 日提升至接近每秒每 MW 500,000 token。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/b200-tokens-per-mw.png"
+  alt="B200 vLLM 每全电源配置兆瓦 token/秒从第 0 天约 300,000 提升至 6 月 5 日在 50 tok/s/user 下接近 500,000"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=DeepSeek-V4-Pro&g_rundate=2026-06-05&g_runid=27002889438&i_metric=y_tpPerMw&i_gpus=b200_vllm&i_dates=2026-04-28&i_dstart=2026-04-27&i_dend=2026-06-05&i_hc=1&i_linelabel=1)
+
+每全电源配置兆瓦 token 数是评估机群规模投资回报的最佳指标：它比单纯的每 GPU token 吞吐量承载更多信息，因为它反映了 PUE 和数据中心开销。由于 B200 的全电源配置功率大约固定在 2.17 kW/GPU，从约 300k 到约 500k tok/s/MW 的约 1.7 倍跃升反映的是纯软件优化收益。
+
+推动吞吐量前沿的同类优化（MegaMoE 分组 FP4 GEMM、更宽的 EP、FP4 权重路径、调度器调优）直接传导至能效提升，因为以 MW 计的全电源配置功率保持不变。
+
+许多组织从最大化稀缺电力资源的角度来评估推理机群。核心问题是如何将配置的 MW 在给定利用率和价格下转化为尽可能多的计费 token。最佳分析方式是参考每 MW 收入、每全电源配置功率 token 数、每 MW 资本支出等指标。这正是我们的 [Tokenomics 模型](https://semianalysis.com/tokenomics-model/)所要解决的商业问题。
+
+## 截至 2026 年 6 月 6 日的当前性能
+
+让我们通过快速回顾各系统和推理引擎的最佳性能来结束性能改进这一部分。使用 SGLang 时，GB300 继续碾压所有其他推理系统，展示了 GB300 NVL72 机架级 world size 的优势。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/current-sglang-pareto.png"
+  alt="截至 2026 年 6 月 6 日各系统 DeepSeek V4 Pro SGLang 最佳性能，GB300 NVL72 碾压所有其他系统"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=DeepSeek-V4-Pro&g_rundate=2026-06-05&g_runid=27002889438&i_active=b300_sglang_mtp%2Cgb300_dynamo-sglang%2Cgb300_dynamo-sglang_mtp%2Cmi355x_sglang_mtp&i_linelabel=1)
+
+开启 MTP 后，GB300 在我们分析的所有交互性水平上都无可匹敌。GB300 每百万输出 token 的成本在 50 tok/s/user 下达到 $0.156（假设 8k token 输入、1k token 输出）。欲了解更多关于我们如何计算总拥有成本（TCO）的信息，请参阅我们的 [TCO 模型](https://semianalysis.com/ai-cloud-tco-model/)。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/gb300-cost-per-million-tokens.png"
+  alt="启用 MTP 的每百万输出 token 成本与交互性对比：GB300 NVL72 在 8k 输入 / 1k 输出下，50 tok/s/user 时达到每百万 token $0.156"
+  caption="来源：InferenceX TCO 计算器"
+/>
+
+[实时 TCO 计算器视图](https://inferencex.semianalysis.com/calculator?g_rundate=2026-06-08&g_model=DeepSeek-V4-Pro&g_runid=27099659001)
+
+机架级优势本质上是一个纵向扩展（scale-up）的故事。NVL72 将 72 个 GPU 置于单个 NVLink 域中，使服务栈能够以足够宽的专家并行运行，让 DeepSeek V4 的 MoE dispatch/combine all-to-all 完全运行在 NVLink 上而无需溢出到较慢的横向扩展（scale-out）网络，同时在更多 rank 之间分摊专家权重加载。
+
+B200 和 B300 的 8 GPU NVLink 岛通过 InfiniBand 横向扩展会更早遇到瓶颈，而 MI355X 在纵向扩展域规模和通信栈成熟度上都更为落后。将这些每机架吞吐量优势转化为实际部署的服务容量是另一个问题，取决于每种 SKU 实际上线的数量：各 SKU 的出货量和平均售价、按客户和季度统计的装机量和有效 FLOPS——这些我们在 [Accelerator & HBM 模型](https://semianalysis.com/accelerator-hbm-model/)中持续追踪。
+
+### ROCm vLLM DeepSeek v4 Pro 的失望表现
+
+在 ROCm 方面，原生 vLLM 的进展远慢于原生 SGLang。ROCm vLLM 的性能远远落后于其 CUDA vLLM 对应版本。部分原因是 AMD 正在将重心转向 ATOM（一个服务零生产 token 的推理引擎），而非专注于原生 vLLM（一个许多主要客户都在使用的推理引擎）。我们将在即将发布的 State of AMD 2026 文章中详细讨论（涵盖 AMD 推理的优点、缺点和问题）。我们将提到的一个积极进展是，开源上游 AMD vLLM 的分布式推理功能终于在非 DeepSeekv4 模型上实现了开箱即用。虽然走到这一步花了很多个月，但 AMD vLLM 团队仍有大量工作要做。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/rocm-vllm-lag.png"
+  alt="ROCm vLLM DeepSeek V4 Pro 性能远落后于 CUDA vLLM"
+  caption="来源：InferenceX"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?i_hc=1&g_model=DeepSeek-V4-Pro&g_rundate=2026-06-05&i_prec=fp4%2Cfp8&i_active=b200_vllm%2Cmi355x_vllm&g_runid=27002889438)
+
+## DeepSeek v4 的下一步
+
+### vLLM
+
+vLLM 的计划在 [DeepSeek V4 路线图 issue (#40902)](https://github.com/vllm-project/vllm/issues/40902) 中追踪，已落地的代码在实现 PR #40860 中，并描述了针对 SemiAnalysis InferenceX 仪表板的基准测试。FP4 Indexer 和初始 MegaMoE 支持已经实现，Hopper 也已支持。vLLM 针对 DeepSeek v4 的剩余工作涵盖五个方面：
+
+- **核心模型支持**：持续 MegaMoE 工作（PR #40833）和 NVFP4 支持。
+- **运行时与并行**：Model Runner V2 集成、MTP 优化、prefill/decode (PD) 优化、pipeline parallelism 支持。
+- **Kernel 集成**：paged prefill kernel、fast top-k kernel、更多水平融合、DeepEP V2，以及与 DeepSeek 自研 TileKernels 的集成。
+- **KV cache**：KV cache 卸载，涵盖 PD + CPU 卸载（PR #39654）和分布式 KV 卸载。
+- **硬件支持**：在已完成 Hopper 支持的基础上，SM120 和 AMD 支持仍是关键待办项。
+
+这里的核心主题聚焦在周边系统：新的模型运行器、pipeline parallelism、KV 卸载，以及更广泛的硬件覆盖。
+
+即将到来的 InferenceX 更新——SemiAnalysis 的开源公共 EcosystemX 仪表板——将可视化所有主要 ML 开源库在所有主要 AI 芯片上的软件演进、CI 覆盖率和队列时间：NVIDIA、AMD、TPU、Trainium、Huawei 等。
+
+### SGLang
+
+SGLang 的计划记录在[性能优化追踪 (#23666)](https://github.com/sgl-project/sglang/issues/23666) 中，Nvidia 围绕 DeepSeek v4 的网络架构图逐模块推进；部分项可能已被初始支持 PR (#23600) 部分覆盖，欢迎社区贡献。
+
+这里反映了三个高层目标：解码的 CUDA graph 支持、prefill 的分段 CUDA graph 支持，以及无运行时权重处理。此外，权重准备应只执行一次而非每步重复。在这三个高层目标之下，检查清单按 V4 的组件分组：
+
+- **mHC**：尝试对 `fc_hc_fn` GEMM（N 维度较小，可能需要特殊 kernel）使用 TF32/BF16，1/RMS + multiply 融合，单 kernel `hc_split_sinkhorn` 和 `hc_post`，以及在 attention 和 MoE 模块中融合 MulSum + RMSNorm（+ FP8/MXFP8 量化）。
+- **HCA（含 Compressor）**：将 `fc_qa` + `fc_kv` 水平融合为一个 FP8 GEMM，q-norm/k-norm 和 RMSNorm+RoPE 融合，去掉 `topk_idx` 的非稀疏 MQA 路径，MQA 直接从压缩和 SWA KV-cache 读取而无需拷贝/拼接，单 kernel InvRoPE，融合的 Compressor 状态更新（kv-update + ape-Add + score-update），以及使 HCA（尤其是 Compressor）兼容解码的 CUDA graph。
+- **CSA（Indexer + Compressor）**：稀疏路径类似的直接缓存读取，可选的（P1）fc_compressor + fc_idx_compressor 和 fc_qb + fc_idx_qb 融合，(RoPE +) Hadamard + MXFP4 量化融合，高效的 MXFP4 BMM+ReLU kernel（可能与 MulSum 甚至 Top-1024 融合），以及使 Indexer 和 Compressor 兼容 CUDA graph。
+- **MoE**：检验路由 GEMM 是否适用 TF32/BF16，将路由路径（softplus + sqrt + bias-add + Top-6 + gather + norm + multiply）尽可能压缩为最少的 kernel，融合 block-wise FP8 和 MXFP8 激活量化，确保 shared-expert 和 routed-expert FC13 均为单 kernel，并审计路由专家前的微型排序 kernel。
+
+SGLang 的重点是用单个融合 kernel 替换小算子链，使新注意力变体就地读取缓存，并将解码路径完全纳入 CUDA graph。
+
+## 第二节：Huawei 950DT 第 0 天 DeepSeek v4 分析
+
+DeepSeek v4 是首个在 Huawei Ascend 上获得一流第 0 天支持的重要开源模型，事实上，DeepSeek 官方 API 的部分服务从第 0 天起就运行在 Huawei 上。我们已获得 Huawei 在 DeepSeek v4 上的性能数据，并计划发布后续文章，深入对比 Huawei 与 H200 和 B200 上的推理性能，使用相同的基准测试（benchmark）工具进行苹果对苹果的比较。
+
+我们即将推出的公开开源 SemiAnalysis EcosystemX 仪表板将可视化所有主要 ML 开源库在所有主要 AI 芯片上的软件演进和 CI 覆盖率，包括 Ascend 栈。
+
+### CANN
+
+CANN（Compute Architecture for Neural Networks）是 Huawei 为在自家 Ascend 芯片上运行 AI 工作负载而开发的软件工具包。自 2025 年 8 月起，他们开源了 CANN 以吸引更多开发者，并"蚕食" Nvidia 的主导地位——尤其是在中国，鉴于美国政府严格限制 CUDA 芯片出口到中国。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/cann-open-source-slide.jpg"
+  alt="Huawei CANN 开源公告幻灯片"
+  caption="来源：CANN 幻灯片"
+/>
+
+[来源](https://gitcode.com/cann/community/blob/master/events/meetup/slides/DeepSeek-V4/20260424/DeepSeek-V4%E6%98%87%E8%85%BE%E9%A6%96%E5%8F%91_%E5%9F%BA%E4%BA%8ECANN%E7%9A%84%E9%AB%98%E6%80%A7%E8%83%BD%E6%8E%A8%E7%90%86%E4%BC%98%E5%8C%96%E5%AE%9E%E8%B7%B5.pdf)
+
+在第 0 天，CANN 发布了 Ascend 芯片的优化指南和基准测试数据。通过这些信息，我们可以看到 Huawei 的 CANN 策略：通过面向中国国产模型发布的全栈推理优化，使 Ascend 具备竞争力。Huawei 试图向中国生态系统表明，如果 DeepSeek 发布新架构，CANN 能够交付 kernel、graph 路径、量化、服务集成和部署配方。
+
+我们在基准测试 MTP 时观察到 CANN 团队的一个有趣方法论，不得不在此提及：他们如何处理 MTP 草稿 token 的 AR（acceptance rate，接受率）或 AL（acceptance length，接受长度）。基准测试 MTP 并非易事，因为基准测试的 AR/AL 可能与用户的实际用例不同。例如，基准测试平均每三个草稿 token 可能接受两个，但在极其多样化的实际部署场景中，可能平均只能接受 1.5 个。
+
+这意味着用户可能看到的性能低于基准测试，从而错误地认为他们的设置有问题。我们在 [InferenceX v2 文章](https://newsletter.semianalysis.com/i/188090866/multi-token-prediction-mtp)中通过与 MTBench 的 AR 对比来解决这个问题。我们基准测试的未来迭代将通过使用真实 trace 来全面解决这一差距。
+
+为应对这一问题，Huawei 选择将完整解码步骤的计时对齐到最后一个 MTP 模块，从而[记录每解码步骤的时间而非每 token 的时间](https://gitcode.com/cann/cann-recipes-infer/blob/052e0ba122043bf46a2b5d17e16488e53e7b0b60/executor/core/engine/execution_engine.py#L451)。最终发布的基准测试结果需要用户乘以其用例的 MTP AL 来得出可比较的性能指标——这是一种非常优雅的性能比较方式。
+
+### 嘿 NVIDIA Goliath，来了一位新 David — Ascend 950
+
+Huawei 对 Ascend 950 芯片的内部代号是"David"，这一代号在 CANN 代码库中被多次引用。毫无疑问，这是因为他们认为自己是对抗 Nvidia Goliath（巨人歌利亚）的 David（大卫）。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/ascend-950-david-codename.jpg"
+  alt="奇幻画风 meme：标注为 'Goliath NVIDIA' 的巨人俯瞰标注为 'Huawei Ascend 950, 代号 David' 的小骑士"
+  caption="来源：SemiAnalysis"
+/>
+
+SIMT/SIMD 950 芯片有两个型号：950PR 和 950DT。PR 代表 Prefill and Recommendation（预填充与推荐），是成本更低、性价比更高的芯片。DT 代表 Decode and Training（解码与训练），该型号具有更高的内存带宽和更高的性能。两者基于同一 Ascend 950 Die，使用双 die UMA 架构，但封装了不同的内存。每季度各 Huawei 芯片的路线图预估和出货量可在 [SemiAnalysis Accelerator 模型](https://semianalysis.com/accelerator-hbm-model/)中查看。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/ascend-950-architecture.jpg"
+  alt="基于同一双 die UMA Ascend 950 die 的 Huawei Ascend 950PR 和 950DT 产品型号"
+  caption="来源：CANN"
+/>
+
+[来源](https://cann.csdn.net/69d8a96e54b52172bc684f2e.html)
+
+芯片架构中有两个重要组件值得讨论：AIC（AI Cube）和 AIV（AI Vector）。AIC 是 Ascend AI Core 的**矩阵/张量核心**部分，用于密集矩阵运算：GEMM、matmul、类卷积张量操作、attention 投影、FFN 线性层等。Huawei 文档将 AIC 描述为分离式 AI Core 架构中的**矩阵计算**核心。AIV 是**向量核心**部分，处理逐元素/向量运算：激活函数、归一化组件、掩码、规约、类型转换、布局变换、matmul 的后处理等。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/ascend-aic-aiv-architecture.png"
+  alt="Huawei Ascend AI Core 分离为 AIC（AI Cube，矩阵计算）和 AIV（AI Vector，逐元素计算）核心的架构图"
+  caption="来源：CANN"
+/>
+
+[来源](https://cann.csdn.net/69d8a96e54b52172bc684f2e.html)
+
+这与 TPU 的 MXU 类似。不过 Ascend 将两种功能更直接地暴露为独立的分离核心，每个核心都能加载自己的代码段，并且支持"双主模式"——AIC 和 AIV 独立运行代码，而非由 AIV 通过消息驱动 AIC。
+
+AI CPU 是一个具有直接设备内存访问权限的设备端 ARM64 执行单元。它作为 AI Core 的补充，处理不适合在 SIMD/SIMT 核心上执行的工作：分支密集的控制流、标量逻辑、动态形状处理，以及 kernel 运行前需要的值依赖的调度/分块元数据。由于 AI CPU 位于设备上，Ascend 可以将这些不规则的控制型工作保留在本地，而非在主机 CPU 之间来回传输——后者是延迟（latency）和流水线气泡的主要来源。AI CPU 也是在专用 CCU 分担通信编排之前，历史上位于旧有 AICore → AICPU → SDMA 路径上的单元。
+
+与 TPU 和 Trainium 类似，Ascend 950 增加了专用的 CCU 通信引擎。该引擎与计算 die 并列，处理集合通信工作而不消耗 AI Core 的计算容量，支持远程读取 + 规约 + 本地写入，以及本地读取 + 远程写入。其优势在于更低的通信延迟、更少的 HBM 流量、更少的用户缓冲区拷贝，以及将计算核心从通信编排中解放出来，避免走旧有的 AICore → AICPU → SDMA 路径。
+
+### Huawei DeepSeekV4 Pro 950DT Profile
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/ascend-950dt-profile-overview.png"
+  alt="Ascend 950DT 上 DeepSeek V4 Flash 三个解码步骤的 Profiler trace，在 16 rank DP/EP 部署下，展示各流的活动"
+  caption="来源：SemiAnalysis, Huawei"
+/>
+
+上图展示了 DeepSeek flash v4 在 Ascend 950DT 上的三步 profile，使用 16 rank DP/EP 部署配置运行。它显示了 16 rank 集合通信参与以及活跃的 MoE dispatch/combine 流量。
+
+与大多数栈目前的标准做法一样，CANN 也使用可在多个流上运行的独立计算和通信算子——通过控制 Cube 和 Vector 核心分配来避免资源争用从而提升性能。Prolog、Compressor 和 LightningIndexer 等操作可以重叠执行，C4A Compressor 可以完全隐藏，shared expert 计算可以隐藏在 routed expert 执行之下而不降低 routed expert 性能。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/ascend-950dt-decode-streams.png"
+  alt="单个 Ascend 950DT 解码步骤的放大 Profiler 视图，展示计算和通信算子分布在多个流上"
+  caption="来源：SemiAnalysis, Huawei"
+/>
+
+放大某个解码步骤，我们可以看到不同组件如何被分配到不同的流。不同流上的操作在设备有空闲的合适资源时可以并行运行。模型使用多个流，因为一层可能不是单一的串行链，而可以包含只需在结果合并时才同步的分支——例如 shared-expert 计算与 routed-expert 计算 100% 重叠。
+
+上图中的 Streams 145-148 对应元数据流。这些算子每次解码步骤触发一次，预计算后续 kernel 复用的值依赖调度/分块元数据。它们是解码步骤中唯一的 AI CPU 操作，占总时间的极小比例，且完全被 AI Core 计算重叠。其影响在更长上下文的基准测试中可能更显著，因为需要提前解析更多序列长度和掩码相关的分区。
+
+在 DeepSeek v4 中，Huawei 将稀疏注意力和 LightningIndexer 的值依赖调度阶段移到了 AI CPU 上，而非回弹到主机。这些元数据操作根据运行时序列长度、掩码和分页 KV 信息构建可复用的每核心分区张量；`SparseAttnSharedkv` 和 `QuantLightningIndexer` 随后使用它们来决定每个 cube 核心处理哪些 Batch/Head/Q-block/K-block 工作，以及相应的 vector 核心规约任务。从概念上讲，这类似于 FlashInfer 在主机端为分页 attention 所做的 planning 阶段：一个低成本、动态形状感知的设置步骤，只运行一次因此在各层间摊销——唯一区别是 Huawei 将同样的 planning 工作推到了设备端 AI CPU 上而非主机端。
+
+上图中的 Stream 152 包含 LM head、最后一层，以及倒数第二层的 `o_proj` 和 MoE。这是 `npugraph_ex` 图编译器的决策，可能是为了让 `npugraph_ex` 运行时将 stream 144 上的主图视为"完成"，同时尾部工作继续异步执行。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/ascend-950dt-mc2-operators.png"
+  alt="Ascend 950DT 解码 trace 展示 MoeDistributeDispatchV2 和 MoeDistributeCombineV2 MC2 计算-通信融合算子"
+  caption="来源：SemiAnalysis, Huawei"
+/>
+
+CANN 早在 2024 年就引入了 MC²（merged compute-communication，计算-通信融合）。这是一类既非普通 kernel 也非 HCCL 集合通信的融合算子，它们将通信和计算嵌入一个 kernel 中。在 DeepSeek v4 解码中，我们可以看到 `MoeDistributeDispatchV2` 和 `MoeDistributeCombineV2` MC² EP 算子被使用。
+
+这里的核心要点是 Ascend 在第 0 天即交付了可用的、经过优化的 DeepSeek v4 推理基础设施。Huawei CANN 栈是仅有的两个具备 DeepSeekV4 第 0 天支持的栈之一，另一个是 Nvidia 的 CUDA。如文章前面所述，AMD 的栈在第 0 天不幸未能良好运行。这与去年 DeepSeek v3/R1 发布时形成了鲜明对比。当时只有一个栈在第 0 天可用：Nvidia CUDA 栈。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/deepseek-day0-huawei-support.jpg"
+  alt="DeepSeek 宣布 DeepSeek V4 第 0 天支持 Huawei Ascend"
+  caption="来源：DeepSeek"
+/>
+
+[来源](https://x.com/deepseek_ai/status/2057854261699195173)
+
+赋予 Ascend 950 内部代号的那个圣经故事以巨人倒地收场。但故事中的 Goliath 是站着不动让 David 投石的，而 Nvidia 的 Goliath 却在不断运动，每年推出新架构并改进现有架构。Huawei 已经证明它能在第 0 天投出一块石头；至于能否击倒一个不断移动的巨人，尚待观察。
+
+## DeepSeek V4 架构深度解析与协同设计
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/dsv4-architecture-diagram.png"
+  alt="DeepSeek V4 架构图，来自 DSv4 技术报告"
+  caption="来源：DSv4 技术报告"
+/>
+
+### 针对 1M 上下文长度的推理优化
+
+DeepSeek v4 引入了压缩稀疏注意力（Compressed Sparse Attention, CSA）和高度压缩注意力（Heavily Compressed Attention, HCA），告别了多头潜在注意力（Multi-head Latent Attention, MLA）。该设计的核心动机是减小 KV cache 大小。
+
+本质上，HCA 的 KV cache 由 KV embedding 的滑动窗口和一组压缩 KV 条目组成，其中每个条目将 key/value 压缩为一个并跨越 m′ 个 token（DeepSeek V4 Pro 中 m′ = 128）。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/dsv4-hca-compression.png"
+  alt="高度压缩注意力（HCA）示意图：KV embedding 的滑动窗口加上每个跨越 128 个 token 的压缩 KV 条目"
+  caption="来源：DSv4 技术报告"
+/>
+
+CSA 使用与 HCA 相同的 KV cache 压缩技术，但压缩率较低（m=4）。CSA 还通过 lightning indexer 选择要 attend 的 token，对压缩后的 KV 条目应用稀疏注意力。该稀疏注意力继承自 DeepSeek v3.2 中的 DeepSeek Sparse Attention。
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/dsv4-csa-sparse-attention.png"
+  alt="压缩稀疏注意力（CSA）示意图，压缩率 m=4，lightning indexer 选择要 attend 的 token"
+  caption="来源：DSv4 技术报告"
+/>
+
+通过交错使用 CSA 和 HCA，DeepSeek v4 大幅压缩了 KV cache 大小，在 1M 上下文长度下实现了 50 倍的 KV cache 缩减。
+
+然而，CSA 和 HCA 的新颖性给服务框架带来了 KV cache 管理挑战。例如，vLLM 的 KV cache 内存分配器实现了复杂的策略来确保高效的内存加载模式并支持前缀缓存（prefix caching）等服务特性。这包括设置逻辑块大小使其能整除 CSA 和 HCA 的 KV 压缩率，以及页面大小分桶策略以避免因 KV cache、compressor 状态、indexer KV 各自每条目大小不同而导致的内存碎片。
+
+### 确定性
+
+为确保 RL 训练的稳定性，DeepSeek 全力推进计算的确定性。这一努力在 GPU kernel 及其部署基础设施中均有体现。DeepSeek 为所有操作编写了自定义 kernel 以实现批不变性（batch invariance），通过强制执行特定的规约顺序（无论批大小如何）来保证确定性。这包括批不变的 split KV attention forward、GEMM 和 MoE backward kernel。批不变 kernel 会带来性能损失，因为使用它们意味着无法采用许多流行的不保证确定性规约顺序的算法技术。DeepSeek 通过编写针对自身工作负载定制的 kernel 来缓解性能损失，例如为特定矩阵形状特化 kernel。在部署基础设施方面，DeepSeek 着力于容错以使所有 rollout 可复现。DeepSeek 为每个生成请求构建了 token 粒度的预写日志（WAL），因此在 prefill 或 decode 期间被抢占的任何请求都可以在不重新计算的情况下恢复。
+
+## MegaMoE
+
+DeepSeek V4 的发布还包含了一个新的融合 MoE kernel，实现了 MoE 层中所有操作的更好重叠。使用专家并行（Expert Parallelism）的 MoE 首先进行 token dispatch all-to-all，然后是 Linear1、Activation、Linear 2，最后是 token combine all-to-all。Linear 1 和 Linear 2 是分组 GEMM 操作，其中给定 rank 中的每个专家将其权重应用于其路由到的 token。作者在 DeepSeek V4 论文中提到，其他实现将 token dispatch 与 Linear 1 以及 Combine 与 Linear 2 进行重叠/交错，但在操作边界——Linear 1、Activation 和 Linear 2 之间——仍然存在跨所有专家的同步。MegaMoE 则将专家拆分为多个 wave，分别调度每个 wave，实现更细粒度的操作重叠，从而隐藏更多通信延迟。这让人联想到分布式 GEMM 等计算-通信融合，即将计算 kernel 和依赖的通信 kernel 通过拆分为更小的片段并流水线化来重叠，以隐藏通信延迟。
+
+论文声称在 DeepSeek v4 Flash 配置下，理论加速比相对于 naive kernel 为 1.92 倍，这意味着 naive kernel 必须将接近 50% 的时间花在 Dispatch 和 Combine 通信上！
+
+<Figure
+  src="/images/deepseekv4-16t-day-0-to-day-43-performance/dsv4-megamoe-overlap.png"
+  alt="DSv4 技术报告中的 MegaMoE wave 调度示意图，展示 dispatch、分组 GEMM、activation 和 combine 在各专家 wave 间的重叠"
+  caption="来源：DSv4 技术报告"
+/>
+
+在详细讨论了性能基准测试之后，让我们来讨论在 H200 和 GB200 NVL72 上运行 DeepSeek v4 的总拥有成本和每 token 成本。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?preset=dsv4-launch">
+  点击查看完整的 DeepSeekV4 InferenceX 仪表板 →
+</DashboardCTA>
+
+_本文后续包含 H200 和 GB200 NVL72 的完整总拥有成本和每 token 成本分析，详见 [SemiAnalysis 通讯订阅版](https://newsletter.semianalysis.com/p/deepseekv4-16t-day-0-to-day-43-performance)。_
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "MI355X SGLang DeepSeek V4 Pro 在发布后 26 天内性能提升了多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "AMD MI355X SGLang DeepSeek V4 Pro 的吞吐量在发布后 26 天内提升了超过 100 倍。2026-04-25 的第 0 天 FP8 构建在技术上可以运行，但交互性被限制在 1-2 tok/s/user，远低于可用的服务水平。性能提升几乎完全来自用真正的 AITER、Triton、TileLang 和 FlyDSL kernel 替换 PyTorch 原生 fallback 路径，关键跃升包括启用原生 FP4 (MXFP4) 权重 MoE、在每层引入 AITER mHC kernel，以及退役剩余 fallback（FlashMLA 从 TileLang 迁移到 Triton、AITER FlyDSL FP4 MoE、fused hash-topk、DSv4 radix attention、fused store-cache、fused WQA/WKV projection 和 fused paged-compress）。数据来源：InferenceX。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "启用 MTP 的 GB300 NVL72 上 DeepSeek V4 Pro 每百万输出 token 成本是多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在启用 MTP 的 GB300 NVL72 SGLang 上，DeepSeek V4 Pro 在 8K 输入 / 1K 输出工作负载下，50 tok/s/user 时每百万输出 token 成本达到 $0.156（基于 InferenceX TCO 计算器）。机架级优势源于 NVLink 纵向扩展到 72 个 GPU，使 DeepSeek V4 的 MoE dispatch/combine all-to-all 完全运行在 NVLink 上而无需溢出到较慢的横向扩展网络，同时在更多 rank 之间分摊专家权重加载。B200 和 B300 等 8 GPU NVLink 岛通过 InfiniBand 横向扩展时会更早遇到瓶颈。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "DeepSeek V4 的注意力架构是什么？CSA / HCA 如何减少 KV cache？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "DeepSeek V4 用压缩稀疏注意力（CSA）和高度压缩注意力（HCA）取代了多头潜在注意力（MLA）。HCA 的 KV cache 由 KV embedding 的滑动窗口和压缩 KV 条目组成，每个条目跨越 128 个 token（DeepSeek V4 Pro 中 m'=128）压缩 key 和 value。CSA 使用相同的压缩机制但压缩率更低（m=4），并通过 lightning indexer 对压缩后的条目应用稀疏注意力，继承自 DeepSeek V3.2 的稀疏注意力模式。交错使用 CSA 和 HCA 在 1M 上下文长度下实现了约 50 倍的 KV cache 缩减。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "B200 vLLM 每配置兆瓦 token 数在 DeepSeek V4 Pro 上有何改善？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "B200 vLLM 每全电源配置兆瓦每秒 token 数从第 0 天的约 300,000 提升到 2026-06-05 时在 50 tok/s/user 交互性下接近 500,000。由于 B200 的全电源配置功率大约固定在每 GPU 2.17 kW，这约 1.7 倍的跃升是来自 MegaMoE 分组 FP4 GEMM、更宽的专家并行、FP4 权重路径和调度器调优的纯软件收益。每全电源配置兆瓦 token 数是评估机群规模投资回报的最佳指标，因为它在原始每 GPU 吞吐量之外还反映了 PUE 和数据中心开销。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "DeepSeek V4 中的 MegaMoE 是什么？它能为 MoE 层带来多少加速？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "MegaMoE 是随 DeepSeek V4 引入的融合 MoE kernel，它将专家拆分为多个 wave 并分别调度，相比此前在每个操作边界同步的实现，在 dispatch all-to-all、Linear 1、Activation、Linear 2 和 combine all-to-all 之间实现了更细粒度的重叠。DeepSeek V4 论文报告在 DeepSeek V4 Flash 配置下，理论加速比相对于 naive kernel 为 1.92 倍，这意味着 naive kernel 将接近 50% 的时间花在了 dispatch 和 combine 通信上。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "哪些推理栈在第 0 天支持了 DeepSeek V4 Pro？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "只有 NVIDIA CUDA（通过原生 SGLang 和原生 vLLM）和 Huawei CANN（在 Ascend 950DT 上）在第 0 天提供了可用的 DeepSeek V4 Pro 支持。AMD ROCm 在 MI355X 上技术上可以运行，但交互性被限制在 1-2 tok/s/user，远低于可用的服务水平。NVIDIA TensorRT-LLM 开箱即用时存在故障，因为 mhcFusedHcKernel.cu 硬编码了 FHC_HIDDEN=4096 常量，对 DeepSeek V4 Pro 的 7168 隐藏维度会静默损坏隐藏状态；SemiAnalysis 编写的补丁在数天后被合并。这与 DeepSeek V3 / R1 发布时形成鲜明对比——当时仅有 NVIDIA CUDA 在第 0 天可用。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200.mdx b/packages/app/content/blog/zh/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200.mdx
new file mode 100644
index 00000000..2abb029c
--- /dev/null
+++ b/packages/app/content/blog/zh/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200.mdx
@@ -0,0 +1,136 @@
+---
+title: 'GB200 NVL72 vs B200 Kimi K2.5 推理对比：宽 EP vLLM 带来 3.1 倍提升'
+subtitle: 'NVL72 的机架级 NVLink 使 Dynamo vLLM 能够以最高 Decode EP 16 运行 Kimi K2.5 宽 EP，在 8k/1k NVFP4 下峰值吞吐量从 4,021 提升至 12,587 tok/s/GPU'
+date: '2026-04-23'
+publishDate: '2026-04-23'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - kimi
+  - nvidia
+  - gb200
+  - b200
+  - vllm
+  - nvl72
+  - wide-ep
+---
+
+NVIDIA GB200 NVL72 运行 Dynamo vLLM 在 Kimi K2.5 NVFP4 8k/1k 上峰值达到 12,587 tok/s/GPU，而最优 B200 单节点 vLLM 配方在同一工作负载上峰值为 4,021 tok/s/GPU。这意味着每 GPU 峰值吞吐量有 3.13 倍的优势。NVL72 的机架级 NVLink 互联让解码端可以使用最高 Decode EP 16 的宽专家并行（在已测试的配方中），峰值配方为 8 GPU 解码池上的 Decode EP 8。B200 在最优实测配方上止步于 Decode EP 4。超过该点后，专家 all-to-all 通信开始受到 scale-out 互联延迟的制约。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+Kimi K2.5 是一个 1T 参数的 MoE 模型，拥有 384 个路由专家加 1 个共享专家，每 token 激活 8 个专家，共 60 层 MoE 层。每个 MoE 层执行一次路由式 all-to-all 分发加一次 all-to-all 汇聚，因此单次前向传播在 60 层中总共约有 120 次 all-to-all 操作。在 NVL72 上，这些流量始终运行在 NVLink 5 上，每 GPU 1.8 TB/s，聚合互联带宽达 130 TB/s。而在 B200 上，宽 EP 超过 8 GPU 后就离开了 NVLink 域，退回到 ConnectX 7 InfiniBand 的每 GPU 400 Gb/s，约为 NVL72 NVLink 带宽的 1/36。稀疏度为 48 的 MoE（如 K2.5）无法在规模化时容忍这种差距。
+
+## 宽 EP 对 Kimi K2.5 为何重要
+
+在 EP 4 下，每个 GPU 持有 Kimi K2.5 384 个专家中的 96 个。解码受限于每步从 HBM 重新加载这些专家权重所需的显存带宽。将 EP 扩展到 16 会将每 GPU 的专家占用降至 24 个。每次专家权重读取被摊销到更大的有效批次上——更多对等 GPU 通过该 rank 分发 token。这使解码从权重带宽受限转向算力和通信受限。在这种模式下，Blackwell 的 FP4 tensor core 和 NVLink 带宽都能发挥优势。
+
+扩宽 EP 的代价是每个 MoE 层增加一次 all-to-all 集合通信。如果该集合通信命中 scale-out 互联，交互性预算在吞吐量收益回本之前就会崩溃。NVL72 的 scale-up 域使得 EP 8 到 EP 16 的宽 EP 在 K2.5 解码池上可行。B200 的 8 GPU NVLink 域使得跨两个节点的 Decode EP 4 成为 scale-out 接管前的天花板。
+
+## 峰值吞吐量与并发曲线
+
+所有数据均为 Kimi K2.5 NVFP4、ISL 8192 / OSL 1024，在 InferenceX 上测量。B200 数据来自 2026-03-27 运行，由 [InferenceX PR #926](https://github.com/SemiAnalysisAI/InferenceX/pull/926) 触发，该 PR 在随机数据集上禁用了 Kimi K2.5 vLLM 基准测试的前缀缓存。GB200 NVL72 数据来自 2026-04-07 运行，由 [InferenceX PR #1008](https://github.com/SemiAnalysisAI/InferenceX/pull/1008) 触发，该 PR 添加了 GB200 Dynamo vLLM 分离式多节点配方（vLLM 0.18.0、nvidia/Kimi-K2.5-NVFP4、NixlConnector KV 传输、FLASHINFER_MLA 注意力）。两次运行间隔 11 天。两者均为各自硬件上峰值吞吐量配方的最新可用数据。
+
+B200 vLLM，2026-03-27 运行，非分离式，16 GPU 池：
+
+| Prefill    | Decode     | Conc | tok/s/GPU | TPOT (ms) | tok/s/user |
+| ---------- | ---------- | ---- | --------- | --------- | ---------- |
+| TP 4, EP 4 | TP 4, EP 4 | 4    | 878       | 9.8       | 101.8      |
+| TP 4, EP 4 | TP 4, EP 4 | 8    | 1,529     | 11.2      | 89.5       |
+| TP 4, EP 4 | TP 4, EP 4 | 16   | 2,286     | 15.1      | 66.3       |
+| TP 4, EP 4 | TP 4, EP 4 | 32   | 3,108     | 22.2      | 45.0       |
+| TP 4, EP 4 | TP 4, EP 4 | 64   | **4,021** | **34.1**  | **29.3**   |
+
+GB200 NVL72 Dynamo vLLM，2026-04-07 运行，分离式：
+
+| Prefill    | Decode       | Conc  | tok/s/GPU  | TPOT (ms) | tok/s/user |
+| ---------- | ------------ | ----- | ---------- | --------- | ---------- |
+| TP 4, EP 4 | TP 4, EP 4   | 4     | 231        | 7.1       | 140.8      |
+| TP 4, EP 4 | TP 4, EP 4   | 8     | 421        | 7.7       | 129.1      |
+| TP 4, EP 4 | TP 4, EP 4   | 16    | 744        | 8.7       | 114.7      |
+| TP 4, EP 4 | TP 4, EP 4   | 32    | 1,230      | 10.3      | 96.9       |
+| TP 4, EP 4 | TP 4, EP 4   | 128   | 2,173      | 12.8      | 77.9       |
+| TP 4, EP 4 | TP 16, EP 16 | 512   | 6,885      | 20.5      | 48.8       |
+| TP 4, EP 4 | TP 16, EP 16 | 1,024 | 7,565      | 21.6      | 46.2       |
+| TP 4, EP 4 | TP 8, EP 8   | 2,048 | **12,587** | 43.1      | 23.2       |
+| TP 4, EP 4 | TP 16, EP 16 | 4,096 | 12,576     | 27.5      | 36.3       |
+
+B200 在并发 64 时每 GPU 吞吐量饱和于 4,021 tok/s，此时 16 GPU 池已满载。NVL72 持续吸收并发直到 2,048 甚至更高。解码池是 8 到 16 个 GPU 的宽 EP 运行在 scale-up 互联上。增加用户使 MoE all-to-all 保持带宽受限而非延迟受限。
+
+## 等交互性对比
+
+在 B200 峰值吞吐量工作点（并发 64，29.3 tok/s/user，4,021 tok/s/GPU）处，最接近的 GB200 NVL72 数据点为：
+
+| 交互性 (tok/s/user) | GB200 NVL72 tok/s/GPU | 配置                              |
+| ------------------- | --------------------- | --------------------------------- |
+| 36.3                | 12,576                | Decode TP 16, EP 16 at conc 4,096 |
+| 23.2                | 12,587                | Decode TP 8, EP 8 at conc 2,048   |
+
+GB200 NVL72 在 23 到 36 tok/s/user 区间内维持约 12,580 tok/s/GPU 的平台，在接近 B200 峰值的等交互性下给出 3.13 倍的吞吐量比率。
+
+<Figure
+  srcLight="/images/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200/benchmark-light.png"
+  srcDark="/images/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200/benchmark-dark.png"
+  alt="Kimi K2.5 NVFP4 8k/1k Pareto 前沿，GB200 NVL72 Dynamo vLLM vs B200 vLLM，y 轴为 tok/s/GPU，x 轴为 tok/s/user"
+  caption="Kimi K2.5 NVFP4 8k/1k Pareto 前沿。GB200 NVL72 Dynamo vLLM（2026-04-07）vs B200 vLLM（2026-03-27）。两次运行间隔 11 天。"
+/>
+
+[在线图表](https://inferencex.semianalysis.com/inference?g_model=Kimi-K2.5&g_rundate=2026-04-07&g_runid=24100518225)，已预筛选为 4 月 7 日的 Kimi K2.5 数据。
+
+## NVL72 上的 vLLM 宽 EP
+
+vLLM 在 v0.9 中推出了 PPLX all-to-all 后端，随后添加了 DeepEP。v0.11 完成了 V1 引擎迁移，并通过 PR [#24845](https://github.com/vllm-project/vllm/pull/24845) 扩展了双批重叠（DBO，Dual Batch Overlap）路径，添加了 DeepEP 高吞吐量内核以及 DBO 的预填充支持，使 all-to-all 通信可以隐藏在计算之后。上述基准测试运行的是 v0.18.0，未启用投机解码。
+
+GB200 NVL72 配置在 NVIDIA Dynamo 中以 vLLM 作为 worker 运行时，在 InferenceX 数据集中标记为 dynamo-vllm。Dynamo 将预填充（4 GPU、TP 4、EP 4）与解码（8 到 16 GPU，TP 和 EP 均扩展至 16）分离，并通过 NVL72 互联在两者之间路由请求。SGLang 和 TRT-LLM 在 NVL72 上也有类似的分离式 + 宽 EP 路径，其中 SGLang 公开的 GB200 结果目前最为成熟。
+
+## 各 SKU 的优势场景
+
+B200 在 16 GPU 池上以 30 tok/s/user 的交互性提供约 4k tok/s/GPU。TP 4、EP 4 配方在并发 64 附近饱和。超过此点后延迟下限崩溃。
+
+GB200 NVL72 在并发 2,048 到 4,096 范围内以 23 到 36 tok/s/user 的交互性提供 12.5k tok/s/GPU。已测试的单节点 B200 配方没有可比的工作点。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "在 vLLM 推理 Kimi K2.5 时，GB200 NVL72 比 B200 快多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 Kimi K2.5 NVFP4 8k/1k 序列长度下，GB200 NVL72 运行 Dynamo vLLM 峰值达到 12,587 tok/s/GPU，而最优 B200 vLLM 配方峰值为 4,021 tok/s/GPU。这意味着每 GPU 峰值吞吐量有 3.13 倍的优势。在 B200 峰值吞吐量工作点（29.3 tok/s/user 交互性）下，GB200 NVL72 在 23 到 36 tok/s/user 范围内提供约 12,580 tok/s/GPU，等交互性下吞吐量为 3.13 倍。测量于 InferenceX，B200 数据来自 2026-03-27，GB200 NVL72 数据来自 2026-04-07。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 GB200 NVL72 在 MoE 模型上能扩展比 B200 更宽的专家并行？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "NVL72 是一个 72 GPU 的 NVLink scale-up 域。每个 GPU 可以 1.8 TB/s 的速率访问其他所有 GPU，聚合互联带宽达 130 TB/s。B200 每个 NVLink 域最多 8 GPU。更宽的专家并行必须跨越 InfiniBand，每 GPU 400 Gb/s，约为 NVL72 NVLink 带宽的 1/36。Kimi K2.5 有 384 个路由专家和 60 层 MoE 层，每层 MoE 执行一次 all-to-all 分发加汇聚，单次前向传播约 120 次 all-to-all。该集合通信仅在互联端到端均为 NVLink 时才能在 EP 16 或更高级别实际运行。在 NVL72 上流量保持在 scale-up 域内；B200 超过 8 GPU 后就不再如此。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "在 GB200 NVL72 上运行 Kimi K2.5 宽 EP 需要什么版本的 vLLM 和配方？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "Kimi K2.5 配方需要 vLLM 0.18 或更高版本，该版本依赖 Eagle3 投机解码支持。PPLX all-to-all 后端在 vLLM 0.9 中推出，随后是 DeepEP，v0.11 完成了 V1 引擎迁移并通过 PR #24845 扩展了双批重叠（DBO）路径（添加 DeepEP 高吞吐量内核和 DBO 预填充支持）。InferenceX GB200 基准测试在预填充池上使用 TP 4，在 NVL72 解码池上将 TP 和 EP 扩展至 16。上游 vLLM 配方使用 DP 4 / DP 16 形态，两侧均为 TP 1 加专家并行。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "B200 是否仍是部署 Kimi K2.5 的可行选择？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "是的，对于中等并发的工作负载而言。B200 vLLM 在 Decode TP 4、EP 4 配置下，16 GPU 池上约提供 4,021 tok/s/GPU、29 tok/s/user 的交互性。B200 的 Pareto 前沿在 16 GPU 池上约在并发 64 处达到上限。需要在单个推理域上容纳数千并发用户同时不损失每 GPU 效率的工作负载，正是 GB200 NVL72 拉开差距的场景——约 3 倍——因为 NVL72 的 scale-up 互联是使 384 专家 MoE 上 Decode EP 16 成为可能的关键。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx b/packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx
new file mode 100644
index 00000000..e4b34623
--- /dev/null
+++ b/packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx
@@ -0,0 +1,192 @@
+---
+title: 'GB200 NVL72 对比 B200 运行 DeepSeek R1 670B：在 125 tok/s/user 下每 GPU 吞吐量最高达 4.4 倍'
+subtitle: 'DeepSeek R1 FP4 1k/1k。NVL72 的 72-GPU NVLink 扩展域允许解码使用最高 EP=32 的宽专家并行，而 B200 的 8-GPU NVLink 岛通过 RoCEv2 上限为 EP=8'
+date: '2026-05-23'
+publishDate: '2026-05-23'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - deepseek
+  - nvidia
+  - gb200
+  - b200
+  - nvl72
+  - trtllm
+  - dynamo
+  - wide-ep
+  - disagg
+---
+
+在 DeepSeek R1 0528 FP4 1k/1k 工作负载下，使用 Dynamo TRT-LLM + MTP 并在两款 SKU 上均采用分离式预填充/解码，GB200 NVL72 在等交互性下的**每 GPU 吞吐量最高可达 B200 的 4.39 倍** — 峰值出现在 125 tok/s/user（GB200 NVL72 为 4,130 tok/s/GPU，B200 为 941 tok/s/GPU）。
+
+NVIDIA [GB200 NVL72](https://inferencex.semianalysis.com/gpu-specs) 通过 NVLink 5 连接全部 72 块 GPU，单向带宽 **900 GB/s/GPU**（双向 1.8 TB/s，Jensen 计算法 rx + tx 之和）。[B200](https://inferencex.semianalysis.com/gpu-specs) 服务器仅通过 NVLink 连接 8 块 GPU；当解码 EP 需要超过 8 个 rank 时，all-to-all 通信必须离开 NVLink 岛，转而通过 **ConnectX-7 RoCEv2 以太网，每 GPU 400 Gbit/s**。因此任何超过 8 路 EP 的集合通信可用每 GPU 带宽从 900 GB/s 降至 50 GB/s，降幅 18 倍。DeepSeek R1 的 256 个路由专家在 all-to-all 通信全程通过 NVLink 在 16 或 32 个 rank 间传输时能充分摊薄开销。
+
+<Figure
+  srcLight="/images/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt/gb200-nvl72-rack-light.png"
+  srcDark="/images/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt/gb200-nvl72-rack-dark.png"
+  alt="GB200 NVL72 42U 机柜布局。18 个计算托盘各装 4 块 GPU（共 72 块），机柜中部 9 个非可扩展 NVSwitch5 托盘将 72 块 GPU 编织成一个 NVLink-5 扩展域，4 个 33 kW 电源架，IPMI 管理刀片和接水盘。"
+  caption="GB200 NVL72 机柜布局 — 18 个计算托盘 × 每个 4 块 GPU = 72 块 GPU 组成一个 NVLink-5 扩展域，由 9 个 NVSwitch5 托盘互联。整个机柜使用与 HGX B200 节点内部 GPU 相同的互联架构；B200 多节点分离式部署跨机柜通过 InfiniBand 或 RoCEv2 以太网通信，每 GPU 带宽低 18 倍。"
+/>
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_seq=1k%2F1k&i_active=b200_dynamo-trt_mtp%2Cgb200_dynamo-trt_mtp">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<Figure
+  srcLight="/images/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt/benchmark-light.png"
+  srcDark="/images/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt/benchmark-dark.png"
+  alt="DeepSeek R1 0528 FP4 1k/1k tok/s/GPU 与交互性关系图。GB200 NVL72（Dynamo TRT，MTP）浅绿色，B200（Dynamo TRT，MTP）深绿色。每个曲线点标注其解码 TP 值。"
+  caption="DeepSeek R1 0528 FP4 1k/1k Pareto 前沿。GB200 NVL72 对比 B200，均使用 Dynamo TRT-LLM + MTP，均采用分离式预填充/解码。数据来自 InferenceX 2026-05-22 测量。标签标注解码 TP。"
+/>
+
+DeepSeek R1 0528 是 DeepSeek 于 2025 年 5 月发布的 671B 参数 MoE 模型 — 采用多头潜在注意力（MLA）进行 KV 缓存压缩，256 个路由专家中每 token 激活 8 个外加 1 个共享专家，共 61 层 transformer。每个 MoE 层在每次前向传播时触发一次路由 all-to-all 分发（dispatch）和一次 all-to-all 汇聚（combine）：大约每 token 120 次 all-to-all 通信。这一通信量级正是 NVLink 级别扩展带宽的用武之地。
+
+## GB200 NVL72 为何在曲线中段胜出
+
+在曲线中段 — 该工作负载下大约 75–175 tok/s/user — 解码变为**网络受限，瓶颈在 EP 分发和汇聚集合通信上**。每个 MoE 层在每 token 触发两次 all-to-all 集合通信：一次**分发**，将每个 token 路由到被分配的 256 个专家中的 8 个（在宽 EP 下通常位于远程 rank 上）；一次**汇聚**，将专家输出收集回每个 token 的主 rank。在 DeepSeek R1 的约 60 个 MoE 层中，每次前向传播大约有 120 次集合通信。
+
+当网络足够快时，运行时可以**将每次分发和汇聚与其所服务的矩阵乘法计算重叠**：发起分发，对已到达的 token 开始专家 GEMM 计算，在剩余字节到达期间大致完成 GEMM，然后发起汇聚。集合通信延迟基本从关键路径上消失，因为 GPU 始终在执行有用的计算。
+
+在 ConnectX-7 RoCEv2 以太网每 GPU 50 GB/s — 比 NVLink 低 18 倍的每 rank 带宽下 — 这种重叠无法实现。同样的集合通信每字节传输时间长达 18 倍，不再能适配 GEMM 时间预算，**暴露为纯粹的通信等待时间**。
+
+## 基准测试数据
+
+所有数据均为 DeepSeek R1 0528 FP4，**ISL 1024 / OSL 1024**，Dynamo TRT-LLM 启用 MTP，两款 SKU 均采用分离式预填充/解码、多节点部署，于 2026-05-22 在 InferenceX 上测量（run 26306422380）。每百万总 token 成本计算方式为 `TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6)`，B200 为 $1.95/GPU/hr，GB200 NVL72 为 $2.21/GPU/hr，数据来源 [SemiAnalysis AI Cloud TCO 模型](https://newsletter.semianalysis.com/p/ai-cloud-economics)。
+
+**GB200 NVL72（Dynamo TRT，MTP），DeepSeek R1 FP4 1k/1k 分离式部署：**
+
+| 并发数 | 预填充        | 解码          | tok/s/GPU    | tok/s/user | TPOT (ms) | $/M tok   |
+| ------ | ------------- | ------------- | ------------ | ---------- | --------- | --------- |
+| 4      | 4 GPU, TP=4   | 32 GPU, EP=8  | 60.7         | 286.40     | 3.49      | $10.12    |
+| 8      | 4 GPU, TP=4   | 32 GPU, EP=8  | 111.8        | 272.64     | 3.67      | $5.49     |
+| 12     | 4 GPU, TP=4   | 32 GPU, EP=8  | 165.2        | 257.11     | 3.89      | $3.72     |
+| 24     | 4 GPU, TP=4   | 32 GPU, EP=8  | 274.8        | 222.28     | 4.50      | $2.23     |
+| 48     | 4 GPU, TP=4   | 32 GPU, EP=8  | 363.3        | 207.30     | 4.82      | $1.69     |
+| 180    | 4 GPU, TP=4   | 32 GPU, EP=32 | 1,149.1      | 164.37     | 6.08      | $0.53     |
+| 2,253  | 12 GPU, TP=12 | 32 GPU, EP=32 | 7,698.0      | 90.99      | 10.99     | $0.08     |
+| 4,301  | 8 GPU, TP=8   | 16 GPU, EP=16 | 12,659.7     | 43.29      | 23.10     | $0.05     |
+| 16,130 | 12 GPU, TP=12 | 20 GPU, EP=4  | **14,659.4** | **17.82**  | **56.11** | **$0.04** |
+
+**B200（Dynamo TRT，MTP），DeepSeek R1 FP4 1k/1k 分离式多节点部署：**
+
+| 并发数 | 预填充        | 解码         | tok/s/GPU    | tok/s/user | TPOT (ms) | $/M tok   |
+| ------ | ------------- | ------------ | ------------ | ---------- | --------- | --------- |
+| 6      | 4 GPU, TP=4   | 40 GPU, EP=8 | 49.3         | 309.17     | 3.23      | $10.99    |
+| 10     | 4 GPU, TP=4   | 40 GPU, EP=8 | 118.7        | 277.39     | 3.61      | $4.56     |
+| 15     | 4 GPU, TP=4   | 40 GPU, EP=8 | 168.9        | 261.09     | 3.83      | $3.21     |
+| 25     | 4 GPU, TP=4   | 40 GPU, EP=8 | 242.4        | 224.59     | 4.45      | $2.23     |
+| 45     | 4 GPU, TP=4   | 40 GPU, EP=8 | 369.9        | 191.18     | 5.23      | $1.46     |
+| 90     | 4 GPU, TP=4   | 40 GPU, EP=8 | 577.3        | 150.56     | 6.64      | $0.94     |
+| 180    | 4 GPU, TP=4   | 40 GPU, EP=8 | 897.9        | 126.42     | 7.91      | $0.60     |
+| 875    | 4 GPU, TP=4   | 40 GPU, EP=8 | 2,832.9      | 101.79     | 9.82      | $0.19     |
+| 1,214  | 4 GPU, TP=4   | 16 GPU, EP=8 | 7,111.4      | 74.04      | 13.51     | $0.08     |
+| 4,968  | 12 GPU, TP=12 | 32 GPU, EP=8 | 9,660.7      | 56.35      | 17.75     | $0.06     |
+| 10,860 | 12 GPU, TP=12 | 20 GPU, EP=4 | **12,515.7** | **21.34**  | **46.86** | **$0.04** |
+
+## 等交互性吞吐量对比
+
+| 交互性 (tok/s/user) | GB200 NVL72 tok/s/GPU | B200 tok/s/GPU | GB200 NVL72 / B200 |
+| ------------------- | --------------------- | -------------- | ------------------ |
+| 25                  | 14,125                | 12,292         | 1.15x              |
+| 45                  | 12,508                | 10,853         | 1.15x              |
+| 60                  | 11,017                | 9,185          | 1.20x              |
+| 75                  | 9,379                 | 6,968          | 1.35x              |
+| 90                  | 7,796                 | 4,512          | 1.73x              |
+| 100                 | 6,781                 | 3,047          | 2.23x              |
+| **125**             | **4,130**             | **941**        | **4.39x**          |
+| 150                 | 1,922                 | 583            | 3.30x              |
+| 175                 | 826                   | 429            | 1.93x              |
+| 200                 | 432                   | 332            | 1.30x              |
+| 225                 | 262                   | 241            | 1.09x              |
+| 250                 | 186                   | 193            | 0.97x              |
+| 275                 | 103                   | 126            | 0.82x              |
+| 300                 | _不可达_              | 67             | _∞_（B200 胜出）   |
+
+以及同一对比按每百万 token 成本归一化的结果，GB200 NVL72 每 GPU 小时 TCO 高 13%（$2.21 vs $1.95）削弱了其吞吐量优势：
+
+| 交互性 (tok/s/user) | GB200 NVL72 $/M tok | B200 $/M tok | B200 / GB200 NVL72 |
+| ------------------- | ------------------- | ------------ | ------------------ |
+| 25                  | $0.0435             | $0.0441      | 1.01x              |
+| 45                  | $0.0491             | $0.0499      | 1.02x              |
+| 60                  | $0.0557             | $0.0590      | 1.06x              |
+| 75                  | $0.0655             | $0.0777      | 1.19x              |
+| 100                 | $0.0905             | $0.1778      | 1.96x              |
+| **125**             | **$0.1486**         | **$0.5755**  | **3.87x**          |
+| 150                 | $0.3194             | $0.9292      | 2.91x              |
+| 175                 | $0.7430             | $1.2638      | 1.70x              |
+| 200                 | $1.4215             | $1.6314      | 1.15x              |
+| 225                 | $2.3450             | $2.2454      | 0.96x              |
+| 250                 | $3.2962             | $2.8067      | 0.85x（B200 胜出） |
+
+4.39 倍的吞吐量峰值（3.87 倍的成本差距）出现在 125 tok/s/user，此时宽 EP 跨 NVLink 互联域发挥最大作用。
+
+<Figure
+  srcLight="/images/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt/benchmark-light.png"
+  srcDark="/images/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt/benchmark-dark.png"
+  alt="DeepSeek R1 0528 FP4 1k/1k tok/s/GPU 与交互性关系图。GB200 NVL72（Dynamo TRT，MTP）浅绿色，B200（Dynamo TRT，MTP）深绿色。每个曲线点标注其解码 TP 值。"
+  caption="DeepSeek R1 0528 FP4 1k/1k Pareto 前沿。GB200 NVL72 对比 B200，均使用 Dynamo TRT-LLM + MTP，均采用分离式预填充/解码。数据来自 InferenceX 2026-05-22 测量。标签标注解码 TP。"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_seq=1k%2F1k&i_active=b200_dynamo-trt_mtp%2Cgb200_dynamo-trt_mtp)，预筛选为 2026-05-22 测试中 B200 和 GB200 NVL72 Dynamo TRT MTP 在 DeepSeek R1 FP4 1k/1k 上的结果。
+
+## 各 SKU 的优势区间
+
+- **GB200 NVL72 Dynamo TRT** 在 75 至 200 tok/s/user 区间内是最佳选择，此区间内 72-GPU NVLink 互联域支撑的宽 EP 是主导因素。成本差距在 125 tok/s/user 时达到峰值，GB200 NVL72 便宜 3.87 倍 — 聊天式和推理服务在生产级交互性目标下恰好落在此区间。
+
+NVIDIA 的 [SGLang GB200 NVL72 结果](https://lmsys.org/blog/2025-09-25-gb200-part-2/)在 SGLang 软件栈上展现了相同的扩展域优势。AMD 的 MI300/MI355X 在 [2026 年下半年工程样片](https://newsletter.semianalysis.com/p/ai-cloud-economics)之前没有对应的机架级 UALoE72 产品出货，因此目前在 AMD 侧无法进行该工作负载的机架级对比。
+
+## 致谢
+
+感谢 NVIDIA 的 Dynamo 和 TensorRT-LLM 团队 — 包括 Jatin Gangani、Kedar Potdar、Sridhar Ramaswamy、Ishan Dhanani 和 Sahithi Chigurupati — 交付了 B200 多节点 RoCEv2 和 GB200 NVL72 上的分离式部署方案。请查看我们另一篇关于 [GB200 NVL72 对比 B200 运行 Kimi K2.5 的博文](https://inferencex.semianalysis.com/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200)。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_seq=1k%2F1k&i_active=b200_dynamo-trt_mtp%2Cgb200_dynamo-trt_mtp">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "GB200 NVL72 在 DeepSeek R1 FP4 上使用 Dynamo TRT-LLM + MTP 比 B200 快多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 DeepSeek R1 0528 FP4、1k/1k 序列长度、Dynamo TRT-LLM + MTP、两款 SKU 均采用分离式预填充/解码的条件下，GB200 NVL72 在等交互性下每 GPU 吞吐量最高可达 B200 的 4.39 倍，峰值出现在 125 tok/s/user（仪表板单调三次 Hermite Pareto 插值显示 4,130 vs 941 tok/s/GPU）。在峰值吞吐量端（低于 25 tok/s/user），差距缩小至 1.15 倍，因为两款 SKU 均运行窄 EP=4 加 DP 注意力的相同 TP=32 分离式配置，工作负载受解码显存带宽限制。在 250 tok/s/user 以上曲线交叉，B200 领先约 1.2 倍，因为小批量工作负载可容纳在 8-GPU NVLink 岛内，跨机柜跳转变成了纯开销。数据来自 InferenceX 2026-05-22 测量，run 26306422380。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 GB200 NVL72 在吞吐量-交互性曲线中段优势如此大？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "75–175 tok/s/user 区间是解码变为网络受限的地方，瓶颈在 EP 分发和汇聚集合通信上。每个 MoE 层触发一次 all-to-all 分发和一次 all-to-all 汇聚；在约 60 个 MoE 层中，每 token 大约有 120 次集合通信。在 NVLink 5 单向 900 GB/s/GPU（双向 1.8 TB/s）下，每对分发/汇聚可以适配该层专家 GEMM 的矩阵乘法时间预算，因此运行时将集合通信与计算重叠，延迟从关键路径上消失。在 ConnectX-7 RoCEv2 以太网每 GPU 50 GB/s — 慢 18 倍 — 下，同样的集合通信每字节传输时间长 18 倍，不再能适配 GEMM，暴露为纯延迟且 GPU 空转等待网络。因此 NVL72 可以运行 EP=16 和 EP=32 而不增加 TPOT 成本；B200 多节点无法将跨节点 all-to-all 与计算重叠，只能退回单节点 EP=8（集合通信保持在 NVLink 上，但宽 EP 吞吐量提升大幅缩水）。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "GB200 NVL72 在此对比中每百万 token 成本也低于 B200 吗？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 75–175 tok/s/user 区间内是的，在两端基本持平。GB200 NVL72 每 GPU 小时 TCO 高约 13%（$2.21 vs $1.95，数据来源 SemiAnalysis AI Cloud TCO 模型），这在成本维度上削弱了吞吐量优势。在峰值吞吐量端（25 至 60 tok/s/user）成本基本持平：GB200 NVL72 便宜 1.01 至 1.06 倍。在 125 tok/s/user 时成本差距为 3.87 倍（B200 每百万 token $0.58，GB200 NVL72 $0.15）。在 225 tok/s/user 以上成本差距反转：B200 在 225 tok/s/user 时便宜 5%，在 250 tok/s/user 时便宜 15%，在 275 tok/s/user 时便宜 27%。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "在 DeepSeek R1 FP4 上 B200 在哪些场景仍优于 GB200 NVL72？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "大约在 250 tok/s/user 以上。在该交互性下工作负载以极小批量运行（B200 方案中 40-GPU 解码池上的 4 至 25 并发用户），每 token 解码工作量足够小，all-to-all 带宽不再是瓶颈，工作负载可舒适地容纳在 8-GPU NVLink 岛内。B200 省去了跨机柜跳转，在 275 tok/s/user 时领先约 1.2 倍。该数据集中 NVL72 没有低于 286 tok/s/user 的方案，因此该点以上只有 B200 可达。极低批量区间也是机架级 NVL72 优势在结构上最小的区间，因为在途 token 太少，宽 NVLink 带宽无用武之地。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "这些结果是否适用于其他模型，还是仅限于 DeepSeek R1？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "该模式适用于所有具有大量路由专家和高每层 all-to-all 通信量的稀疏 MoE 模型。我们此前关于 Kimi K2.5 NVFP4 8k/1k 的博文显示 GB200 NVL72 Dynamo vLLM 在峰值吞吐量下达到 12,587 tok/s/GPU，B200 为 4,021 tok/s/GPU，宽 EP 最高 EP=16 在 NVL72 互联域上带来 3.13 倍优势。本文的 DeepSeek R1 是不同的模型（671B vs 1T 参数，256 vs 384 个专家，MLA vs K2.5 的注意力机制），但每 MoE 层的 all-to-all 行为相似，因此整体格局重复：峰值吞吐量基本持平，曲线中段因 NVLink 上的宽 EP 产生大差距，高交互性端因批量可容纳在单个 NVLink 岛内而反转。稠密模型或路由专家较少的 MoE 模型将呈现更小的差距，因为 all-to-all 集合通信占每步成本的比例更小。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx b/packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx
new file mode 100644
index 00000000..dfff75bd
--- /dev/null
+++ b/packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx
@@ -0,0 +1,181 @@
+---
+title: 'GB300 NVL72 vs GB200 NVL72 推理性能与性价比对比 — DeepSeek-V4-Pro 1.6T：吞吐量最高提升 2.83 倍'
+subtitle: 'DSv4-Pro FP4 8K/1K，Dynamo+vLLM，两套机架均采用分离式部署。GB300 多出 50% 的 HBM（每 GPU 288 GB vs 192 GB）解锁了 GB200 无法容纳的更宽预填充+解码配方——尽管单 GPU TCO 溢价 20%，曲线中段性价比仍提升 2.31 倍。'
+date: '2026-05-27'
+publishDate: '2026-05-27'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - deepseek
+  - nvidia
+  - gb300
+  - gb200
+  - nvl72
+  - vllm
+  - dynamo
+  - wide-ep
+  - disagg
+---
+
+在 DeepSeek-V4-Pro FP4、8K/1K 输入输出长度、Dynamo vLLM 框架以及两套机架均启用分离式预填充/解码的条件下，GB300 NVL72 在等交互性下**每 GPU 吞吐量最高达 GB200 NVL72 的 2.83 倍**，峰值出现在 27 tok/s/user（GB300 为 6,182 tok/s/GPU，GB200 为 2,189 tok/s/GPU）。纸面上两者的硅片差异看似不大——相同的显存带宽、相同的 NVLink 互联、相同的 scale-up 规模，仅多出 1.5 倍 HBM 容量和 1.5 倍 FP4 算力——但曲线中段的差距远超任何静态比率，因为 GB300 额外的 HBM 消除了 GB200 必须为之付出代价的一个软件约束。
+
+其机制在于 **HBM 余量**。DSv4-Pro 1.6T 参数量下，仅 FP4 权重就约占 800 GB，GB200 在窄预填充形态下可用 HBM 相当紧张，配方不得不在批大小上做出妥协以将模型装入显存。GB300 的 1.5 倍 HBM 容量（每 GPU 288 GB vs 192 GB）在相同形态下仍有数百 GB 的余量，使得预填充可以运行更大的批次来保持更宽解码池的饱和。在每 GPU TCO 溢价 20%（$2.65 vs $2.21/GPU/hr，数据来自 [SemiAnalysis AI Cloud TCO 模型](https://semianalysis.com/ai-cloud-tco-model/)）之后，GB300 在 27 tok/s/user 下**每百万 token 的成本仍便宜 2.31 倍**。更多 HBM，更多节省。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_active=gb200_dynamo-vllm%2Cgb300_dynamo-vllm&g_model=DeepSeek-V4-Pro&i_linelabel=1">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<Figure
+  srcLight="/images/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4/benchmark-light.png"
+  srcDark="/images/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4/benchmark-dark.png"
+  alt="DeepSeek-V4-Pro 1.6T FP4 8K/1K tok/s/GPU vs 交互性。GB300 NVL72（Dynamo vLLM）浅绿色，GB200 NVL72（Dynamo vLLM）深绿色。每个曲线点标注 TP 值。GB300 在 13–18 tok/s/user 交互性范围内保持约 10k tok/s/GPU；GB200 在 15–18 tok/s/user 范围内保持约 8.5k；两者在中段均有衰减；GB300 在全重叠区间内维持更高的每 GPU 吞吐量。"
+  caption="DeepSeek-V4-Pro 1.6T FP4 8K/1K Pareto 前沿。GB300 NVL72 vs GB200 NVL72，均使用 Dynamo vLLM，均采用分离式预填充/解码。在 InferenceX 上测量，日期为 2026-05-22（运行编号 26306422380）。点标签表示总 TP。"
+/>
+
+## DeepSeek-V4-Pro 模型架构
+
+DeepSeek-V4-Pro 是 DeepSeek 的旗舰 MoE 模型：**总参数量 1.6T，每 token 激活 49B**（来自 [DeepSeek V4 预览公告](https://api-docs.deepseek.com/news/news260424)）。该架构将 **token 级压缩**与 **DSA（DeepSeek 稀疏注意力）** 结合——这是 DeepSeek 在 V3.2 中引入的稀疏注意力模式，并扩展到更长的上下文（官方服务默认以 1M 上下文运行 DSv4）。开源权重检查点为 [`deepseek-ai/DeepSeek-V4-Pro`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)。
+
+## 纸面规格对比
+
+GB300 NVL72（Blackwell Ultra）和 GB200 NVL72（Blackwell）共享相同的 NVLink 5 scale-up 互联、相同的 72 GPU 规模、相同一代 NVSwitch 以及相同的每 GPU 8 TB/s HBM 带宽。差异在于 HBM 容量和 dense FP4 算力。数值直接取自 [/gpu-specs](/gpu-specs)：
+
+<Figure
+  srcLight="/images/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4/specs-radar-light.png"
+  srcDark="/images/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4/specs-radar-dark.png"
+  alt="GPU 规格雷达图，对比 GB200 NVL72（浅绿色）和 GB300 NVL72（深绿色）在 7 个维度上的表现：显存、显存带宽、FP4/FP8/BF16 TFLOP/s、Scale Up 带宽、Scale Up 域显存、Scale Up 域显存带宽。GB300 在显存和 FP4 两项上为 100%（设定天花板）；GB200 在这两项上约 67%，因为少了 1.5 倍。两者在显存带宽、Scale Up 带宽、Scale Up 域显存带宽、FP8、BF16 上持平。"
+  caption="GB200 NVL72（浅绿色）vs GB300 NVL72（深绿色），来自 /gpu-specs。各轴数值按面板中所有 SKU 的跨供应商最大值归一化。唯一的显著差异在于显存（GB300 设定 288 GB 天花板，GB200 约为 67%）和 FP4（GB300 设定 15 PFLOP/s 天花板，GB200 约为 67%）。其余——HBM 带宽、NVLink scale-up 带宽、规模、FP8、BF16——均相同。"
+/>
+
+| 规格                              | GB200 NVL72         | GB300 NVL72         | GB300 / GB200 |
+| --------------------------------- | ------------------- | ------------------- | ------------- |
+| HBM 容量                          | 192 GB              | 288 GB              | **1.50x**     |
+| HBM 带宽                          | 8 TB/s              | 8 TB/s              | 1.00x         |
+| Dense FP4 (TFLOP/s)               | 10,000              | 15,000              | **1.50x**     |
+| Dense FP8 (TFLOP/s)               | 5,000               | 5,000               | 1.00x         |
+| Dense BF16 (TFLOP/s)              | 2,500               | 2,500               | 1.00x         |
+| 每 GPU Scale-up 带宽（单向）      | 900 GB/s (NVLink 5) | 900 GB/s (NVLink 5) | 1.00x         |
+| Scale-up 规模                     | 72                  | 72                  | 1.00x         |
+| Scale-up 域 HBM 容量              | 13.5 TB             | 20.25 TB            | **1.50x**     |
+| Scale-up 域 HBM 带宽（聚合）      | 576 TB/s            | 576 TB/s            | 1.00x         |
+| TCO（SemiAnalysis AI Cloud 模型） | $2.21/GPU/hr        | $2.65/GPU/hr        | 1.20x         |
+
+如果解码纯粹受 HBM 带宽限制，预填充纯粹受 FP4 算力限制，则纸面性价比上限在任一瓶颈上均为 `1.50 / 1.20 = 1.25x`。实测 **2.31 倍性价比峰值比该上限高出 1.85 倍**——这正是本文的核心要点。提升来自一个 **硅片比率低估系统增益的区间**：HBM 容量是一个离散的解锁条件（决定哪种配方能装下），而非连续旋钮；一种配方在一套机架上能跑而另一套跑不了所带来的倍数增益，不会出现在任何规格表上。
+
+## 分离式部署 + 宽 EP 实际带来了什么
+
+稀疏 MoE 的推理有两个资源特征截然相反的阶段。**预填充**受算力限制：请求中的每个 token 都并行通过整个模型处理，因此 DSv4-Pro 的 384 个路由专家在每个提示的每一层都被全部激活。**解码**受显存带宽限制：每个生成 token 每层仅激活 384 个路由专家中的 6 个（加 1 个共享专家），每步开销主要取决于从 HBM 流式读取被路由到的专家权重。在相同 GPU 上同时运行两者，预填充的突发流量会不断干扰解码的稳态运行，最终两者都无法充分利用。
+
+**分离式部署**将两者拆分到独立调优的 GPU 池中。预填充实例以足够宽的配置运行，以摊销全专家激活的计算步骤；解码实例以最佳的 (TP, EP, DP) 形态运行，以在稳态负载下获得最大的每步 token 数。两个池通过 NVLink 互联通信（预填充 → 解码的 KV 传输），且可独立扩展。
+
+**宽专家并行（EP）** 则将解码侧的路由专家分片到多个 rank 上。在 EP=4 时，每个 GPU 持有 DSv4-Pro 384 个路由专家中的 96 个，所有这些都必须常驻 HBM 并随时准备响应路由到它们的 token。在 EP=8 时每 GPU 持有 48 个。在 EP=16 时每 GPU 持有 24 个——每 rank 的路由专家权重占用近似线性缩减，余下的 HBM 用于 KV 缓存和激活值。分片越宽，每个 GPU 的 HBM 带宽在服务路由到其专家的请求时分摊越均匀，每 GPU 解码效率也就越高。EP 组中每增加一个 rank 都在为其他所有 rank 做有用功——*这*就是"买得越多，省得越多"的杠杆，应用的不是批量硬件折扣，而是实际的硅片利用率。
+
+代价在于宽 EP 需要在每个 MoE 层的专家 GEMM 前后分别执行一次路由式 **all-to-all 分发**和 **all-to-all 汇聚**。在 DSv4-Pro 的 MoE 层中，这意味着每个 token 需要数百次集合通信。它们必须在 GEMM 计算背后重叠，否则就暴露为裸延迟。在 NVLink 5 上（每 GPU 单向 900 GB/s，双向 1.8 TB/s），该分发在 EP=8 到 EP=16 的中等批解码中可以嵌入 GEMM 时间预算内，运行时将其隐藏。而在 scale-out 侧（ConnectX-7 RoCEv2 Ethernet 或 InfiniBand，每 GPU 单向 50 GB/s，**慢 18 倍**），相同的集合通信需要 18 倍时间并暴露为延迟——这就是为什么宽 EP 需要机架级 NVLink 域，也是为什么无论谁先出货，GB200 NVL72 和 GB300 NVL72 在该负载上都优于任何 8-GPU HGX 节点。
+
+## 测试数据
+
+所有行均为 DeepSeek-V4-Pro FP4、**ISL 8192 / OSL 1024**、NVL72、Dynamo vLLM、分离式预填充/解码、无投机解码，在 InferenceX 上于 2026-05-22 测量（[GHA 运行编号 26306422380](https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26306422380)）。每百万总 token 成本按 `TCO_$/GPU/hr × 1e6 / (3600 × tput_per_gpu)` 计算，GB200 NVL72 为 $2.21/GPU/hr，GB300 NVL72 为 $2.65/GPU/hr，来自 [SemiAnalysis AI Cloud TCO 模型](https://semianalysis.com/ai-cloud-tco-model/)。
+
+**GB200 NVL72 (Dynamo vLLM)，DSv4-Pro FP4 8K/1K 分离式：**
+
+| Conc | Prefill      | Decode       | tok/s/GPU   | tok/s/user | TPOT (ms) | $/M tok   |
+| ---- | ------------ | ------------ | ----------- | ---------- | --------- | --------- |
+| 1    | 8 GPU, TP=8  | 8 GPU, EP=1  | 32.8        | 74.13      | 13.26     | $18.72    |
+| 256  | 8 GPU, TP=8  | 32 GPU, EP=1 | 1,613.8     | 32.69      | 30.83     | $0.38     |
+| 512  | 8 GPU, TP=8  | 32 GPU, EP=1 | 2,004.5     | 28.31      | 35.46     | $0.31     |
+| 256  | 8 GPU, TP=8  | 8 GPU, EP=8  | 3,148.0     | 24.42      | 41.23     | $0.20     |
+| 512  | 8 GPU, TP=8  | 8 GPU, EP=8  | 5,336.2     | 21.26      | 47.43     | $0.10     |
+| 1024 | 8 GPU, TP=8  | 8 GPU, EP=8  | 6,036.2     | 21.60      | 46.42     | $0.10     |
+| 4096 | 16 GPU, TP=8 | 8 GPU, EP=8  | 8,153.1     | 18.51      | 54.34     | $0.08     |
+| 4096 | 24 GPU, TP=8 | 8 GPU, EP=8  | **8,933.0** | **15.26**  | **66.26** | **$0.07** |
+
+**GB300 NVL72 (Dynamo vLLM)，DSv4-Pro FP4 8K/1K 分离式：**
+
+| Conc | Prefill      | Decode        | tok/s/GPU    | tok/s/user | TPOT (ms) | $/M tok   |
+| ---- | ------------ | ------------- | ------------ | ---------- | --------- | --------- |
+| 18   | 4 GPU, TP=4  | 68 GPU, EP=1  | 138.8        | 73.43      | 13.58     | $5.31     |
+| 192  | 4 GPU, TP=4  | 24 GPU, EP=1  | 1,920.0      | 36.78      | 27.44     | $0.38     |
+| 3072 | 28 GPU, TP=8 | 32 GPU, EP=16 | 6,812.0      | 25.91      | 38.77     | $0.11     |
+| 4096 | 16 GPU, TP=8 | 8 GPU, EP=8   | 10,214.0     | 17.58      | 57.12     | $0.07     |
+| 4096 | 20 GPU, TP=8 | 8 GPU, EP=8   | 10,853.1     | 14.74      | 69.17     | $0.07     |
+| 4096 | 24 GPU, TP=8 | 8 GPU, EP=8   | **11,055.6** | **13.12**  | **77.83** | **$0.07** |
+
+GB200 的每 GPU 峰值吞吐量为 8,933，交互性为 15.3 tok/s/user。GB300 的峰值为 11,056，交互性为 13.1 tok/s/user——**在更低交互性下限处，每 GPU 吞吐量高出 1.24 倍**，计入软件开销后接近 1.5 倍的硅片比率。峰值处的单位成本性价比基本持平（$0.069 vs $0.067），因为 GB300 的 20% TCO 溢价吃掉了 1.24 倍吞吐提升的大部分。标题中的倍数差异出现的位置不在峰值，而在曲线中段——GB300 的 HBM 余量在那里带来了 GB200 不具备的配方。
+
+## 等交互性对比
+
+在匹配交互性下的每 GPU 吞吐量和每百万 token 成本，沿各 SKU 的 Pareto 前沿插值。超出前沿实测范围的单元格标记为 `_unreachable_`。
+
+| 交互性 (tok/s/user) | GB200 tok/s/GPU | GB300 tok/s/GPU | GB300 / GB200 | GB200 $/M tok | GB300 $/M tok | GB200 / GB300 |
+| ------------------- | --------------- | --------------- | ------------- | ------------- | ------------- | ------------- |
+| 16                  | 8,835           | 10,608          | 1.20x         | $0.07         | $0.07         | 1.00x         |
+| 18                  | 8,366           | 10,094          | 1.21x         | $0.07         | $0.07         | 1.01x         |
+| 20                  | 7,283           | 9,401           | 1.29x         | $0.08         | $0.08         | 1.07x         |
+| 22                  | 5,650           | 8,562           | 1.52x         | $0.11         | $0.08         | 1.31x         |
+| 25                  | 2,846           | 7,208           | 2.53x         | $0.21         | $0.10         | 2.11x         |
+| **27**              | **2,189**       | **6,182**       | **2.83x**     | **$0.28**     | **$0.12**     | **2.31x**     |
+| 28                  | 2,058           | 5,789           | 2.81x         | $0.30         | $0.13         | 2.30x         |
+| 32                  | 1,661           | 3,570           | 2.15x         | $0.36         | $0.21         | 1.76x         |
+| 36                  | 1,376           | 2,036           | 1.48x         | $0.65         | $0.35         | 1.88x         |
+| 50                  | 649             | 941             | 1.45x         | $4.78         | $1.58         | 3.03x         |
+
+标题中 **2.83 倍每 GPU 吞吐量峰值出现在 27 tok/s/user（性价比 2.31 倍），位于曲线中段**而非峰值吞吐处。在 20 tok/s/user 以下，两套机架都运行足够宽的预填充批次，HBM 余量优势被抹平；在 36 tok/s/user 以上，两者都运行窄批次，没有哪套机架拥有宽 EP 能充分利用的配方。22–32 tok/s/user 区间是 GB300 的 1.5 倍 HBM 容量让其停留在一个更高 Pareto 节点上的地方（`conc=3072, 28 GPU 预填充, 32 GPU 解码 EP=16, 6,812 tok/s/GPU at 25.9 tok/s/user`），而 GB200 在同等交互性下没有等效配方——其最接近的配方是在 32-GPU 解码池上 conc=256 / 512，仅能提供 1,614–2,005 tok/s/GPU。
+
+50 tok/s/user 行显示成本比率（3.03x）再次扩大，因为两条曲线都进入了右侧的陡峭衰减区。这里的解读需要更谨慎——两套机架在该区域的 Pareto 覆盖都很薄（GB200 在约 33 tok/s/user 处各有一个节点，GB300 在约 37 tok/s/user 处各有一个节点，然后是到约 73 tok/s/user 的长尾），因此插值是在两个间隔较大的实测节点之间读取差距。22–32 tok/s/user 区间才是 GB300 优势的可靠甜蜜点；将 50 tok/s/user 行视为方向性参考。
+
+<Figure
+  srcLight="/images/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4/benchmark-light.png"
+  srcDark="/images/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4/benchmark-dark.png"
+  alt="DeepSeek-V4-Pro 1.6T FP4 8K/1K tok/s/GPU vs 交互性。GB300 NVL72（Dynamo vLLM）浅绿色，GB200 NVL72（Dynamo vLLM）深绿色。每个曲线点标注 TP 值。GB300 在 13–18 tok/s/user 交互性范围内保持约 10k tok/s/GPU；GB200 在 15–18 tok/s/user 范围内保持约 8.5k；两者在中段均有衰减；GB300 在全重叠区间内维持更高的每 GPU 吞吐量。"
+  caption="DeepSeek-V4-Pro 1.6T FP4 8K/1K Pareto 前沿。GB300 NVL72 vs GB200 NVL72，均使用 Dynamo vLLM，均采用分离式预填充/解码。在 InferenceX 上测量，日期为 2026-05-22（运行编号 26306422380）。点标签表示总 TP。"
+/>
+
+[在线图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_active=gb200_dynamo-vllm%2Cgb300_dynamo-vllm&g_model=DeepSeek-V4-Pro&i_linelabel=1)，已预筛选为 2026-05-22 运行中 GB200 NVL72 和 GB300 NVL72 Dynamo vLLM 的 DSv4-Pro FP4 8K/1K 数据。
+
+## 致谢
+
+感谢 NVIDIA 的 Dynamo 和 vLLM 团队——包括 Jatin Gangani、Kedar Potdar、Sridhar Ramaswamy、Ishan Dhanani 和 Sahithi Chigurupati——以及 vLLM 团队，是他们将 GB200 和 GB300 的 DSv4-Pro 配方交付落地，使得机架间对比成为可能。配套文章：[GB200 NVL72 vs B200 DeepSeek R1 对比](https://inferencex.semianalysis.com/blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt)，覆盖了 SKU 梯队下一级的 scale-up 互联优势。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_active=gb200_dynamo-vllm%2Cgb300_dynamo-vllm&g_model=DeepSeek-V4-Pro&i_linelabel=1">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "使用 Dynamo vLLM 时，GB300 NVL72 比 GB200 NVL72 在 DeepSeek-V4-Pro 上快多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 DSv4-Pro FP4、8K/1K 输入输出长度、Dynamo vLLM 以及两套机架均启用分离式预填充/解码的条件下，GB300 NVL72 在等交互性下每 GPU 吞吐量最高达 GB200 NVL72 的 2.83 倍，峰值出现在 27 tok/s/user（仪表板 Pareto 插值为 6,182 vs 2,189 tok/s/GPU）。在计入 GB300 20% 的 TCO 溢价（$2.65 vs $2.21/GPU/hr）后，同等交互性下每百万 token 的峰值成本优势为 2.31 倍。在 20 tok/s/user 以下，两套机架都运行宽预填充批次，差距缩小至约 1.2 倍；在 36 tok/s/user 以上，两者都运行窄批次，宽 EP 无法充分利用。测量于 InferenceX，GHA 运行编号 26306422380，2026-05-22。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 GB300 NVL72 在吞吐量-交互性曲线中段胜出，尽管 NVLink 带宽相同？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "差距在于 HBM 容量而非 NVLink。DSv4-Pro 1.6T 参数量，约 800 GB FP4 权重。GB300 的 1.5 倍 HBM 容量（每 GPU 288 vs 192 GB）让预填充侧在相同 TP 形态下有足够余量运行更宽的批次，从而保持更宽解码池的饱和并提升每 GPU 吞吐量。在 22–32 tok/s/user 区间，GB300 停留在一个 GB200 没有等效配方的 Pareto 节点上（conc=3072, 28 GPU 预填充, 32 GPU 解码 EP=16），在同等交互性下 GB200 最接近的配方仅能提供 1,614–2,005 tok/s/GPU，而 GB300 在 25.9 tok/s/user 时为 6,812。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "在此对比中，GB300 NVL72 的每百万 token 成本是否低于 GB200 NVL72？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 22–32 tok/s/user 区间是的，在峰值吞吐量处基本持平。GB300 的 TCO 每 GPU 每小时高约 20%（$2.65 vs $2.21，来自 SemiAnalysis AI Cloud TCO 模型），这在成本维度上稀释了吞吐量优势。在峰值吞吐量处（13–18 tok/s/user），成本基本持平：GB300 领先 1.00x 到 1.07x。在 27 tok/s/user 时，成本差距为 2.31 倍（GB200 $0.28/百万 token vs GB300 $0.12）。在 50 tok/s/user 时，插值成本差距扩大至 3.03 倍，但两侧的 Pareto 覆盖都很薄，因此极高交互性行应视为方向性参考而非定量结论。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "GB300 的优势是否可推广到其他工作负载和其他模型？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "其机制（HBM 余量驱动的宽 EP 配方解锁）可推广到任何足够大的稀疏 MoE 模型——只要其 FP4 权重占用在窄预填充形态下接近单 GPU HBM 容量。DeepSeek-R1 0528 的 671B / 37B 激活参数量，权重仅为 DSv4-Pro 的一半，在 GB200 上更容易容纳，因此 GB300 在 R1 上的提升更小。Kimi K2.5 / K2.6 的 1T / 32B 激活参数量介于两者之间。权重占用小得多的模型（Qwen3.5、GLM-5、MiniMax-M2.5、gpt-oss-120b）不会出现此差距，因为模型在两套机架上都能轻松容纳——对于这些模型，相关杠杆是 dense FP4 算力（GB300 高 1.5 倍），差距更接近硅片比率。输入长度超过 8K 的工作负载会放大 GB300 的优势，因为 KV 缓存随序列长度线性增长，而 1.5 倍 HBM 余量直接转化为同一 TP 形态下 1.5 倍的在途 token 数——但 InferenceX 尚未有 1M 上下文 DSv4-Pro 的配方。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/inferencemax-open-source-inference-benchmarking.mdx b/packages/app/content/blog/zh/inferencemax-open-source-inference-benchmarking.mdx
new file mode 100644
index 00000000..ecad0f7f
--- /dev/null
+++ b/packages/app/content/blog/zh/inferencemax-open-source-inference-benchmarking.mdx
@@ -0,0 +1,671 @@
+---
+title: 'InferenceMAX：开源推理基准测试'
+subtitle: 'NVIDIA GB200 NVL72、AMD MI355X、每 GPU 吞吐量 Token、延迟 Tok/s/user、性价比、每百万 Token 成本、每配置兆瓦 Token 数、DeepSeek R1 670B、GPTOSS 120B、Llama3 70B'
+date: '2025-10-09'
+publishDate: '2025-10-09'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - announcement
+---
+
+LLM 推理性能取决于两大支柱：硬件和软件。硬件创新通过每年发布新型 GPU/XPU 和新系统来驱动性能的阶跃式提升，而软件则每天都在演进，在这些硬件跃升之上持续带来性能增益。
+
+SGLang、vLLM、TensorRT-LLM、CUDA 和 ROCm 等 AI 软件通过内核级优化、分布式推理策略和调度创新实现持续的性能改进，以仅相隔数天的增量更新不断推高性能的帕累托前沿。
+
+这种软件进步的速度带来了一个挑战：在某个固定时间点进行的基准测试很快就会过时，无法反映使用最新软件包所能达到的真实性能。
+
+InferenceMAX 是[一个开源的自动化基准测试项目](https://github.com/InferenceMAX/InferenceMAX)，旨在与软件生态系统本身同步快速迭代，专为应对这一挑战而设计。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/4be54ad9-692e-4500-948c-38beb5018814_1734x922.png"
+  caption="来源：SemiAnalysis InferenceMAX GitHub 仓库"
+/>
+
+InferenceMAX 每夜在数百块芯片上运行我们的完整基准测试套件，持续对全球最流行的开源推理框架和模型进行重新测试，以实时追踪真实性能。随着这些软件栈的不断改进，InferenceMAX 近乎实时地捕捉这些进展，提供推理性能进步的实时指标。免费的公开实时仪表板可在 [https://inferencemax.ai/](https://inferencemax.ai/) 访问。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/fbf9bb59-ada3-44b0-8cba-c6ee11097b3f_1839x1344.png"
+  caption="来源：SemiAnalysis"
+/>
+
+AMD 和 Nvidia GPU 都能在不同的工作负载下提供有竞争力的性能，AMD 在某些类型的工作负载上表现更好，而 Nvidia 在其他类型上表现更优。的确，两个生态系统都在快速发展！
+
+在分析 InferenceMAX 的结果时有许多细微差别和需要考量的因素，这在很大程度上是因为它被设计为一个中立的基准测试，不会为了推广任何特定供应商或解决方案而进行数据筛选。因此，存在一些模型和交互性（tok/s/user）水平，在这些场景下 AMD 目前表现优于同代 Nvidia GPU；同样也存在 Nvidia 目前表现更好的交互性水平。InferenceMAX 的目标简单而宏大——提供既尽可能模拟真实应用场景、又反映软件创新持续步伐的基准测试。
+
+在 InferenceMAX v1 的首次发布中，我们对 GB200 NVL72、B200、MI355X、H200、MI325X、H100 和 MI300X 进行了基准测试。在接下来的两个月内，我们将扩展 InferenceMAX 以纳入 Google TPU 和 AWS Trainium 后端，使其成为首个真正跨 AMD、NVIDIA 和定制加速器的多供应商开放基准测试项目。
+
+InferenceMAX v1 远非完美，但我们相信它是朝着正确方向迈出的良好第一步。在未来的版本中，还有优化工作负载、扩展模型覆盖范围以及更好地反映真实工作负载的空间。
+
+## 致谢
+
+感谢 Lisa Su 和 Anush Elangovan 为这个免费开源项目提供了 MI355X 和 CDNA3 GPU。我们要感谢 Anush、Quentin Colombet 以及数十位 AMD 贡献者对各类问题的快速响应，以及他们在 AMD GPU 的调试、优化和性能验证方面给予的帮助。每当我们遇到 ROCm 问题（我们注意到，这些问题出现的频率远低于 2024 年底！），他们都会立即介入帮助找到临时修复方案以消除阻塞，随后将永久补丁合入 ROCm 以确保长期稳定性。Quentin 及其团队展现了 [AMD 2.0 的紧迫感](https://semianalysis.com/2025/04/23/amd-2-0-new-sense-of-urgency-mi450x-chance-to-beat-nvidia-nvidias-new-moat/)，这种态度[受到 xAI 等众多客户的高度赞赏](https://www.youtube.com/live/5dmFa9iXPWI?si=5HHNsDd7bw3lDASk&t=1073)。
+
+我们也感谢 Jensen Huang 和 Ian Buck 通过提供 GB200 NVL72 机架（通过 OCI）和 B200 GPU 的访问来支持这一开源项目。感谢 Kedar Pandurang Potdar、Sridhar Ramaswamy、Kyle Kranen、ptrblck、NVIDIA 推理团队、NVIDIA Dynamo 团队、NCCL 团队以及 Nvidia 固件/驱动团队在验证和优化 Blackwell 和 Hopper 配置以及快速修复 Bug 方面给予的帮助。
+
+我们还要感谢 SGLang、vLLM 和 TensorRT-LLM 的维护者们构建了世界一流的软件栈并将其开源给全世界。此外，我们要感谢 Simon Mo、Kaichao You、Michael Goin 和 Robert Shaw，他们在解决几个关键的 Blackwell Bug 方面提供了不可或缺的帮助。
+
+最后，我们感谢 Crusoe、CoreWeave、Nebius、TensorWave、Oracle 和 TogetherAI 通过计算资源支持开源创新，使这个项目得以实现，同时也感谢更广泛的社区推动推理基准测试的发展。
+
+## 我们正在招聘
+
+我们正在寻找一名工程师加入我们的特别项目团队。这是一个独特的机会，可以参与如 InferenceMAX 这样高曝光度的特别项目，并获得众多行业领袖和 CEO 的支持。如果你对性能工程、系统可靠性充满热情，并希望在硬件和软件的交叉领域工作，这是一个产生全行业影响的难得机会。
+
+**你将参与的工作：**
+
+- 跨多供应商（AMD、NVIDIA、TPU、Trainium 等）构建和运行大规模基准测试
+- 设计可复现的 CI/CD 流水线以自动化基准测试工作流
+- 确保行业合作伙伴使用的系统的可靠性和可扩展性
+
+**我们期望的候选人特质：**
+
+- 扎实的 Python 技能
+- 站点可靠性工程（SRE）或系统级问题解决的背景
+- CI/CD 流水线和现代 DevOps 实践经验
+- 对 GPU、TPU、Trainium、多云和性能基准测试的好奇心
+
+申请链接：[https://app.dover.com/apply/SemiAnalysis/2a9c8da5-6d59-4ac8-8302-3877345dbce1](https://app.dover.com/apply/SemiAnalysis/2a9c8da5-6d59-4ac8-8302-3877345dbce1)
+
+## InferenceMAX 项目支持者
+
+InferenceMAX 项目得到了许多大型计算资源买家和 ML 社区知名成员的支持，包括来自 OpenAI、Microsoft、PyTorch Foundation 等组织的人士：
+
+> _"随着我们以前所未有的规模构建系统，ML 社区拥有开放、透明的基准测试来反映推理在各种硬件和软件上的真实表现至关重要。InferenceMAX 的直接对比基准测试消除了噪声，提供了关于 token 吞吐量、性价比和每兆瓦 token 数的实时全景。这种开源努力增强了整个生态系统的实力，帮助从研究人员到前沿数据中心运营者在内的每一个人做出更明智的决策。"_
+>
+> -- Peter Hoeschele，OpenAI Stargate 基础设施与工业计算副总裁
+
+> _"开放协作正在推动 AI 创新的下一个时代。开源的 InferenceMAX 基准测试为社区提供了透明的每夜测试结果，激发信任并加速进步。它凸显了我们 AMD Instinct MI300、MI325X 和 MI355X GPU 在多样化工作负载下具有竞争力的 TCO 性能，彰显了我们平台的优势以及我们致力于让开发者实时了解软件进展的承诺。"_
+>
+> -- Lisa Su 博士，AMD 董事长兼 CEO
+
+> _"推理需求正在呈指数级增长，这是由长上下文推理所驱动的。NVIDIA Grace Blackwell NVL72 正是为这个思考型 AI 的新时代而发明的。NVIDIA 通过持续的硬件和软件创新来满足这一需求，为 AI 的下一步发展赋能。通过频繁的基准测试，InferenceMAX 为行业提供了 LLM 推理性能在真实工作负载上的透明视图。结果很明确：搭载 TRT-LLM 和 Dynamo 的 Grace Blackwell NVL72 提供了无与伦比的性价比和能效比——驱动着世界上最高效、最具成本效益的 AI 工厂。"_
+>
+> -- Jensen Huang，NVIDIA 创始人兼 CEO
+
+> _"速度就是护城河。InferenceMAX 的每夜基准测试与 AMD 软件栈的改进速度保持同步。看到 AMD 的 MI300、MI325 和 MI355 GPU 在多样化工作负载和交互性水平下表现如此出色，令人振奋。"_
+>
+> -- Anush Elangovan，AMD GPU 软件副总裁
+
+> _"InferenceMAX 凸显了 ML 社区关注的工作负载。在 NVIDIA，我们欢迎这些对比，因为它们证实了我们全栈方案的优势——从 GPU 硬件到 NVLink 网络，到 NVL72 机架规模，再到 Dynamo 分离式推理服务，始终在规模化场景下提供行业领先的推理性能和投资回报率。"_
+>
+> -- Ian Buck，NVIDIA 超大规模部门副总裁兼总经理、CUDA 发明者
+
+> _"InferenceMAX 的每夜测试结果凸显了 AMD 软件栈的快速进步。亲眼见证一个开放项目的诞生令人兴奋——它在 AMD 软件团队的工作与特定 ML 用例在我们 MI300、MI325 和 MI355 GPU 上的实际影响之间建立了紧密的反馈循环。我期待看到 InferenceMAX 的后续发展，并展示 AMD 平台的全部潜力。AMD GPU 将继续每周变得更快。"_
+>
+> -- Quentin Colombet，AMD 高级总监，前 Brium CEO
+
+> _"我们在 Azure 的使命是为客户提供最高性能、最高效、最具成本效益的 AI 云。SemiAnalysis InferenceMAX 通过提供透明、可复现的基准测试来支持这一使命，这些测试在真实工作负载下追踪不同 GPU 和软件栈的推理性能。这些关于吞吐量、效率和每瓦成本的持续数据增强了我们为规模化调优 Azure 推理平台的能力，帮助客户在 Microsoft Cloud 上自信地构建应用。"_
+>
+> -- Scott Guthrie，Microsoft 云与 AI 执行副总裁
+
+> _"在 Microsoft，为我们的客户大规模提供最佳推理性能和经济效益需要深入理解 AI 模型如何与真实硬件和软件交互。像 InferenceMAX 这样的开源、可复现的基准测试对于在真实工作负载下产生关于吞吐量、效率和成本的透明洞察至关重要。这些持续信号有助于指导我们的平台战略，使我们能够从硅片到系统再到软件优化整个栈，让每一层协同工作以释放基础设施的全部潜力。"_
+>
+> -- Saurabh Dighe，Azure 战略规划与架构公司副总裁
+
+> _"理论峰值与真实推理吞吐量之间的差距往往取决于系统软件：推理引擎、分布式策略和底层内核。InferenceMAX 的价值在于它对最新软件进行基准测试，展示了 FP4、MTP、投机解码和 wide-EP 等优化在不同硬件上的实际表现。这样的开放、可复现结果有助于整个社区更快前进。"_
+>
+> -- Tri Dao，Together AI 首席科学家、Flash Attention 发明者
+
+> _"行业需要更多公开、可复现的推理性能基准测试。我们很高兴从 vLLM 团队的角度与 InferenceMAX 合作。更多样化的工作负载和场景——每个人都可以信赖和引用——将有助于生态系统向前发展。公正、透明的测量驱动着从模型架构到推理引擎再到硬件的每一层进步。"_
+>
+> -- Simon Mo，vLLM 项目联合负责人
+
+> _"The benchmark is good sir"_
+>
+> -- Michael Goin，vLLM 维护者
+
+> _"InferenceMAX benchmark is pogchamp & W in chat"_
+>
+> -- Kaichao You，vLLM 项目联合负责人
+
+> _"InferenceMAX 展示了开放生态系统在实践中的运作方式。vLLM、SGLang 和 TensorRT-LLM 等许多领先的推理栈都构建在 PyTorch 之上，而这样的基准测试展示了内核、运行时和框架层面的创新如何转化为在包括 NVIDIA 和 AMD GPU 在内的各种硬件平台上的可量化性能。通过开源和每夜运行，InferenceMAX 为追踪进展和为 PyTorch 用户提供数据驱动的洞察提供了一种透明的、社区驱动的方法。"_
+>
+> -- Matt White，PyTorch Foundation 执行总监
+
+> _"Oracle Cloud Infrastructure 旨在为前沿实验室和企业提供灵活性和选择权，拥有多种 GPU SKU 可用于大规模 AI。InferenceMAX 通过提供开源、可复现的基准测试来增强这一使命，这些测试反映了最新硬件和软件上的真实性能、效率和成本。凭借这种透明度，客户可以自信地选择最符合其 AI 战略的平台。"_
+>
+> -- Jay Jackson，Oracle Cloud Infrastructure 副总裁
+
+> _"InferenceMAX 通过提供开放、透明的基准测试来提升标准，追踪推理在最新 GPU 和软件栈上的真实表现。对于客户而言，拥有测量真实每美元 token 数和每瓦 token 数的可复现数据，将抽象的营销数字转化为可操作的洞察。在 CoreWeave，我们支持这一努力，因为它为快速变化的领域带来了清晰度，帮助整个生态系统自信地构建。"_
+>
+> -- Peter Salanki，CoreWeave CTO
+
+> _"InferenceMAX 通过提供开放、透明的基准测试树立了新标准，揭示了推理在当今领先 GPU 和软件栈上的真实表现。凭借测量真实每美元 token 数和每瓦 token 数的可复现数据，客户可以超越营销宣传，获取可操作的洞察。对于我们 Nebius 而言——作为全栈 AI 云提供商——这一项目帮助我们自信地构建推理平台，确保我们与生态系统保持一致。"_
+>
+> -- Roman Chernin，Nebius 联合创始人兼首席商务官
+
+> _"在 Crusoe，我们相信成为一个出色的合作伙伴意味着赋予客户选择权和清晰度。这就是我们自豪地支持 InferenceMAX 的原因——它为整个 AI 社区提供了针对最新硬件的开源、可复现的基准测试。通过提供关于吞吐量、效率和成本的透明、真实的数据，InferenceMAX 穿透了炒作，帮助客户自信地为其独特工作负载选择最佳平台。"_
+>
+> -- Chase Lochmiller，Crusoe 联合创始人兼 CEO
+
+> _"Supermicro 对 InferenceMAX 的发布感到兴奋——这是 SemiAnalysis 打造的基准测试系统，用于测量真实吞吐量、性价比和能源效率。这个开源工具提供了在最新硬件和软件上运行的可复现基准测试，使 AI 实验室和企业能够大规模选择最佳平台。"_
+>
+> -- Charles Liang，Supermicro 创始人兼 CEO
+
+> _"在 TensorWave，我们正在 AMD GPU 上构建下一代云，因为我们相信当客户拥有强有力的替代选择时，创新才会蓬勃发展。InferenceMAX 通过提供开源、可复现的基准测试来支持这一愿景，这些测试追踪最新硬件和软件上的吞吐量、效率和成本。通过穿透合成数字并凸显真实推理性能，它帮助客户看到 AMD 平台在大规模 AI 方面的全部潜力。"_
+>
+> -- Darrick Horton，TensorWave CEO
+
+> _"Vultr 致力于提供一个开放的生态系统，为开发者提供自由选择——无论是在 NVIDIA 还是 AMD GPU 上——以构建和扩展 AI。借助 InferenceMAX，客户可以获得开放、可复现的基准测试，清晰地洞察最前沿硬件和软件的吞吐量、效率和成本。通过展示真实性能，我们赋能团队自信地为其 AI 工作负载选择合适的平台。"_
+>
+> -- Nathan Goulding，Vultr 工程高级副总裁
+
+## 吞吐量（tok/s/gpu）与延迟/交互性（tok/s/user）之间的根本权衡
+
+在大规模服务 LLM 时，核心权衡在于吞吐量与交互性（以每用户每秒 token 数为单位）。吞吐量是每块 GPU 处理 token 的速率（tok/s/gpu），而交互性描述的是为每个单独用户生成 token 的速率（tokens/sec/user）。简单来说，你可以为单个用户提供快速高效的服务——通常通过同时服务较少的用户来实现——但这样做的代价是整体 GPU 吞吐量降低。
+
+这种权衡的存在是因为 LLM 推理依赖于矩阵乘法运算，而这些运算受益于将多个请求批量处理——即同时服务更多用户。大批量处理能够实现更好的 GPU 利用率和更高的 token 吞吐量，但它将可用资源分配给了更多请求，从而降低了每个用户的 token 处理速度。反之，小批量将 GPU 资源集中在较少的请求上——即更少的用户，以整体吞吐量为代价换取高交互性。在实践中，大多数提供商的目标是在这两个极端之间取得平衡。在这一权衡曲线上的最优点取决于具体用例：一些应用优先考虑响应速度，而另一些则优先考虑吞吐量。然而，目标交互性水平直接决定了推理成本。更高的交互性意味着更高的成本。
+
+拥有或租用 GPU 系统进行推理通常带来固定的每小时美元成本。因此，随着交互性增加和整体吞吐量下降，每小时处理的 token 数量减少，推高了每 token 的单位成本（以每百万 token 成本衡量）。为保持盈利，提供商必须将每 token 价格设定在其服务成本之上。这意味着高交互性用例需要更高的每 token 价格以支撑这一更高的成本，而高吞吐量应用可以以更低的价格提供服务。
+
+一个简单的类比可以说明整个权衡关系。一辆公交车和一辆法拉利的绝对拥有成本可能非常相似，但公交车将这一成本分摊到几十名乘客身上，而法拉利只服务一两个人。法拉利通过即刻出发、直达路线和优质体验提供卓越的响应性，但每位乘客的成本从根本上更高。LLM 服务运营受制于类似的约束。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/539b0f8f-9421-41b4-9e89-a53f2697b0c8_1976x1454.png"
+  caption="来源：SemiAnalysis"
+/>
+
+## 帕累托前沿曲线
+
+吞吐量和延迟之间始终存在权衡。为了确定帕累托前沿曲线，我们尝试找到每一个数据点 P，使得不存在任何一个点在吞吐量和延迟两个维度上都优于点 P。这意味着数据点 P 是**帕累托最优**的，即没有其他点能在改善一个维度的同时不牺牲另一个维度。当我们将这些帕累托最优点连接起来，就得到了帕累托前沿曲线。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/4686438a-1880-4162-92c9-e64d2ab7718b_2852x993.png"
+  caption="来源：SemiAnalysis"
+/>
+
+## InferenceMAX v1 基准测试方法论
+
+提供反映不同 GPU、推理引擎和工作负载在多种交互性水平下全部可能性的基准测试，是 InferenceMAX 的核心目标。本节将介绍基准测试方法论如何被设计来实现这一目标。
+
+对于每次基准测试运行，我们设置一个推理服务器和一个基准测试客户端。推理服务器监听请求并处理它们。我们根据模型使用 vLLM、SGLang 和 TRT-LLM。对于基准测试客户端，我们使用移除了 vLLM 依赖的 vLLM benchmark serving 脚本。基准测试客户端发送请求、记录运行时间，并保存与推理作业相关的指标。
+
+我们选择使用随机序列的基准测试请求以避免前缀缓存（prefix caching），因为目前将前缀缓存纳入考量的复杂度较高。前缀缓存因工作负载不同而差异显著，需要仔细调查请求模式以选择具有代表性的前缀比率。在 InferenceMAX 的未来迭代中，我们将使用 shareGPT 等数据集替代随机数据。我们将请求速率设置为无限，并设置最大并发请求数，从而捕获推理服务器在处理特定数量请求时的行为。我们还将总请求数设置得足够大，以摊销冷启动不稳定因素，例如 JIT 编译时间。
+
+对于输入/输出序列长度，我们最终确定了三组：1024 输入 token / 1024 输出 token 代表对话工作负载，1024 输入 token / 8192 输出 token 代表推理（reasoning）工作负载，8192 输入 token / 1024 输出 token 代表摘要工作负载。为了模拟真实请求具有不同输入序列长度的情况，我们将每个请求的输入长度在指定输入序列长度的 80% 到 100% 之间随机变化。
+
+基准测试运行的配置选项如下：
+
+- **模型**：LLaMA 70B、DeepSeek R1、gpt-oss 120B
+- **精度**：MXFP4 weights、FP8、FP4
+- **GPU**：H100、H200、B200、GB200 NVL72、MI300X、MI325X、MI355X
+- **开源框架**：[vLLM](https://github.com/vllm-project/vllm)、[SGLang](https://github.com/sgl-project/sglang)、[TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
+- **并行度**：1、2、4、8 等
+- **最大并发数**：4、8、16、32、64 等
+
+在模型方面，我们选择 LLaMA3 70B 作为稠密企业模型部署的代表。
+
+对于稀疏 MoE 模型的基准测试，我们选择了 DeepSeekV3 670B。从算术强度、近似活跃参数数量、总参数量和内存访问模式来看，DeepSeekV3 的模型架构是最接近 OpenAI 4o/5 等前沿闭源模型架构的模型。因此，DeepSeek 是用于基准测试以推断 OpenAI 内部模型架构可能表现的最佳代理模型。
+
+对于较小的稀疏 MoE 模型，我们选择了 GPT-OSS 120B MoE，因为它在算术强度、近似活跃参数数量、总参数量和内存访问模式方面最接近 GPT-5 mini。
+
+我们根据硬件支持情况，在各模型上对 FP8、FP4 和 MX4 weights 进行基准测试。我们扫描不同的最大并发用户数（一个类似于 batch size 的概念）来绘制完整的吞吐量和延迟曲线。我们还扫描不同的模型并行方案，因为更大的模型并行度可以减少内存加载时间，从而在一定程度上提高低延迟区间的吞吐量，以找到帕累托前沿曲线。
+
+为了防止 [SGLang 与 vLLM 基准测试大战的重演](https://x.com/dylan522p/status/1920638653677596836)并节省计算时间，我们决定首先为每个模型只选择 vLLM 或 SGLang 中的一个作为默认引擎。早在 7 月份，我们就告知 AMD 和 Nvidia，我们将使用 SGLang 作为 DeepSeek 670B 的引擎，使用 vLLM 作为 Llama3 70B 和 Llama4 的引擎。后来我们用 GPT-OSS 120B 替换了 Llama4，因为没有人使用 Llama4，而 GPT-OSS 120B 更接近较小的 "mini" 前沿模型。
+
+我们希望服务器配置尽可能反映真实部署情况，因此要求 AMD 和 Nvidia 提交与其文档指南中讨论如何在其硬件上部署这些模型时相当接近的配置：
+
+- [https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html](https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html)
+- [recipes.vllm.ai](https://docs.vllm.ai/projects/recipes/en/latest/)
+- [https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/inference-vllm-gpt-oss-120b.html](https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/inference-vllm-gpt-oss-120b.html)
+
+我们之前没有明确说明 InferenceMAX 是否允许预热（warmup），所以 Nvidia 在其 SGLang DeepSeek 提交中包含了一个预热阶段来处理某些 JIT 编译的内核。在基准测试开发接近尾声时，AMD 注意到了上述关于 Nvidia 提交的情况，并询问是否允许预热，因为他们没有意识到自己也可以这样做。经过 AMD、Nvidia 和 SemiAnalysis AI 工程团队之间的讨论，各方同意目前禁止预热，并将 DeepSeek 基准测试的时长延长最多 5 倍以确保公平。这次沟通混乱是我们的失误，因为从一开始就没有明确预热规则。我们计划在发布后重新审视这一议题，因为在真实生产推理中，预热通常在 Kubernetes 控制平面将 Pod 报告为健康之前发生。
+
+## 讨论：DeepSeek R1 服务策略
+
+我们允许供应商可选地为 DeepSeek R1 提交分离式服务（disaggregated serving）配置。分离式服务将推理的两个阶段——预填充（prefill）和解码（decode）——分配给不同的 GPU 资源。通过分离这两个阶段，处于不同阶段的请求不会相互干扰，从而实现更好的 SLA 保障，尤其在高并发场景下。
+
+我们还将分离式服务与大规模专家并行（wide EP）相结合。Wide EP 通过多种技术实现，其中最值得注意的是 DeepEP。DeepEP 提供两种分发模式：常规模式和低延迟模式。常规模式专注于提高预填充阶段的吞吐量，而低延迟模式则为降低解码阶段的延迟而定制。
+
+对于 DeepSeek R1 的分离式服务，我们还收到了启用多 token 预测（MTP）的提交。DeepSeek R1 实现了 MTP，即模型通过额外的 MTP 模块在每次前向传播中预测多个 token。根据 DeepSeek 的说法，使用 MTP 训练可以提高模型的规划能力。此外，在推理过程中使用 MTP 模块可以在对模型质量损失极小的情况下提升 token 吞吐量。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/18326fc6-9e43-4998-838b-b2b7087de7f1_2853x1341.png"
+  caption="来源：DeepSeek-V3 技术报告，图 3"
+/>
+
+Nvidia 已为 GB200 NVL72 上的 DeepSeek R1 提交了包含分离式服务、wide EP 和 MTP 的运行结果。Nvidia 还提交了特定配置以绘制帕累托前沿，我们计划未来扩展以扫描更大的配置空间。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/2c25289d-8cf8-4d1a-9be6-69a4e12f9886_2353x1271.png"
+  caption="来源：DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving"
+/>
+
+在服务 DeepSeek R1 时，SGLang 提供了多种并行策略，包括**张量并行（TP）**、**数据并行（DP）**和**专家并行（EP）**。并行策略在 GPU 之间分配工作，以降低每块 GPU 的内存使用并提高硬件利用率。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/076f3327-558d-481f-91be-87edecf55135_2153x1624.png"
+  caption="来源：SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs"
+/>
+
+通常，我们使用张量并行沿注意力头数维度（通常为 128）分配注意力层的工作。然而，这不太适合 DeepSeek R1，因为它使用了多头潜在注意力（Multi-Latent Attention, MLA）——这是一种特殊的注意力机制，其中只有一个 KV 头，导致 KV 缓存重复。为了解决这个问题，SGLang 在低交互性场景下使用数据并行注意力，沿 batch 维度分配工作，消除了 KV 缓存重复的需要并减少了通信负载。
+
+DeepSeek R1 还有大量的专家层，因此我们应用专家并行，为每块 GPU 分配一组专家层。这降低了内存使用，但代价是更高的通信负载。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/b668c1ec-765e-4bfd-9e17-66c2bfb91c76_1590x1582.png"
+  caption="来源：SGLang v0.4"
+/>
+
+## InferenceMAX 架构
+
+InferenceMAX 使用 GitHub Actions 来编排基准测试运行。一个 GitHub Action 将每个基准测试配置作为一个 [job](https://docs.github.com/en/actions/get-started/understand-github-actions#jobs) 运行，并在 [runner](https://docs.github.com/en/actions/get-started/understand-github-actions#runners) 上执行。我们将 GPU 服务器作为 runner 接入 GitHub Actions，使其监听请求并执行 job。在执行 job 时，runner 会执行为该服务器编写的 runner 启动脚本，该脚本反过来使用 Docker 或 SLURM，具体取决于服务器的设置。启动脚本随后执行包含具体基准测试配置的基准测试脚本。
+
+我们将并行策略 + 最大并发数基准测试扫描的逻辑定义为参数化的 [workflow](https://docs.github.com/en/actions/get-started/understand-github-actions#workflows)，并通过增量组合 workflow 来执行所有 GPU 类型、所有模型和 GPU，以及不同的输入/输出序列长度。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/290f8a5a-3e55-45c6-a0e4-7c1dc6065175_2293x1701.png"
+  caption="来源：SemiAnalysis"
+/>
+
+## 性能结果——吞吐量 vs 端到端延迟/交互性（tok/s/user）
+
+以下是撰写本文时 2025 年 10 月 7 日每夜运行的性能快照。完整的每夜结果请访问我们在 [http://inferencemax.ai/](http://inferencemax.ai/) 上的仪表板。
+
+在解读吞吐量 vs 延迟/交互性图表时，请记住大多数实际应用运行在两个极端之间的某个位置。仅测量单一或有限吞吐量或交互性水平的基准测试结果有时可能具有误导性。
+
+例如，如果 GPU A 在给定交互性水平下——以面向人类的 AI 聊天机器人应用为例取 5 tokens/s/user——提供了 GPU B 4 倍的吞吐量，但这个交互性水平实际上太慢而不实用的事实意味着这种性能差异几乎没有真实的实际意义。相反——应该为给定应用选择一个现实的交互性水平。
+
+在本报告的后续部分，我们还将按 GPU 的总拥有成本（TCO）对吞吐量进行归一化。
+
+每百万 token 的 TCO 成本是客户真正关心的北极星指标——性能只是计算这一指标的中间步骤。例如，B200 可能提供比 MI355X 高 1.5 倍的吞吐量，但如果其每小时 TCO 成本是 MI355X 的 2 倍——那么 MI355X 将是更好的选择，因为即使 MI355X 在每 GPU 吞吐量的绝对性能上较低，它的 TCO 性价比更优。
+
+让我们逐步分析几个基准测试示例来解释如何分析结果。
+
+在我们的第一个结果中，H100 vLLM 与 MI300X ROCm 7.0 vLLM 在 Llama 3.3 70B FP8 推理场景（1k in/ 8k out）的对比显示了 MI300X 的强劲性能，尤其在低交互性水平（20 到 30 tok/s/user）下，这得益于 MI300X 在 TP1 运行时更好的内存带宽和内存容量优势。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/471369b0-e290-4560-a348-b3079c8b6e88_2329x1393.png"
+  caption="来源：SemiAnalysis"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/a46e880d-4076-46ff-84d1-5be99ba50a8a_2634x1582.png"
+  caption="来源：SemiAnalysis"
+/>
+
+在 H200 与 MI325X 使用 vLLM 运行 GPT-OSS 120B MX4 weights 摘要工作负载的对比中，我们看到了有竞争力的结果。MI325X 在交互性低于 110 tok/s/user 时相对 H200 具有优势，在高于 110 tok/s/user 的水平下仍与 Nvidia 保持一定竞争力。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/bec7ee8f-a2bc-4458-be18-76fa434ff7f1_2477x1470.png"
+  caption="来源：SemiAnalysis"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/cb664d13-b2f7-42e5-824d-fbf6319c0f7f_2406x1432.png"
+  caption="来源：SemiAnalysis"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/3350a97b-7c28-4e46-9c80-519555da1eb6_2342x1124.jpeg"
+  caption="来源：SemiAnalysis"
+/>
+
+在 LLaMA 70B FP4 方面，B200 在所有三种工作负载类型上的吞吐量性能都大幅超越 MI355X。这表明 AMD 的 FP4 内核仍有改进空间。
+
+转到 B200（vLLM 和 TRT-LLM）vs. MI355X vLLM 运行 GPT-OSS 120B 的对比，我们可以看到 MI355X 在按 TCO 归一化后与 B200 vLLM 具有竞争力。在下一节中，我们将看到 MI355X 在某些交互性范围内的 TCO 性价比优于 Nvidia。吞吐量-延迟图表显示竞争更为激烈，MI355X 在给定 tok/s/gpu 吞吐量下比 B200 慢不超过约 15 秒。我们在现实中看到的 GPT-OSS 120B 最实用的交互性范围约为 150-200 tok/s/user。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d40e2253-8285-466f-8f16-6459c1051fb4_2373x1419.png"
+  caption="来源：SemiAnalysis"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/918c3a71-6c31-400a-a5c9-fb0930f8a2f7_1879x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+转到 DeepSeek 670B MoE FP8，在 MI325X SGLang 与 H200 SGLang 的对比中，我们观察到 MI355X 在给定吞吐量水平下的延迟和交互性方面均明显落后。H200 SGLang 在相当的吞吐量下始终比 MI325X 的延迟低约 40%。此外，在比较两者的帕累托前沿交互性时，我们也看到了持续的差距。MI355X SGLang 与 B200 SGLang 的对比呈现出与 MI325X vs H200 类似的态势。AMD 在 SGLang 镜像方面似乎有很大的改进空间。
+
+我们还可以看到，对于 GB200 NVL72 SGLang Dynamo FP8 机架级推理，目前尚未完全优化，仍有改进空间。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/2755907c-48bb-47aa-99be-6058c695ac85_1857x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d68490f3-adfd-4ab7-b6d1-80bae3921ef9_1868x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+转到 FP4 DeepSeek 670B MoE，我们看到 GB200 NVL72 机架级 TRT-LLM 推理以大幅优势击败了单节点 SGLang 推理。我们期待在未来几个月内在多节点 8-GPU 机器上对 wideEP + 分离式预填充进行基准测试。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/9534de11-42bd-44c9-9398-16bb9deedcfd_1864x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/c98fcaff-2665-406e-8ce0-c63c456532e4_1374x839.png"
+  caption="来源：SemiAnalysis"
+/>
+
+接下来，我们比较 GB200 在 DeepSeek R1 上开启和关闭多 token 预测（MTP）的表现，场景为 8K 输入 / 1K 输出——这一输入/输出比例旨在反映摘要用例。MTP 开启的优势在比较吞吐量 vs. 交互性时尤为明显。在 70-140 tok/s/user 的范围内，我们看到 MTP 开启场景相比 MTP 关闭场景每 GPU 的吞吐量显著更高——在某些等交互性（tok/s/user）水平下甚至达到 2-3 倍的吞吐量。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/cb76eccd-5bb3-480e-98b0-e821f11b8d88_1849x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/717be24a-af21-4e21-a829-c907bf3aaa88_1883x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+## 性能结果——每百万 Token TCO 成本 vs 交互性（tok/s/user）
+
+然而，比较每 GPU 的 token 吞吐量只是得出真正底线——即每 token 总拥有成本（TCO）——所需的几个数据点之一。
+
+ML 推理工程师通常以每百万 token TCO 成本为单位来衡量这一指标。要从每 GPU 吞吐量转换为每百万 token 的 TCO 成本，在比较芯片之间的差异时，我们必须按以 USD/hr/GPU 为单位的总拥有成本进行归一化。例如，如果 B200 提供了比 MI355X 高 1.5 倍的吞吐量但每小时 TCO 成本是其 2 倍——那么 MI355X 将是更好的选择，即使它的绝对性能较低。
+
+在我们位于 [http://inferencemax.ai/](http://inferencemax.ai/) 的 InferenceMAX 门户上，我们估算了各种客户群体的每百万 token TCO 成本 vs 延迟/交互性，包括：
+
+- 购买并拥有芯片的超大规模厂商和一级前沿实验室（4 年经济使用寿命）
+- 计划拥有自有芯片的新型云巨头和大型托管推理服务提供商（4 年经济使用寿命）
+- 从新型云厂商租用 GPU，签订 3 年合同，预付 25%
+
+建模每 Token 总拥有成本绝非易事，涉及 SemiAnalysis 多个团队和业务领域。在 AI Token 工厂经济模型中，我们展示了用于推导这一北极星指标的所有假设，以及用于确定这些数值的 SemiAnalysis 模型。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d913a30b-22dc-42bd-93bc-f81500ac9d18_1415x864.png"
+  caption="来源：SemiAnalysis AI 加速器模型、SemiAnalysis BOM 与 ODM 模型、SemiAnalysis AI 网络模型、SemiAnalysis AI TCO 模型、SemiAnalysis 数据中心模型"
+/>
+
+特别是，[SemiAnalysis AI TCO 模型](https://semianalysis.com/ai-cloud-tco-model/) 提供了针对各种 AI 服务器方案和网络架构组合（即 InfiniBand vs SpectrumX vs Arista 以太网 vs 白盒以太网）的全面总拥有成本建模，是 InferenceMAX 中使用的每 GPU 总拥有成本以及新型云市场租赁价格的主要来源。
+
+SemiAnalysis GPU 云市场租赁价格报告基于对 70 多家 GPU 云和 100 多个从 GPU 云租赁的终端用户的调研。未来，我们将探索在 InferenceMAX.ai 门户上实现不同租赁定价合同期限（如 1 年或 1 个月）的仪表板。我们还计划允许自定义输入，以便你可以输入自己的 $/GPU/hr 报价，来确定最符合你的交互性目标和成本的 GPU。
+
+在以下分析中，我们聚焦于拥有芯片并将其商业模型基于 4 年经济使用寿命的超大规模运营商级别的每百万 token 成本。
+
+我们看到，在所有交互性水平下，MI325X vLLM 的每百万 token 成本都优于 H200 vLLM。当我们引入 Nvidia（大部分）开源的 TRT-LLM 时，我们看到 H200 当前的软件栈在与使用今天 vLLM 栈的 MI325X 的对比中胜出。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1217e5ad-909c-419e-b4a5-304a7145275e_1736x1165.png"
+  caption="来源：SemiAnalysis"
+/>
+
+当我们在推理输入/输出长度场景下比较 B200 vLLM 与 MI355 ROCm 7.0 vLLM 运行 Llama3 70B FP4 时，B200 目前表现优于 MI355。这也印证了我们的建议——AMD 应更加专注于优化 FP4 对 Llama3 的支持。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/5dde0cca-4570-4e17-a685-f8aa0fc9685c_1774x1178.png"
+  caption="来源：SemiAnalysis"
+/>
+
+对于 GPT-OSS 120B FP4 摘要任务，我们看到 MI355X vLLM 的每百万 token TCO 成本低于 B200 vLLM，甚至在交互性低于 225 tok/s/user 时可以击败 B200 TRT-LLM。对于高于 225 tok/s/user 的交互性水平，我们看到 B200 TRT-LLM 以及其他推理引擎的优化更到位，能够提供比 MI355X vLLM 更低的 TCO 性价比。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1629db8a-7dbb-4d06-b58c-60c58b66a248_1826x1237.png"
+  caption="来源：SemiAnalysis"
+/>
+
+在 GPT-OSS 120B MX4 weights 上，我们看到 MI300X 在整个交互性范围内相对 H100 展现了非常强劲的 TCO 性价比。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/2f3e8a00-c9fe-4a81-be97-6a3681eb15ee_1668x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+对于 gpt-oss 120B MX4 weights，H200 TRT-LLM 在交互性低于 135 tok/s/user 时与 MI325X 的 TCO 性价比不相上下。在此水平之上，MI325X vLLM 在每百万 token TCO 成本方面领先于 H200 TRT-LLM。
+
+这一结果令人惊讶的是，完全开源的 vLLM Hopper 版本比"大部分"开源的 TRT-LLM Hopper 版本更快。即使是 MI325X vLLM 也在交互性高于 135 tok/s/user 时击败了 H200 TRT-LLM。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/dd8c102c-4041-4e37-be5f-5b308b06adfc_1672x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+转到 DeepSeek 670B MoE FP8，我们看到在保持每百万 token TCO 成本不变的情况下，B200 SGLang 提供的交互性比 MI355X SGLang 快 1.5 倍。我们注意到 ROCm AITER 中仍有大量优化正在集成到 SGLang 中，因此我们预计 SGLang DeepSeek 670B MoE 的 TCO 性价比将很快得到改善。
+
+在交互性保持在约 35 tok/s/user 时，GB200 NVL72 击败了所有其他方案，提供了 4 倍更优的每百万 token TCO 成本。我们注意到 Dynamo 团队目前只有时间实现足以将并行成本帕累托前沿降低到 30 tok/s/user 区域的优化。他们仍有空间进一步优化，以将 GB200 NVL72 FP8 在约 40 及以上交互性水平的成本帕累托前沿进一步推低。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/aa169f01-a0c1-493e-9fcf-15e0c3e9e2dd_1646x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+转到 DeepSeek R1 FP4 摘要用例，我们看到在低于 90 tok/s/user 的交互性水平下，GB200 NVL72 使用 TRT-LLM 引擎配合 Dynamo 分离式预填充，在每百万 token TCO 成本上决定性地超越了所有单节点 8-GPU 服务器。有趣的是，在交互性高于 90 tok/s/user 时，B200 TRT-LLM 击败了 GB200 NVL72。然而，就目前而言，单节点 B200 服务器在高交互性用例中可以实现比 GB200 NVL72 更好的 TCO 性价比。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/a154ae09-41cd-487b-a99c-30f89c793381_1946x1302.png"
+  caption="来源：SemiAnalysis"
+/>
+
+在下面针对推理用例的基准测试中，我们看到 B200 SGLang 目前表现优于 MI355X SGLang。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d8274bbd-4310-44b0-9545-809260cfca44_1669x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+对于摘要场景，GB200 使用当前的 TRT-LLM Dynamo 软件在交互性低于 80 tok/s/user 时表现优于 B200 单节点。比较 MI355X SGLang 与 B200 SGLang，我们看到 B200 提供了更好的每百万 token TCO 成本。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f5dcfed6-70d1-460f-8963-14784577ca44_1695x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+我们还对使用 FP4 配合多 token 预测（MTP）运行的工作负载进行了基准测试。MTP 是 DeepSeek 团队在训练阶段实现的功能。我们看到，在保持每百万 token TCO 成本不变的情况下，MTP 可以在给定成本水平下提供比不使用 MTP 高 2-3 倍的交互性（tok/s/user）。的确，大多数前沿实验室和一级托管 DeepSeek REST API 端点提供商已经在生产工作负载中启用了 MTP。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/90eb3363-3d59-4c46-aff7-521af7ecc45b_1679x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+## 每全额配置公用事业兆瓦估计 Token 吞吐量 vs 交互性（tok/s/user）
+
+电力是 AI 基础设施的终极约束。每个数据中心都在有限的功率范围内运行，通常以兆瓦（MW）为单位衡量。这直接决定了给定数据中心可以产生多少有用计算——以及最终可以产生多少 token。
+
+推理经济不仅可以通过 GPU 性能（吞吐量/GPU vs TCO）的视角来分析，还可以通过每功率吞吐量来分析——以每全额配置公用事业 MW 的 token/s 来衡量。全额公用事业功率涵盖 GPU、CPU、网络设备、其他相关集群 IT 设备以及设施开销的功率需求。设施开销包括电力分配损耗以及用于冷却设备（如冷水机组、CDU 和冷却塔等）的功率消耗。每 MW 处理的 token 数越多，每单位能源的潜在收入和利润就越大。请注意，对于 InferenceMAX，我们使用的是全额配置公用事业 MW，它考虑了上述设施开销，而不是每全额关键 IT MW 的 token 数（后者不考虑设施开销）。这些指标在不同站点之间有所不同，但我们根据我们的 [AI TCO 模型](https://semianalysis.com/ai-cloud-tco-model/)和[数据中心模型](https://semianalysis.com/datacenter-industry-model/)选择了行业代表性数值。
+
+请注意，托管租金和电力成本通常占总拥有成本的不到 20%。这意味着如果给定 GPU 与另一块 GPU 相比每 MW token 数低 20%，这仅会转化为不到 4% 的总拥有成本差异（即 20% \* 20% = 4%）。TCO 贡献的大头来自各 GPU 硬件供应商收取的毛利率。一些厂商收取高达 75% 的毛利率（即售出成本的 4 倍加价），而其他厂商低于 50% 的毛利率（即不到售出成本的 2 倍）。
+
+我们使用速率单位——即 token/s per MW——而非累积的每 token 能量单位（如焦耳每 token）。这是因为数据中心容量以兆瓦（MW）为单位来规划，这是一个速率单位，等同于每秒 1 兆焦耳（MJ）。如果我们将速率单位在给定时间段内积分，就得到该时间段内消耗的绝对能量值。
+
+目前，我们通过加总数据中心中各组件的热设计功耗（TDP）来估算给定集群所需的 MW。TDP 与预期平均功率不同。举例说明：对于内存带宽受限的解码工作负载，系统功耗永远不会达到 TDP，而是会在较低的功率水平——即预期平均功率附近波动。未来，我们将通过 ipmitool 对每个系统（和网络设备）的实际功耗进行基准测试。届时我们将转向每 token 的累积能量单位。
+
+我们基于 InferenceMAX 原始结果结合来自我们 [AI 数据中心行业模型](https://semianalysis.com/datacenter-industry-model/)的 AI 集群全额公用事业功率数据来估算每配置功率的吞吐量。该模型通过跨供应商、架构和推理栈的功率归一化估算来量化全额公用事业功率。完整的估算和持续的每夜基准测试可在 [InferenceMAX.ai](https://inferencemax.semianalysis.com/) 上获取。
+
+## 每 MW 性能结果
+
+我们看到，对于 gpt-oss 120B 使用 MX4 weights 的推理场景（1K 输入 token / 8K 输出 token），在 90 tok/s/user 交互性水平下，MI300X 每全额配置公用事业 MW 可处理 750,000 token/s（再次强调，这是按公用事业 MW 衡量的，而非按关键 IT 功率 MW），而 MI355X 每全额配置公用事业 MW 可处理 2,550,000 token/s。这代表了从 CDNA3 代到 CDNA4 代约 3 倍的能效提升。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/0aa98f63-d0f8-4666-85fd-47a1bbf78c7a_1645x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+在 Nvidia 阵营中，我们看到了类似的代际趋势。对比 HGX H100 与 HGX B200 运行 gpt-oss 120B FP4 weights，H100 每 MW 可处理 900,000 token/s，而 B200 每 MW 可处理 2.8M token/s——B200 相比 H100 能效提升约 3 倍。当我们看到约 180 tok/s/user 的更高交互性水平时，B200 实现了令人瞩目的 7 倍能效提升。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/3b5b1ecd-8df1-48c8-8e21-5bfa9f21352b_1654x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+让我们比较 AMD 和 Nvidia 同代 GPU 的能效。我们首先看 GPTOSS 120B 的每配置全额公用事业 MW token/s。根据我们的初始 InferenceMAX 结果快照，我们看到 Blackwell 在这一吞吐量/功率指标上比 CDNA4 架构能效高 20%。造成这一差异的一个重要因素是 MI355X 单 GPU TDP 明显更高，为 1.4kW/GPU，而 B200 为 1kW/GPU。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/594bc030-2862-4521-acf4-d06304a30dce_1687x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+在下一个基准测试中，我们以 30 tok/s/user 交互性水平来看 DeepSeek R1 的每功率 token 数。比较单节点 H200 FP8 与 GB200 NVL72 FP4（不含多 token 预测），GB200 NVL72 在每全额配置公用事业 MW 处理的 token/s 方面提供了约 8 倍的提升。请注意，H200 和 B200 的结果都来自单节点。我们将探索 B200 和 H200 通过在 SpectrumX 以及 InfiniBand 上实现分离式预填充和 wide 专家并行所能释放的更大每 MW token 吞吐量潜力。SGLang 的 [GB200 NVL72 分析](https://lmsys.org/blog/2025-09-25-gb200-part-2/)显示，8-GPU 系统确实可以通过实现 wide 专家并行获得强劲的性能提升。然而，SGLang 的博客也表明，即使 Hopper 同样实现了分离式预填充和 wide EP，GB200 NVL72 仍然胜出。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/e17e8c63-bc2a-4ce2-a9c9-0d00a586de3a_1566x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+继续 DeepSeek 但转到 FP8，我们看到 GB200 在 tok/s/gpu vs tok/s/user 上也主导了所有单节点系统。我们注意到这里有一些细微差别——B200 和 MI355X 都在运行单节点 SGLang，尽管对于 DeepSeek 而言，vLLM 可能在 MI355X 上比 SGLang 能提供更好的结果。我们将探索为 MI355X 添加 DeepSeek vLLM 支持和/或为所有 8-GPU 服务器添加 SGLang 多节点 wideEP。此外，如前所述，Dynamo 团队目前只有时间实现足以在约 30 tok/s/user 附近实现并行帕累托前沿下移的优化。进一步的优化可以将帕累托前沿进一步推低，从而在更高交互性水平下提升 GB200 NVL72 FP8 的每功率吞吐量。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/129e616b-82d6-413b-ae0c-e28f07b27d84_1661x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+## AMD Bug 与 NVIDIA Blackwell Bug
+
+在排查过程中遇到了几个有趣的 Blackwell Bug。第一个 Bug 是我们从 2025 年 7 月开始使用的 Blackwell vLLM 镜像会导致实例在我们的裸金属 B200 机器上挂起长达 30 分钟。这特别难以复现和调试，因为其他人尝试在他们的 Blackwell 集群上使用完全相同的镜像时并没有遇到任何挂起问题。
+
+我们用来调试这个挂起问题的第一个工具是 [py-spy](https://github.com/benfred/py-spy)——一个 Python 性能分析器——用来收集跟踪信息。我们注意到它卡在 [ncclCommInitRank](https://github.com/NVIDIA/nccl/blob/8d26308e6aba7f1667b24a861b5dc73f0f2e1f40/src/init.cc#L1974) 上，这很奇怪——许多 ML 性能工程师都知道，这个函数在单节点上应该运行得非常快。另外值得注意的是，vLLM 由于[各种技术原因](https://github.com/vllm-project/vllm/blob/3d1f67616da88cbf0033bf5027cc0c6e5e9cacf6/vllm/distributed/device_communicators/pynccl_wrapper.py#L4-L23)使用了他们[自己的 FFI 绑定](https://github.com/vllm-project/vllm/blob/3d1f67616da88cbf0033bf5027cc0c6e5e9cacf6/vllm/distributed/device_communicators/pynccl_wrapper.py#L144)来调用 NCCL。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/465df901-046f-4109-84b4-7ccaf092f79e_2810x867.png"
+  caption="来源：SemiAnalysis"
+/>
+
+在阅读了 vLLM 的 NCCL 绑定代码后，我们并不认为 FFI 绑定是问题的根本原因。运行 nvidia-smi 时，我们看到 GPU_UTIL 不是 100% 而是 0%——表明没有内核在 GPU 上运行，由此我们得出结论这不是设备端的 NCCL 死锁。
+
+接下来，我们使用 Linux [perf](https://perfwiki.github.io/main/) top 分析器查看 Python 层之下，试图更深入地了解具体是哪个共享库触发了这个问题。我们注意到该进程（及子进程）的大部分 CPU 周期都花在 "libnvidia-ptxjitcompiler.so" 上。查阅 "libnvidia-ptxjitcompiler" 文档后，我们找到了这样的描述：_"PTX JIT Compiler 库（/usr/lib/libnvidia-ptxjitcompiler.so.575.57.08）是一个 JIT 编译器，将 PTX 编译为 GPU 机器码，由 CUDA 驱动使用"_。这非常奇怪，因为我们不确定为什么初始化时要调用 PTX 编译器——通常所有 NCCL 内核都是在构建时预编译好的，不应该有需要即时编译的内核。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/a8a1c7d6-51e3-4b5f-85ea-41568792ac21_1937x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+我们当时~~太懒了~~太忙了，无法重建整个容器镜像并从头编译启用了调试符号的 NCCL。因此，我们接下来使用 [strace](https://man7.org/linux/man-pages/man1/strace.1.html) 来确定 ptxjitcompiler 正在进行什么系统调用，以便更深入一层了解正在调用的函数。我们发现 ptxjitcompiler 正在容器内的 ~/.nv/ComputeCache/ 中创建和添加文件。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/c9556f28-f5c9-4ffb-935a-d85b790bbe44_2125x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+再剥一层洋葱，我们研究了 ~/.nv/ComputeCache/ 的作用。根据文档，它是将 PTX 虚拟 ISA 转换为 SASS 机器码的缓存。这也让我们非常困惑，因为通常 NCCL 在构建时会同时打包机器码和 PTX 虚拟 ISA。我们开始阅读 [NCCL 构建脚本，注意到 SM100（Blackwell）在 CUDA 12 中未被启用](https://github.com/NVIDIA/nccl/commit/80f6bda4378b99d99e82b4d76a633791cc45fef0#diff-45a9034a0c75cbfbbb34e853a43f6513c1d4c933eccf6adca705abe234fc1113R42-R49)——而我们正在使用 CUDA 12——发现它们只为即将推出的 CUDA 13 启用了 SM100。这意味着 SM100 SASS 未被打包，我们实际上是在将 compute_90（Hopper）PTX JIT 转换为 SM100 SASS，导致这个过程耗时极长。其他人没有看到这个 Bug 是因为他们使用的是通过 SLURM 手动挂载了 home 目录的内部集群。由于 SASS JIT 缓存存储在 home 目录 ~/.nv/ComputeCache/ 中，SASS 已经被缓存了！
+
+原来 vLLM 7 月份的容器镜像基于 PyTorch 容器镜像，后者使用了一个未预构建 Blackwell SM100 的 NCCL 版本。修复方法是使用[修复后的 2.26.2 版本](https://pypi.org/project/nvidia-nccl-cu12/2.26.2.post1/)，其中包含了预构建的 Blackwell 支持，这样就不会浪费 30 分钟来编译虚拟 ISA 到机器码。这个 Bug 已在最新的 vLLM 容器镜像中修复。感谢 simon-mo、youkaichao、mgoin、Robert-shaw、ptrblck 和 Kedar Potdar 帮助实施永久修复并快速解决问题。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/e242f6a3-f0a2-4a95-9301-c4bdc6695544_1839x1121.png"
+  caption="来源：SemiAnalysis"
+/>
+
+我们遇到的另一个 Blackwell 问题是 vLLM/SGLang 的子依赖 Flashinfer 存在文件锁竞态条件（race condition）。出于某种原因，Nvidia 决定不将编译好的内核打包到容器镜像中，而是在服务器启动时下载它们。由于我们每个节点最多有 8 个进程（每个 GPU 1 个进程），如果代码不是进程安全的，我们在下载这些编译内核时就会遇到竞态条件。
+
+实际上，这个竞态条件是由于[一次旨在防止竞态条件的尝试](https://github.com/flashinfer-ai/flashinfer/pull/1779)而引入的！Flashinfer 没有依赖内置 FileLock Python 包的锁清理，而是手动清理锁，这反而导致了竞态条件。[这已在 Flashinfer 中修补](https://github.com/flashinfer-ai/flashinfer/pull/1779)，但尚未上游到 vLLM/SGLang Blackwell 发布容器镜像。非常感谢 Flashinfer 团队和 Kedar Potar 迅速介入，在与团队对接后仅 4 小时内就完成了调试和修补。
+
+还有一个 Blackwell Bug 是 Flashinfer 将构建环境标志名称改为 FLASHINFER_CUDA_ARCH_LIST，但 Nvidia 方面没有通知 vLLM/SGLang 维护者，也没有提交自己的 PR，因此在几周时间内 [vLLM](https://github.com/vllm-project/vllm/pull/25730) 和 [SGLang](https://github.com/sgl-project/sglang/pull/11226) 都不支持 Flashinfer 的 AOT。
+
+我们发现 Nvidia 容器工具包偶尔会完全出错并显示以下消息：
+
+> _"docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'_
+>
+> _nvidia-container-cli: initialization error: driver rpc error: timed out: unknown"_
+
+尝试在 CLI 中使用 nvidia-smi 也会触发挂起。这表明整个 Nvidia 驱动实际上已经崩溃。在与 Nvidia 固件/驱动团队和 NVIDIA NCCL 团队进行详细的调试会议后，我们发现自 NCCL 2.26 以来存在一个缓慢的资源泄漏 Bug，因为我们使用了 CUDA graphs 并且每夜启动超过 500 个 Blackwell 容器。
+
+由于我们频繁地启动和停止大量 Blackwell 容器，所有这些启停操作累积起来最终导致驱动崩溃。资源泄漏 Bug 的具体原因在于，当启用 CUDA graph 时，NCCL 默认会启用用户缓冲区（user buffers）。如果没有这个资源泄漏 Bug，NCCL 用户缓冲区功能本应通过让 NCCL 使用应用程序缓冲区实现零拷贝，减少应用层缓冲区和 NCCL 内部缓冲区之间的数据移动。[临时修复方案是在过渡期内不启用 NCCL 用户缓冲区](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-graph-register)，直到 Bug 修复可以推出。修复的预计时间约为 10 月 20 日，预计作为 NCCL 2.28 的一次次要更新发布。感谢 Kedar Potar 和众多 Nvidia 团队成员迅速定位根因并以令人难以置信的速度和支持修复了这个 Bug。
+
+在 AMD 方面，我们在开发 InferenceMAX 过程中遇到的 Bug 较少，而且这些 Bug 更容易修复。其中一个 Bug 是 AMD 的 CUDNN 等价物 AITER 在一个辅助函数中崩溃，因为它没有考虑到 "/opt/rocm/llvm/bin/amdgpu-arch" 不仅返回计算架构（如 gfx942），还会在返回 gfx942 时包含后缀。AITER 旨在通过模式匹配来判断它所使用的架构，但没有考虑后缀存在的情况。[编写临时修复](https://github.com/InferenceMAX/InferenceMAX/blob/3b8879031799cac260ef00bd8911dabbe5982d49/benchmarks/70b_fp8_mi325x_slurm.sh#L39)很简单，但 AITER 将在接下来几周内推出永久修复。感谢 Quentin 帮助修补了这个问题！
+
+我们还在 MI355X 基准测试中遇到了一个 Bug，即基准测试运行崩溃并转储了 1TB 名为 gpucore.XXX 的文件。经过调查，我们发现根因是服务器配置中的 chunked prefill 大小设置过高。将其从 196608 降低到 32768 即修复了这个问题（[PR 链接](https://github.com/InferenceMAX/InferenceMAX/pull/80/files)）。
+
+AMD [最近添加了 pyxis 支持](https://instinct.docs.amd.com/projects/container-toolkit/en/release-1.1.x/container-runtime/enroot-pyxis-installation.html)，为在 SLURM 中使用容器带来了良好的使用体验，尤其是在多节点训练或多节点离线批量推理作业方面。然而，我们在他们的 ROCm 7.0 SGLang 镜像 _"rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915"_ 上遇到了一个 Bug，当尝试通过 pyxis SLURM 运行此镜像时导致硬崩溃。根因在于组成该 Docker 镜像的某些层的权限处理方式导致了层间的权限冲突。AMD 团队正在研究如何永久修复并防止此类错误再次发生。
+
+早在 7 月份，当我们尝试在 AMD GPU 上为 SGLang 启用 AITER 时，由于 DeepSeek V3 的编译过程缓慢，耗时比正常情况多 10 倍（总共约 30 分钟）（[GitHub issue 链接](https://github.com/sgl-project/sglang/issues/7826)）。这个问题最终在后续版本中得到解决，目前已修复。
+
+## GitHub Action CI/CD Bug
+
+GitHub Actions 的[自托管 runner](https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/add-runners) 支持为我们在 InferenceMAX 中想要运行的基准测试提供了一个简单直接的解决方案。集成设置快速，允许在各种 GPU 集群上运行可复现的工作流，无需构建自定义基础设施。然而，随着 InferenceMAX 开始扩展以包含更多 job，GitHub Actions 的一些局限性也浮现出来。
+
+每个基准测试变体作为一个单独的 job 运行。对于每个模型，我们对以下不同组合进行基准测试：不同的 GPU、输入/输出序列长度、精度、张量并行度和并发数。这在添加更多配置时导致每个 workflow 的 job 数量产生[组合爆炸](https://en.wikipedia.org/wiki/Combinatorial_explosion)。
+
+具体来说：InferenceMAX 目前在最多 7 种 GPU 类型上对 3 个模型进行基准测试，涵盖 3 种不同的 ISL/OSL 对、2 种精度设置，以及大约 4 种并发数和张量并行选项。并非每个模型都使用所有可能的配置，但这个最坏情况估算给出 3 _ 7 _ 3 _ 2 _ 4 \* 4 = 2016 个不同的 job。在这种规模下，GitHub Actions 工作流可视化遇到了限制：服务器在尝试渲染 DAG 时在十秒后超时，导致[错误消息](https://github.com/503.html)。这使得调试运行变得极其困难。我们的解决方法是将单个每夜 workflow 拆分为三个，按 ISL/OSL 对分拆。这将每个 workflow 的 job 数从大约 1500 减少到 500，服务器似乎可以可靠地处理。
+
+另一个 Bug 涉及使用 [download-artifacts@v5](https://github.com/actions/download-artifact) action 时的硬限制。在每次完整扫描 workflow 结束时，会运行一个 job 来收集和汇总所有 job 的性能结果，这些结果作为 workflow 的 artifact 存储。作为收集过程的一部分，会调用 download-artifacts@v5 action。它初始化一个 [artifact client](https://github.com/actions/toolkit/blob/main/packages/artifact/src/internal/client.ts)，该 client 反过来调用一个 [list artifacts 函数](https://github.com/actions/toolkit/blob/main/packages/artifact/src/internal/find/list-artifacts.ts)（需要列出所有 artifact 然后通过模式匹配找到请求的那个），该函数出于"性能原因"强制执行了 1000 的硬限制。据称当 client 尝试列出超过 1000 个 artifact 时应该会打印一个警告，但我们从未观察到这种行为。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/44f22920-4b5c-4e9c-9606-bd0199e77bd0_2351x1121.png"
+  caption="来源：GitHub"
+/>
+
+我们要感谢 Scott Guthrie 将我们与 GitHub 的合适人员对接，并感谢这些团队成员帮助我们实施了针对这些 Bug 的临时解决方案。我们期待继续使用 GitHub Actions 来创建开源世界中最大的 GPU CI/CD 集群之一。
+
+## 对 Nvidia 和 AMD 的建议
+
+尽管大量用户和 GPU 在 SGLang 和 vLLM 上运行，Nvidia 一直将大部分推理工程师资源分配给 TensorRT-LLM 的开发，而投入到支持 SGLang 和 vLLM 的工程资源相对较少。我们建议 Jensen 分配更多推理工程资源来支持和贡献 vLLM 和 SGLang 等流行推理引擎。这将使 Nvidia 更好地履行其加速工作负载的使命，无论用户选择哪种推理引擎。
+
+此外，ML 社区将受益于 Nvidia 为 QA 其 Blackwell 软件投入更多的时间和资源，以最大限度地减少终端用户在新平台上部署应用时遇到的 Bug 数量。在开发 InferenceMAX 的过程中，我们遇到了许多仅在 Blackwell 上出现、而在 Hopper 或其他平台上不存在的 Bug。
+
+在 AMD 方面，我们建议他们减少需要手动启用才能获得合理性能的 ROCm 特定标志数量。AMD 已经认识到这一点，并已开始着手确保优化配置默认启用。事实上，许多减少所需标志数量的更改已经合入了 master 分支。
+
+我们对 Nvidia 的 Blackwell 平台提出了相同的建议，并建议 Nvidia 通过[默认启用性能优化](https://github.com/vllm-project/vllm/issues/25689)来减少获得合理性能所需的[标志数量](https://github.com/vllm-project/vllm/pull/25924)。
+
+## InferenceMAX 后续计划
+
+在接下来的几个月内，我们将通过集成 Google TPU 和 Amazon Trainium 来扩展 InferenceMAX 的硬件覆盖范围，我们计划在未来两个月内上线。这将实现跨 AMD、NVIDIA、Google 和 AWS 加速器的统一、同等条件对比。这标志着 InferenceMAX 向成为全行业真正跨供应商的开放基准测试平台迈出了重要一步。
+
+此外，我们还将推出另一项计划——对 FP4 模型进行每夜评估（eval），包括 MATH-500 和 GPQA-Diamond，使社区能够以一致、透明的方式衡量吞吐量与质量的权衡。这将有助于揭示低精度推理如何影响不同模型家族和部署场景下的准确性。此外，我们还将追踪输出 token 吞吐量以创建更全面的洞察。
+
+在 NVIDIA 和 AMD 系统方面，多项令人兴奋的计划正在推进中。我们正在 MI300 和 MI355 系列 GPU 以及 B200 GPU 上开展 DeepSeek 的分离式预填充 + 多节点专家并行配置的工作，测试这些高级并行优化如何在推理工作负载上实现扩展。同时，我们也期待测试 HGX B300 Blackwell Ultra 和 GB300 NVL72 Blackwell Ultra，以了解它们相对于 GB200 NVL72 的性能提升。
+
+InferenceMAX 并不完美，但我们坚信我们正朝着正确的方向前进——打造一个能够跟上 AI 软件进步步伐的基准测试——并将继续整合来自 AI 芯片供应商、前沿实验室和大型加速器消费者的反馈。
+
+接下来，我们将深入分析 InferenceMAX v1 中当前使用的各 GPU（H100、H200、B200、GB200 NVL72、MI300X、MI325X、MI355X）的 TCO 各组成部分。
+
+## 超大规模厂商总拥有成本——Hopper、Blackwell、GB200 NVL72、MI300X、MI325X、MI355X
+
+<Blur>
+
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
+
+</Blur>
+
+---
+
+_本文继续发表在我们的 Substack 上。[订阅 SemiAnalysis](https://newsletter.semianalysis.com/subscribe) 以阅读完整文章。_
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "什么是 InferenceMAX？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "InferenceMAX 是一个开源的自动化基准测试项目，旨在实时追踪 ML 推理性能。它每夜在数百块芯片上运行完整的基准测试套件，持续对全球最流行的开源推理框架和模型进行重新测试。免费的实时仪表板可在 inferencemax.ai 访问。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "InferenceMAX 对哪些 GPU 进行基准测试？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "InferenceMAX v1 对 NVIDIA GB200 NVL72、B200、H200 和 H100，以及 AMD MI355X、MI325X 和 MI300X 进行基准测试。该项目正在扩展以纳入 Google TPU 和 AWS Trainium 后端，使其成为首个真正的多供应商开放基准测试项目。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "InferenceMAX 基准测试多久运行一次？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "InferenceMAX 使用 GitHub Actions 在 GPU 集群上编排基准测试运行，每夜执行完整的基准测试套件。这种每夜节奏确保结果能够跟上 vLLM、SGLang 和 TensorRT-LLM 等推理引擎的快速软件改进步伐。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "在 InferenceMAX 基准测试中 AMD 和 NVIDIA GPU 表现如何对比？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "AMD 和 NVIDIA GPU 都能在不同工作负载下提供有竞争力的性能。例如，MI300X 在低交互性水平下凭借更好的内存带宽表现出相对 H100 的强劲性能，MI355X 在某些 GPT-OSS 120B 工作负载下的每百万 token TCO 成本可以击败 B200。然而，NVIDIA B200 在所有工作负载类型上的 Llama 70B FP4 性能都大幅超越 MI355X。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "谁支持 InferenceMAX 项目？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "InferenceMAX 得到了主要行业领袖的支持，包括 Lisa Su（AMD）、Jensen Huang（NVIDIA）、Scott Guthrie（Microsoft）和 Peter Hoeschele（OpenAI Stargate）。Crusoe、CoreWeave、Nebius、TensorWave、Oracle 和 Together AI 提供了计算资源。PyTorch Foundation、vLLM 和 SGLang 维护者也为该项目背书。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "InferenceMAX 对哪些模型进行基准测试？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "InferenceMAX v1 对 LLaMA 3 70B（代表稠密企业模型部署）、DeepSeek V3 670B（作为 OpenAI 等前沿稀疏 MoE 模型架构的代理）和 GPT-OSS 120B MoE（最接近 GPT-5 mini 的较小稀疏模型）进行基准测试。基准测试根据硬件支持跨 FP8、FP4 和 MX4 精度进行。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper.mdx b/packages/app/content/blog/zh/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper.mdx
new file mode 100644
index 00000000..0b71fc1b
--- /dev/null
+++ b/packages/app/content/blog/zh/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper.mdx
@@ -0,0 +1,943 @@
+---
+title: 'InferenceX v2：NVIDIA Blackwell 对决 AMD 与 Hopper — 前身为 InferenceMAX'
+subtitle: 'GB300 NVL72、MI355X、B200、H100、分离式推理、宽专家并行、大规模混合专家、SGLang、vLLM、TRTLLM'
+date: '2026-02-16'
+publishDate: '2026-02-16'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - announcement
+---
+
+## 引言
+
+InferenceXv2（前身为 InferenceMAX）建立在 InferenceMAXv1 奠定的基础之上。InferenceMAXv1 是[我们开源的持续更新推理基准测试](https://github.com/SemiAnalysisAI/InferenceX)，已为 AI 推理性能和经济性评估树立了新的行业标准。InferenceMAXv1 超越了传统静态、时间点式的基准测试，在数百块芯片和主流开源框架上持续运行测试。[免费仪表板在此。](https://inferencemax.ai/)
+
+[我们的基准测试已被广泛复现、验证和/或获得几乎所有主要算力采购方的支持](https://inferencemax.semianalysis.com/quotes)，包括 [Google Cloud](https://cloud.google.com/blog/products/compute/scaling-moe-inference-with-nvidia-dynamo-on-google-cloud-a4x)、[Microsoft Azure](https://blog.aks.azure.com/2025/10/24/dynamo-on-aks#enterprise-scale-inference-experiments--dynamo-with-gb200-running-on-aks)、[Oracle、OpenAI](https://inferencemax.semianalysis.com/quotes) 等众多机构。
+
+InferenceXv2 在此基础上进一步拓展，将覆盖范围扩大到包含大规模 DeepSeek MoE 分离式推理（分离预填充，简称"disagg"）配合宽专家并行（wideEP）优化的全部 **6 款 NVIDIA 过去 4 年发布的西方市场 GPU SKU**，以及 AMD 过去 3 年发布的所有西方市场 GPU SKU——InferenceXv2 在一次完整的基准测试运行中总共使用了近 1000 块前沿 GPU。
+
+在今天的发布中，InferenceXv2 成为首个对 Blackwell Ultra GB300 NVL72 和 B300 进行全帕累托前沿曲线基准测试的测试套件，也是首个第三方 disagg+wideEP 多节点 FP4 和 FP8 MI355X 性能基准测试。在未来的 InferenceX 迭代中，我们将继续重点关注分离式服务配合宽专家并行，因为这正是 OpenAI、Anthropic、xAI、Google Deepmind、DeepSeek 等前沿 AI 实验室以及 TogetherAI、Baseten、Fireworks 等高级 API 服务商在生产环境中实际部署的方案。在本文中，我们还将解析围绕[最新 Claude Code Fast mode 功能](https://code.claude.com/docs/en/fast-mode)的系统工程原理和经济学分析。
+
+我们的基准测试完全以 Apache 2.0 协议开源——这意味着我们能够以与 AI 软件生态系统同样快速的步伐推进。如果您喜欢我们的工作并希望给予支持，[请在 GitHub 上点个星标](https://github.com/SemiAnalysisAI/InferenceX)！我们还为 ML 社区的所有人提供了免费数据可视化工具 [https://inferencex.com](https://inferencex.semianalysis.com/)，供大家自行探索完整数据集。
+
+我们将第一时间添加 DeepSeekv4 及其他热门中国前沿模型的支持，因为在过去 6 个月中，我们已清理了大量技术债务，现在能够以[稳定的基础设施快速推进](https://www.cnet.com/tech/mobile/zuckerberg-move-fast-and-break-things-isnt-how-we-operate-anymore/)。今年稍后，我们还将把 TPUv7 Ironwood 和 Trainium3 纳入 InferenceX！如果您想在获得有竞争力的薪酬的同时为我们的使命贡献力量，[请在此申请](https://app.dover.com/apply/semianalysis/2a9c8da5-6d59-4ac8-8302-3877345dbce1)。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1e9a8353-ca83-4bd3-ab4a-3541132f6665_1680x1175.png"
+  caption="来源：InferenceMAX GitHub"
+/>
+
+## 关键观察与重点结果
+
+在 FP8 MI355X disagg+wideEP SGLang 配置下，AMD 与 FP8 B200 disagg+wideEP SGLang 相比展现出有竞争力的性价比表现（perf per TCO），但与广泛使用的 Dynamo TRTLLM B200 FP8 相比，TRT 继续展现出碾压级优势。AMD SGLang 分离预填充+wideEP FP8 能够匹配 NVIDIA SGLang 的性能，这是令人振奋的好消息。
+
+我们还发现，在单节点聚合服务场景下，AMD 的 SGLang 在 FP8 上提供了比 NVIDIA SGLang 更好的性价比。[同样令人欣慰的是，AMD 已弃用其 vLLM 的二等公民分支，转向更靠近上游、更致力于提供一流体验。](https://x.com/vllm_project/status/2013928644302033208) 敬请期待我们的"AMD 现状"文章，我们将详细讨论 AMD 进步迅速的领域以及进展迟缓的领域。我们建议 NVIDIA 除了 TRTLLM 引擎外，进一步加大对 SGLang 和 vLLM 生态系统的投入。[Jensen 需要为 SGLang 和 vLLM 等开源生态系统调配更多资源与工程师。](https://www.linkedin.com/in/akbarnurlybayev?trk=feed-detail_main-feed-card_feed-actor-image)
+
+在前沿大规模推理服务所采用的最新推理技术（如 disagg 预填充+wideEP+FP4）方面，NVIDIA 的 B200、B300 以及 ASU 级别的机架规模 GB200/GB300 NVL72 在 SGLang 和 TRTLLM 上均实现了碾压级领先。NVIDIA GPU 在能效方面同样占据主导地位，在所有工作负载上，每 token 的全口径预分配能耗（皮焦耳）都低得多。
+
+转向 AMD 方面，我们发现其系统和软件在推理上最大的问题是*[可组合性](https://en.wikipedia.org/wiki/Composability)*。也就是说，AMD 的许多推理优化实现单独运行时表现良好，但当多种优化组合使用时，结果并不如预期那般有竞争力。具体而言，分离预填充、wideEP 和 FP4 推理优化的可组合性亟需大幅改进。
+
+虽然仅启用部分 SOTA 推理优化时 AMD 的性能具有竞争力，但当同时启用所有三大主流实验室使用的优化时，AMD 的性能目前无法与 NVIDIA 匹敌。我们强烈建议 AMD 将重心放在不同推理优化的可组合性上。据了解，AMD 将开始在整个软件栈中关注 FP4+分布式推理的软件可组合性。这一工作将在春节后展开，因为他们的大部分 disagg 预填充+wideEP 核心工程师都在中国。
+
+NVIDIA 的 GB300 NVL72 没有令人失望。与强劲的 H100 disagg+wideEP+MTP 基线相比，FP8 vs FP4 最高达到了 100x 的性能提升，FP8 vs FP8 则达到 65x。在 H100 vs GB200 NVL72 的对比中，我们在 75 tok/s/user 下观察到高达 55x 的实际性能差距。机架规模的 Blackwell NVL72 对 Hopper 形成了碾压，让 Hopper 相形见绌。正如 Jensen 在 GTC 2025 上所说的，[他是首席营收毁灭者。](https://newsletter.semianalysis.com/i/174558496/ai-total-cost-of-ownership-cost-declines)
+
+在 GTC 2024 上，Jensen 声称 Blackwell 相比 H100 的推理性能将提升高达 30x，Jensen 在 Blackwell 推理性能上做到了低调承诺、超额兑现。这应该能让那些喜欢开"Jensen 数学"玩笑的分析师们暂时消停一段时间。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/2ed3fe4a-93e9-4c47-8fb2-91f17da1b7c5_2392x1418.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## 致谢与 InferenceX™（前身为 InferenceMAX）倡议支持者
+
+我们要感谢 Jensen Huang 和 Ian Buck 对这一开源工作的支持，他们提供了最新的 GB300 NVL72 系统以及代表过去四年所有 GPU SKU 的服务器访问权限。我们要感谢 NVIDIA 团队允许我们在这近 1000 块 GPU 上进行独立基准测试。感谢 Jatin Gangani、Kedar Potdar、Sridhar Ramaswamy、Ishan Dhanani、Sahithi Chigurupati 以及众多其他 NVIDIA 推理工程师帮助验证和优化 Blackwell 与 Hopper 配置。
+
+我们同样感谢 Lisa Su 和 Anush Elangovan 对 InferenceMAX 的支持，感谢他们安排了数十位 AMD 工程师（包括 Chun、Andy、Bill、Ramine、Theresa、Parth 等）为 InferenceMAX 和上游 vLLM/SGLang bug 修复做出贡献，以及在帮助调试和分类 AMD 专有 bug 以优化 AMD 性能方面的积极响应。
+
+我们还要向 SGLang、vLLM 和 TensorRT-LLM 的维护者们致敬，他们构建了世界级的软件栈并将其开源给全世界。您可以在此查看他们关于 InferenceX 的文章：
+
+- [SemiAnalysis InferenceMAX: vLLM maintainers & NVIDIA accelerate Blackwell Inference](https://blog.vllm.ai/2025/10/09/blackwell-inferencemax.html)
+- [GPT-OSS Performance Optimizations: Pushing Pareto Frontier](https://blog.vllm.ai/2026/02/01/gpt-oss-optimizations.html)
+- [SGLang & NVIDIA Accelerating SemiAnalysis InferenceMAX & GB200 Together](https://lmsys.org/blog/2025-10-14-sa-inference-max/)
+
+InferenceX 倡议还获得了来自 OpenAI、Microsoft、vLLM、Tri Dao、PyTorch Foundation、Oracle 等众多主要算力采购方和 ML 社区知名成员的支持。[完整名单请见此处](https://inferencemax.semianalysis.com/quotes)。
+
+## 重要技术概念入门
+
+在本节中，我们将对一些技术概念进行简要介绍，帮助读者更好地理解后续结果。部分读者可能不需要这些内容，可以直接跳到结果分析部分。我们将在结果分析之后对其中一些主题进行更深入的探讨。
+
+## 交互性与吞吐量的权衡
+
+LLM 推理的根本权衡在于吞吐量与延迟。_交互性_（tok/s/user）描述了系统中每个用户接收 token 的速度——它是每输出 token 时间（TPOT）的倒数。_吞吐量_（tok/s）描述了系统在所有用户之间总共能产出多少 token。可以通过批处理请求来获得更高的总吞吐量，但每个请求分配到的算力会减少，因此完成速度会更慢。这类似于乘坐公交车与跑车的选择。公交车服务众多乘客，但频繁停靠耗费时间，不过成本可以由多位乘客分摊。跑车只能搭载一两位乘客，但几乎不会额外停靠，意味着整体行程更快，只是每位乘客的费用高得多。对于周末去公园的人来说公交车可能更合理，而对于需要快速抵达目的地的名人来说跑车可能更好。没有放之四海而皆准的解决方案。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/18c9a3dd-3777-44d5-a3e2-b4d28140df38_2106x1380.png"
+  caption="来源：SemiAnalysis"
+/>
+
+本文将展示的大多数基准测试结果都是一条曲线。在不同的交互性/延迟水平下分析吞吐量非常重要，而不是仅仅看最大吞吐量（通常只能在单一低交互性水平下达到）。在推理领域，没有万能的用例方案。所需的交互性和吞吐量水平取决于具体用例。例如，实时语音模型需要极低的延迟，以便终端用户能与 LLM 维持自然的"对话"，而基础问答聊天机器人则允许更高的延迟。我们将这一判断留给读者，请根据曲线并应用此原则来定位自己的用例落在吞吐量-交互性曲线的哪个位置。
+
+成本/TCO 性能比 vs 交互性/端到端延迟曲线大致与吞吐量 vs 交互性/端到端延迟曲线一致：更多的 token/小时意味着更低的每 token 成本，因为固定的 $/小时成本被分摊到更多产出的 token 上。
+
+### 预填充和解码阶段
+
+推理包含两个主要阶段：预填充和解码。*预填充*发生在请求生命周期的第一次前向传播中。由于请求中的所有 token 被并行处理，这一阶段计算密集。该阶段负责为序列"填充" KV 缓存。预填充之后，响应逐 token 生成（即*解码*）。每次前向传播都从 HBM 加载序列的整个 KV 缓存，而仅为单个 token 执行计算，因此解码是内存（带宽）密集型操作。
+
+当预填充和解码在同一引擎上执行时，预填充会不断打断解码批次，导致整体性能下降。
+
+### 分离预填充
+
+分离预填充（又称 PD 分离或简称"disagg"）是将预填充和解码阶段分离到不同 GPU 池或集群上的做法。这些独立的预填充和解码池可以分别调优和扩展，以匹配工作负载需求。
+
+## 张量并行、专家并行、数据并行（TP、EP、DP）
+
+TP 允许在小批次下最大化交互性，但必须在每一层执行一次 all-reduce。EP 对专家进行分片，利用 MoE 的稀疏性，缺点是 MoE 层需要执行 all-to-all 集合通信（比 all-reduce 等简单集合通信代价更高），并且在小批次下可能出现负载不均衡。DP 将整个模型（或模型的一部分，如注意力机制）复制到多组 GPU（rank）上，然后在各 rank 之间进行请求负载均衡。DP 最易于扩展，但重复了权重加载，在大规模部署下可能造成浪费。
+
+## 跟踪随时间的改进
+
+InferenceX 的核心目标之一是可视化性能随时间的改进。虽然新芯片的发布频率约为每年一次（O(年)级别），但软件版本的更新频率约为每周一次（O(周)级别）。我们的目标是持续使用最新最先进的软件改进来更新配方，并对各种配置进行基准测试。
+
+## DeepSeek R1
+
+AMD 团队已为所有 SGLang DeepSeek R1 FP4 配置显著提升了性能。在相同的交互性水平下，AMD 在不到 2 个月的时间内几乎将吞吐量翻了一番。此外，我们已推动 AMD 将其分支 SGLang 镜像中的性能优化上推至官方 SGLang 镜像。从 2025 年 12 月到 2026 年 1 月，AMD 的软件性能提升了多达 2x。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d0bd5df8-c675-4dce-a853-dfa6f4d381af_1498x1102.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+为了更接近一流体验，AMD 需要通过计算资源贡献和代码贡献来增强对 vLLM 和 SGLang 维护者的支持，并安排更多 AMD 审稿人来加速 AMD PR 合入上游的审核流程。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f7fc9e49-b04b-41b0-b0ec-df0d912c0a3c_800x434.jpeg"
+  caption="来源：SemiAnalysis"
+/>
+
+另一方面，NVIDIA 的结果更为稳定，B200 SGLang 在类似时间段内仅有小幅改进。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/19e48a4c-0c1b-4681-b180-03ef0c8c2ce3_2346x1340.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+许多成熟 SKU 的改进幅度很小。例如，H200 TRT 单节点在自 10 月以来的 4 个月间性能没有变化，但这是因为 Hopper 的支持从第一天起就很出色，性能自始至终都接近该工作负载的理论峰值，使得交付增量性能提升变得困难。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/ca0fbb96-36c4-4040-a022-49f2185b661a_2074x1224.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+MI300X 和 MI325X 有了一些改进，主要来自最新的 SGLang 版本。请注意，在 InferenceX 的大部分历史中，AMD 使用的是未上推到上游的"私有" ROCm 镜像，因此约 2026 年 1 月之前的运行结果不能与更新的结果直接比较。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/4b8c3b9b-7536-4cba-8b85-854d25169864_1922x1726.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+GB200 Dynamo TRT-LLM disagg 也有了显著改进，最大吞吐量在一个多月内提升了 20%。我们还看到中等交互性水平的改进，这部分采用了宽 EP。这可能归因于 GB200 上日趋成熟的宽 EP 内核。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/db4fa8dc-176c-4224-9ab5-6ebfe8f6af9c_1493x1280.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+B200 SGLang 自我们去年 10 月首次发布以来，FP4 和 FP8 场景均呈现稳步持续改进，在某些交互性水平下每 GPU 吞吐量翻了一番。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1d5636b8-69d8-4676-9c3c-823da8d03514_2638x1840.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+对于 MI355X 分离式推理服务，AMD 推荐使用 SGLang 配合 MoRI。[MoRI 是 AMD 的 MoE 分发/聚合集合通信与 KV 缓存传输库](https://github.com/ROCm/mori/tree/main)，由 AMD 精锐的中国工程团队从第一性原理出发构建。虽然 MoRI 在开放 CI 和测试方面仍需大量工作，但我们坚定支持 MoRI 的发展方向。这是因为 MoRI 没有采用 AMD 历史上的做法（即将 NVIDIA 的 NCCL 分支为 RCCL），而是汲取了 RCCL/NCCL 的经验教训，从零开始构建了一个全新的包。MoRI 的使用也在一个多月的时间内带来了良好的加速效果，在 20-45 tok/s/user 交互性范围内，每 GPU 吞吐量提升了 20% 以上。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/6b0d71aa-e6aa-425f-bbcc-25e2c1de2f4d_1900x1744.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## GPT-OSS 120B
+
+对于 MI300X 和 MI325X，我们在各方面看到了微小的改进。一些 AITER 优化帮助提升了 MI300X 在所有交互性水平下的性能，切换到上游 vLLM ROCm 镜像也带来了改进。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/10e95c72-6372-415e-8e51-d8021815182c_2142x1784.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+就 MI325X 而言，下游 ROCm 分支镜像（2025 年 10 月 5 日运行时使用）中存在的性能优化似乎并非全部都已合入官方 vLLM ROCm 镜像。
+遗憾的是，MI355X 目前仍在使用 vLLM 0.10.1 版本的分支构建 `rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1`。我们本希望现在已经更新了，但不幸的是，当前的官方镜像（撰写本文时为 0.15.1）尚未针对 MI355X 进行优化，并且会遇到硬错误。我们在 MI355X 上运行 vLLM 0.14 时也曾遇到硬错误崩溃。业界消息是 vLLM 0.16.0 将最终提供 MI355X 所需的全部改进。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1755b498-ab4d-4c02-b6fd-152ee538a34d_2126x1788.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+回到 NVIDIA 的系统，Hopper 和 Blackwell 在 vLLM 0.11.2 和 0.13.0 之间都实现了稳步性能提升。我们即将把 NVIDIA GPU 的配方更新到最新 vLLM 版本，预计切换后将获得更大的性能提升。我们还观察到最新的 TRT-LLM 1.2.0 版本带来了性能提升。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/53a95093-3d25-4d01-9d64-64ea9e113749_2376x1760.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/77c591fb-74ef-46ce-bba2-9f82a52f5f6f_2362x1752.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## 分离式推理框架
+
+NVIDIA 使用 Dynamo 作为其分离式推理方案。[Dynamo](https://docs.nvidia.com/dynamo/design-docs/overall-architecture) 是一个专为多节点分布式推理设计的推理框架，具备预填充-解码分离、请求路由和 KV 缓存卸载等技术。它与推理引擎无关，允许我们在基准测试中使用 SGLang 和 TRT-LLM 作为后端。对于 AMD，我们使用 SGLang 配合两种不同的 KV 缓存传输框架：MoRI 和 Mooncake。[MoRI](https://github.com/rocm/mori) 是一个高性能通信接口，专注于 RDMA 和 GPU 集成，提供网络集合操作和专家并行内核等应用。Mooncake [最近加入了 PyTorch 生态系统](https://pytorch.org/blog/mooncake-joins-pytorch-ecosystem/)，支持预填充-解码分离以及多种容错多节点功能。
+
+## DeepSeek Disagg + WideEP 结果深入分析
+
+在几乎所有交互性水平下，disagg 在每 GPU 总 token 吞吐量方面都优于聚合推理（灰色线条）。多节点分离式预填充相比单节点聚合服务有着碾压级的优势。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/7ace6118-029a-44df-b0ef-2e7595e6f388_2032x1339.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+NVIDIA 持续为 B200/GB200 FP8 推送新更新。最新数据展示了 DeepSeek FP8 B200 TRT 单节点（MTP 启用/禁用）vs GB200 Dynamo+TRT disagg（MTP 启用/禁用）的对比。这表明在改进机架规模推理软件和 wideEP 内核方面持续投入了工程力量。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/29485790-238d-4e1d-aa48-0559c79c9855_2132x1247.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在比较 MI355X 分离式推理与聚合推理时，我们注意到了类似的模式。分离式推理仅在低交互性、高批次的情况下超过聚合推理。这在 FP4 下尤为明显，可能源于优化不够充分的内核。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/25a7c41e-fa99-4117-8e49-ac121a22bf0f_2092x1241.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在 MI355X 上将 disagg 预填充+wideEP 与 FP4 组合使用时，我们观察到性能表现不佳。
+
+虽然理论建模显示 MI355X 上的 disagg 推理应该远优于单节点，但由于 ROCm 软件栈在组合多种 SOTA 推理优化时缺乏内核和集合通信优化，disagg 在较高交互性水平下的实际表现反而更差。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/2d82d32f-089b-405d-b4ef-94b4956676ed_2078x1233.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+### NVIDIA TensorRT-LLM 与 NVL72
+
+TensorRT-LLM 已在全球范围内为 TogetherAI 等服务商每小时处理数十亿 token，它真正让 GB200 NVL72 和 GB300 NVL72 大放异彩，在高吞吐量下性能提升超过一倍。MTP 进一步增强了这些结果，充分释放了芯片的全部潜力。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d4628887-37be-4563-ad68-091282e20ddf_2350x1486.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/fcea5602-9449-4cd3-9b9d-d9f58cc83f23_2296x1458.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+NVL72 系列更大世界规模带来的优势也体现在成本图表上。在固定 60 tok/s/user 交互性水平下，每块 GB200 NVL GPU 产出的 token/s 略低于 B200 的三倍。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/36087d46-94e1-4629-90cb-4b0dfad1a8c1_1856x827.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+随着交互性提高，这一差距缩小。在 130 tok/s/user 时，GB200 NVL72 几乎没有优势，在 $/百万 token 的基准上甚至更贵。在低批次下，推理工作负载足够小，可以在单个 HGX 节点的 NVLink 域内（即 8 块 GPU）运行，GB200 NVL72 的大规模扩展优势开始消失。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/3e287d0e-947f-4fd7-9dc8-d697fad9ac7d_1781x822.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## NVIDIA 与 AMD 分离预填充对比
+
+在今天发布的 InferenceXv2 中，ML 社区首次能够看到开源 MI355X 分布式推理的完整帕累托前沿。我们展示了 B200 和 MI355X 在启用和未启用 MTP 的情况下的帕累托曲线。
+
+对于 FP8 分离预填充，MI355X（MoRI SGLang）与 B200（Dynamo SGLang）具有相当的竞争力。这两种配置均未使用宽 EP，所有预填充/解码实例最多使用 EP8。在吞吐量 vs 交互性帕累托前沿的两端，MI355X 略落后于 B200。然而，MI355X disagg 在曲线中段的某些交互性水平上有轻微优势。B200 和 MI355X 都受益于 MTP 的使用，且我们观察到两款芯片使用 MTP 时的相对性能提升幅度相同。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/99728443-e697-49cc-8416-7a380c60ad12_2147x1249.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+然而，如果我们仅衡量输出（解码）token 吞吐量，可以看到在较低交互性水平下 B200 的输出 token 吞吐量远高于 MI355X。请注意，在查看分离式推理配置的仅输出 token 吞吐量时，我们按解码 GPU 数量而非总 GPU 数量进行归一化。B200 和 MI355X 运行推理任务时可能使用了不同数量的输出 GPU，但关键是无论解码在什么配置上运行，B200 都能更快地完成解码任务。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f67a92c3-b159-4b2a-bf87-ecbb7002b23c_2118x1306.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+尽管 MI355X 在 FP8 disagg 上具有竞争力，但其 FP4 性能受到可组合性问题的影响。AMD 单节点 FP4 性能尚可，但当我们将 AMD FP4 分离预填充与 NVIDIA 进行比较时，性能表现不佳，MI355X 被 NVIDIA 的 B200 全面碾压。在 1k1k 场景下，MI355X（MoRI SGLang）启用 MTP 也仅勉强胜过未启用 MTP 的 B200（Dynamo SGLang）。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/a5b9e7bc-c484-4400-9ffe-96ed4bbfb70f_2138x1236.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+一旦引入 Dynamo TRT-LLM，B200 的性能进一步提升，以至于即使启用 MTP 的 MI355X 也无法匹配 B200 配合 Dynamo TRT-LLM 和 MTP 的性能。MI355X 只有在使用 MTP 时才能匹配 B200（未启用 MTP）的性能，且仅限于约 60 tok/s/user 到约 120 tok/s/user 的交互性范围。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/0be8b8f5-b627-4dc9-938b-4a407ef19c34_2103x1233.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在比较 Dynamo TRTLLM B200 分离预填充与 SGLang MoRI MI355 分离预填充时，由于 TRTLLM 上分离预填充实现更为成熟，AMD 被全面碾压。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/89827e17-6cfd-42f1-b250-d7f07cbe6a09_2120x1242.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/c53a37b8-dd9f-4142-b114-60e6e2c7f3e7_3446x1946.png"
+  caption="来源：Dwarkesh Podcast 与 SemiAnalysis"
+/>
+
+下图展示了构成 MI355X（MoRI SGLang）帕累托前沿的各种并行配置。请注意，目前宽 EP 尚未用于任何数据点（即没有 EP 16、32 等配置）。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/e1b62a52-bd6a-4cd1-82e7-65b6903d82ac_2996x1774.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## 解析推理服务商的单元经济学
+
+以下是 OpenRouter 上所有服务 DeepSeek R1 0528 FP8 的推理服务商列表，包含其每百万输入/输出 token 的成本和平均交互性。排除 Chutes 后，中档服务商的交互性约为 35 tok/s/user。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/ce79108c-8341-4100-86de-943d8ca3c34e_916x1190.png"
+  caption="来源：OpenRouter"
+/>
+
+接下来，我们可以使用真实的 InferenceX 数据来插值计算 35 tok/sec/user 交互性水平下每百万输入/输出 token 的成本，鉴于上述数据，这是一个合理的交互性水平。
+
+正如我们在文章后面提到的，这最好被理解为*基线*数据，并不完全代表真实世界的推理场景，主要因为 InferenceX 使用随机数据进行基准测试并禁用了前缀缓存。换句话说，实际的性能/成本*至少*能达到这个水平。同样值得注意的是，并非*每个 GPU* 在*每个*交互性水平都有数据点。因此，我们无法在每个交互性级别进行*精确*比较。尽管如此，我们认为下面呈现的柱状图比较是（非常）合理的插值替代精确数据点的方式。
+
+比较该交互性水平下的 disagg+wideEP 配置，我们可以看到分布式推理技术在性价比和整体吞吐量方面的显著效果。我们还看到大规模扩展域（如 GB300 和 GB200 NVL72）在每 GPU 总吞吐量上的绝对主导地位。
+
+值得注意的是，在该交互性水平下（8k1k 工作负载类型），启用 MTP 的 B200 能够实现最佳性价比。以下我们还列出了每种 GPU 的总拥有成本（TCO）（自有 - 云服务商）：
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f200bfa6-02b5-464f-a4ea-ffe88cb6ed49_2520x81.png"
+  caption="来源：SemiAnalysis TCO 模型"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/fd2a22ee-c300-4fbd-a782-bdf5ac918c02_1882x1776.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/cf5af414-7def-47ca-bcee-7e4123d29560_1932x1760.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1eb79a88-ef6f-40fc-ae5d-392967666f11_1874x1772.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+让我们利用上述发现深入分析大规模 LLM 服务的单元经济学。从上面的 OpenRouter 数据可以看到，Crusoe 以 36 tok/sec/user 的交互性提供服务，输入 token 价格为 $1.35/M，输出 token 价格为 $5.40/M。如果我们假设没有缓存命中，并且 Crusoe 至少使用了配备 MTP、disagg 和宽 EP 等 SOTA 推理技术的 H200，上述数据表明他们的成本*不超过* $0.226/M 输入 token 和 $2.955/M 输出 token，输入 token 的利润率高达 83% 毛利率（折旧计入销售成本），输出 token 的毛利率为 45%。
+
+当然，这些假设可能不*完全*正确，且这些计算没有考虑停机时间或利用率不足的情况，但这展示了使用 InferenceX 数据可以进行的一些有趣分析。更多关于推理经济学的分析可以在 [SemiAnalysis Tokenomics Model](https://semianalysis.com/tokenomics-model/) 中找到。
+
+OpenRouter 数据还显示 Nebius AI Studio (Fast) 以 167 tok/sec/user 的交互性提供 DeepSeek FP4 服务，输入 token $2/M，输出 token $6/M。相应调整 InferenceX 中的交互性水平，我们可以看到以下数据。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f41d237c-16c2-4a9d-b681-a6668b01f62b_2398x1526.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/4a4dc479-c4ae-4fa2-8f35-76858a36a401_2276x1540.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/7b9491f1-0669-458d-84ea-3921c5aeb10f_2370x1544.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在如此高的交互性下，有必要采用 MTP 等推测解码技术来实现足够高的吞吐量，使推理具有经济可行性。幸运的是，MTP 能够在对模型精度影响极小的情况下提升吞吐量。我们将在文章后续部分进一步讨论 MTP 及其如何用于提升吞吐量/降低成本。
+
+最后，我们再展示一张 FP8 DeepSeek 工作负载在 125 tok/s/user 下的图表。这是另一个低延迟工作负载，MTP 在其中显著改善了经济可行性。与前面的例子一样，我们注意到在这些较高交互性范围内，最便宜的配置都使用了 MTP。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/ccabb1a5-220a-4623-a615-245053808f24_2086x1738.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+### NVIDIA 分离预填充与 WideEP
+
+EP 需要 all-to-all 通信，每块 GPU 都需要向其他所有 GPU 发送 token。这对带宽需求极高。请回忆，NVIDIA 的服务器有两个独立的网络域——NVLink 扩展域（scale-up）和 Scale-out 域，后者通常使用 InfiniBand 或以太网作为网络协议。
+
+- NVLink 域（NVL72 机架内）：72 块 GPU 通过 NVLink 连接，每 GPU 单向带宽 900 GB/s。这大约是基于 InfiniBand/以太网的 Scale-out 网络带宽的 7-10 倍。
+- InfiniBand/RoCEv2 以太网（NVL72 机架外）：通常每 GPU 单向 400-800 Gbit/s（50-100 GB/s）。请注意，我们对 NVIDIA 的所有测试都在基于 InfiniBand 的集群上进行。
+
+TP 将每一层的权重矩阵分片到各 GPU 上。这意味着每一层的每个 token 最多需要两次 all-reduce 通信（列并行 GEMM 后一次，行并行 GEMM 后一次）。对于 EP，all-to-all 仅在 MoE 层执行。每块 GPU 只发送被路由到相应专家的 token。这意味着与 TP 相比，EP 在所有层的通信成本更低。
+
+由于 EP 的 all-to-all 通信带宽需求随参与者数量增长，在跨越较慢的 IB/以太网网络之前尽量保持在高带宽 NVLink 域内是更好的选择。使用 NVL72，72 块 GPU 的 EP 可以在不离开 NVLink 的情况下完成，而前代产品（仅 8 块 GPU NVLink 域）只能在 8 块 GPU 之间以 NVLink 速度执行 EP，超出后就要使用较慢的 IB/以太网网络。
+
+宽 EP 在权重加载效率方面也有重大优势。对于 DeepSeek R1 这样的模型，解码是内存带宽受限的：瓶颈在于 GPU 从 HBM 加载权重的速度。使用宽 EP（例如 DEP32），32 块 GPU 共同持有并加载一次 670B 权重，每块仅加载其分片（约 21B）。所有 32 块芯片的总 HBM 带宽被用于加载模型的单个副本。相比之下，使用更窄的 EP 配合更多 DP 副本（例如 5xDEP8），5 个副本中的每一个都需要完整的 670B 权重副本——系统中总共有 5×670B = 3.35T 的冗余权重加载。EP 将权重在芯片间分摊；DP 则复制它们。这就是为什么更宽的 EP（在 NVLink 等高带宽互连的支持下）能带来显著更好的每 GPU 吞吐量。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/7ed2a472-3511-4b29-afbd-0c593795085a_2434x1430.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+通常，在低并发度下 TP 更为合适，这主要是出于负载均衡的考虑。在小批次下，EP 会因 token 到专家的路由不均匀而导致部分 GPU 利用率不足、另一些则过载。TP 避免了这一问题，因为每块 GPU 持有每个专家的一个切片，始终获得等量的工作。在低并发度下，这种负载不均衡的代价超过了 TP 额外通信开销的成本。
+
+在更高并发度下，这种权衡发生变化。较大的批次使专家激活分布更加均匀，EP 的通信和权重加载优势开始主导 TP 昂贵的逐层 all-reduce。在曲线中段，混合 TP+EP 配置在两方面取得平衡——在每个专家内使用小规模 TP 组实现负载均衡，同时在更大范围的 GPU 上使用 EP 来分摊权重并减少通信。
+
+对于更高的交互性水平（小批次），大规模扩展世界往往不能带来更强的性能。B300 通过 IB 的 disagg 与 GB300 NVL72 的性能相同，因为工作负载受延迟限制而非带宽限制。NVL72 巨大的 NVLink 带宽优势并不重要，因为即使是慢得多的 IB 链路也不会被微小的 token 批次流量所饱和。
+
+预填充/解码分离也发挥了重要作用。预填充是计算密集且突发的；解码是内存带宽受限且稳态的。当它们共享同一 GPU 时，会相互干扰，导致延迟抖动和容量浪费。将它们分离到专用 GPU 池，使每个阶段运行与其特性匹配的工作负载，从而提高有效利用率。这就是为什么分离式 B200 配置在吞吐量-交互性曲线中段优于单节点 B200。PD 分离结合跨更多 GPU 通过 IB 的更宽 EP 能比将两个阶段塞入单个 8-GPU 节点更高效地分摊权重。
+
+[附注：TogetherAI 的优秀推理工程师注意到多轮对话流量中一个模式，即首轮预填充的需求与后续轮次的预填充需求差异很大，通过对此进行分离实现了更好的首 token 延迟（TTFT）表现。](https://www.together.ai/blog/cache-aware-disaggregated-inference)
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/bfdcb99e-dc02-4468-bd72-b25a7be6c15d_2380x1386.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## Jensen 低调承诺、超额兑现——Hopper vs Blackwell vs 机架规模 NVL72
+
+在 GTC 2024 上，Jensen 在台上承诺从 H100 到 GB200 NVL72 将实现高达 30x 的性能提升，[所有人都认为这是典型的营销包装，在现实世界中不可能实现。](https://newsletter.semianalysis.com/p/nvidia-blackwell-perf-tco-analysis) 许多人试图为这种"现实扭曲力场"贴上标签，以便开更多 Jensen 数学的玩笑。确实——[我们曾指出 30x 性能差异的比较是将 H200 FP8 的最差情况与 GB200 FP4 的合理情况进行对比。](https://newsletter.semianalysis.com/i/175661150/benchmarking-the-h200-on-its-bad-hair-day)
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/4fec3378-2cf4-4c1c-a40d-bcbd788c9a70_3022x1964.jpeg"
+  caption="来源：Nvidia GTC 2024"
+/>
+
+但事实证明，笑话在他们自己身上。快进将近两年后，我们现在可以看到这并非营销炒作，Jensen 实际上一直在低调承诺 Blackwell 的性能。根据我们的测试，相比强劲的 H100 disagg+wideEP FP8 基线，Blackwell 在大规模 MoE 推理上表现出色，在 116 toks/s/user 下，GB200 NVL72 FP4 达到了高达 98x 的性能提升，GB300 NVL72 FP4 更是达到了高达 100x！也许新的 Jensen 数学法则就是：在 token 吞吐量方面，他兑现的是承诺值的两倍。买得越多，省得越多！
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/70638c7e-69a6-43f2-96a4-23766bcabbd2_2121x1248.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+即使考虑到 Blackwell 和 Blackwell Ultra 更高的总拥有成本，我们仍看到与 Hopper 相比 9.7x（40 tok/s/user）至 65x（116 tok/s/user）的每美元 token 提升。[您可以在我们的免费网站上详细探索 Hopper vs Blackwell 的性能对比](https://inferencemax.semianalysis.com/?i_seq=8k%2F1k&g_model=DeepSeek-R1-0528&g_rundate=2026-02-12&g_runid=21928999802&i_prec=fp4%2Cfp8&i_metric=y_costh&i_log=1#inference)。Blackwell 相对 Hopper 的性能优势如此之大，以至于我们不得不在仪表板中添加对数刻度来进行可视化展示。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/402b23af-7ad6-46e4-97af-a5698ea2bd87_2176x1416.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+如前文所述，B300 服务器最多通过 900 GB/s/GPU 的 NVLink 扩展网络连接 8 块 GPU，而 GB300 NVL72 服务器通过 NVLink 扩展网络连接 72 块 GPU。因此，当我们的推理部署需要超过 8 块 GPU（但少于 72 块）时，需要引入多节点 B300 服务器来组成推理系统，这意味着通信退回到带宽较低的 InfiniBand XDR Scale-out 网络，每 GPU 提供 800 Gbit/s（单向）带宽。相比之下，机架规模的 GB300 NVL72 通过 NVLink 连接 72 块 GPU，提供每 GPU 900 GB/s（单向）带宽，我们可以看到机架规模服务器使推理系统中的 GPU 之间通信带宽比多节点 B300 服务器高出 9 倍以上。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/8664f48c-037c-45cc-b6f8-1999ed0cee0e_2298x1430.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+诚然 GB300 NVL72 的全口径每 GPU 成本更高，但这仅将带宽/TCO 优势降低到 8x 更快的水平。机架规模架构的带宽优势直接推动了更低的每 token 成本。Google TPU、AWS Trainium 和 NVIDIA 是目前仅有的部署了机架规模系统设计的 AI 芯片厂商。AMD 的首款机架规模 MI455X UALoE72 系统的工程样品和小批量生产将在 2026 年下半年完成，而由于制造延迟，大规模量产和首批生产 token 要到 2027 年 Q2 才能在 MI455X UALoE72 上产生。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/58c7b664-76a7-454b-ac99-036b0b6f4abb_2132x1456.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## Blackwell vs Blackwell Ultra
+
+在纸面参数上，新发布的 Blackwell Ultra 与 Blackwell 拥有相同的内存带宽和 FP8 性能，FP4 性能仅高 1.5x，但在实际测量中，我们发现 Blackwell Ultra 的 FP8 性能比 Blackwell 好了多达 1.5x，不过 FP4 仅好了 1.1x。这可能是因为 Blackwell Ultra 作为新发布的 GPU，软件尚未完全优化。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/a7625f0e-7e35-4170-8986-4fe0d66f7925_2125x1247.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/bfe9255c-33f0-4f1b-ab82-acf2321ae8f1_2124x1245.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## MI355X vs MI325X vs MI300X
+
+在 AMD SKU 上，我们看到 MI355X 相比 MI300X 性能提升高达 10x。AMD 目前仅在 MI355X 上成功运行了 DeepSeek SGLang 分离式推理，尚未提交 MI300X 或 MI325X 的分离式推理结果，可能是由于旧 SKU 上的软件问题仍在解决中。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d6dd3138-e228-4121-a061-4aa92c84d6a4_2334x1390.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/101c2a16-c861-40f5-8079-3f2e38038980_2491x1123.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/c34192a3-3ddc-4f85-8708-289261c4ec7a_2219x1024.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+从成本角度来看，对于 FP8 的 DeepSeekR1，在 24 tok/s/user 的交互性水平下，MI355X 的推理成本比 MI325X 便宜略低于 3 倍。每 GPU 的吞吐量略低于 MI325X 的 4 倍。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/ab1ad749-fe92-4209-9347-4456d22b0cfd_2088x1432.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## AMD 在 FP4、分布式推理与宽专家并行上的可组合性问题
+
+虽然 AMD 在单节点 FP4 上表现尚可，在 FP8 分布式推理上与 B200 SGLang 具有竞争力，但当前 AMD 开源推理栈的问题在于：虽然各个推理优化单独表现良好，但实际客户部署时会将多种优化组合使用。顶级 AI 实验室都同时启用 FP4 **配合**分离式推理**配合**宽专家并行，问题正出在这里。
+
+AMD 软件仍未达标，SemiAnalysis 和 AMD 内部的理论极限建模都表明，对于 FP4，分离式推理配合宽专家并行应该优于 MI355X 单节点推理。遗憾的是，软件仍然是 AMD GPU 的巨大瓶颈。AMD 管理层需要继续优化工程人才的资源配置——例如，将工程资源从无人使用的单节点宠物项目（如 ATOM）转移到修复上述推理优化可组合性问题上。当前不佳的软件表现源于缺乏聚焦和优先级设置不当。所有顶级实验室已在使用分离式推理和宽专家并行；AMD 需要停止专注于单节点，大力投入多节点推理的开源方案。
+
+AMD 在开源分布式推理、宽专家并行和 FP4 可组合性方面落后超过六个月，[NVIDIA 和 SGLang 团队六个月前就已展示了他们在 DeepSeek 上的 NVFP4 性能。](https://lmsys.org/blog/2025-09-25-gb200-part-2/)
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/eddd9541-ed5a-4e49-aab2-291d49fd7e68_2132x1252.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## AMD ATOM 引擎
+
+AMD 推出了名为 ATOM 的新推理引擎。ATOM 能提供略好的单节点性能，但在许多关键功能上完全缺失，导致无法用于真实工作负载。例如，它不支持 NVMe 或 CPU KVCache 卸载、工具解析、宽专家并行或分离式服务。这导致生产环境中没有任何客户使用它。与 NVIDIA 的 TRTLLM 不同——后者在 TogetherAI 等公司全球范围内每小时生成数十亿 token，[并且支持工具解析和其他功能](https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve/trtllm-serve.html#cmdoption-trtllm-serve-serve-tool_parser)——由于上述功能缺失，目前没有任何 token 工厂在使用 ATOM。
+
+此外，vLLM 等开源推理引擎的维护者对 AMD 感到失望，原因是 AMD 提供的工程和 GPU 资源不足。例如，vLLM 首席维护者 Simon Mo 在 GitHub RFC 中表示，他仍然没有可用的 MI355X 来添加到 vLLM CI 中，因此用户体验不佳。vLLM 上目前没有任何 MI355X 测试，而 NVIDIA 的 B200 在 vLLM 上有大量测试。同样，vLLM 上的 MI300X CI 机器数量仍然不够。上游 vLLM 至少还需要 20 台 MI300 机器、20 台 MI325 机器和 20 台 MI355X 机器才能达到与 CUDA 相同的可用性水平。
+
+在 SemiAnalysis，我们一直在推动 AMD 为 vLLM 贡献更多计算资源，并在最近几周取得了一些成果。vLLM 将开始获得几台 MI355X 机器，使其 CI 测试覆盖率从 0% 提升到非零水平。我们将在即将发布的"AMD 现状"文章中详细讨论 AMD 此前对 vLLM、SGLang、PyTorch CI 机器贡献不足的情况，以及 Anush 如何开始着手解决这一问题。在 SemiAnalysis，我们将建立内部仪表板来跟踪 AMD 和 NVIDIA 在 vLLM、SGLang、PyTorch 和 JAX 上运行的测试数量和质量。
+
+此外，vLLM 维护者表示，由于机器资源不足，他们无法为 ROCm 提供首日 vLLM 支持。这一巨大的上市时间差距导致 ROCm 持续落后，给 NVIDIA 留下了继续收取高达 75% 毛利率（4 倍成本加价）的巨大空间。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/96fd0617-347d-49a1-a971-19e42faeab25_1435x1289.png"
+  caption="来源：Github"
+/>
+
+最后，AMD 没有足够的提交者"通过功能引导和代码所有权展示持续的上游参与"，并且缺乏能审核自身代码的审稿人。这就是为什么 ROCm vLLM 的开发速度远慢于 CUDA vLLM。
+
+AMD 有许多才华横溢的优秀工程师在 ATOM 上工作，我们鼓励 AMD 管理层考虑将这些优秀工程师重新部署到人们实际使用的库和框架上，如 vLLM 和 SGLang。
+
+如前所述，AMD 还需要优先解决 FP4、wideEP 和分离式服务的可组合性问题，而不是过度专注于单节点的 FP4 优化。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/da3b4a10-0f65-403d-a9f6-093b86753c02_2120x1258.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## 多 Token 预测（MTP）
+
+推测解码通过使用一个小型、低成本的草稿模型提前提议多个 token 来降低自回归生成的成本。大模型然后在一次类似预填充计算的前向传播中验证所提议的 token。对于给定的输入序列长度，当输入多出 N 个 token 时，单次前向传播的耗时大致相同。推测解码利用这一特性，在小模型上运行推理生成多个 token 供主模型在一次前向传播中验证，在相似的时间预算内最多额外产出 N 个 token。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/b2b2aa12-c308-4f4b-84f7-969228600ce5_2296x1126.png"
+  caption="来源：Brendan Bycroft"
+/>
+
+这一关于在相同时间预算内额外产出 token 的假设对于稠密模型最为成立，因为批量验证可以在多个位置复用相同的权重流。对于混合专家模型，不同的 token 可能路由到不同的专家，因此验证多个草稿 token 可能激活更多专家，迫使从内存获取额外的专家权重。正如 EAGLE 论文中 Mixtral 8x7B Instruct 模型的结果所示，这些额外的内存访问会削弱带宽节省，使验证与标准解码步骤的成本相当。
+
+多 Token 预测在无需单独草稿模型的情况下追求类似的效益。模型架构中添加了辅助预测头，使单个模型能从同一底层表示中提议多个未来 token。这改善了分布对齐，因为提议来自最终进行评分的同一模型。多 Token 预测还避免了服务额外模型的运维复杂性，同时仍然支持多 token 生成策略，但要求 MTP 头与主模型一起预训练。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/27ee5a46-78b5-40dd-b76d-1f096e0ae06d_1755x1154.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在所有 SKU 上，启用 MTP 都带来了性能提升。通过利用通常未使用的 logits 来验证额外 token，仅增加了极少的计算开销，节省了解码过程中昂贵的权重加载。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/fb5fc8fa-d129-475c-bb87-664e08bc6179_1773x1151.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在大批次下，推理机制相比小批次受内存带宽限制更少。由于推测解码（包括 MTP）的工作原理是用多余的计算换取更少的内存受限解码步骤，推测 token 带来的额外验证工作可能无法恰好利用空闲算力，导致在大批次下的改进幅度较小。
+
+从成本角度来看，MTP 能带来巨大的成本节省。在下表中，我们看到使用 Dynamo TRT 运行 FP4 的 DeepSeek-R1-0528 每百万总 token 成本为 $0.251，但启用 MTP 可将成本大幅降低至每百万总 token 仅 $0.057。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/dcf44984-9cb9-49ae-b35a-aeb5b5d14244_1566x1778.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在所有配置中，当其他条件不变时，在 DeepSeek R1 上使用 MTP 可以提高交互性，且对模型精度没有显著影响。这与 DeepSeek V3 技术报告的结论一致。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1143164c-b38f-4ca9-888a-e9e270d6ef48_1757x1187.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+关于 MTP 性能数据的有效性，有人可能会认为合成数据集的分布不能代表真实数据。然而，比较 MTBench 和我们的 1k1k 基准测试之间的 MTP 接受行为，我们发现分布非常相似，这证实了我们的 InferenceX 基准测试是真实世界生产性能的良好代理。话虽如此，InferenceX 并非完美，我们始终在寻求改进。如果您想加入我们的使命，[请在此申请加入我们的特别项目团队](https://app.dover.com/apply/semianalysis/2a9c8da5-6d59-4ac8-8302-3877345dbce1)。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/6c4a7c01-3d56-486d-b959-cb4b6468f56f_2408x1390.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## 精度评估
+
+吞吐量优化有时会悄然牺牲精度（例如通过过于激进的接受率放松、解码调整、数值不稳定的内核或端点配置错误）。如果没有评估，一个配置错误的服务器（截断、错误的解码、错误的端点参数）仍然可以产生很好的吞吐量数字，但输出的答案却是垃圾。例如，这一额外的检查层帮助我们发现了 GPT-OSS 的某些 DP 注意力实现的问题。
+
+现在，每个代表性吞吐量配置都附带了数值精度检查。目前我们仅使用 GSM8k，但由于这是一个非常简单的基准测试，评估分数可能不会因数值计算差异而有太大变化，更难的基准测试可能会在数值精度方面产生更大的差异。因此，我们计划在未来扩展到更难的测试，如 GPQA、HLE、MATH-500、SWE-Bench verified。
+
+另一种性能-精度权衡体现在量化上。以更低精度服务模型可能导致模型输出变差。对于 DeepSeek R1，FP8 运行的评估分数略高于 FP4。请注意，GSM8k 评估已经饱和，且在 QAT/PAT 过程中通常会针对常见的 GSM8k、MATH-500 等进行校准，导致有时评估结果优异而真实世界终端用户评估表现不佳。如果您想加入团队研究如何正确评估推理引擎精度，[请在此申请加入使命](https://app.dover.com/apply/semianalysis/2a9c8da5-6d59-4ac8-8302-3877345dbce1)。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/e58e6323-b5d1-4221-9c51-ff39b44d1f98_1779x1180.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## Anthropic Fast Mode 推理解析
+
+Anthropic 最近伴随 Opus 4.6 发布了"[fast mode](https://code.claude.com/docs/en/fast-mode)"。其价值主张是：相同的模型质量，约 2.5 倍的速度提升，约 6-12 倍的价格。这两个数字可能都令人意外，一些用户推测[这一定需要新硬件](https://x.com/Yuchenj_UW/status/2020214926133063705)。实际上并不需要。这本质上就是那个根本性权衡在起作用。任何模型都可以在广泛的交互性水平（每用户 token/s）范围内提供服务，每百万 token 的成本（CPMT）会相应变化。用我们的类比来说，梅赛德斯既造公交车也造跑车。
+
+精打细算的人可能认为 fast mode 更贵，但如果从总拥有成本的角度来看，fast mode 在某些情况下实际上要便宜得多。例如，一个 GB200 NVL72 机架可以花费 330 万美元，因此，如果 Claude Code 的智能体循环（在生产中运行在 Trainium 上）通过工具调用使用 NVL72 机架，而这些机架的推理速度慢了 2.5x，您就需要 2.5x 更多的机架来提供推理服务，这意味着不启用 fast mode 将额外花费近 500 万美元。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/cad37655-7b9a-4c86-81a8-3314ad0526fe_1694x348.png"
+  caption="来源：Anthropic"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/4bb71482-fe77-4e33-b5cb-b7db512b61c1_1700x439.png"
+  caption="来源：Anthropic"
+/>
+
+以在 B200 上使用 TRT-LLM 运行的 DeepSeek R1 0528 FP4 编码工作流为例。在 50 tok/sec/user 的交互性下，推理成本约为 $0.56/M 输出 token。在 125 tok/sec/user 的交互性下，成本上升至约 $4/M 输出 token——速度提升 2.5 倍，价格增加约 7 倍，与我们在 Anthropic fast mode 中看到的情况非常接近。请注意，这假设 DeepSeek R1 与 Opus 4.6 相似，实际并非如此。但总体原理仍然成立。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/66509f21-d3e5-435f-9163-50d9be56c789_1930x1162.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/6621150f-7da2-44ae-9695-493374487825_1972x1122.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+这直接源于 LLM 推理中延迟与吞吐量的根本权衡。在大批次下，GPU 实现了更好的利用率和更高的总 token 吞吐量，即同时服务更多用户、更低的每 token 成本。在小批次下，每个请求有更大的并行度，每个用户获得更快的响应，但总 token 吞吐量下降。由于[加速器的每小时成本](https://semianalysis.com/ai-cloud-tco-model/)无论如何使用都是固定的，更低的吞吐量意味着更少的 token 来分摊该成本，因此每 token 价格更高。
+
+简言之，fast mode 不一定是硬件层面的故事，只是在相同 GPU 上用吞吐量换延迟的自然结果。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/132f55e4-43c7-4df3-bb4e-1408d85c2782_2718x1796.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+此外，我们观察到推测解码等推理优化技术可以直接降低推理成本，无需新芯片。
+
+以下面的例子为例，DeepSeek R1 FP4 在 8k/1k 工作负载上。在 150 tok/sec/user 的交互性水平下，基线 GB300 Dynamo TRT 的每百万 token 成本约为 $2.35，而启用 MTP 将价格降至约 $0.11。仅通过采用一种推理优化技术，就实现了该交互性水平下约 21 倍的价格降低。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f88b30b6-aa73-4ad2-a008-b2e8f940cfd0_1958x1104.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f6dfa226-93d7-4596-9dc5-feebd5ef1dce_1966x1098.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/8742f134-05d4-4a07-9257-8c93b4730cd7_2704x1790.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+固定 50 tok/sec/user 的交互性水平，我们进一步看到 MTP 如何在各种芯片上有效降低每百万 token 成本（CPMT）。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/bc992849-b42d-4899-81a3-77105c86886b_1950x1250.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## 宽专家并行（WideEP）与分离预填充
+
+在本节中，我们将深入探讨专家并行，并解释什么是*宽*专家并行。然后我们将解释分离预填充的概念、它与 WideEP 的区别，以及 WideEP 和分离预填充如何协同使用以实现 SOTA 性能。
+
+## WideEP
+
+目前，大多数前沿 AI 实验室采用的是混合专家（MoE）模型架构而非稠密架构。在 MoE 架构中，每个 token 只激活一部分"专家"。例如，DeepSeek R1 总共有 671B 参数，但仅有 37B 活跃参数。具体来说，DeepSeek R1 有 256 个路由专家（和 1 个共享专家），每个 token 被路由到 8 个不同的专家。这种架构天然适合专家并行（EP），即将专家权重均匀分布到一定数量的 GPU 上。
+
+考虑在单个 8-GPU 服务器上服务 DeepSeek R1。在 671B 参数的规模下，需要某种形式的并行才能将模型放入可用的 HBM 中。最简单的方法是张量并行（TP），它将每个权重矩阵分片到所有 GPU 上。这对稠密模型效果很好，但忽略了 MoE 的稀疏激活模式。使用 TP=8 时，每个专家的权重都分片到所有 8 块 GPU 上，意味着每次专家激活都需要在所有 GPU 之间执行 all-reduce——即使 256 个专家中每个 token 仅激活 8 个，且归约维度的 GEMM 更小导致算术强度更低。TP 将每个专家当作稠密层对待，在模型稀疏性未被利用的情况下承担了全部跨 GPU 通信成本。
+
+专家并行采用了更合适的方法，将完整的专家分配给各个 GPU。使用 EP=8 时，我们将每层的 256 个专家分配到 8 块 GPU 上，每块 GPU 每层 32 个专家。每块 GPU 持有约 1/8 的专家权重加上非专家权重的完整副本（注意力投影、嵌入层、归一化层和共享专家）。由于 DeepSeek R1 约 90% 以上的参数是路由专家权重，EP 捕获了大部分内存节省，而将剩余不到 30B 的非专家参数在所有 8 块 GPU 上复制是可以承受的。
+
+前向传播在每一层分两个阶段进行。在注意力阶段，每块 GPU 充当独立的数据并行 rank，使用其复制的非专家权重处理自己的请求子集——无需 GPU 间通信。在 MoE 阶段，轻量级路由器确定每个 token 需要哪些专家，token 通过 all-to-all 通信被分发到相应的 GPU。每块 GPU 仅对路由到它的 token 执行其本地专家计算，结果通过第二次 all-to-all 返回。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/2f923fd4-57c0-418e-8b01-49025b9c48d5_8236x3544.png"
+  caption="DeepSeek R1 的 EP8 DP8 部署方案。每层 256 个专家均匀分配到 8 块 GPU，而注意力层及其他非专家权重（共享专家、门控网络、RMSNorm、LM head 等）在所有 8 个 DP rank 上复制。来源：SemiAnalysis"
+/>
+
+最直接的扩展方式是复制：在 N 个节点上部署 N 个独立的 EP8 实例。每个实例独立服务请求，无跨节点通信。这使吞吐量线性扩展，但每块 GPU 仍然持有每层 32 个专家，每个 token 最多激活其中 8 个。75% 的专家权重闲置在 HBM 中。
+
+**宽专家并行**（WideEP）采用了不同的方法，将 EP *跨*节点扩展而非复制独立实例。在 64-GPU 集群（8 个节点）上，DP64/EP64 将每层每块 GPU 仅放置 256/64 = 4 个专家，同时每块 GPU 仍持有非专家权重的完整副本。在 MoE 阶段，所有 64 个 DP rank 的 token 通过 all-to-all 分发到托管其路由专家的 GPU。
+
+与单节点 EP8 基线相比，这带来了三重叠加效益。首先，将专家占用从每 GPU 32 个减少到 4 个，释放了大量 HBM 用于 KV 缓存，直接增加了每 GPU 批次容量。其次，64 个 DP rank 的 token 汇聚到更少的每 GPU 专家上，增加了每专家 token 数，提高了算术强度（每字节权重加载的 FLOPS 更多），改善了计算利用率。相同的专家权重在每步服务 8 倍的 token。第三，聚合 HBM 带宽随 GPU 数量线性扩展；64 块 GPU 同时加载专家权重提供了单节点 8 倍的内存带宽，减少了内存瓶颈。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1ae2668e-28ef-4a1f-8ab1-0b5f1373a1d1_8476x3546.png"
+  caption="DeepSeek R1 的 WideEP EP64 DP64 部署方案。每层 256 个专家均匀分配到 64 块 GPU（8 个节点），注意力层及其他非专家权重（共享专家、门控网络、RMSNorm、LM head 等）在所有 64 个 DP rank 上复制。来源：SemiAnalysis"
+/>
+
+上述配置仅使用 DP+EP（也称为 DEP），其中每块 GPU 持有所有非专家权重的完整副本。随着 GPU 数量增加，这种复制变得越来越浪费。在 64-GPU 的 DP64/EP64 部署中，每块 GPU 都存储着约 40B 非专家参数的相同副本。
+
+在 GPU 组内添加张量并行可以解决这个问题。在 EP64/DP8/TP8 配置中，64 块 GPU 被组织成 8 个 DP 组，每组 8 块 GPU。在每个 TP 组内，注意力投影、共享专家、归一化层和 LM head 被 8 路分片，因此每块 GPU 仅持有 1/8 的非专家权重。在整个集群中，256 个专家仍然像之前一样分布——每 4 块 GPU 一个。
+
+纯 DEP 只有一种通信模式：用于专家路由的 all-to-all。添加 TP 在每个 TP 组内为注意力和非专家计算引入了第二种 all-reduce。关键设计原则是将 TP 组放置在单个节点内（NVLink 或 MNNVL 提供高带宽互连），而将 EP/DP 跨节点运行（all-to-all 通信模式可以容忍更高的延迟）。
+
+一如既往，这里的权衡是吞吐量与延迟的取舍。一个组内的 TP=8 意味着这 8 块 GPU 共享一个批次并且必须在每个解码步同步，将有效 DP 度从 64 降低到 8。注意力侧的每 GPU 批处理独立性丧失了。但每个 DP 组现在每步处理注意力的速度提高了 8 倍，因为矩阵乘法在 TP 组内被 8 路分割。每 token 延迟下降，同时峰值并发度也下降——相对于纯 DEP，该配置沿延迟-吞吐量帕累托前沿滑动。
+
+## 分离预填充
+
+分离预填充，有时也称为预填充-解码（PD）分离，是将 LLM 推理的预填充和解码阶段在不同节点上执行的过程。预填充发生在请求首次处理时，对所有 token 执行一次前向传播，从而"预填充"该请求的 KV 缓存。这是一个计算密集型操作，因为所有 token 同时通过前向传播。随后，token 逐个生成或"解码"，每个解码步都从 HBM 加载 KV 缓存。这是一个内存密集型过程，因为不断增长的 KV 缓存持续被加载。
+
+在传统的单节点推理中，引擎在同一 GPU 上交替执行预填充和解码。到来的预填充请求会阻塞正在进行的解码批次，增加首 token 延迟（TTFT）和 token 间延迟。分块预填充通过将长预填充拆分为更小的片段来缓解这一问题，但资源竞争的根本问题仍然存在。分离预填充彻底消除了这一问题！
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/0bc87a96-aa31-4b37-99c6-603c98f332f3_1318x733.png"
+  caption="来源：DistServe"
+/>
+
+分离还支持对每个阶段进行独立扩展和优化。有了独立的节点，每个阶段可以分别调优：不同的并行策略、不同的批次大小和不同的内存分配比例。预填充与解码节点的比例也可以根据工作负载的输入-输出长度比进行匹配。例如，预填充主导的工作负载（长输入、短输出，如摘要生成、RAG、大上下文窗口的智能体编码）分配更多预填充实例。解码主导的工作负载（短输入、长输出，如思维链推理、长文本生成）分配更多解码实例。缓存命中率高的工作负载也倾向于使用更多解码，因为来自共享系统提示或多轮对话历史的复用 KV 缓存条目完全跳过了预填充。
+
+分离的关键代价是 KV 缓存传输。预填充完成后，该请求的完整 KV 缓存必须从预填充节点传输到解码节点，然后才能生成第一个解码 token。对于像 DeepSeek R1 这样具有 61 层和 FP8 KV 缓存的模型，8192 个 token 的预填充产生大约 500MB 的 KV 数据需要通过网络传输，这直接增加了 TTFT。这种传输通过 RDMA（通常是 RoCE 或 InfiniBand）进行，使用零拷贝 GPU 到 GPU 的数据移动，无需 CPU 参与。NIXL（NVIDIA Inference Transfer Library）等库将数据移动层抽象在统一的异步 API 后面，具有可插拔的 UCX、GPUDirect Storage 等传输后端。这将推理引擎与任何特定传输协议解耦，并支持跨异构硬件的分离，其中预填充和解码实例可能跨越不同的设备类型或互连。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/3b56d901-ef89-43c9-8d11-c18062f1b7b9_1165x1165.png"
+  caption="来源：Github"
+/>
+
+## 使用宽 EP + 分离式服务优化推理
+
+宽 EP 和分离预填充是两种独立的技术，通常一起使用以实现帕累托最优性能。在本节中，我们将逐步分析 InferenceX 的真实结果，帮助建立对在不同交互性水平下何种并行策略、宽 EP 和分离预填充组合更合适的直觉。
+
+首先了解对于单节点配置，哪些并行策略落在帕累托前沿的哪些部分会有所帮助。以在单个 8-GPU B200 节点上使用 TRT-LLM 运行 DeepSeek R1 FP4 8k/1k 为例。最优策略随着在前沿上的移动而变化，主要由批次大小及其对专家激活密度的影响驱动。
+
+在最高交互性水平（批次 1-16）下，纯 TP 优于任何涉及 EP 的配置。在小批次下，每步仅有少量专家被激活。使用 EP 时，这些激活在 GPU 之间分布不均：在批次 4 时，256 个专家中仅有 32 个被触发，任何给定 GPU 在给定层中接收零路由 token 的概率约为两位数百分比。TP 通过将每个专家分片到所有 GPU 来避免这一问题，因此无论路由器选择哪些专家，所有 8 块 GPU 都平等参与每个专家的计算。我们在分析 DeepSeek R1 时收集了专家激活比例与批次大小的数据，确认在批次 16 及以下时，每层的专家激活率非常低。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/5ca10b5a-f80e-45b4-8d22-e3134d30b54d_2232x1446.png"
+  caption="来源：SemiAnalysis"
+/>
+
+随着我们移至稍低的交互性水平，批次大小仍然足够小，专家权重仍然通过 TP 而非 EP 进行分片。交叉点出现在大约批次 32 处，此时约 50-60% 的专家在每层被激活。在这个密度下，EP 的负载不均衡变得可以容忍，其 token 路由开销比 TP 所需的逐专家 all-reduce 更低。该范围内的配置使用 TEP：注意力使用张量并行（所有 GPU 协作完成每个注意力计算），MoE 层使用专家并行（专家分配到特定 GPU 并通过 all-to-all 路由）。在最高吞吐量、最低交互性区域，批次很大（128+），配置转向完全 DEP：注意力权重在所有 GPU 上作为独立数据并行 rank 完全复制，专家通过 EP 分布，批次容量最大化但牺牲了每 token 延迟。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d13280a5-ddc2-4610-84bb-bf470301cc8e_2086x1233.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在扩展到宽 EP 配合分离预填充时，我们观察到了相同的一般模式。预填充和解码使用独立的并行策略和节点数量，两者都根据工作负载和目标交互性水平进行调优。以 8k/1k 工作负载（预填充密集型）在高吞吐量、低交互性区域为例。预填充是瓶颈，因为每个请求需要对 8192 个输入 token 执行一次计算密集的前向传播。该区域的配方分配更多预填充节点而非解码（4P1D、7P2D、4P3D）以维持高预填充吞吐量。这些预填充节点运行 DEP 配置，在独立的数据并行 rank 上复制注意力权重，以便同时处理多个长上下文预填充。解码节点数量较少但以同样的原则运行宽 DEP 配合大批次。
+
+在低交互性端，同时进行的请求较少，因此单个预填充实例就能跟上传入需求的节奏。但每个请求仍需 1024 个解码步骤，且在高交互性下这些步骤必须很快。该区域的配方转向更多解码节点而非预填充（1P3D、1P4D），每个解码实例在小批次下运行 TEP。注意力的张量并行通过将计算分片到实例内的所有 GPU 来最小化每步延迟，而专家并行在中等批次（EP 负载均衡足够好的情况下）处理 MoE 路由。多个小批次解码实例（而非较少的大批次实例）保持了低每 token 延迟，同时仍提供足够的并发服务能力。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/61e2a61e-1b95-4ecb-a03d-061d15615c40_2086x1214.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/1f027e38-879b-4074-960e-928ceca839e2_2112x1227.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/d6c3bf7c-a035-48fb-bfaf-5ae0169e5c1a_2097x1225.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## 深入分析 DeepSeek R1 单节点结果
+
+在 DeepSeek R1 FP8 1k1k 上，我们看到 MI355X 在单节点场景下与对标的 B200 具有竞争力——尽管在 FP4 多节点场景下被碾压。MI355X（SGLang）在较低交互性水平的吞吐量性能上甚至超过了 B200（SGLang）。此外，MI355X（SGLang）在大多数场景下从性价比角度优于 B200（TRT 和 SGLang）。
+
+遗憾的是，时至 2026 年，大多数前沿实验室和推理服务商既不使用 FP8 也不使用单节点推理。
+
+这一结果表明 AMD 的芯片本身非常出色，如果他们能在软件方面推进得更快，完全可以与 NVIDIA 展开极具竞争力的较量。速度就是护城河。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/a4e8da6f-c4ee-4d39-96ae-9143459d3ea9_2102x1236.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/7ce2b96f-840d-411b-9c6c-2f821219fba5_2130x1444.png"
+  caption="来源：SemiAnalysis InferenceMAX"
+/>
+
+因此，我们看到 MI355X 在 FP4 性能上明显落后于 B200：
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/dbc1dd2c-e15c-45b7-acf7-508d38ad1913_2406x1430.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+在比较 H200（SGLang）和 MI325X（SGLang）上的 DeepSeek R1 FP8 性能时，自我们去年 10 月首次发布 InferenceXv1 以来变化不大。MI325X 数据采集于 2026 年 2 月 12 日，使用 SGLang 0.5.8，而 B200 数据采集于 2026 年 1 月 23 日，使用 SGLang 0.5.7。
+
+我们注意到一个值得关注的问题是 MI325X 的交互性范围比 H200 小得多，H200 的范围为 30-90 tok/sec/user，而 MI325X 仅为 13-35 tok/sec/user。这对于希望在更广泛交互性范围内服务用户的服务商来说是个问题。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/f3ba43db-8f65-4b28-a4a2-66282670449f_2117x1236.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## GPT-OSS 120B 单节点
+
+MI300X、MI325X、H200 和 H100 集中在吞吐量 vs 交互性图的左下方，表明它们之间的权衡大致相似，NVIDIA 通常保持适度领先。下一个层级是 MI355X，在给定交互性水平下每 GPU 的 token 吞吐量大约提升 2 倍以上。在 MI355X 中，ATOM 将曲线向低交互性高吞吐量方向移动，表明它优先考虑峰值吞吐量而非每用户响应速度。
+
+在这一层级之上是 NVIDIA 的 B200 和 GB200，它们在整个前沿线上都优于 MI355X。虽然 B200 和 GB200 共享相同的 Blackwell 计算核心，但 GB200 实现了更高的吞吐量-交互性曲线，因为该平台和服务栈减少了大规模部署中的非计算瓶颈（互连/拓扑、CPU-GPU 耦合和运行时调度），从而实现了有效的横向扩展和更低的每 token 开销。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/478b3a9a-c57d-4766-bde1-c3ee1fef550a_2068x1178.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+如果我们将成本纳入等式，MI355X 变得更有竞争力：在高吞吐量下优于 B200。然而，GB200 仍然是最便宜的选择。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/028672d5-2c24-4dbd-974d-9f50d163df27_1796x1182.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+再次回到 B200 与 GB200 NVL72 的比较，NVL72 的影响显而易见。我们在本文前面讨论了 GB200 NVL72 的 72 GPU 扩展世界规模与 B200 的 8 GPU 扩展世界规模的影响。在约 100 tok/s/user 的交互性范围内，每 GPU 的输出 token 吞吐量翻了一倍以上，展示了 NVL72 更大扩展域的影响。
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/0186cfbc-1b42-46ae-ae1a-0d7791afcb20_2081x1306.png"
+  caption="来源：SemiAnalysis InferenceX"
+/>
+
+## InferenceX 仓库核心更新
+
+我们对 InferenceX 仓库进行了一些核心架构变更，使基准测试更易于理解和复现。此外，我们已全面拥抱 AI 工具以最大化生产力并提高开发效率。
+
+## 自 InferenceXv1 以来的核心变更
+
+我们自 v1 以来做出的主要改变之一是执行扫描的频率。此前我们每晚对每个配置执行完整扫描。然而，随着我们添加了更多芯片、分离预填充、宽 EP 和其他功能，我们意识到每晚运行既过于耗时又浪费资源。而且，这也没有必要——基准测试只有在配方变更或新软件版本发布时才真正需要重新运行。
+
+我们现在基于仓库根目录的 [changelog](https://github.com/InferenceMAX/InferenceMAX/blob/main/perf-changelog.yaml) 的修改来触发扫描。当开发者对给定配置进行了影响性能的变更时，他们会在 changelog 中添加一个条目，列出受影响的配置和变更的简要描述。所有配置定义在一个[主配置 YAML 文件](https://github.com/InferenceMAX/InferenceMAX/blob/main/.github/configs/nvidia-master.yaml)中，它是每个待扫描数据点的完整状态表示，包括 ISL/OSL、EP、TP、DP、MTP 等核心设置。当包含 changelog 添加的 PR 被合并时，一个工作流会解析引用的配置键，从主配置中提取相应的扫描定义，并将它们作为独立的 GitHub Actions 任务分发。这些任务收集完整扫描的所有数据点，并将结果作为产物上传。
+
+以下是 InferenceX 启动任务的高层示意图。
+
+<Figure src="https://substack-post-media.s3.amazonaws.com/public/images/74936db5-88cb-418e-932a-e7a8693a6857_2904x2845.png" />
+
+## Claude Code 深度 AI 应用
+
+在 InferenceX v1 发布后不久，我们意识到不充分利用 AI 会让多少开发吞吐量白白浪费。因此，我们卷起袖子决定拥抱 Claude Code，开始一次一个 token 地吸收智能，达到了目前每天 $6,000 的消费速率。如果您想为我们年化吸收 300 万美元 Claude 智能的 KPI 做贡献，[请在此申请加入使命。](https://app.dover.com/apply/semianalysis/2a9c8da5-6d59-4ac8-8302-3877345dbce1) 我们的启蒙之旅始于发现 GitHub Copilot agent 是免费的——起初我们简直不敢相信这个功能居然不收费！很快我们意识到 Copilot 很糟糕，也就明白了为什么 GitHub 要免费赠送。要让我们继续用它，恐怕*得倒贴钱*给我们才行。
+
+自 Claude Code 发布以来，我们一直在本地使用它。但最近，我们将 Claude Code 集成到了 InferenceX 的开发中，除了常规的 PR 审查等任务外，还赋予了它在集群上执行扫描的能力。通过我们搭建的工作流，Claude 可以手动启动运行、查看结果并进行迭代。这使我们能够通过 GitHub 应用轻松地在移动中部署快速修复。
+
+另一个酷炫的用例是使用 Claude 为新的 vLLM/SGLang 镜像寻找配方。当新镜像发布时，配方有时需要更新以实现最佳性能（新环境变量、修改的引擎参数等）。通过我们的 Claude Code 集成，我们只需打开一个 issue 并要求 Claude 搜索镜像 changelog 中的所有提交，以找到需要添加到配方中的必要变更。这效果相当好，虽然不*完美*，但通常能提供一个良好的起点。
+
+## GitHub Actions
+
+秉承开源精神，所有运行都在 GitHub Actions 上进行，因此基准测试结果是可验证、透明且可复现的。然而，GitHub 的故障最近一直是我们目标的持续障碍。[最近我们看到的独角兽比任何其他动物都多](https://github.com/503.html)！但也许是时候出去接触一下大自然了。
+
+Microsoft/GitHub 自己也意识到了这个问题，已停止在其状态页面上更新综合正常运行时间数字，在过去 90 天内只剩下一个 9：97.36%。忽视问题并不会让它消失...
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/3b921859-49f3-4b0b-b02e-dd0bf7a36e2e_3000x975.png"
+  caption="来源：Outages project"
+/>
+
+<Figure
+  src="https://substack-post-media.s3.amazonaws.com/public/images/dd7aad58-ba30-4364-9565-980ae6464534_3000x975.png"
+  caption="来源：Outages project"
+/>
+
+总的来说，GitHub Actions 只是勉强够用。它为开发者提供了一种平庸的体验。它显然不是为在数百块 GPU 的集群上启动数千个任务而设计的。尽管如此，自发布以来我们与一些 GitHub Actions 工程师进行了密切合作以更好地满足 InferenceX 的需求，我们可以自信地说他们非常好合作。此外，我们的一个直接需求是在点击工作流运行时实现任务的懒加载，虽然他们花了一些时间，[但最终实现了这一功能。](http://github.blog/changelog/2025-12-22-improved-performance-for-github-actions-workflows-page/)
+
+## InferenceX 的未来
+
+自 2025 年 10 月初 InferenceX 首次发布以来，我们一直在努力持续改进。发布后，我们花了一些时间重构代码库使其更具可扩展性，使新模型和推理技术现在可以"即插即用"地添加。这些改变使我们能够无缝集成 H100、H200、B200、B300、GB200、GB300 和 MI355X 的 PD-disagg 基准测试。我们还在默认基准测试管线中添加了精度评估，以确保在所有配置中对模型性能的可见性。
+
+虽然自发布以来我们做了很多改进，但要达到提供最贴近真实世界推理基准测试这一北极星目标，仍有大量工作要做。为此，我们计划在真实数据集上进行基准测试、添加智能体编码性能基准测试、包含更多 SOTA 推理优化、测试更多模型，以及更多。
+
+## 迁移至多轮真实多轮对话和智能体编码数据集
+
+目前，InferenceX 使用完全随机的 token 作为基准测试的输入。然后我们在 [ISL*0.8, ISL] 分布下均匀变化 ISL/OSL，OSL 类似。由于使用随机数据，我们在所有基准测试中禁用了前缀缓存，因为完全随机数据的前缀缓存命中率期望值为 0%。此外，所有随机数据都是单轮的，意味着每个对话只包含一个提示和一个回答。虽然这提供了良好的基线帕累托前沿，但它不是模拟真实世界生产推理工作负载的实际基准测试方案。
+
+在近期，我们将使用类似 [allenai/WildChat-4.8M](https://huggingface.co/datasets/allenai/WildChat-4.8M) 这样的数据集创建基础多轮基准测试，该数据集记录了真实用户的多轮对话。除了在所有场景中启用前缀缓存外，我们还将启用 KV 缓存 CPU 卸载，因为这是我们在生产工作负载中看到的做法。这将更准确地评估每款芯片的优缺点。例如，MI355X 拥有 288GB HBM3e 而 B200 仅有 192GB。因此，我们预期 MI355X 在高并发多轮场景中表现更好，因为更多内存可以分配给 KV 缓存。另一方面，在 GPU KV 缓存紧张、块被卸载到 CPU 的场景中，我们预期 GB 系列表现更优，因为这些芯片拥有 900 GB/s 双向 CPU-GPU 带宽，相比之下 HGX 使用 PCIe 5.0 和 6.0 分别只有 128 GB/s 和 256 GB/s。此外，我们目前看到 AMD 的 CPU 卸载软件表现不佳，这可能在相同场景中对性能产生负面影响。
+
+关键是：真实世界的多轮数据集测试了更多 SOTA 推理引擎功能，能够在所有芯片上捕获更细致和可靠的性能数据。
+
+随着 Claude Code、Codex 和 Kimi 的兴起，在智能体编码场景中进行性能基准测试变得越来越重要。与上述类似，这些场景是多轮的，但还包括超长上下文对话以及工具使用。在接下来几个月中，我们计划创建一个基准测试套件，能够在所有芯片上最准确地捕获开放模型在这些智能体编码场景中的性能。
+
+## 添加 TPU、Trainium 及更多模型
+
+目前，我们持续对 DeepSeek R1 和 GPT OSS 120B（此前还有 Llama 3.1 70B）进行基准测试。为了跟上最新的模型架构，我们计划在接下来几个月内添加 DeepSeek V3.2（含 DSA）、DeepSeek V4 首日支持、Kimi K2.5、Qwen3、GLM5 等众多模型。我们还将最终添加多模态模型，并使用 EPD 和 CFD（由 TogetherAI 发明）优化。
+
+除了新模型外，我们正在积极推进添加 TPU 和 Trainium 的工作。
+
+## 总拥有成本（NVL72、Blackwell、Blackwell Ultra、MI355、Hopper、MI325、MI300）
+
+<Blur>
+
+Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
+
+</Blur>
+
+---
+
+_本文完整版发布在我们的 Substack 上。[订阅 SemiAnalysis](https://newsletter.semianalysis.com/subscribe) 阅读完整文章。_
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "NVIDIA Blackwell 相比 Hopper 在推理上快了多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "GB300 NVL72 FP4 相比强劲的 H100 disagg+wideEP FP8 基线性能提升高达 100x，FP8 vs FP8 则达 65x。即使考虑更高的总拥有成本，Blackwell 在每美元 token 数方面相比 Hopper 实现了 9.7x 至 65x 的提升。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "什么是 LLM 推理中的分离预填充？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "分离预填充将 LLM 推理中计算密集的预填充阶段和内存密集的解码阶段分离到不同的 GPU 池上。这消除了阶段间的资源竞争，支持独立扩展和调优，与在相同 GPU 上运行两个阶段相比，改善了首 token 延迟（TTFT）和 token 间延迟。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "什么是宽专家并行（WideEP）？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "宽专家并行将 EP 跨多个节点扩展，而非复制独立实例。例如，在运行 DeepSeek R1 的 64-GPU 集群上，WideEP 每块 GPU 仅放置 4 个专家（而非 32 个），释放 HBM 用于 KV 缓存，增加每专家 token 数以提高计算利用率，并相比单节点 EP8 提供 8 倍的聚合内存带宽。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "NVIDIA B200 与 AMD MI355X 在推理上如何比较？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 FP8 分离预填充上，MI355X 使用 SGLang 与 B200 具有竞争力。然而，在前沿实验室使用的 FP4 disagg+wideEP 工作负载上，由于 AMD 在组合多种推理优化时存在可组合性问题，B200 显著优于 MI355X。AMD 的单节点 FP8 性能强劲，但多节点 FP4 分布式推理落后较多。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "DeepSeek R1 推理每百万 token 的成本是多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "使用 B200 配合 Dynamo TRT-LLM FP4，DeepSeek R1 推理成本约为每百万总 token $0.251。启用 MTP 可将成本降至每百万总 token $0.057。在 GB300 NVL72 FP4 上以 150 tok/s/user 的交互性运行时，启用 MTP 将成本从 $2.35 降至约 $0.11 每百万 token，降幅达 21 倍。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "GB200 NVL72 机架规模架构提供了什么优势？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "GB200 NVL72 通过 NVLink 以每 GPU 900 GB/s 的带宽连接 72 块 GPU，是 B200 节点间使用的 InfiniBand Scale-out 网络带宽的 9 倍以上。这一巨大的带宽优势直接推动了 DeepSeek R1 等需要跨多 GPU 进行 all-to-all 通信的宽专家并行大型 MoE 模型的更低每 token 成本。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "多 Token 预测（MTP）如何提升推理性能？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "MTP 使用模型内置的辅助预测头从相同的表示中提议多个未来 token，无需单独的草稿模型。在所有测试的 GPU SKU 上，启用 MTP 均能提升吞吐量且对模型精度没有显著影响，在高交互性水平下可将每百万 token 成本降低高达 21 倍。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "Blackwell Ultra GB300 与 Blackwell GB200 相比如何？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "尽管纸面上内存带宽和 FP8 规格相同，Blackwell Ultra 在实际测试中 FP8 性能比 Blackwell 好了多达 1.5 倍。然而，FP4 性能仅好 1.1 倍，可能是因为软件尚未针对新发布的 Blackwell Ultra GPU 完全优化。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/mi355x-deepseek-v4-pro-sglang-110x-in-26-days.mdx b/packages/app/content/blog/zh/mi355x-deepseek-v4-pro-sglang-110x-in-26-days.mdx
new file mode 100644
index 00000000..8a6134f2
--- /dev/null
+++ b/packages/app/content/blog/zh/mi355x-deepseek-v4-pro-sglang-110x-in-26-days.mdx
@@ -0,0 +1,254 @@
+---
+title: 'MI355X 上 DeepSeek-V4-Pro 搭配 SGLang：26 天内每 GPU 吞吐量提升 110.5 倍'
+subtitle: 'amd/deepseek_v4 分支合入了 TileLang 注意力索引器、Triton 稀疏 MLA、融合 RoPE/Hadamard、FlyDSL MoE 以及 FP4 权重，历经 31 个性能优化 PR——将首次点亮时 20 tok/s/GPU、2.4 tok/s/user 的水平提升至 8K/1K 负载下 2,256 tok/s/GPU、9.4 tok/s/user，吞吐量与交互性同步攀升'
+date: '2026-05-26'
+publishDate: '2026-05-26'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - deepseek
+  - amd
+  - mi355x
+  - sglang
+  - rocm
+  - fp4
+---
+
+DeepSeek-V4-Pro 于 [2026-04-24](https://api-docs.deepseek.com/news/news260424) 发布后仅 26 天，AMD MI355X 上基于 SGLang [sgl-project/sglang `amd/deepseek_v4` 分支](https://github.com/sgl-project/sglang/compare/main...amd/deepseek_v4) 的服务在 8K/1K 负载下达到 **2,256 tok/s/GPU、9.4 tok/s/user**——相比 2026-04-25 首次点亮时 20.4 tok/s/GPU、2.4 tok/s/user 的水平，**每 GPU 吞吐量提升 110.5 倍**，而且是罕见的双轴同步提升：每 GPU 吞吐量提升 110.5 倍的*同时*，交互性也提升了 3.85 倍。SemiAnalysis 此前在 [推文中](https://x.com/SemiAnalysis_/status/2053520440589451720)指出 14 天内核心级别的提升约为 75 倍；仪表板现在又记录了此后 12 天的持续优化。
+
+**31 个性能优化 PR** 在 AMD 分支上紧密接力完成了这些核心工作：FP4 权重启用（[#24031](https://github.com/sgl-project/sglang/pull/24031)）、用于 DeepSeek Sparse Attention 的 TileLang 注意力索引器（[#24033](https://github.com/sgl-project/sglang/pull/24033)、[#24050](https://github.com/sgl-project/sglang/pull/24050)）、Triton 稀疏 MLA 内核及后续融合调度优化（[#24930](https://github.com/sgl-project/sglang/pull/24930)、[#25878](https://github.com/sgl-project/sglang/pull/25878)、[#25977](https://github.com/sgl-project/sglang/pull/25977)）、融合的多头压缩 / RoPE / Hadamard（[#24355](https://github.com/sgl-project/sglang/pull/24355)、[#24727](https://github.com/sgl-project/sglang/pull/24727)、[#26014](https://github.com/sgl-project/sglang/pull/26014)）、FlyDSL MoE（[#24971](https://github.com/sgl-project/sglang/pull/24971)）、融合 hash topk（[#24728](https://github.com/sgl-project/sglang/pull/24728)）、AITER MHC 前处理/后处理，以及六余个压缩器逐元素内核融合。速度就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=DeepSeek-V4-Pro&i_gpus=mi355x_sglang&i_dates=2026-05-03%2C2026-05-04%2C2026-05-08%2C2026-05-19%2C2026-05-21%2C2026-04-25&i_prec=fp4%2Cfp8&i_dstart=2026-04-25&i_dend=2026-05-21&i_linelabel=1">
+  点击查看完整的 InferenceX 仪表板 →
+</DashboardCTA>
+
+<Figure
+  srcLight="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/benchmark-light.png"
+  srcDark="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/benchmark-dark.png"
+  alt="MI355X SGLang DeepSeek-V4-Pro 每 GPU 吞吐量与交互性对比，5 个日期：2026-04-25（FP8 基线，约 67 tok/s/GPU、低于 3 tok/s/user）、2026-05-02（FP4 首次点亮，峰值约 500 tok/s/GPU）、2026-05-04（约 615）、2026-05-10（约 1503）、2026-05-21（约 2256 tok/s/GPU、9.4 tok/s/user）。每个日期的曲线均向右上方移动。"
+  caption="MI355X SGLang DeepSeek-V4-Pro（1.6T / 49B 激活参数）在 ISL 8192 / OSL 1024 下的表现。来自 amd/deepseek_v4 SGLang 分支的 5 个日期，跨度 26 天。标注表示 8-GPU TP=8 配置；后期日期在高并发段使用了 DP attention。"
+/>
+
+## DeepSeek-V4-Pro 模型架构
+
+DeepSeek-V4-Pro 是 DeepSeek 的旗舰 MoE 模型：**总参数 1.6T，每 token 激活 49B**（据 [DeepSeek V4 预览公告](https://api-docs.deepseek.com/news/news260424)）。该架构将新颖的**逐 token 压缩**路径与 **DSA（DeepSeek Sparse Attention）** 相结合——这是 DeepSeek 在 V3.2 中引入的稀疏注意力模式，现扩展至更长的上下文（官方服务默认在 **1M 上下文**下运行 DSv4）。V4-Pro 的官方定位是"极致效率：世界领先的长上下文能力，大幅降低计算和显存开销"；开源权重发布于 [`deepseek-ai/DeepSeek-V4-Pro`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)。
+
+注意力机制是 SGLang AMD 分支需要编写大量内核的根本原因。逐 token 压缩引入了围绕注意力块的**多头压缩（mHC）前/后处理对**——运行时将其与 RoPE 和 Hadamard 变换融合——而解码路径上的 DSA 需要单独的**注意力索引器**以及一个**稀疏 MLA 内核**来仅遍历被路由到的位置。整个技术栈足够新，以至于上游 `main` 分支在发布时无法在 Blackwell 或 ROCm 上运行 DeepSeek-V4-Pro；AMD 分支正是在 MI355X 上弥合了这一差距。
+
+**MI355X 上的 FP4 权重支持在发布时同样不存在。** 2026-04-25 的首次点亮测量使用的是 FP8——并且需要 `SGLANG_HACK_FLASHMLA_BACKEND=torch` 加上 `--time=300` 的 SLURM 扩展才能在不触及 3 小时 CI 上限的情况下通过约 30 分钟的 MoE JIT 编译——因为 [PR #24031](https://github.com/sgl-project/sglang/pull/24031)（kk，2026-04-29）尚未在 ROCm 上启用 FP4 模型路径。一旦该 PR 合入（加上 2026-05-02 的 InferenceX 配方更新——启用 `SGLANG_DSV4_FP4_EXPERTS=True` 并拉取了 `deepseek-ai/DeepSeek-V4-Pro` 的 FP4 权重），曲线就进入了可量测的服务区间。本文从 2026-05-02 起的所有日期均使用 FP4；仅 2026-04-25 使用 FP8。
+
+## DeepSeek-V4-Pro 与 Claude Opus 4.6、GPT-5.4、Gemini 3.1 Pro 的对比
+
+DeepSeek 在预览发布时公布了 V4-Pro-Max 与 Claude Opus 4.6、GPT-5.4-xHigh 和 Gemini 3.1-Pro-High 在知识/推理及智能体基准测试（benchmark）上的评估（evaluation）。从质量角度看，这是一个**开源前沿编码模型**：
+
+<Figure
+  srcLight="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/quality-benchmarks-light.png"
+  srcDark="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/quality-benchmarks-dark.png"
+  alt="柱状图对比 DeepSeek-V4-Pro-Max（蓝色斜线填充）与 Claude Opus 4.6-Max、GPT-5.4-xHigh、Gemini 3.1-Pro-High 在 SimpleQA Verified、HLE、Apex Shortlist、Codeforces、SWE Verified、Terminal Bench 2.0 和 Toolathlon 上的得分。DSv4 在 SimpleQA Verified（57.9）、Apex Shortlist（90.2）、Codeforces（3206）和 SWE Verified（80.6，并列）上领先。"
+  caption="DeepSeek-V4-Pro-Max 与 Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 在知识+推理及智能体基准测试上的对比（来源：DeepSeek V4 预览发布于 api-docs.deepseek.com/news/news260424）。DSv4-Pro 在 SimpleQA（57.9 vs Opus 46.2 / GPT 45.3）、Apex Shortlist（90.2 vs Opus 85.9）、Codeforces（3206 vs Opus 3168）上领先，SWE Verified 并列（80.6 vs Opus 80.8 / GPT 80.6）。在 Terminal Bench 2.0（67.9 vs GPT 75.1）和 Toolathlon（51.8 vs GPT 54.6）上落后。"
+/>
+
+这一质量水准正是 HaiShaw 领导的 AMD SGLang 团队将 MI355X 服务视为 14 天冲刺的原因：一个前沿的开源编码模型值得投入工程力量，而一旦在 AMD 硅片上形成可用的性能曲线，服务栈上每一个百分点的性能/成本改进都会推动真实工作负载的迁移。
+
+## 实现这一切的关键贡献
+
+**上游技术栈：`amd/deepseek_v4` SGLang 分支。** [sgl-project/sglang `amd/deepseek_v4`](https://github.com/sgl-project/sglang/compare/main...amd/deepseek_v4) 是一个持续 rebase 的分支，以编号的性能优化 PR 形式合入 AMD 专用的 DeepSeek-V4-Pro 内核。截至 2026-05-22 共 31 个 PR，四位主要贡献者。本文中的每一次测量均基于该分支的镜像，而非 SGLang main（关于上游合并的情况请参见[后续计划](#mi355x-deepseek-v4-pro-的后续计划)）。按机制分组的关键优化：
+
+- **DSA 注意力（TileLang 索引器 + Triton 稀疏 MLA）。** [#24033](https://github.com/sgl-project/sglang/pull/24033)（Thomas Wang，04-29）将 TileLang 注意力路径移植至 ROCm；[#24050](https://github.com/sgl-project/sglang/pull/24050)（Thomas Wang，04-29）在 TileLang 中添加了**注意力索引器**；[#24930](https://github.com/sgl-project/sglang/pull/24930)（amd-danli103，05-11）引入了 **Triton 稀疏 MLA 内核**；[#25878](https://github.com/sgl-project/sglang/pull/25878)（05-20）和 [#25977](https://github.com/sgl-project/sglang/pull/25977)（jacky.cheng，05-22）分别将 prefill 和 extend 的 gather + attention 路径融合为单次调度。
+- **mHC 融合（多头压缩，逐 token 压缩路径）。** [#24355](https://github.com/sgl-project/sglang/pull/24355)（kk，05-04）"优化 mhc 性能"；[#24424](https://github.com/sgl-project/sglang/pull/24424)（Thomas Wang，05-05）**压缩器逐元素内核融合**；[#25020](https://github.com/sgl-project/sglang/pull/25020)（Xinyi Song，05-12）压缩器优化；[#25245](https://github.com/sgl-project/sglang/pull/25245)（jacky.cheng，05-15）**融合 softmax pool Triton 内核用于压缩器**；[#25353](https://github.com/sgl-project/sglang/pull/25353)（Xinyi Song，05-15）"启用新压缩器路径"；[#26014](https://github.com/sgl-project/sglang/pull/26014)（Xinyi Song，05-22）**低并发下的 Triton 融合 mhc_post_pre**。
+- **RoPE + Hadamard 融合。** [#24727](https://github.com/sgl-project/sglang/pull/24727)（Xinyi Song，05-09）**使用 `rope_rotate_activation` 融合 RoPE Hadamard**——消除了 CPU 侧的一次 launch 调用，改善了每步解码循环的 HBM 利用率。[#24249](https://github.com/sgl-project/sglang/pull/24249)（Xinyi Song，05-02）完成了类似的**融合 compress-decode** 内核。
+- **MoE：FlyDSL + FP4 + 融合 hash topk。** [#24031](https://github.com/sgl-project/sglang/pull/24031)（kk，04-29）启用 **FP4 模型路径**；[#24728](https://github.com/sgl-project/sglang/pull/24728)（Xinyi Song，05-09）**融合 hash topk** 路由步骤；[#24971](https://github.com/sgl-project/sglang/pull/24971)（Thomas Wang，05-11）合入 **FlyDSL MoE 后端**用于 ROCm；[#25070](https://github.com/sgl-project/sglang/pull/25070)（Thomas Wang，05-12）添加了 swiglu-limit 密集 MoE / shared expert 路径。
+- **AITER 内核 + 其他融合。** 05-07 cherry-pick 了 AITER MHC 前/后处理修复（[commit b639cb6](https://github.com/sgl-project/sglang/commit/b639cb6)）；[#25043](https://github.com/sgl-project/sglang/pull/25043)（jacky.cheng，05-12）**将注意力路径上的 input_layernorm 与 FP8 per-128 group 量化融合**；[#25251](https://github.com/sgl-project/sglang/pull/25251)（jacky.cheng，05-19）为全 greedy 采样使用 **AITER `greedy_sample`**；[#25097](https://github.com/sgl-project/sglang/pull/25097)（Raiden Makoto，05-13）**ROCm 上的 Triton 融合 store cache**；[#25375](https://github.com/sgl-project/sglang/pull/25375)（Thomas Wang，05-18）wqb 输入的 **rmsnorm_quant 融合**。
+
+**InferenceX 配方迭代循环。** InferenceX 基准测试（benchmark）配方通过大约每 2–3 天一次的镜像更新吸收每一波上游改进：容器镜像从 `rocm/sgl-dev:v0.5.10rc0-rocm720-mi35x-20260414`（04-25，仅 FP8，配方需要 `SGLANG_HACK_FLASHMLA_BACKEND=torch` 才能编译）→ `rocm/sgl-dev:rocm720-mi35x-583b1b6-20260501-DSv4`（05-02，通过 `SGLANG_DSV4_FP4_EXPERTS=True` 启用 FP4）→ `a8410de6-20260502`（05-03，融合 compress-decode）→ `bfd32b6-20260507`（05-08，AITER MHC 前/后处理 + Triton SWA prepare）→ `0363e6c-20260509` → `b19052c-20260518`（05-19，稳定的 `lmsysorg/sglang:v0.5.12-rocm720-mi35x` 仓库，含 Triton 注意力后端、FlyDSL MoE、融合 hash topk）→ `8c3b5aa-20260521`（05-21 最终版）。镜像更新之间的配方调优收紧了 `--num-continuous-decode-steps`（4 → 8，+4.7%），将 `--max-running-requests` 和 `--cuda-graph-max-bs` 调整为矩阵并发值，并在 DP attention 配置上启用了 `--enable-prefill-delayer`。
+
+## 数据详情
+
+所有数据行均为 DeepSeek-V4-Pro 在 **ISL 8192 / OSL 1024** 下，使用单台 MI355X 8-GPU 节点，于 2026-04-25 至 2026-05-21 期间在 InferenceX 上测量。吞吐量（throughput）为每 GPU 值。精度：2026-04-25 为 FP8（发布时唯一可用的路径）；2026-05-02 起为 FP4，使用 `deepseek-ai/DeepSeek-V4-Pro` 并设置 `SGLANG_DSV4_FP4_EXPERTS=True`。后期运行在高并发下启用了 DP attention。
+
+**2026-04-25（FP8，基线首次点亮）：**
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) |
+| ---- | --------- | ---------- | --------- |
+| 8    | 20.4      | 2.43       | 411       |
+| 32   | 42.0      | 1.19       | 843       |
+| 64   | 67.4      | 0.93       | 1,074     |
+
+**2026-05-02（FP4 首次点亮，+TileLang 注意力，FP4 启用）：**
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) |
+| ---- | --------- | ---------- | --------- |
+| 1    | 25.2      | 23.89      | 41.86     |
+| 2    | 45.4      | 21.65      | 46.41     |
+| 4    | 76.5      | 18.38      | 54.87     |
+| 8    | 115.8     | 13.87      | 72.92     |
+| 16   | 167.2     | 10.07      | 97.87     |
+| 32   | 247.0     | 7.33       | 138.64    |
+| 64   | 359.9     | 5.23       | 199.14    |
+| 128  | 500.2     | 3.61       | 288.50    |
+
+**2026-05-04（+融合 compress-decode，+TileLang MHC 后处理，移除 Torch 回退）：**
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) |
+| ---- | --------- | ---------- | --------- |
+| 1    | 33.3      | 31.82      | 31.43     |
+| 4    | 102.1     | 24.65      | 40.86     |
+| 8    | 153.0     | 18.43      | 54.82     |
+| 16   | 218.9     | 13.04      | 77.62     |
+| 32   | 324.2     | 10.10      | 100.26    |
+| 64   | 455.7     | 6.86       | 151.33    |
+| 128  | 614.6     | 4.54       | 227.59    |
+
+**2026-05-10（+AITER MHC 前/后处理，+Triton SWA prepare，+FlyDSL MoE 预览版）：**
+
+| 并发 | tok/s/GPU | tok/s/user | TPOT (ms) |
+| ---- | --------- | ---------- | --------- |
+| 1    | 43.9      | 42.44      | 23.56     |
+| 4    | 136.0     | 33.11      | 30.45     |
+| 8    | 233.4     | 28.63      | 35.44     |
+| 16   | 336.1     | 20.33      | 49.86     |
+| 32   | 488.3     | 16.80      | 60.58     |
+| 64   | 802.9     | 14.81      | 66.43     |
+| 128  | 1,194.3   | 10.17      | 98.80     |
+| 256  | 1,503.2   | 6.14       | 164.86    |
+
+**2026-05-21（最新：SGLang v0.5.12 + Triton 注意力后端 + 融合 hash topk + FlyDSL MoE）：**
+
+| 并发    | tok/s/GPU   | tok/s/user | TPOT (ms)  |
+| ------- | ----------- | ---------- | ---------- |
+| 1       | 59.2        | 57.06      | 17.52      |
+| 4       | 198.5       | 47.71      | 20.96      |
+| 8       | 348.2       | 41.78      | 23.94      |
+| 16      | 561.3       | 33.37      | 29.97      |
+| 32      | 811.7       | 23.99      | 41.68      |
+| 64      | 959.6       | 16.79      | 59.56      |
+| 128     | 1,556.0     | 13.76      | 72.69      |
+| **256** | **2,256.1** | **9.37**   | **106.75** |
+| 512     | 1,814.4     | 5.59       | 178.90     |
+
+加粗行即为标题数据：**在并发 256 + DP attention 下达到 2,256 tok/s/GPU、9.4 tok/s/user**——相比 04-25 首次点亮时 20.4 tok/s/GPU、2.4 tok/s/user **提升 110.5 倍**（即使与 04-25 峰值 67.4 tok/s/GPU、0.9 tok/s/user 相比也有 33.5 倍提升，而那已经不是可用的服务工作点）。MI355X 上 DSv4-Pro 单节点聚合服务的全新性能天花板。
+
+## 等交互性吞吐量对比
+
+在匹配交互性水平下的每 GPU 吞吐量，沿各日期的帕累托前沿进行插值。2026-04-25 的交互性未超过 2.5 tok/s/user，因此该日期每行均显示 `_unreachable_`——模型当时尚未进入服务区间。超出前沿测量范围的单元格显示为 `_unreachable_`。
+
+| 交互性 (tok/s/user) | 04-25         | 05-02         | 05-04         | 05-10         | 05-21         | 05-02 → 05-21 |
+| ------------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
+| 8                   | _unreachable_ | 221           | 401           | 1,363         | _unreachable_ | _∞_           |
+| 10                  | _unreachable_ | 169           | 328           | 1,208         | 2,162         | **12.8x**     |
+| 12                  | _unreachable_ | 136           | 247           | 1,065         | 1,855         | **13.6x**     |
+| **15**              | _unreachable_ | **104**       | **194**       | **775**       | **1,272**     | **12.2x**     |
+| 17                  | _unreachable_ | 88            | 169           | 473           | 951           | 10.8x         |
+| 20                  | _unreachable_ | 61            | 139           | 361           | 876           | **14.3x**     |
+| 25                  | _unreachable_ | _unreachable_ | 99            | 266           | 788           | _∞_           |
+| 30                  | _unreachable_ | _unreachable_ | 50            | 205           | 653           | _∞_           |
+| 40                  | _unreachable_ | _unreachable_ | _unreachable_ | 89            | 393           | _∞_           |
+| 50                  | _unreachable_ | _unreachable_ | _unreachable_ | _unreachable_ | 140           | _∞_           |
+
+核心结论是**从 2026-05-02 到 2026-05-21，在 10–20 tok/s/user 的服务区间内，等交互性下的每 GPU 吞吐量提升了 12–14 倍**。提升逐日期层层累加——每次镜像更新都将曲线再推高 1.6–4.4 倍。高交互性段（25+ tok/s/user）在 05-04 之后才完全打开，而 50 tok/s/user 仅在 05-21 配合最新的 FlyDSL MoE + 融合 hash topk 内核（`lmsysorg/sglang:v0.5.12-rocm720-mi35x`）后才可量测。
+
+<Figure
+  srcLight="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/benchmark-light.png"
+  srcDark="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/benchmark-dark.png"
+  alt="MI355X SGLang DeepSeek-V4-Pro 每 GPU 吞吐量与交互性对比，5 个日期：2026-04-25（FP8 基线，约 67 tok/s/GPU、低于 3 tok/s/user）、2026-05-02（FP4 首次点亮，峰值约 500 tok/s/GPU）、2026-05-04（约 615）、2026-05-10（约 1503）、2026-05-21（约 2256 tok/s/GPU、9.4 tok/s/user）。每个日期的曲线均向右上方移动。"
+  caption="MI355X SGLang DeepSeek-V4-Pro（1.6T / 49B 激活参数）在 ISL 8192 / OSL 1024 下的表现。来自 amd/deepseek_v4 SGLang 分支的 5 个日期，跨度 26 天。标注表示 8-GPU TP=8 配置；后期日期在高并发段使用了 DP attention。"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=DeepSeek-V4-Pro&i_gpus=mi355x_sglang&i_dates=2026-05-03%2C2026-05-04%2C2026-05-08%2C2026-05-19%2C2026-05-21%2C2026-04-25&i_prec=fp4%2Cfp8&i_dstart=2026-04-25&i_dend=2026-05-21&i_linelabel=1)，已预过滤为 MI355X SGLang DSv4-Pro 在 5 个测量日期的数据。
+
+## MI355X DeepSeek-V4-Pro 的后续计划
+
+**MI355X 与 NVIDIA 在 DSv4-Pro 上的差距不在硅片——在于软件。** 从纸面参数看，MI355X 的 HBM 更大（288 GB vs B200 的 180 GB——**1.60 倍容量**），HBM 带宽相同（均为 8 TB/s），密集计算在各精度下均略高（FP4 / FP8 / BF16 均为 B200 的 **1.12 倍**）。B200 唯一领先的硅片参数是节点内扩展带宽——NVLink 5 单向 900 GB/s vs 第五代 Infinity Fabric 的 576 GB/s，1.56 倍优势——但在单节点 TP=8 运行 1.6T 激活 49B 的 MoE 模型时，这一差距的影响小于 AMD 分支仍在弥合的内核栈成熟度差距。
+
+<Figure
+  srcLight="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/specs-radar-light.png"
+  srcDark="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/specs-radar-dark.png"
+  alt="GPU 规格雷达图对比 MI355X（红色）与 B200 SXM（绿色），来自 /gpu-specs。MI355X 多边形在 Memory 轴达到 100%（288 GB，与 GB300 NVL72 持平），在 FP8 + BF16 TFLOP/s 轴亦然（5,033 / 2,516——单 GPU 天花板）。B200 多边形仅在 Scale Up BW 轴领先（NVLink 5 的 900 GB/s vs MI355X 的 Infinity Fabric 576 GB/s）。FP4 轴上两者均压缩于 GB300 NVL72 的 15,000 TFLOP/s 天花板之下。扩展域轴以 GB200/GB300 NVL72 的 72 GPU 为上限，因此两个 8-GPU SKU 均读数约 11%。"
+  caption="MI355X（红色）与 B200 SXM（绿色）在 /gpu-specs 上的对比。各轴按面板中所有厂商 SKU 的最大值归一化。MI355X 在每 GPU FP8 / BF16 / Memory 上达到天花板值；B200 仅在扩展带宽上领先。"
+/>
+
+| 规格                    | MI355X                     | B200 SXM            | MI355X / B200 |
+| ----------------------- | -------------------------- | ------------------- | ------------- |
+| HBM 容量                | 288 GB                     | 180 GB              | **1.60x**     |
+| HBM 带宽                | 8 TB/s                     | 8 TB/s              | 1.00x         |
+| 密集 FP4 (TFLOP/s)      | 10,066                     | 9,000               | 1.12x         |
+| 密集 FP8 (TFLOP/s)      | 5,033                      | 4,500               | 1.12x         |
+| 密集 BF16 (TFLOP/s)     | 2,516                      | 2,250               | 1.12x         |
+| 每 GPU 扩展带宽（单向） | 576 GB/s (Infinity Fabric) | 900 GB/s (NVLink 5) | 0.64x         |
+| 扩展域 GPU 数           | 8                          | 8                   | 1.00x         |
+| 扩展域 HBM 容量         | 2.30 TB                    | 1.44 TB             | **1.60x**     |
+| 扩展域 HBM 带宽（聚合） | 64 TB/s                    | 64 TB/s             | 1.00x         |
+
+因此，当实测的 B200 SGLang DSv4-Pro 曲线在 15–30 tok/s/user 服务区间内比 MI355X SGLang 高出约 5 倍（完全相同的 FP4 / 8K / 1K 负载）时，这一差距不来自算力，不来自 HBM 容量，不来自 HBM 带宽，也几乎与扩展带宽无关。差距在于**上游内核覆盖度、融合完整性以及调度器调优**——正是 `amd/deepseek_v4` 分支持续 rebase 追赶的方向，也是 26 天内缩小了 110.5 倍的差距所在：
+
+<Figure
+  srcLight="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/b200-vs-mi355x-light.png"
+  srcDark="/images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/b200-vs-mi355x-dark.png"
+  alt="DeepSeek V4 Pro 1.6T FP4 8K/1K——B200（SGLang，绿色）vs MI355X（SGLang，红色）每 GPU 吞吐量与交互性对比。B200 SGLang 在并发 8（低交互性）时峰值约 3.5k tok/s/GPU，在超过 70 tok/s/user 时仍有可用吞吐量。MI355X SGLang 在低交互性时峰值约 2.25k tok/s/GPU，在约 50 tok/s/user 以下开始下降。在 15–30 tok/s/user 服务区间的等交互性垂直差距约为 4–5 倍——完全是软件差距。"
+  caption="B200 SGLang vs MI355X SGLang 在 DeepSeek-V4-Pro FP4 ISL 8192 / OSL 1024 下的对比（InferenceX，2026-05-22 运行）。相同模型、相同精度、相同框架、均为单节点 TP=8 聚合服务。来源：SemiAnalysis InferenceX。"
+/>
+
+据 [SemiAnalysis 评估](https://x.com/SemiAnalysis_/status/2053520440589451720)，接下来需要弥合的差距：
+
+- **还需约 5 倍吞吐量才能追平单节点聚合 B200。** B200 上的 SGLang DSv4-Pro 技术栈在 70+ tok/s/user 时已达到数千 tok/s/GPU 区间，而 MI355X SGLang 仅在低交互性的左侧边缘触及此水平。按照 `amd/deepseek_v4` 分支当前的 PR 节奏，AMD 在未来几周内追平这一差距是现实可行的——硅片具备能力，内核只需继续追赶。
+- **还需额外约 1.5 倍以追平 PD 分离式 B200。** InferenceX 尚未发布 MI355X DSv4-Pro 的分离式配方。`mori-sglang` AMD 分离式分支具备 prefill/decode 分离原语，但尚未接入 InferenceX 循环中的 DSv4-Pro 配方。
+- **AMD 分支上的持续内核节奏。** 31 个 PR 的开发节奏造就了 110.5 倍的提升；[开放的 compare 视图](https://github.com/sgl-project/sglang/compare/main...amd/deepseek_v4)仍在每 2–3 天添加新的性能优化 PR，因此本文中的曲线到下周就会过时。新的压缩器路径（[#25353](https://github.com/sgl-project/sglang/pull/25353)）和 extend 的融合 nosplitk 注意力调度（[#25977](https://github.com/sgl-project/sglang/pull/25977)）在 2026-05-21 数据集之后才合入，尚未反映在图表中。
+- **分支 → SGLang main 上游迁移。** 第一批代码在 [PR #24933](https://github.com/sgl-project/sglang/pull/24933)（kk，2026-05-18 合入，跨 17 个文件 +3,678 / -70）中合入——足以通过 `is_hip` / `use_aiter` 门控、替换无法在 ROCm 上编译的 JIT 融合内核的 Triton 版本，以及新的 HIP 注意力后端在 SGLang main 上以 **eager 模式**运行 DSv4-Pro。PR 描述明确标注了后续工作："后续 PR 将从 `amd/deepseek_v4` 分支合入剩余的 DSv4 优化"——压缩流融合、多流启用、TileLang 注意力索引器、FlyDSL MoE 以及关键的 SGLANG*OPT*\* 开关截至 2026-05-22 仍为分支专有。在这些迁移完成之前，MI355X 在 SGLang `main` 上的 DSv4-Pro 服务性能将比本文测量结果低一个数量级——分支镜像（`lmsysorg/sglang:v0.5.12-rocm720-mi35x-*`）仍是复现上述曲线的唯一途径。
+
+对于当前的 MI355X DSv4-Pro 服务，基于 `lmsysorg/sglang:v0.5.12-rocm720-mi35x-20260517` 的 2026-05-21 配方是生产前沿——任何早于 05-10 的版本不应作为基准测试对比对象。
+
+## 致谢
+
+31 个性能优化 PR 是以下贡献者的工作成果：[Thomas Wang](https://github.com/thomawan)（TileLang 注意力索引器、FlyDSL MoE、压缩器逐元素融合、带 CUDA graph 的 attn early-exit、rmsnorm-quant 融合）、[Xinyi Song](https://github.com/xinyiisme)（融合 compress-decode、融合 RoPE Hadamard、融合 hash topk、压缩器优化）、[HaiShaw](https://github.com/HaiShaw)（集成协调 + 环境配置）、[amd-danli103](https://github.com/amd-danli103)（Triton 稀疏 MLA + 融合调度）、[jacky.cheng](https://github.com/jackylee99)（input_layernorm + FP8 per-group 量化融合、softmax pool、AITER greedy_sample）、[kk](https://github.com/kkHuang-amd)（FP4 启用、MHC 性能、fuse_wqkv）、[Raiden Makoto](https://github.com/raidenmakoto)（Triton 融合 store cache）、[Xinyu Jiang](https://github.com/xinyujiang)（radix 优化），以及更广泛的 AMD AI 团队。从上游到基准测试的迭代速度就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=DeepSeek-V4-Pro&i_gpus=mi355x_sglang&i_dates=2026-05-03%2C2026-05-04%2C2026-05-08%2C2026-05-19%2C2026-05-21%2C2026-04-25&i_prec=fp4%2Cfp8&i_dstart=2026-04-25&i_dend=2026-05-21&i_linelabel=1">
+  点击查看完整的 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "AMD MI355X SGLang DeepSeek-V4-Pro 自发布以来性能提升了多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 8K/1K 负载下，MI355X SGLang DeepSeek-V4-Pro 的每 GPU 吞吐量从 2026-04-25 的 20.4 tok/s/GPU、2.4 tok/s/user（FP8 首次点亮，并发 8）增长到 2026-05-21 的 2,256 tok/s/GPU、9.4 tok/s/user（FP4，并发 256，启用 DP attention）——26 天内每 GPU 吞吐量提升 110.5 倍，同时交互性也提升了 3.85 倍。在 10–20 tok/s/user 的服务区间内，从 2026-05-02 首次 FP4 测量到 2026-05-21 的等交互性累积提升为 12–14 倍（15 tok/s/user 时从 104 到 1,272 tok/s/GPU；20 tok/s/user 时从 61 到 876）。SemiAnalysis 此前指出 14 天内核心级别的提升约为 75 倍。测量于 InferenceX，GHA 运行 26306422380。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "SGLang amd/deepseek_v4 分支上有哪些关键改动？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "截至 2026-05-22，sgl-project/sglang amd/deepseek_v4 分支上共有 31 个编号的性能优化 PR。关键内核变更包括：TileLang 注意力路径及注意力索引器（PR 24033 和 24050，Thomas Wang）；Triton 稀疏 MLA 内核及后续融合 gather+attention 调度（PR 24930、25878、25977，amd-danli103 和 jacky.cheng）；融合多头压缩（mHC）操作（PR 24355 kk、PR 24424 Thomas Wang、PR 25353 Xinyi Song、PR 26014 Xinyi Song）；融合 RoPE 和 Hadamard（PR 24727 Xinyi Song）；FlyDSL MoE 后端（PR 24971 Thomas Wang）；融合 hash topk 路由（PR 24728 Xinyi Song）；FP4 模型路径启用（PR 24031 kk）；AITER MHC 前/后处理引入；input_layernorm 与 FP8 per-128 group 量化融合（PR 25043 jacky.cheng）；wqb 输入的 rmsnorm-quant 融合（PR 25375 Thomas Wang）。InferenceX 配方通过大约每 2–3 天一次的容器镜像更新吸收每波上游改进，从 rocm/sgl-dev:v0.5.10rc0-rocm720-mi35x-20260414 发展到 lmsysorg/sglang:v0.5.12-rocm720-mi35x-20260517。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 2026-04-25 的首次点亮测量如此之慢？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "DeepSeek-V4-Pro 于 2026-04-24 发布，采用新颖的注意力路径（逐 token 压缩加 DSA，即 DeepSeek Sparse Attention），上游 SGLang main 分支在发布时无法在 Blackwell 或 ROCm 上运行该模型。04-25 的 InferenceX 配方被迫使用 SGLANG_HACK_FLASHMLA_BACKEND=torch 作为回退，且仅 FP8 路径可以编译，因此实测的内核时间主要由 torch 回退路径主导，而非后续两周陆续合入的生产级注意力索引器或压缩器内核。结果是峰值仅 67 tok/s/GPU、0.93 tok/s/user，这并非可用的服务工作点。2026-05-02 使用正式 TileLang 注意力路径的首次 FP4 测量才是曲线首次进入可用交互性区间。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "MI355X 与 NVIDIA B200 在 DeepSeek-V4-Pro 上的对比如何？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "据 SemiAnalysis 评估，MI355X DSv4-Pro 在等交互性下仍需约 5 倍的吞吐量提升才能追平相同负载下的单节点聚合 B200，并在此基础上再提升约 1.5 倍才能追平 PD 分离式 B200。按照 amd/deepseek_v4 SGLang 分支当前的 PR 节奏（26 天内 31 个性能优化 PR），在未来几周内弥合单节点差距是现实可行的。InferenceX 尚未发布 MI355X DSv4-Pro 的分离式配方。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "MI355X DeepSeek-V4-Pro 在 SGLang 上还有哪些未覆盖的领域？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "仍有三处差距。第一，本文中的仪表板图表是 2026-05-21 的快照；新的压缩器路径（PR 25353）、extend 的融合 nosplitk 注意力调度（PR 25977）以及低并发下的 Triton 融合 mhc_post_pre（PR 26014）在 05-21 数据集之后才合入 amd/deepseek_v4 分支，尚未反映在图表中。第二，MI355X 在 InferenceX 中尚无 DSv4-Pro 的分离式 prefill+decode 配方；mori-sglang AMD 分离式分支具备相关原语，但尚未接入 DSv4-Pro 配方。第三，本文的 8K/1K 负载为单节点 TP=8 并在高并发下启用 DP attention；更长上下文（DSv4 默认 1M）和分离式配方仍待上游合入。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx b/packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx
new file mode 100644
index 00000000..c8c0e976
--- /dev/null
+++ b/packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx
@@ -0,0 +1,195 @@
+---
+title: 'AMD MI355X GLM-5 推理：SGLang FP8 单节点每百万 token 成本比 B200 最高低 40%'
+subtitle: 'GLM-5 发布 14 周后，AMD 在 MI355X 上同时实现了 SGLang FP8 的 MTP 和非 MTP 方案 — 通过 TileLang 实现的融合 MLA + FP8 KV 缓存在大部分性能 Pareto 前沿上将单节点 FP8 成本曲线翻转为 AMD 占优'
+date: '2026-05-25'
+publishDate: '2026-05-25'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - glm5
+  - amd
+  - nvidia
+  - mi355x
+  - b200
+  - sglang
+  - rocm
+---
+
+GLM-5 发布 14 周后，AMD MI355X SGLang FP8 在 8k/1k 工作负载的大部分单节点 Pareto 前沿上，每百万 token 成本低于 NVIDIA B200 SGLang FP8（从约 10 到约 77 tok/s/user；B200 在约 90 tok/s/user 以上重新反超）。峰值差距为**使用 MTP 时在 18 tok/s/user 下达到 1.41 倍**（B200 $0.30/M vs MI355X $0.22/M — 降低 40%），**不使用 MTP 时在 10 tok/s/user 下达到 1.36 倍**（$0.31/M vs $0.23/M）。两项测试均使用 **SGLang v0.12**，MI355X 的 ROCm 软件栈在此版本上已与 B200 的 CUDA 软件栈功能对齐：均支持 MTP 和非 MTP 方案，均支持 FP8 KV 缓存，均基于 SGLang 最新的 TileLang MLA 路径。
+
+这正是关键的节奏。GLM-5 发布后，一个季度内 AMD 就完成了上游 SGLang 内核的合入（[sgl-project/sglang PR #21511](https://github.com/sgl-project/sglang/pull/21511)）及其他优化，并提交了配套的 InferenceX 方案（[InferenceX PR #1440](https://github.com/SemiAnalysisAI/InferenceX/pull/1440)），将该模型的 FP8 单节点成本曲线翻转为 AMD 占优。速度就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_model=GLM-5&i_prec=fp8&g_rundate=2026-05-20&g_runid=26187777287&i_active=b200_sglang%2Cb200_sglang_mtp%2Cmi355x_sglang%2Cmi355x_sglang_mtp&i_metric=y_costh&i_linelabel=1">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+GLM-5 是智谱（ZAI）的 MoE 旗舰模型，于 2026-02-11 发布 — 距本文所述的 InferenceX 测试正好 14 周。该模型拥有 **744B 参数的稀疏 MoE 架构，每 token 激活约 40B**：256 个专家采用 top-8 路由（约 5.9% 稀疏度），外加共享专家。公开的架构名称为 `glm_moe_dsa` — 模型在解码路径中集成了 **DeepSeek 稀疏注意力（DSA）**，这与 DeepSeek 在 V3.2 中引入的稀疏注意力模式相同，也是 SGLang 的 TileLang 后端所围绕构建的核心，同时采用多头潜在注意力（MLA）进行 KV 缓存压缩以支持其 200K 上下文窗口。
+
+在 MI355X 上，等效能力在四月中旬通过 SGLang 的 TileLang 后端落地，由此带来的解码吞吐量提升使得 MI355X 较低的单 GPU TCO（$1.48/GPU/hr，B200 为 $1.95/GPU/hr，数据来源 [SemiAnalysis AI Cloud TCO 模型](https://newsletter.semianalysis.com/p/ai-cloud-economics)）得以转化为真正的每 token 成本优势，而非被软件差距所淹没。
+
+## 推动这一结果的关键优化
+
+AMD 方面的标志性性能优化之一是 [sgl-project/sglang PR #21511](https://github.com/sgl-project/sglang/pull/21511)（由 [HaiShaw](https://github.com/HaiShaw) 提交，2026-04-03 合入）。该 PR 通过 SGLang 的 TileLang 后端为 MI300/MI355 启用了 FP8 KV 缓存和 FP8 注意力内核（在 DeepSeek-V3.2 和 GLM-5 上均已测试），并针对不同硬件代际采用了不同的融合策略：
+
+- **在 MI355 上**，该 PR **复用了现有的 `fused_qk_rope_cat_and_cache_mla` 内核来同时处理 Q 和 KV 的 FP8 量化**。QK rope 拼接、MLA 缓存写入以及 Q 和 KV 的 FP8 量化全部合并到每个解码步骤的单次内核调用中 — 无需额外的 HBM 往返，无需单独的量化内核启动。
+
+TileLang 依赖版本已更新以在 AMD 上启用 FP8 GEMM，并新增了 `sparse_mla_fwd_decode_partial_fp8` 内核用于部分解码归约路径。该 PR 报告 MI355 吞吐量提升超过 5%（MI300 超过 10%），gsm8k 准确率无下降（DeepSeek-V3.2 0.945 → 0.946；GLM-5 0.946 → 0.950），通过 `--kv-cache-dtype fp8_e4m3` 配合 TileLang 预填充/解码后端激活。
+
+## 基准测试数据
+
+所有数据均为 GLM-5 FP8，**ISL 8192 / OSL 1024**，单节点非分离式部署，于 2026-05-20 在 InferenceX 上测量，CUDA（B200）和 ROCm（MI355X）均使用 **SGLang v0.12**。每百万总 token 成本计算方式为 `TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6)`，B200 为 $1.95/GPU/hr，MI355X 为 $1.48/GPU/hr。
+
+容器镜像：
+
+- **B200:** `lmsysorg/sglang:v0.5.12-cu130`
+- **MI355X:** `lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517`
+
+**B200 SGLang FP8 MTP，TP=8，8 GPU：**
+
+| 并发数 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ------ | --------- | ---------- | --------- | ---------- |
+| 4      | 417.0     | 100.85     | 9.92      | $1.30      |
+| 8      | 650.1     | 77.82      | 12.85     | $0.83      |
+| 16     | 952.7     | 56.93      | 17.57     | $0.57      |
+| 32     | 1,296.8   | 38.16      | 26.21     | $0.42      |
+| 64     | 1,619.3   | 23.56      | 42.45     | $0.34      |
+| 128    | 1,929.5   | 13.78      | 72.59     | $0.28      |
+| 256    | 1,947.3   | 11.88      | 84.15     | $0.28      |
+
+**MI355X SGLang FP8 MTP，TP=4，4 GPU**（Pareto 锚定方案）：
+
+| 并发数 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ------ | --------- | ---------- | --------- | ---------- |
+| 4      | 625.5     | 76.80      | 13.02     | $0.66      |
+| 8      | 911.7     | 54.59      | 18.32     | $0.45      |
+| 16     | 1,208.1   | 35.82      | 27.92     | $0.34      |
+| 32     | 1,707.4   | 24.83      | 40.27     | $0.24      |
+| 64     | 1,895.0   | 18.19      | 54.99     | $0.22      |
+| 128    | 1,911.7   | 18.05      | 55.40     | $0.22      |
+
+**MI355X SGLang FP8 MTP，TP=8，8 GPU**（高交互性分支）：
+
+| 并发数 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ------ | --------- | ---------- | --------- | ---------- |
+| 4      | 373.4     | 90.43      | 11.06     | $1.10      |
+| 8      | 534.2     | 65.05      | 15.37     | $0.77      |
+
+**B200 SGLang FP8 非 MTP，TP=8，8 GPU：**
+
+| 并发数 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ------ | --------- | ---------- | --------- | ---------- |
+| 4      | 231.3     | 54.25      | 18.43     | $2.34      |
+| 8      | 382.4     | 46.07      | 21.71     | $1.42      |
+| 16     | 613.2     | 36.65      | 27.28     | $0.88      |
+| 32     | 933.7     | 27.47      | 36.40     | $0.58      |
+| 64     | 1,291.8   | 18.42      | 54.28     | $0.42      |
+| 128    | 1,669.1   | 11.87      | 84.23     | $0.32      |
+| 256    | 1,746.1   | 10.72      | 93.27     | $0.31      |
+
+**MI355X SGLang FP8 非 MTP，TP=4，4 GPU：**
+
+| 并发数 | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ------ | --------- | ---------- | --------- | ---------- |
+| 4      | 358.8     | 42.03      | 23.79     | $1.15      |
+| 8      | 579.6     | 34.68      | 28.83     | $0.71      |
+| 16     | 870.8     | 25.86      | 38.67     | $0.47      |
+| 32     | 1,274.0   | 18.57      | 53.86     | $0.32      |
+| 64     | 1,660.1   | 11.83      | 84.56     | $0.25      |
+| 128    | 2,071.4   | 7.33       | 136.36    | $0.20      |
+| 256    | 2,189.4   | 6.69       | 149.45    | $0.19      |
+
+## 等交互性成本对比
+
+对两条 Pareto 前沿在匹配交互性下进行插值。对于 MI355X MTP，Pareto 前沿取 TP=4 和 TP=8 在每个交互性水平上的较低值 — TP=4 在约 77 tok/s/user 以下占优，TP=8 并发数 4 在高交互性端（约 90 tok/s/user）接管，因为 TP=4 无法达到该区间。
+
+**MTP：**
+
+| 交互性 (tok/s/user) | B200 SGLang MTP $/M tok | MI355X SGLang MTP $/M tok | B200 / MI355X |
+| ------------------- | ----------------------- | ------------------------- | ------------- |
+| **18**              | **$0.30**               | **$0.22**                 | **1.41x**     |
+| 24                  | $0.34                   | $0.24                     | 1.40x         |
+| 35                  | $0.40                   | $0.34                     | 1.17x         |
+| 55                  | $0.55                   | $0.45                     | 1.22x         |
+| 77                  | $0.82                   | $0.66                     | 1.25x         |
+| 90                  | $1.08                   | $1.10                     | 0.98x         |
+
+**非 MTP：**
+
+| 交互性 (tok/s/user) | B200 SGLang $/M tok | MI355X SGLang $/M tok | B200 / MI355X |
+| ------------------- | ------------------- | --------------------- | ------------- |
+| 15                  | $0.37               | $0.28                 | 1.31x         |
+| 20                  | $0.45               | $0.35                 | 1.27x         |
+| 30                  | $0.66               | $0.58                 | 1.14x         |
+| 40                  | $1.07               | $1.03                 | 1.05x         |
+
+<Figure
+  srcLight="/images/mi355x-glm5-fp8-sglang-40-cheaper-than-b200/benchmark-light.png"
+  srcDark="/images/mi355x-glm5-fp8-sglang-40-cheaper-than-b200/benchmark-dark.png"
+  alt="GLM-5 FP8 8k/1k 每百万总 token 成本与交互性关系图，B200 SGLang 和 MI355X SGLang，含和不含 MTP 投机解码"
+  caption="GLM-5 FP8 8k/1k。每百万总 token 成本与交互性。B200 SGLang 和 MI355X SGLang，含和不含 MTP。标签标注每个配置的 GPU 数量。"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_model=GLM-5&i_prec=fp8&g_rundate=2026-05-20&g_runid=26187777287&i_active=b200_sglang%2Cb200_sglang_mtp%2Cmi355x_sglang%2Cmi355x_sglang_mtp&i_metric=y_costh&i_linelabel=1)，预筛选为 2026-05-20 测试中 B200 和 MI355X SGLang 上的 GLM-5 FP8。
+
+## MI355X 在 GLM-5 上的后续展望
+
+此次结果为单节点、聚合、仅 FP8。仍有两个差距待弥合：
+
+- **FP4 可组合性。** 本次对比中 B200 使用的是 CUDA nightly 上的 FP8。B200 NVFP4 SGLang 的 GLM-5 方案已开始交付，将进一步压缩 B200 的成本曲线。MI355X MXFP4 GLM-5.1 SGLang 已通过 [InferenceX PR #1098](https://github.com/SemiAnalysisAI/InferenceX/pull/1098) 于 2026-04-21 交付，但 MI355X 上的 FP4 + MTP 组合尚未达到本文展示的 FP8 + MTP 方案的水平。
+- **分离式部署和宽专家并行。** MI355X 上的 GLM-5 尚无分离式部署或宽 EP 方案。NVIDIA 的 GB200 NVL72 Dynamo TRT-LLM 和 Dynamo vLLM 方案在 Kimi K2.5 上已展示了[机架级宽 EP 带来的约 3 倍每 GPU 吞吐量优势](https://inferencex.semianalysis.com/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200)。AMD 尚未为 GLM-5 交付分离式部署方案。
+
+## 致谢
+
+该方案的快速落地得益于 [Anush Elangovan](https://x.com/AnushElangovan)、[HaiShaw](https://github.com/HaiShaw) 及更广泛的 AMD AI 团队在 GLM-5 发布后 14 周内完成了上游 SGLang TileLang 融合 MLA + FP8 KV 内核的提交。SGLang 维护者在提交后数天内即完成了审查与合入。从上游到基准测试的闭环速度就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_model=GLM-5&i_prec=fp8&g_rundate=2026-05-20&g_runid=26187777287&i_active=b200_sglang%2Cb200_sglang_mtp%2Cmi355x_sglang%2Cmi355x_sglang_mtp&i_metric=y_costh&i_linelabel=1">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "AMD MI355X 在 GLM-5 FP8 单节点推理上比 NVIDIA B200 便宜多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 GLM-5 FP8、8k/1k 序列长度、SGLang MTP 条件下，MI355X 的每百万 token 成本比 B200 最高低 1.41 倍 — 降低 40%。峰值差距出现在 18 tok/s/user，B200 每百万 token 成本为 $0.30，MI355X 为 $0.22。不使用 MTP 时，峰值差距为 1.36 倍，出现在 10 tok/s/user（$0.31 vs $0.23）。MI355X 的成本优势覆盖整个 MTP 曲线的约 12 至约 77 tok/s/user 区间。核心配置为 MI355X 上 4-GPU TP=4 MTP 方案，TCO 为 $1.48/GPU/hr，B200 为 $1.95/GPU/hr。数据基于 2026-05-20 InferenceX 测量。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "SGLang PR #21511 是什么？它如何帮助 MI355X 运行 GLM-5？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "sgl-project/sglang PR #21511（由 HaiShaw 提交，2026-04-03 合入）通过 SGLang 的 TileLang 后端为 AMD MI300 和 MI355 启用了 FP8 KV 缓存和 FP8 注意力内核，在 DeepSeek-V3.2 和 GLM-5 上均已测试。在 MI355 上，它复用了现有的 fused_qk_rope_cat_and_cache_mla 内核来同时处理 Q 和 KV 的 FP8 量化，将 QK rope 拼接、MLA 缓存写入以及 FP8 量化合并到每个解码步骤的单次内核调用中。在 MI300 上，使用独立的 Triton 内核 set_mla_kv_buffer_fp8_quant 处理 KV 缓存量化。该 PR 报告 MI355 吞吐量提升超过 5%（MI300 超过 10%），gsm8k 准确率无下降（GLM-5 0.946 → 0.950，DeepSeek-V3.2 0.945 → 0.946），通过 --kv-cache-dtype fp8_e4m3 配合 TileLang 预填充/解码后端激活。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 MI355X 的成本差距在 18 tok/s/user 时最大？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "两个效应叠加。首先，MI355X SGLang MTP 方案提供了 4-GPU TP=4 变体，在 FP8 下没有直接对应的 4-GPU B200 方案，因此 MI355X 以更少的 GPU 和更低的单 GPU TCO（$1.48 vs $1.95/GPU/hr）分摊相同的吞吐量。其次，MI355X TP=4 的吞吐量曲线在并发数 64 和 128 时稳定在 $0.22/M，两者均提供约 18 tok/s/user，而 B200 在相同交互性下正在并发数 128（$0.28，13.8 tok/s/user）和并发数 64（$0.34，23.6 tok/s/user）之间插值。叠加效应在 18 tok/s/user 时达到峰值（B200 $0.30 vs MI355X $0.22，1.41 倍即便宜 40%）。在 90 tok/s/user 以上，对比略微翻转回 B200 有利，因为没有 MI355X 方案能匹配 B200 的 TP=8 并发数 4 在 100+ tok/s/user 的表现。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "MI355X 在不使用 MTP 的 GLM-5 FP8 上也比 B200 便宜吗？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "是的，但差距在更高交互性下缩小。不使用 MTP 时，MI355X SGLang FP8（TP=4）在 10 tok/s/user 下比 B200 SGLang FP8（TP=8）便宜 1.36 倍（$0.23 vs $0.31/M），在 20 tok/s/user 下便宜 1.27 倍（$0.35 vs $0.45），在 40 tok/s/user 下便宜 1.05 倍（$1.03 vs $1.07）。MTP 拉大了低成本端的差距，因为投机解码提升了 MI355X TP=4 的每步有效吞吐量约 1.34 倍：在并发数 32 时，TPOT 从 53.9 ms（非 MTP）降至 40.3 ms（MTP），tok/s/GPU 从 1,274 升至 1,707。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "AMD MI355X 在 GLM-5 上还有哪些差距待弥合？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "主要有两个差距。FP4 可组合性：B200 NVFP4 SGLang 的 GLM-5 方案已开始交付，速度明显快于 MI355X MXFP4。MI355X MXFP4 GLM-5.1 SGLang 已通过 InferenceX PR #1098 交付，但 MI355X 上的 FP4 + MTP 组合尚未达到本文 FP8 + MTP 方案的水平。分离式部署和宽专家并行：MI355X 上的 GLM-5 尚无分离式部署或宽 EP 方案。NVIDIA GB200 NVL72 Dynamo 方案在 Kimi K2.5 上已展示了机架级宽 EP 带来的约 3 倍每 GPU 吞吐量优势。AMD 尚未为 GLM-5 交付分离式部署方案。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx b/packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx
new file mode 100644
index 00000000..0331d3fc
--- /dev/null
+++ b/packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx
@@ -0,0 +1,130 @@
+---
+title: 'AMD MI355X Kimi K2.5 推理：vLLM 25 天内吞吐量提升 7.7 倍、交互性最高提升 15 倍'
+subtitle: 'vLLM PR #35850 修复了 MI355X CDNA4 上的 AITER MLA 分发路径，解锁 TP=8 下的 Kimi K2.5 推理性能，随 vLLM 0.18 一同发布'
+date: '2026-04-22'
+publishDate: '2026-04-22'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - kimi
+  - amd
+  - vllm
+  - rocm
+  - mi355x
+---
+
+仅凭一个 vLLM PR，AMD MI355X 上 Kimi K2.5 MXFP4 在 8k/1k 工作负载上同等并发下的性能就从 6.6 tok/s/user 跃升至 78.9 tok/s/user。同一个 PR 还带来了其他惊人的性能增益，包括低批次下交互性提升 12.0 倍、峰值吞吐量提升 7.7 倍，以及等吞吐量下最高 15 倍的交互性提升。
+
+然而，他们最令人印象深刻的成就在于推动曲线改变的速度。[vLLM PR #35850](https://github.com/vllm-project/vllm/pull/35850) 于 3 月 6 日合入并随 vLLM 0.18 发布，到 3 月 26 日 InferenceX 的基准测试流水线就通过 [InferenceX PR #936](https://github.com/SemiAnalysisAI/InferenceX/pull/936)（该 PR 启用了 MI355X Kimi K2.5 [配方](https://recipes.vllm.ai/moonshotai/Kimi-K2.5?hardware=mi355x&features=tool_calling%2Creasoning%2Cencoder_parallel)上的 AITER、专家并行及 vLLM 0.18.0 升级）捕获了完整效果——距离我们 3 月 1 日的 vLLM 0.16.0 基线仅 25 天。MI355X Kimi K2.5 MXFP4 上的每一个工作点都从一个几乎不可用、只有单点延迟下限的状态，被重写为一条可达 78.9 tok/s/user 低批次交互性和 2,687 tok/s/GPU 峰值吞吐量的完整 Pareto 前沿。这正是我们构建 [InferenceX](https://github.com/SemiAnalysisAI/InferenceX) 自动化基准测试的原因——高效地捕获并报告此类变化。
+
+我们在 [InferenceXv2](https://inferencex.semianalysis.com/blog/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper) 中对 AMD Kimi K2.5 推理提出的最持续的批评之一是可组合性。MI355X 在 CDNA4 上的硅片能力在 tensor-core 层面与 B200 有竞争力，但 AMD 的 ROCm 和 vLLM 路径并不总能释放该能力。这在推理配方仍在成熟中的新一代前沿 MoE 模型上尤为明显。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+## PR #35850 修复了什么
+
+Kimi K2.5 是一个 1T 参数的 MoE 模型，使用多头潜在注意力（MLA，Multi-head Latent Attention），这是 DeepSeek 在 V2 中引入的注意力变体。MLA 通过将 key 和 value 投影到共享潜空间来减少 KV 缓存占用。由此产生的每 rank 注意力头数取决于 tensor-parallel rank 数：TP=4 时 Kimi K2.5 为 16 头/rank，TP=8 时为 8 头/rank。
+
+AITER（AMD 面向 ROCm 的手动调优 AI Tensor Engine）在 CDNA4 上有优化的 MLA 内核路径，但 vLLM 的集成代码在 TP=8 时未能正确分发到该路径。AITER 的 MLA 解码内核围绕 `gqa_ratio=16` 的 ASM 路径构建，原生接受 16 头/rank（TP=4）和 128 头/rank，并拒绝中间值。在 Kimi K2.5 TP=8 的 8 头/rank 下，分发因头数断言失败而回退到 vLLM 的参考 TritonMLA 路径，该路径在 MXFP4 下比 AITER 慢很多。
+
+PR #35850 在单次提交中完成了三项修改：通过头重复技巧（将 8 头填充到 16 头以使用现有的 `gqa_ratio=16` ASM 内核）支持 `num_heads < 16` 的 AITER MLA——这为 Kimi K2.5 TP=8 和 Kimi-Linear TP=16 解锁了性能；放宽了头数断言以接受 4、8 或 [16, 128] 内任意 16 的倍数；以及当使用 FP8 KV 缓存时自动从 TritonMLA 回退到 AITER MLA（TritonMLA 对 FP8 KV 抛出 `NotImplementedError`）。三项修改均随 vLLM 0.18 发布。此外，AMD 对 MoE 专家形态的 MXFP4 GEMM 自动调优也与该 PR 一起贡献了观测到的吞吐量提升。
+
+## 解读曲线
+
+InferenceX 的基准测试结果在变化一落地就捕获到了：
+
+| 日期           | Conc  | Decode TP | tok/s/GPU | TPOT      | tok/s/user | 同并发增益 |
+| -------------- | ----- | --------- | --------- | --------- | ---------- | ---------- |
+| 3 月 1 日      | 4     | 8         | 28.7      | 152 ms    | 6.6        | （基线）   |
+| 3 月 1 日      | 8     | 8         | 55.0      | 158 ms    | 6.3        | （基线）   |
+| 3 月 1 日      | 16    | 8         | 104.8     | 164 ms    | 6.1        | （基线）   |
+| 3 月 1 日      | 32    | 8         | 191.2     | 179 ms    | 5.6        | （基线）   |
+| 3 月 1 日      | 64    | 8         | 348.5     | 199 ms    | 5.0        | （基线）   |
+| **3 月 26 日** | **4** | **8**     | **337**   | **13 ms** | **78.9**   | **12.0x**  |
+| 3 月 26 日     | 8     | 8         | 521       | 16 ms     | 60.8       | 9.7x       |
+| 3 月 26 日     | 16    | 8         | 870       | 20 ms     | 50.5       | 8.3x       |
+| 3 月 26 日     | 32    | 8         | 1,255     | 27 ms     | 36.4       | 6.5x       |
+| 3 月 26 日     | 64    | 8         | 1,647     | 43 ms     | 23.3       | 4.7x       |
+
+两个日期均使用 TP=8 以进行等条件对比。延迟下限从 152–199 ms 崩塌至 13–43 ms，覆盖整个批次曲线。
+
+对于峰值吞吐量，修复后的最优配方切换到 TP=4，以少量低批次交互性为代价换取更高的每 GPU token 数：
+
+| 日期       | Conc | TP  | tok/s/GPU | TPOT  | tok/s/user |
+| ---------- | ---- | --- | --------- | ----- | ---------- |
+| 3 月 26 日 | 4    | 4   | 650       | 13 ms | 76.2       |
+| 3 月 26 日 | 64   | 4   | **2,687** | 53 ms | 19.0       |
+
+### 等吞吐量对比：15 倍提升之所在
+
+上表中的 12.0 倍增益是在相同批大小下对比两个版本。更有价值的对比是固定每 GPU 吞吐量，考察每位用户的响应速度提升了多少。在 8k/1k 下沿两条 TP=8 曲线在匹配 tok/s/GPU 水平上进行插值：
+
+| 等吞吐量 (tok/s/GPU) | v0.16 交互性 (tok/s/user) | v0.18 交互性 (tok/s/user) | 交互性增益 |
+| -------------------- | ------------------------- | ------------------------- | ---------- |
+| 337                  | 5.1（插值，约并发 62）    | **78.9**（实测，并发 4）  | **15.6x**  |
+| 380                  | 4.9（外推）               | 74.7（插值，约并发 5）    | 15.2x      |
+
+"最高 15 倍"的标题出现在 337 tok/s/GPU 处，v0.16 破损的延迟下限（152–199 ms TPOT，与批大小无关）与 v0.18 在并发 4 下 13 ms 的正常下限在此交汇。在该工作点上，vLLM v0.18 已能在 Kimi K2.5 推理上实现接近实时语音的延迟。
+
+<Figure
+  srcLight="/images/mi355x-kimi-k2-5-vllm-aiter-7x-speedup/benchmark-light.png"
+  srcDark="/images/mi355x-kimi-k2-5-vllm-aiter-7x-speedup/benchmark-dark.png"
+  alt="MI355X Kimi K2.5 MXFP4 8k/1k Pareto 前沿：vLLM 0.16（3 月 1 日基线）vs vLLM 0.18（3 月 26 日），tok/s/GPU vs tok/s/user"
+  caption="MI355X Kimi K2.5 MXFP4 8k/1k Pareto 前沿。vLLM 0.16（3 月 1 日基线）vs vLLM 0.18（3 月 26 日）。"
+/>
+
+您可以在[此处](https://inferencex.semianalysis.com/inference?g_rundate=2026-04-20&g_runid=24695468813&g_model=Kimi-K2.5&i_gpus=mi355x_vllm&i_dstart=2026-03-01&i_dend=2026-03-26)找到该图表的在线版本，已预筛选为 3 月 1 日到 3 月 26 日间 MI355X vLLM 上的 Kimi K2.5 数据。
+
+## 速度即护城河
+
+修复后，MI355X Kimi K2.5 推理在 8k/1k MXFP4 上峰值达到 2,687 tok/s/GPU，约为 B200 单节点 vLLM FP4 4,021 tok/s/GPU 的 67%。在超大规模云厂商和新兴云服务商租用 Instinct MI355X 的较低每 GPU TCO 下，确实存在 MI355X 每百万 token 更便宜的工作点。然而，机架级分离式推理方面的差距仍未弥合：在该工作负载上 MI355X 落后 GB200 NVL72 Dynamo vLLM 和 TRT-LLM 4.7x–5.3x。大多数 MI355X 配置仅限单节点，限制在 4 或 8 GPU，没有分离式预填充/解码拆分，没有跨机架级互联的宽专家并行。
+
+AMD 已经展示了其能够在自有技术栈上交付生产级分离式推理的能力。MI355X DeepSeek R1 结果使用了 `mori-sglang` 配合分离式预填充/解码、MXFP4 和 MXFP8，含和不含 MTP 投机解码两种模式。我们期待尽快在 Kimi K2.5 上看到同样的方案。
+
+而这正是我们基准测试所展示的更新节奏为何如此重要。如果在 3 月 1 日做一次时间点 MI355X vs B200 对比，结论会是 MI355X 落后 10 倍且几乎不可用。然而，仅仅 25 天后的数据证明 MI355X 已进入 B200 单节点的射程之内。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "vLLM PR #35850 为 MI355X 上的 Kimi K2.5 修复了什么？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "vLLM PR #35850（2026 年 3 月 6 日合入）通过头重复技巧为 MI355X CDNA4 添加了 num_heads < 16 的 AITER MLA 支持，将 8 个头填充到 16 个后调用优化的 gqa_ratio=16 ASM 内核。这为 Kimi K2.5 TP=8（8 头/rank）和 Kimi-Linear TP=16（4 头/rank）解锁了性能。该 PR 还放宽了头数断言以接受 4、8 或 [16, 128] 内任意 16 的倍数，并在使用 FP8 KV 缓存时添加了从 TritonMLA 到 AITER MLA 的自动回退。修复前，TP=8 的 Kimi K2.5 回退到参考 Triton MLA 路径，在 MXFP4 下比 AITER 慢很多。该修复随 vLLM 0.18 发布。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "vLLM 0.18 后 MI355X 上的 Kimi K2.5 快了多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 Kimi K2.5 MXFP4 8k/1k 下，MI355X v0.18 在等吞吐量下的交互性比 v0.16 最高提升 15.6 倍，测量于 337 tok/s/GPU（v0.18 TP=8 曲线的最低实测点，v0.18 为 78.9 tok/s/user vs v0.16 插值为 5.1 tok/s/user）。在 TP=8 同并发下，交互性从 6.6 提升至 78.9 tok/s/user，增益 12.0 倍。峰值吞吐量提升 7.7 倍，从 348.5 tok/s/GPU（TP=8，v0.16）到 2,687 tok/s/GPU（TP=4，v0.18，新的最高吞吐量配方）。总耗时 25 天，3 月 1 日至 3 月 26 日。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "修复后 MI355X 在 Kimi K2.5 上能与 NVIDIA B200 竞争吗？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在单节点吞吐量上，是的。修复后 MI355X MXFP4 达到 2,687 tok/s/GPU，而 B200 单节点 vLLM FP4 为 4,021 tok/s/GPU，约为 B200 吞吐量的 67%，且 CDNA4 Instinct 部署的每 GPU TCO 更低。与 GB200 NVL72（Dynamo vLLM 12,586 tok/s/GPU，Dynamo TRT-LLM 14,187 tok/s/GPU）之间 4.7x 到 5.3x 的差距在于可组合性。MI355X 在 Kimi K2.5 上尚无分离式推理和宽专家并行方案。AMD 已通过 mori-sglang 为 DeepSeek R1 交付了分离式推理，因此技术路线已有先例。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "MI355X 上部署 Kimi K2.5 是否应升级到 vLLM 0.18？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "是的。PR #35850 中的 AITER MLA 分发修复和 MXFP4 GEMM 自动调优是达到 Kimi K2.5 8k/1k 峰值 2,687 tok/s/GPU 以及低批次下 sub-15 ms TPOT 延迟下限的必要条件。0.18 之前的 vLLM 在 MI355X 上回退到参考注意力实现，在该模型上浪费了超过一个数量级的性能。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x.mdx b/packages/app/content/blog/zh/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x.mdx
new file mode 100644
index 00000000..ac7908b2
--- /dev/null
+++ b/packages/app/content/blog/zh/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x.mdx
@@ -0,0 +1,179 @@
+---
+title: 'AMD MI355X Qwen3.5 397B-A17B 推理：SGLang FP8 三个月内每 GPU 吞吐量提升最高 19 倍'
+subtitle: '从 v0.5.8（2 月）→ v0.5.10rc0（4 月）→ v0.5.12（5 月），三次 AITER 内核合入 MI355X 加上从 TP=8 到 TP=2/TP=4 的重新调优，将 Qwen3.5 8k/1k 峰值从 1.3k 推高至 6.4k tok/s/GPU，并将曲线延伸至 75 tok/s/user'
+date: '2026-05-25'
+publishDate: '2026-05-25'
+tags:
+  - benchmark
+  - gpu
+  - inference
+  - qwen
+  - amd
+  - mi355x
+  - sglang
+  - rocm
+---
+
+阿里巴巴于 [2026-02-16 发布 Qwen3.5-397B-A17B](https://www.alibabacloud.com/blog/602894) 后 13 周，AMD MI355X 上 SGLang FP8 在 8k/1k 工作负载下的每 GPU 吞吐量在 40 tok/s/user 的等交互性下**最高提升至 19.0 倍**（在仪表板的单调三次 Hermite Pareto 插值上，从 2026-02-20 v0.5.8.post1 基线的 192 tok/s/GPU 提升至 2026-05-19 v0.5.12 运行的 3,660 tok/s/GPU）。这一增长叠加了三个 SGLang 版本以及三次 AITER MoE 内核合入带来的大部分提升，5 月的 v0.5.10rc0 → v0.5.12 镜像升级又在此基础上额外贡献了约 **1.5 倍**。
+
+这完全是软件优化——全程使用相同的 MI355X CDNA4 硅片，TCO 始终为 $1.48/GPU/hr。相关记录：[sgl-project/sglang#20736](https://github.com/sgl-project/sglang/pull/20736)、[sgl-project/sglang#21188](https://github.com/sgl-project/sglang/pull/21188) 和 [sgl-project/sglang#21421](https://github.com/sgl-project/sglang/pull/21421)，均在 3–4 月合入，且均通过 `SGLANG_USE_AITER=1` 开关控制。上游到基准测试的闭环速度本身就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_model=Qwen-3.5-397B-A17B&g_rundate=2026-05-19&i_gpus=mi355x_sglang&i_dstart=2026-02-20&i_dend=2026-05-19&i_prec=fp8">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<Figure
+  srcLight="/images/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x/benchmark-light.png"
+  srcDark="/images/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x/benchmark-dark.png"
+  alt="Qwen3.5 FP8 8k/1k tok/s/GPU vs 交互性，MI355X SGLang 三个日期对比：2026-02-20（v0.5.8.post1）、2026-04-16（v0.5.10rc0）、2026-05-19（v0.5.12）。每条曲线标注日期和各点的 TP 值。"
+  caption="Qwen3.5-397B-A17B FP8 8k/1k 在 MI355X SGLang 上的表现。三个月内三次运行：v0.5.8.post1（2 月 20 日，TP=8）、v0.5.10rc0（4 月 16 日，TP=2/4）、v0.5.12（5 月 19 日，TP=2/4）。点标签表示该配置所使用的 TP 值。"
+/>
+
+Qwen3.5-397B-A17B 是阿里巴巴的 MoE 旗舰模型，于 2026-02-16 发布，总参数量 397B，每 token 激活 17B，拥有 **512 个专家**（top-K 路由），并采用混合注意力架构，交替使用 Gated DeltaNet 和 Gated Attention 层。首次 InferenceX 基准测试在模型发布四天后便在 MI355X 上完成。
+
+## 推动性能提升的具体优化
+
+带来这些巨大性能提升的部分优化包括：
+
+- **[sgl-project/sglang PR #20736](https://github.com/sgl-project/sglang/pull/20736)**，由 [zhentaocc](https://github.com/zhentaocc) 提交（合著者 [yichiche](https://github.com/yichiche)），2026-04-15 合入——**在 Qwen2 MoE 和 Qwen3.5 MoE 中将共享专家与路由专家融合**。当 `shared_expert_intermediate_size == moe_intermediate_size` 时，共享专家被视为额外的一个专家（top-K + 1），在单次 AITER MoE 分发中一并处理。每个 MoE 层减少一次内核启动，共享专家权重的 HBM 往返次数也减少。据报告在 Qwen3.5 并发 16 时总吞吐量提升 +4.6%，TPOT 降低 -4%；FP8 精度最初需要 AITER split-K 修复后才能启用。
+- **[sgl-project/sglang PR #21188](https://github.com/sgl-project/sglang/pull/21188)**，由 [yichiche](https://github.com/yichiche) 提交，2026-03-23 合入——**为 `GemmaRMSNorm` 添加 `forward_hip` 路径，使 AMD GPU 使用融合 RMSNorm 内核（AITER `fused_add_rms_norm` / `rms_norm`）而非原生回退路径**。原生路径在 MI355X 上受标量运算限制；融合路径将 Gemma 风格的 `weight + 1.0` 偏移吸收进内核中。据报告在 8x MI355X、并发 1、8k/1k 下：**中位端到端延迟降低 -23.1%，总吞吐量提升 +30.0%，中位首 token 延迟（TTFT）降低 -17.0%**，同时 GSM8K 精度从 0.943 提升至 0.955。
+- **[sgl-project/sglang PR #21421](https://github.com/sgl-project/sglang/pull/21421)**，由 [zhentaocc](https://github.com/zhentaocc) 提交，2026-03-26 合入——**将 AITER 的 `fused_topk` 内核集成到 SGLang 的 `fused_topk` 中，用于 softmax 评分的 MoE top-K 选择**。启用 AITER 时自动分发到 `aiter.fused_moe.fused_topk`。内核微基准测试显示：在 Qwen3.5 形态（E=512, top-K=10）上比 sgl-kernel 基线快约 1.31x 到 **6.29x**，在高 token 数下增益最大。端到端 bs=64 1k/1k 下：总吞吐量提升 +1.9%，GSM8K 精度与基线偏差在 ±0.001 以内。
+
+## 测试数据
+
+所有行均为 Qwen3.5-397B-A17B FP8、**ISL 8192 / OSL 1024**、单节点非分离式 MI355X，在 InferenceX 上测量。每百万总 token 成本按 `TCO_$/GPU/hr / (3600 × tput_per_gpu / 1e6)` 计算，MI355X TCO 为 $1.48/GPU/hr，来自 [SemiAnalysis AI Cloud TCO 模型](https://newsletter.semianalysis.com/p/ai-cloud-economics)。
+
+各日期使用的容器镜像：
+
+- **2026-02-20:** `rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260218`
+- **2026-04-16:** `lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260414`
+- **2026-05-19:** `lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517`
+
+**2026-02-20，MI355X SGLang FP8，TP=8、8 GPU**（基线）：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 171.9     | 40.86      | 24.47     | $2.39      |
+| 8    | 312.1     | 37.66      | 26.55     | $1.32      |
+| 16   | 568.0     | 35.47      | 28.19     | $0.72      |
+| 32   | 917.8     | 28.48      | 35.11     | $0.45      |
+| 64   | 1,288.0   | 19.22      | 52.03     | $0.32      |
+
+**2026-04-16，MI355X SGLang FP8，TP=2、2 GPU**（重新调优 + AITER PR 后）：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 1,074.3   | 63.89      | 15.65     | $0.38      |
+| 8    | 1,704.6   | 50.98      | 19.61     | $0.24      |
+| 16   | 2,571.9   | 38.50      | 26.51     | $0.16      |
+| 32   | 3,567.8   | 26.22      | 38.15     | $0.12      |
+
+**2026-04-16，MI355X SGLang FP8，TP=4、4 GPU**（高吞吐量分支）：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 32   | 2,584.9   | 38.56      | 25.94     | $0.16      |
+| 64   | 3,426.6   | 24.84      | 40.25     | $0.12      |
+| 128  | 4,263.2   | 15.38      | 65.01     | $0.10      |
+| 256  | 5,099.3   | 9.20       | 108.64    | $0.08      |
+
+**2026-05-19，MI355X SGLang FP8，TP=2、2 GPU**（v0.5.12 升级）：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 4    | 1,267.5   | 75.22      | 13.29     | $0.32      |
+| 8    | 2,008.1   | 59.67      | 16.76     | $0.20      |
+| 16   | 3,175.6   | 46.73      | 21.40     | $0.13      |
+| 32   | 4,346.8   | 31.91      | 31.34     | $0.09      |
+
+**2026-05-19，MI355X SGLang FP8，TP=4、4 GPU**（v0.5.12 升级）：
+
+| Conc | tok/s/GPU | tok/s/user | TPOT (ms) | $/M tokens |
+| ---- | --------- | ---------- | --------- | ---------- |
+| 32   | 3,171.8   | 46.82      | 21.36     | $0.13      |
+| 64   | 4,113.4   | 29.83      | 33.53     | $0.10      |
+| 128  | 5,019.6   | 18.09      | 55.27     | $0.08      |
+| 256  | 6,409.1   | 11.56      | 86.53     | $0.06      |
+
+## 等交互性吞吐量对比
+
+每个日期沿其 Pareto 前沿插值（4 月和 5 月运行取 TP=2 和 TP=4 中较高的每交互性吞吐量；2 月基线仅有 TP=8）。比率为匹配 tok/s/user 下的每 GPU 吞吐量：
+
+| 交互性 (tok/s/user) | 2 月 v0.5.8 tok/s/GPU | 4 月 v0.5.10rc0 tok/s/GPU | 5 月 v0.5.12 tok/s/GPU | 5 月 / 2 月 | 5 月 / 4 月 |
+| ------------------- | --------------------- | ------------------------- | ---------------------- | ----------- | ----------- |
+| 20                  | 1,259                 | 3,906                     | 4,861                  | 3.86x       | 1.24x       |
+| 30                  | 859                   | 3,278                     | 4,449                  | 5.18x       | 1.36x       |
+| 35                  | 612                   | 2,867                     | 4,114                  | 6.72x       | 1.44x       |
+| **40**              | **192**               | **2,476**                 | **3,660**              | **19.0x**   | **1.48x**   |
+| 50                  | _unreachable_         | 1,765                     | 2,959                  | _∞_         | 1.68x       |
+| 60                  | _unreachable_         | 1,244                     | 1,985                  | _∞_         | 1.60x       |
+
+40 tok/s/user 处的 19 倍峰值部分源于区间延伸——2 月 TP=8 配方在并发 4 时有 24.5 ms 的 TPOT 下限（40.86 tok/s/user），在该工作负载上无法再降低，因此对比区间的上限恰好是旧配方已开始崩溃的位置。到 50 tok/s/user 时 v0.5.8 曲线已不存在；到 75 tok/s/user 时只有 v0.5.12 曲线仍有数据点。仅 5 月 v0.5.12 镜像就在整个共享区间内在 4 月基线基础上额外贡献了 1.44x 到 1.68x——这是纯粹的版本升级收益。
+
+<Figure
+  srcLight="/images/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x/benchmark-light.png"
+  srcDark="/images/mi355x-qwen3-5-sglang-v0-5-12-up-to-17x/benchmark-dark.png"
+  alt="Qwen3.5 FP8 8k/1k tok/s/GPU vs 交互性，MI355X SGLang 三个日期对比：2026-02-20（v0.5.8.post1）、2026-04-16（v0.5.10rc0）、2026-05-19（v0.5.12）。每条曲线标注日期和各点的 TP 值。"
+  caption="Qwen3.5-397B-A17B FP8 8k/1k 在 MI355X SGLang 上的表现。三个月内三次运行：v0.5.8.post1（2 月 20 日，TP=8）、v0.5.10rc0（4 月 16 日，TP=2/4）、v0.5.12（5 月 19 日，TP=2/4）。点标签表示该配置所使用的 TP 值。"
+/>
+
+[在线图表](https://inferencex.semianalysis.com/inference?g_model=Qwen-3.5-397B-A17B&g_rundate=2026-05-19&i_gpus=mi355x_sglang&i_dstart=2026-02-20&i_dend=2026-05-19&i_prec=fp8)，已预筛选为 MI355X SGLang Qwen3.5 FP8 三次运行的数据。
+
+## MI355X 上 Qwen3.5 的下一步
+
+- **分离式推理服务。** Qwen3.5 的 512 专家池恰好是分离式预填充/解码拆分能大显身手的场景。目前尚无 MI355X Qwen3.5 分离式配方，AMD 也尚未为 Qwen3.5 交付分离式推理方案。
+
+## 致谢
+
+这条三个月的性能提升曲线来自 AMD 的 [zhentaocc](https://github.com/zhentaocc)（Todd Chen）和 [yichiche](https://github.com/yichiche)（Jacky Cheng），他们编写了全部三个上游 SGLang PR，由 [HaiShaw](https://github.com/HaiShaw) 审核并合入。上游到基准测试的闭环速度本身就是护城河。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_model=Qwen-3.5-397B-A17B&g_rundate=2026-05-19&i_gpus=mi355x_sglang&i_dstart=2026-02-20&i_dend=2026-05-19&i_prec=fp8">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "与三个月前相比，MI355X SGLang 在 Qwen3.5 FP8 上快了多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 Qwen3.5-397B-A17B FP8 8k/1k 的 MI355X SGLang 上，等交互性下的每 GPU 吞吐量在 2026-02-20 v0.5.8.post1 基线和 2026-05-19 v0.5.12 运行之间最高提升 19.0 倍，峰值出现在 40 tok/s/user（在仪表板的单调三次 Hermite Pareto 插值上为 192 到 3,660 tok/s/GPU）。每 GPU 峰值吞吐量从 1,288 提升至 6,409 tok/s/GPU（5.0 倍）。2 月基线使用 TP=8；4 月和 5 月运行使用 TP=2 和 TP=4。全程使用单台 MI355X 节点，无硬件变更。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "哪些 SGLang PR 推动了 MI355X Qwen3.5 的加速？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "三个由 AMD 编写、通过 SGLANG_USE_AITER=1 开关控制的上游 PR 被纳入 v0.5.10rc0 镜像。PR #20736（zhentaocc 提交，2026-04-15 合入）在 Qwen2/Qwen3.5 MoE 中将共享专家与路由专家融合为单次 AITER 分发中的 topk+1。PR #21188（yichiche 提交，2026-03-23 合入）为 GemmaRMSNorm 添加 forward_hip 路径，使 AMD GPU 使用融合 RMSNorm 内核（AITER fused_add_rms_norm / rms_norm）而非原生回退路径，据报告在 8k/1k 并发 1 下端到端延迟降低 -23.1%，吞吐量提升 +30.0%。PR #21421（zhentaocc 提交，2026-03-26 合入）将 AITER 的 fused_topk 内核集成到 SGLang 的 fused_topk 中用于 softmax MoE top-K 选择，内核微基准测试在 Qwen3.5 形态上比 sgl-kernel 基线快 1.31x 到 6.29x。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "从 TP=8 切换到 TP=2/TP=4 带来了多少增益？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "2026 年 4 月的 InferenceX 配方更新将 TP 重新调优、v0.5.10rc0 镜像升级和 AITER PR 打包为一次变更，因此从公开的 InferenceX 数据集无法将它们完全分离。可以确定的是：TP=8 在 Qwen3.5 的 512 专家 MoE 分发路径上浪费了 MI355X 大部分的 tensor-core 算力，而 TP=2 / TP=4 将解码批次分配到更少的 rank 上以保持 AITER 融合 MoE 分发的高效运行。TP 重新调优是 AITER 内核优化在端到端层面得以体现的必要条件。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "仅 5 月 v0.5.12 镜像升级带来了多少加速？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在保持 TP=2/TP=4 配方不变的情况下，2026-05-19 SGLang v0.5.12 镜像在 30 到 60 tok/s/user 区间内的等交互性下，每 GPU 吞吐量比 2026-04-16 v0.5.10rc0 镜像提升了 1.44x 到 1.68x，并将 Pareto 前沿延伸至 75 tok/s/user，而 4 月配方的上限为 64。TP=4 并发 256 时的峰值吞吐量从 5,099 提升至 6,409 tok/s/GPU。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "这一仅限 MI355X 的 Qwen3.5 结果未涵盖什么？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "分离式推理服务。Qwen3.5 的 512 专家池恰好是分离式预填充/解码拆分能大显身手的场景，但目前尚无 MI355X Qwen3.5 分离式配方，AMD 也尚未为 Qwen3.5 交付分离式推理方案。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/content/blog/zh/sglang-0-5-6-b200-deepseek-r1-fp4-up-to-1-8x.mdx b/packages/app/content/blog/zh/sglang-0-5-6-b200-deepseek-r1-fp4-up-to-1-8x.mdx
new file mode 100644
index 00000000..a631535a
--- /dev/null
+++ b/packages/app/content/blog/zh/sglang-0-5-6-b200-deepseek-r1-fp4-up-to-1-8x.mdx
@@ -0,0 +1,112 @@
+---
+title: 'SGLang 0.5.6 在 B200 DeepSeek R1 FP4 上的表现：低并发下最高提升 1.8 倍'
+subtitle: '针对 DeepSeek V3 的分段 CUDA graph、统一事件循环和 JIT 内核将 8k/1k 吞吐量从 508 提升至 907 tok/s/GPU，使用相同的 16 GPU B200 资源池'
+date: '2026-05-02'
+publishDate: '2026-05-02'
+tags:
+  - benchmark
+  - inference
+  - gpu
+  - nvidia
+  - b200
+  - deepseek
+  - sglang
+  - fp4
+---
+
+B200 运行 SGLang 0.5.6 处理 DeepSeek R1 NVFP4 在 8k/1k 工作负载并发数 4 下达到 907 tok/s/GPU，相比 0.5.5 的 508 tok/s/GPU 提升了 1.79 倍。两次测试使用相同的 16 GPU 资源池，TP 4 / EP 4 配置。唯一的变化是 Docker 镜像从 lmsysorg/sglang:v0.5.5-cu129-amd64 更新为 lmsysorg/sglang:v0.5.6-cu129-amd64。
+
+SGLang 0.5.6 于 2025-12-03 发布，InferenceX 基准测试在 28 天后的 2025-12-31（镜像更新当天）捕捉到了完整的性能提升效果。这正是我们构建 InferenceX 自动化基准测试循环的原因 — 在硬件不变的情况下，第一时间捕捉软件驱动的性能变化。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+性能提升在低并发下最为显著。在并发数 4 和 8 时，解码循环的每步耗时中有相当比例花在 Python 调度器和内核分发代码上而非矩阵乘法中，因此 0.5.6 的调度器和计算图优化效果最为直接。在高并发下 tensor core 接近饱和，较小的吞吐量提升（并发数 64 为 1.03 倍，并发数 128 为 1.16 倍）来自重构后的注意力内核路径。
+
+## SGLang 0.5.6 中的关键更新
+
+[SGLang 0.5.6](https://github.com/sgl-project/sglang/releases/tag/v0.5.6) 于 2025-12-03 发布。三项更新与低并发吞吐量提升直接相关。分段 CUDA graph 支持扩展到了 DeepSeek V3 和 MLA 注意力路径，减少了构建和回放计算图的每步 Python 开销。事件循环在 PD 分离式、重叠式和 DP 注意力服务模式间统一，降低了内部循环开销。引入了 JIT 内核，减少了启动成本并允许内核编译针对运行时观察到的形状进行特化。
+
+另外三项 0.5.6 更新影响注意力内核路径。MHA 和 MLA KV 缓存重构以支持 FP4。FlashInfer TRTLLM GEN MHA 路径重新启用。FlashInfer 更新至 0.5.2。这些在高并发下更为重要，此时 KV 缓存较大且注意力计算是主要开销。并发数 128 下的 1.16 倍提升即来自此路径。
+
+## 基准测试数据
+
+所有数据均为 DeepSeek R1 NVFP4，ISL 8192 / OSL 1024，来自 InferenceX。0.5.5 数据来自 2025-12-15 的测试，镜像由 [InferenceX PR #204](https://github.com/SemiAnalysisAI/InferenceX/pull/204) 设置，该 PR 于 2025-11-10 将 B200 SGLang 配置从 v0.5.3rc1-cu129-b200 升级至 v0.5.5-cu129-amd64。0.5.6 数据来自 2025-12-31 的测试，由 [InferenceX PR #276](https://github.com/SemiAnalysisAI/InferenceX/pull/276) 触发，该 PR 将 Docker 镜像更新至 v0.5.6-cu129-amd64，无其他配置更改。
+
+B200 SGLang，DeepSeek R1 NVFP4，TP 4 / EP 4 解码，16 GPU 非分离式资源池。该方案遵循 [SGLang DeepSeek V3/R1 部署指南](https://docs.sglang.io/basic_usage/deepseek_v3.html)。
+
+| 版本      | 并发数 | tok/s/GPU | TPOT (ms) | tok/s/user | 提升      |
+| --------- | ------ | --------- | --------- | ---------- | --------- |
+| 0.5.5     | 4      | 508       | 9.2       | 108.4      | 基准      |
+| 0.5.5     | 8      | 903       | 11.6      | 86.5       | 基准      |
+| 0.5.5     | 16     | 1,471     | 15.7      | 63.8       | 基准      |
+| 0.5.5     | 32     | 2,302     | 22.2      | 45.1       | 基准      |
+| 0.5.5     | 64     | 3,323     | 33.7      | 29.6       | 基准      |
+| 0.5.5     | 128    | 4,430     | 54.9      | 18.2       | 基准      |
+| **0.5.6** | **4**  | **907**   | **9.2**   | **108.5**  | **1.79x** |
+| 0.5.6     | 8      | 1,437     | 11.6      | 86.0       | 1.59x     |
+| 0.5.6     | 16     | 1,500     | 15.5      | 64.6       | 1.02x     |
+| 0.5.6     | 32     | 3,063     | 22.0      | 45.6       | 1.33x     |
+| 0.5.6     | 64     | 3,419     | 32.9      | 30.4       | 1.03x     |
+| 0.5.6     | 128    | 5,145     | 53.7      | 18.6       | 1.16x     |
+
+加粗行是本文的核心数据：0.5.6 在并发数 4 下达到 907 tok/s/GPU，相比 0.5.5 的 508 提升 1.79 倍，硬件和方案完全相同。在匹配并发数下，两个版本的交互性几乎一致。各并发数下 TPOT 在四舍五入范围内不变。0.5.6 以相同的每用户 token 速率服务了更多用户。
+
+<Figure
+  srcLight="/images/sglang-0-5-6-b200-deepseek-r1-fp4-up-to-1-8x/benchmark-light.png"
+  srcDark="/images/sglang-0-5-6-b200-deepseek-r1-fp4-up-to-1-8x/benchmark-dark.png"
+  alt="DeepSeek R1 NVFP4 8k/1k B200 SGLang 在 TP 4 EP 4 下的每 GPU 吞吐量，SGLang 0.5.5 vs 0.5.6 跨并发数扫描"
+  caption="B200 DeepSeek R1 NVFP4 8k/1k，SGLang 0.5.5（2025-12-15）vs SGLang 0.5.6（2025-12-31）跨并发数的每 GPU 吞吐量。"
+/>
+
+[实时图表](https://inferencex.semianalysis.com/inference?g_rundate=2025-12-31&g_runid=20621824084&i_prec=fp4%2Cfp8&i_gpus=b200_sglang&i_dstart=2025-12-15&i_dend=2025-12-31)，预筛选为 0.5.5 和 0.5.6 两次测试中 B200 SGLang 上的 DeepSeek R1。
+
+## 各项优化在曲线上的作用位置
+
+TP 4 / EP 4 配置下 DeepSeek R1 NVFP4 的解码有固定的每步开销。内核启动、Python 调度器处理和计算图构建是主要贡献项，与注意力和 MoE GEMM 计算并列。在并发数 4 时，GEMM 规模较小，固定开销占步骤耗时的比例显著。减少固定开销直接加速了每步执行，这就是最大提升比（并发数 4 为 1.79 倍，并发数 8 为 1.59 倍）出现在低并发的原因。分段 CUDA graph 和 JIT 内核是相关的发布项。
+
+在并发数 128 时，KV 缓存较大，注意力计算是每步的主要开销。重构后的 MHA 和 MLA KV 缓存对 FP4 的支持以及重新启用的 FlashInfer TRTLLM GEN MHA 路径在并发数 128 下产生了 1.16 倍的提升，尽管调度器开销的减少在此并发量级已趋于平坦。在中等并发数（16、32、64）下，两种效果均不占主导，吞吐量提升较小且不太稳定（1.02 倍、1.33 倍、1.03 倍）。
+
+<DashboardCTA href="https://inferencex.semianalysis.com/inference">
+  点击查看完整 InferenceX 仪表板 →
+</DashboardCTA>
+
+<JsonLd>{`{
+  "@context": "https://schema.org",
+  "@type": "FAQPage",
+  "mainEntity": [
+    {
+      "@type": "Question",
+      "name": "SGLang 0.5.6 在 B200 DeepSeek R1 FP4 上比 0.5.5 快多少？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "在 B200 TP 4 / EP 4 非分离式部署、DeepSeek R1 NVFP4、8k/1k 序列长度条件下，SGLang 0.5.6 在并发数 4 时 tok/s/GPU 为 0.5.5 的 1.79 倍（508 → 907），并发数 8 时为 1.59 倍（903 → 1,437）。中高并发的吞吐量提升较小：并发数 16 为 1.02 倍，并发数 32 为 1.33 倍，并发数 64 为 1.03 倍，并发数 128 为 1.16 倍。在匹配并发数下，两个版本的交互性（tok/s/user）几乎不变。数据来自 InferenceX，0.5.5 使用 2025-12-15 测试数据，0.5.6 使用 2025-12-31 测试数据。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "SGLang 0.5.6 中哪些变更带来了性能提升？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "三项更新在低并发下贡献最大。分段 CUDA graph 支持扩展到了 DeepSeek V3 和 MLA 注意力路径，减少了每步 Python 和内核启动开销。事件循环在 PD 分离式、重叠式和 DP 注意力服务模式间统一，收紧了内部解码循环。引入了 JIT 内核，降低了启动成本并允许内核编译针对运行时观察到的形状进行特化。MHA 和 MLA KV 缓存重构以支持 FP4，FlashInfer 更新至 0.5.2，这些提高了注意力内核的上限，但对低并发吞吐量提升贡献较小。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "为什么 0.5.6 的性能提升在低并发下最大？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "调度器和内核启动开销在小批量时占每个解码步骤的比例更大。在 8k/1k 并发数 4 下，MoE GEMM 形状较小，每步固定开销占解码步骤的比例显著，因此分段 CUDA graph 和 JIT 内核的变更带来了明显的 1.79 倍提升。在中等并发（16 至 64）时，调度器优化的收益减弱而注意力内核尚未成为主导，提升比率较小且不太稳定。在并发数 128 时 KV 缓存较大，注意力计算主导了每步耗时，重构的 MHA 和 MLA KV 缓存加上重新启用的 FlashInfer TRTLLM GEN MHA 路径带来了 1.16 倍的提升。"
+      }
+    },
+    {
+      "@type": "Question",
+      "name": "InferenceX 多快捕捉到了 SGLang 0.5.6 的性能提升？",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "SGLang 0.5.6 于 2025-12-03 发布。InferenceX PR #276 于 2025-12-31（28 天后）将 NVIDIA DeepSeek SGLang Docker 镜像从 v0.5.5-cu129-amd64 更新至 v0.5.6-cu129-amd64，DeepSeek R1 FP4 B200 配置的首次 0.5.6 测试在同日运行。无硬件或并行配置变更，仅 Docker 镜像更新。"
+      }
+    }
+  ]
+}`}</JsonLd>
diff --git a/packages/app/cypress/e2e/zh-pages.cy.ts b/packages/app/cypress/e2e/zh-pages.cy.ts
new file mode 100644
index 00000000..8b52407e
--- /dev/null
+++ b/packages/app/cypress/e2e/zh-pages.cy.ts
@@ -0,0 +1,87 @@
+describe('Chinese (/zh) pages', () => {
+  describe('zh landing page', () => {
+    before(() => {
+      cy.visit('/zh');
+    });
+
+    it('renders the Chinese landing content', () => {
+      cy.contains('h2', '完整仪表板').should('exist');
+      cy.contains('快速对比').should('exist');
+    });
+
+    it('links into the Chinese dashboard tree', () => {
+      cy.get('a[href="/zh/inference"]').should('exist');
+    });
+
+    it('sets hreflang alternates to the English homepage', () => {
+      cy.get('link[rel="alternate"][hreflang="en"]').should('exist');
+      cy.get('link[rel="alternate"][hreflang="zh-CN"]').should('exist');
+    });
+
+    it('header language toggle points back to English', () => {
+      cy.get('[data-testid="language-toggle"]').should('have.attr', 'href', '/');
+    });
+  });
+
+  describe('zh dashboard tab page', () => {
+    before(() => {
+      cy.visit('/zh/inference');
+    });
+
+    it('renders the Chinese SEO intro above the chart', () => {
+      cy.get('[data-testid="zh-tab-intro"]').within(() => {
+        cy.contains('h1', 'AI 推理基准测试').should('exist');
+      });
+    });
+
+    it('tab nav shows Chinese labels linking within /zh', () => {
+      cy.get('[data-testid="tab-trigger-evaluation"]')
+        .should('contain.text', '准确率评估')
+        .and('have.attr', 'href')
+        .and('match', /^\/zh\/evaluation/u);
+    });
+  });
+
+  describe('zh blog', () => {
+    before(() => {
+      cy.visit('/zh/blog');
+    });
+
+    it('renders the Chinese blog listing', () => {
+      cy.contains('h2', '文章').should('exist');
+      cy.get('a[href^="/zh/blog/"]').should('have.length.gte', 1);
+    });
+  });
+
+  describe('zh blog post page', () => {
+    before(() => {
+      cy.visit('/zh/blog/inferencemax-open-source-inference-benchmarking');
+    });
+
+    it('renders translated content with Chinese chrome', () => {
+      cy.get('article.prose').should('exist');
+      cy.contains('分钟阅读').should('exist');
+      cy.get('a[href="/zh/blog"]').should('exist');
+    });
+
+    it('links to the English original', () => {
+      cy.get('a[href="/blog/inferencemax-open-source-inference-benchmarking"]').should('exist');
+    });
+  });
+
+  describe('English pages expose the Chinese sibling', () => {
+    before(() => {
+      cy.visit('/blog');
+    });
+
+    it('has a zh-CN hreflang alternate and a language toggle', () => {
+      // hreflang URLs are absolute against the production origin.
+      cy.get('link[rel="alternate"][hreflang="zh-CN"]')
+        .should('have.attr', 'href')
+        .and('match', /\/zh\/blog$/u);
+      cy.get('[data-testid="language-toggle"]')
+        .should('contain.text', '中文')
+        .and('have.attr', 'href', '/zh/blog');
+    });
+  });
+});
diff --git a/packages/app/src/app/(landing)/page.tsx b/packages/app/src/app/(landing)/page.tsx
index 64097c94..9d691104 100644
--- a/packages/app/src/app/(landing)/page.tsx
+++ b/packages/app/src/app/(landing)/page.tsx
@@ -1,13 +1,14 @@
 import type { Metadata } from 'next';
 
 import { LandingPage } from '@/components/landing/landing-page';
+import { enAlternates } from '@/lib/i18n';
 import { LANDING_META } from '@/lib/tab-meta';
 import { SITE_URL } from '@semianalysisai/inferencex-constants';
 
 export const metadata: Metadata = {
   title: LANDING_META.title,
   description: LANDING_META.description,
-  alternates: { canonical: SITE_URL },
+  alternates: enAlternates('/'),
   openGraph: {
     title: `${LANDING_META.title} | InferenceX`,
     description: LANDING_META.description,
diff --git a/packages/app/src/app/about/page.tsx b/packages/app/src/app/about/page.tsx
index 4cc1c5d2..7e80dd7b 100644
--- a/packages/app/src/app/about/page.tsx
+++ b/packages/app/src/app/about/page.tsx
@@ -4,6 +4,7 @@ import Link from 'next/link';
 import { Card } from '@/components/ui/card';
 import { FAQ_ITEMS } from '@/components/about/faq-data';
 import { JsonLd } from '@/components/json-ld';
+import { enAlternates } from '@/lib/i18n';
 import { GITHUB_OWNER, GITHUB_REPO, SITE_URL } from '@semianalysisai/inferencex-constants';
 
 const faqJsonLd = {
@@ -23,7 +24,7 @@ export const metadata: Metadata = {
   title: 'About',
   description:
     'InferenceX is an independent, vendor neutral, reproducible benchmark which continuously benchmarks inference software across a wide range of AI accelerators.',
-  alternates: { canonical: `${SITE_URL}/about` },
+  alternates: enAlternates('/about'),
   openGraph: {
     title: 'About | InferenceX',
     description:
diff --git a/packages/app/src/app/blog/[slug]/page.tsx b/packages/app/src/app/blog/[slug]/page.tsx
index 7b302fbb..05bda7e6 100644
--- a/packages/app/src/app/blog/[slug]/page.tsx
+++ b/packages/app/src/app/blog/[slug]/page.tsx
@@ -15,7 +15,14 @@ import { ReadingProgressBar } from '@/components/blog/reading-progress-bar';
 import { ShareTwitterButton, ShareLinkedInButton } from '@/components/share-buttons';
 import { Card } from '@/components/ui/card';
 import { JsonLd } from '@/components/json-ld';
-import { getAllPosts, getAdjacentPosts, extractHeadings, getPostBySlug } from '@/lib/blog';
+import {
+  getAllPosts,
+  getAdjacentPosts,
+  extractHeadings,
+  getPostBySlug,
+  hasZhTranslation,
+} from '@/lib/blog';
+import { languageAlternates } from '@/lib/i18n';
 import {
   AUTHOR_HANDLE,
   AUTHOR_NAME,
@@ -42,7 +49,11 @@ export async function generateMetadata({ params }: Props): Promise<Metadata> {
     description: meta.subtitle,
     keywords: meta.tags,
     authors: [{ name: AUTHOR_NAME }],
-    alternates: { canonical: `${SITE_URL}/blog/${slug}` },
+    alternates: {
+      canonical: `${SITE_URL}/blog/${slug}`,
+      // hreflang to the Chinese translation when one exists under content/blog/zh/.
+      ...(hasZhTranslation(slug) && { languages: languageAlternates(`/blog/${slug}`) }),
+    },
     openGraph: {
       title: `${meta.title} | ${SITE_NAME}`,
       description: meta.subtitle,
diff --git a/packages/app/src/app/blog/page.tsx b/packages/app/src/app/blog/page.tsx
index a34c147f..d63bb098 100644
--- a/packages/app/src/app/blog/page.tsx
+++ b/packages/app/src/app/blog/page.tsx
@@ -6,12 +6,13 @@ import { BlogTagLink } from '@/components/blog/blog-tag-link';
 import { Card } from '@/components/ui/card';
 import { JsonLd } from '@/components/json-ld';
 import { getAllPosts } from '@/lib/blog';
+import { enAlternates } from '@/lib/i18n';
 import { SITE_URL, SITE_NAME, AUTHOR_NAME } from '@semianalysisai/inferencex-constants';
 
 export const metadata: Metadata = {
   title: 'Articles',
   description: `Technical articles from ${SITE_NAME} by ${AUTHOR_NAME} — AI inference benchmarking, GPU performance analysis, and ML infrastructure insights.`,
-  alternates: { canonical: `${SITE_URL}/blog` },
+  alternates: enAlternates('/blog'),
   openGraph: {
     title: `Articles | ${SITE_NAME} by ${AUTHOR_NAME}`,
     description: 'AI inference benchmarking insights and GPU performance analysis.',
diff --git a/packages/app/src/app/compare-per-dollar/page.tsx b/packages/app/src/app/compare-per-dollar/page.tsx
index beecbf65..93090112 100644
--- a/packages/app/src/app/compare-per-dollar/page.tsx
+++ b/packages/app/src/app/compare-per-dollar/page.tsx
@@ -7,6 +7,8 @@ import {
   SUPPORTERS_LINE,
 } from '@semianalysisai/inferencex-constants';
 
+import { enAlternates } from '@/lib/i18n';
+
 import { ComparePairCardLink } from '@/components/compare/compare-pair-card-link';
 import { JsonLd } from '@/components/json-ld';
 import { Card } from '@/components/ui/card';
@@ -21,7 +23,7 @@ const DESCRIPTION = `Which GPU delivers more inference performance per dollar? I
 export const metadata: Metadata = {
   title: 'GPU Performance per Dollar',
   description: DESCRIPTION,
-  alternates: { canonical: `${SITE_URL}/compare-per-dollar` },
+  alternates: enAlternates('/compare-per-dollar'),
   openGraph: {
     title: `GPU Performance per Dollar | ${SITE_NAME}`,
     description: DESCRIPTION,
diff --git a/packages/app/src/app/compare/page.tsx b/packages/app/src/app/compare/page.tsx
index 0c4a63f8..fda22b38 100644
--- a/packages/app/src/app/compare/page.tsx
+++ b/packages/app/src/app/compare/page.tsx
@@ -8,6 +8,8 @@ import {
   SUPPORTERS_LINE,
 } from '@semianalysisai/inferencex-constants';
 
+import { enAlternates } from '@/lib/i18n';
+
 import { ComparePairCardLink } from '@/components/compare/compare-pair-card-link';
 import { JsonLd } from '@/components/json-ld';
 import { Card } from '@/components/ui/card';
@@ -22,7 +24,7 @@ const DESCRIPTION = `InferenceX is the independent, open-source GPU inference be
 export const metadata: Metadata = {
   title: 'GPU Comparisons',
   description: DESCRIPTION,
-  alternates: { canonical: `${SITE_URL}/compare` },
+  alternates: enAlternates('/compare'),
   openGraph: {
     title: `GPU Comparisons | ${SITE_NAME}`,
     description: DESCRIPTION,
diff --git a/packages/app/src/app/land-acknowledgement/page.tsx b/packages/app/src/app/land-acknowledgement/page.tsx
index e154c908..bb85154c 100644
--- a/packages/app/src/app/land-acknowledgement/page.tsx
+++ b/packages/app/src/app/land-acknowledgement/page.tsx
@@ -1,6 +1,7 @@
 import type { Metadata } from 'next';
 
 import { Card } from '@/components/ui/card';
+import { enAlternates } from '@/lib/i18n';
 import { SITE_URL } from '@semianalysisai/inferencex-constants';
 
 const REGIONAL_ACKNOWLEDGEMENTS = [
@@ -29,7 +30,7 @@ export const metadata: Metadata = {
   title: 'Land Acknowledgement',
   description:
     'A land acknowledgement for the Indigenous peoples and homelands connected to InferenceX US benchmark clusters in San Jose, Los Angeles, and Chicago.',
-  alternates: { canonical: `${SITE_URL}/land-acknowledgement` },
+  alternates: enAlternates('/land-acknowledgement'),
   openGraph: {
     title: 'Land Acknowledgement | InferenceX',
     description:
diff --git a/packages/app/src/app/quotes/page.tsx b/packages/app/src/app/quotes/page.tsx
index 0cba135a..388fdc69 100644
--- a/packages/app/src/app/quotes/page.tsx
+++ b/packages/app/src/app/quotes/page.tsx
@@ -1,13 +1,14 @@
 import type { Metadata } from 'next';
 
 import { QuotesContent } from '@/components/quotes/quotes-content';
+import { enAlternates } from '@/lib/i18n';
 import { SITE_URL } from '@semianalysisai/inferencex-constants';
 
 export const metadata: Metadata = {
   title: 'Supporters',
   description:
     'InferenceX initiative is supported by major buyers of compute and prominent members of the ML community including those from MiniMax, Moonshot Kimi, Alibaba Qwen, OpenAI, Microsoft, vLLM, PyTorch Foundation, Oracle and more.',
-  alternates: { canonical: `${SITE_URL}/quotes` },
+  alternates: enAlternates('/quotes'),
   openGraph: {
     title: 'Supporters | InferenceX by SemiAnalysis',
     description:
diff --git a/packages/app/src/app/sitemap.ts b/packages/app/src/app/sitemap.ts
index d1717aa3..6f3ae968 100644
--- a/packages/app/src/app/sitemap.ts
+++ b/packages/app/src/app/sitemap.ts
@@ -3,6 +3,7 @@ import type { MetadataRoute } from 'next';
 import { getAllPosts } from '@/lib/blog';
 import { getAllComparableCompareSlugs } from '@/lib/compare-availability';
 import { canonicalCompareSlug } from '@/lib/compare-slug';
+import { languageAlternates, zhPath } from '@/lib/i18n';
 import { SITE_URL as BASE_URL } from '@semianalysisai/inferencex-constants';
 
 const TABS = [
@@ -14,61 +15,67 @@ const TABS = [
   'gpu-metrics',
 ] as const;
 
+type SitemapEntry = MetadataRoute.Sitemap[number];
+
+/**
+ * Emit an English page and its Chinese sibling as a pair, both carrying the
+ * full hreflang set so crawlers link the two versions.
+ */
+function localizedPair(
+  enPath: string,
+  entry: Omit<SitemapEntry, 'url' | 'alternates'>,
+): SitemapEntry[] {
+  const languages = languageAlternates(enPath);
+  return [
+    {
+      ...entry,
+      url: enPath === '/' ? BASE_URL : `${BASE_URL}${enPath}`,
+      alternates: { languages },
+    },
+    { ...entry, url: `${BASE_URL}${zhPath(enPath)}`, alternates: { languages } },
+  ];
+}
+
 export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
   const now = new Date().toISOString();
   // Only emit (model, pair) URLs that have benchmark data on both sides —
   // avoids polluting the sitemap with empty pages that hurt crawl budget.
   const compareSlugs = await getAllComparableCompareSlugs();
+  const zhPosts = new Set(getAllPosts('zh').map((post) => post.slug));
 
   return [
-    {
-      url: BASE_URL,
-      lastModified: now,
-      changeFrequency: 'daily',
-      priority: 1,
-    },
-    ...TABS.map((tab) => ({
-      url: `${BASE_URL}/${tab}`,
-      lastModified: now,
-      changeFrequency: 'daily' as const,
-      priority: 0.9,
-    })),
-    {
-      url: `${BASE_URL}/quotes`,
-      lastModified: now,
-      changeFrequency: 'monthly',
-      priority: 0.6,
-    },
-    {
-      url: `${BASE_URL}/land-acknowledgement`,
+    ...localizedPair('/', { lastModified: now, changeFrequency: 'daily', priority: 1 }),
+    ...TABS.flatMap((tab) =>
+      localizedPair(`/${tab}`, {
+        lastModified: now,
+        changeFrequency: 'daily' as const,
+        priority: 0.9,
+      }),
+    ),
+    ...localizedPair('/quotes', { lastModified: now, changeFrequency: 'monthly', priority: 0.6 }),
+    ...localizedPair('/about', { lastModified: now, changeFrequency: 'monthly', priority: 0.6 }),
+    ...localizedPair('/land-acknowledgement', {
       lastModified: now,
       changeFrequency: 'yearly',
       priority: 0.4,
-    },
-    {
-      url: `${BASE_URL}/compare`,
-      lastModified: now,
-      changeFrequency: 'daily',
-      priority: 0.8,
-    },
-    {
-      url: `${BASE_URL}/compare-per-dollar`,
+    }),
+    ...localizedPair('/compare', { lastModified: now, changeFrequency: 'daily', priority: 0.8 }),
+    ...localizedPair('/compare-per-dollar', {
       lastModified: now,
       changeFrequency: 'daily',
       priority: 0.8,
-    },
-    {
-      url: `${BASE_URL}/blog`,
-      lastModified: now,
-      changeFrequency: 'weekly',
-      priority: 0.8,
-    },
-    ...getAllPosts().map((post) => ({
-      url: `${BASE_URL}/blog/${post.slug}`,
-      lastModified: new Date(`${post.modifiedDate ?? post.date}T00:00:00Z`).toISOString(),
-      changeFrequency: 'monthly' as const,
-      priority: 0.7,
-    })),
+    }),
+    ...localizedPair('/blog', { lastModified: now, changeFrequency: 'weekly', priority: 0.8 }),
+    ...getAllPosts().flatMap((post) => {
+      const entry = {
+        lastModified: new Date(`${post.modifiedDate ?? post.date}T00:00:00Z`).toISOString(),
+        changeFrequency: 'monthly' as const,
+        priority: 0.7,
+      };
+      // Posts without a Chinese translation stay English-only in the sitemap.
+      if (!zhPosts.has(post.slug)) return [{ ...entry, url: `${BASE_URL}/blog/${post.slug}` }];
+      return localizedPair(`/blog/${post.slug}`, entry);
+    }),
     ...compareSlugs.map(({ modelSlug, a, b }) => ({
       url: `${BASE_URL}/compare/${canonicalCompareSlug(modelSlug, a, b)}`,
       lastModified: now,
diff --git a/packages/app/src/app/zh/(dashboard)/calculator/page.tsx b/packages/app/src/app/zh/(dashboard)/calculator/page.tsx
new file mode 100644
index 00000000..2bd53fca
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/calculator/page.tsx
@@ -0,0 +1,23 @@
+import type { Metadata } from 'next';
+
+import ThroughputCalculatorDisplay from '@/components/calculator/ThroughputCalculatorDisplay';
+import { resolveCalculatorUrlSeed } from '@/components/calculator/url-seed';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('calculator');
+
+interface Props {
+  searchParams: Promise<Record<string, string | string[] | undefined>>;
+}
+
+export default async function ZhCalculatorPage({ searchParams }: Props) {
+  const sp = await searchParams;
+  const seed = resolveCalculatorUrlSeed(sp);
+  return (
+    <>
+      <ZhTabIntro tab="calculator" />
+      <ThroughputCalculatorDisplay urlSeed={seed} />
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/(dashboard)/evaluation/page.tsx b/packages/app/src/app/zh/(dashboard)/evaluation/page.tsx
new file mode 100644
index 00000000..a12ca6f3
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/evaluation/page.tsx
@@ -0,0 +1,21 @@
+import type { Metadata } from 'next';
+
+import { EvaluationProvider } from '@/components/evaluation/EvaluationContext';
+import EvaluationChartDisplay from '@/components/evaluation/ui/ChartDisplay';
+import { NudgeEngine } from '@/components/nudge-engine';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('evaluation');
+
+export default function ZhEvaluationPage() {
+  return (
+    <>
+      <ZhTabIntro tab="evaluation" />
+      <EvaluationProvider>
+        <EvaluationChartDisplay />
+        <NudgeEngine scope="evaluation" />
+      </EvaluationProvider>
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/(dashboard)/gpu-metrics/page.tsx b/packages/app/src/app/zh/(dashboard)/gpu-metrics/page.tsx
new file mode 100644
index 00000000..8909104e
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/gpu-metrics/page.tsx
@@ -0,0 +1,16 @@
+import type { Metadata } from 'next';
+
+import GpuMetricsDisplay from '@/components/gpu-power/GpuPowerDisplay';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('gpu-metrics');
+
+export default function ZhGpuMetricsPage() {
+  return (
+    <>
+      <ZhTabIntro tab="gpu-metrics" />
+      <GpuMetricsDisplay />
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/(dashboard)/gpu-specs/page.tsx b/packages/app/src/app/zh/(dashboard)/gpu-specs/page.tsx
new file mode 100644
index 00000000..297f4bc6
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/gpu-specs/page.tsx
@@ -0,0 +1,16 @@
+import type { Metadata } from 'next';
+
+import { GpuSpecsContent } from '@/components/gpu-specs/gpu-specs-content';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('gpu-specs');
+
+export default function ZhGpuSpecsPage() {
+  return (
+    <>
+      <ZhTabIntro tab="gpu-specs" />
+      <GpuSpecsContent />
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/(dashboard)/historical/page.tsx b/packages/app/src/app/zh/(dashboard)/historical/page.tsx
new file mode 100644
index 00000000..4240aef1
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/historical/page.tsx
@@ -0,0 +1,19 @@
+import type { Metadata } from 'next';
+
+import { InferenceProvider } from '@/components/inference/InferenceContext';
+import HistoricalTrendsDisplay from '@/components/trends/HistoricalTrendsDisplay';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('historical');
+
+export default function ZhHistoricalPage() {
+  return (
+    <>
+      <ZhTabIntro tab="historical" />
+      <InferenceProvider activeTab="historical">
+        <HistoricalTrendsDisplay />
+      </InferenceProvider>
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/(dashboard)/inference/page.tsx b/packages/app/src/app/zh/(dashboard)/inference/page.tsx
new file mode 100644
index 00000000..a659d153
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/inference/page.tsx
@@ -0,0 +1,19 @@
+import type { Metadata } from 'next';
+
+import { InferenceProvider } from '@/components/inference/InferenceContext';
+import InferenceChartDisplay from '@/components/inference/ui/ChartDisplay';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('inference');
+
+export default function ZhInferencePage() {
+  return (
+    <>
+      <ZhTabIntro tab="inference" />
+      <InferenceProvider activeTab="inference">
+        <InferenceChartDisplay />
+      </InferenceProvider>
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/(dashboard)/layout.tsx b/packages/app/src/app/zh/(dashboard)/layout.tsx
new file mode 100644
index 00000000..7a237413
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/layout.tsx
@@ -0,0 +1,5 @@
+import { DashboardShell } from '@/components/dashboard-shell';
+
+export default function ZhDashboardLayout({ children }: { children: React.ReactNode }) {
+  return <DashboardShell>{children}</DashboardShell>;
+}
diff --git a/packages/app/src/app/zh/(dashboard)/reliability/page.tsx b/packages/app/src/app/zh/(dashboard)/reliability/page.tsx
new file mode 100644
index 00000000..d4748140
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/reliability/page.tsx
@@ -0,0 +1,19 @@
+import type { Metadata } from 'next';
+
+import { ReliabilityProvider } from '@/components/reliability/ReliabilityContext';
+import ReliabilityChartDisplay from '@/components/reliability/ui/ChartDisplay';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('reliability');
+
+export default function ZhReliabilityPage() {
+  return (
+    <>
+      <ZhTabIntro tab="reliability" />
+      <ReliabilityProvider>
+        <ReliabilityChartDisplay />
+      </ReliabilityProvider>
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/(dashboard)/submissions/page.tsx b/packages/app/src/app/zh/(dashboard)/submissions/page.tsx
new file mode 100644
index 00000000..97c8546d
--- /dev/null
+++ b/packages/app/src/app/zh/(dashboard)/submissions/page.tsx
@@ -0,0 +1,16 @@
+import type { Metadata } from 'next';
+
+import SubmissionsDisplay from '@/components/submissions/SubmissionsDisplay';
+import { ZhTabIntro } from '@/components/zh/zh-tab-intro';
+import { tabMetadataZh } from '@/lib/tab-meta-zh';
+
+export const metadata: Metadata = tabMetadataZh('submissions');
+
+export default function ZhSubmissionsPage() {
+  return (
+    <>
+      <ZhTabIntro tab="submissions" />
+      <SubmissionsDisplay />
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/about/page.tsx b/packages/app/src/app/zh/about/page.tsx
new file mode 100644
index 00000000..688f719a
--- /dev/null
+++ b/packages/app/src/app/zh/about/page.tsx
@@ -0,0 +1,215 @@
+import type { Metadata } from 'next';
+import Link from 'next/link';
+
+import { Card } from '@/components/ui/card';
+import { FAQ_ITEMS_ZH } from '@/components/about/faq-data-zh';
+import { JsonLd } from '@/components/json-ld';
+import { zhAlternates, ZH_OG_LOCALE, ZH_LANG_TAG } from '@/lib/i18n';
+import { GITHUB_OWNER, GITHUB_REPO, SITE_URL } from '@semianalysisai/inferencex-constants';
+
+const faqJsonLd = {
+  '@context': 'https://schema.org',
+  '@type': 'FAQPage',
+  inLanguage: ZH_LANG_TAG,
+  mainEntity: FAQ_ITEMS_ZH.map((item) => ({
+    '@type': 'Question',
+    name: item.question,
+    acceptedAnswer: {
+      '@type': 'Answer',
+      text: [item.answer, item.link?.text, ...(item.list ?? [])].filter(Boolean).join(' '),
+    },
+  })),
+};
+
+export const metadata: Metadata = {
+  title: '关于',
+  description:
+    'InferenceX 是一个独立、厂商中立、可复现的基准测试平台，持续测试各类 AI 加速器上的推理软件性能。',
+  alternates: zhAlternates('/about'),
+  openGraph: {
+    title: '关于 | InferenceX',
+    description:
+      'InferenceX 是一个独立、厂商中立、可复现的基准测试平台，持续测试各类 AI 加速器上的推理软件性能。',
+    url: `${SITE_URL}/zh/about`,
+    locale: ZH_OG_LOCALE,
+  },
+  twitter: {
+    title: '关于 | InferenceX',
+    description:
+      'InferenceX 是一个独立、厂商中立、可复现的基准测试平台，持续测试各类 AI 加速器上的推理软件性能。',
+  },
+};
+
+export default function AboutPageZh() {
+  return (
+    <main className="relative">
+      <JsonLd data={faqJsonLd} />
+      <div className="container mx-auto px-4 lg:px-8 flex flex-col gap-6 lg:gap-4 pb-8">
+        <section>
+          <Card>
+            <h2 className="text-lg font-semibold mb-2">
+              开源持续推理基准测试——受万亿美元级吉瓦规模 Token 工厂运营者的信赖
+            </h2>
+            <p className="text-muted-foreground mb-2">
+              随着世界以指数级速度迈向
+              AGI，软件开发和模型发布日新月异。现有基准测试因其静态性质而迅速过时，参与者往往提交专为基准测试定制的软件镜像，无法反映真实的线上推理性能。
+            </p>
+            <p className="text-muted-foreground mb-2">
+              <strong>InferenceX&trade;</strong>（原名
+              InferenceMAX）是我们独立、厂商中立、可复现的基准测试平台，通过持续测试实际可用于 ML
+              社区的各类 AI 加速器上的推理软件来解决这些问题。
+            </p>
+            <p className="text-muted-foreground">
+              我们的开放数据与洞察已被 ML 社区广泛采用，包括万亿美元级 Token 工厂和 AI
+              实验室的容量规划策略团队，以及多家数十亿美元级
+              NeoCloud。了解更多详情请阅读我们的文章：{' '}
+              <Link
+                href="/blog/inferencemax-open-source-inference-benchmarking"
+                className="text-brand hover:underline font-medium"
+              >
+                InferenceX v1
+              </Link>
+              、{' '}
+              <Link
+                href="/blog/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper"
+                className="text-brand hover:underline font-medium"
+              >
+                InferenceX v2
+              </Link>
+              。
+            </p>
+          </Card>
+        </section>
+
+        <section id="reproducibility" className="scroll-mt-24">
+          <Card>
+            <h2 className="text-lg font-semibold mb-2">可复现性</h2>
+            <p className="text-muted-foreground mb-4">
+              仪表板上的每一个数据点均来自公开的 GitHub Actions
+              工作流运行。测试配方、日志、产物以及数据库记录端到端关联，任何人都可以审计、重新运行或
+              fork 基准测试。
+            </p>
+            <ol className="space-y-3 text-sm text-muted-foreground mb-4">
+              <li className="flex gap-3">
+                <span className="flex size-6 shrink-0 items-center justify-center rounded-full bg-brand/10 text-brand font-semibold text-xs">
+                  1
+                </span>
+                <div>
+                  <strong className="text-foreground">配方提交至仓库。</strong>{' '}
+                  每种硬件、框架、模型和精度的组合都是一个提交到公开仓库的 shell
+                  脚本。镜像、命令行和并行度均在源码中固定。
+                </div>
+              </li>
+              <li className="flex gap-3">
+                <span className="flex size-6 shrink-0 items-center justify-center rounded-full bg-brand/10 text-brand font-semibold text-xs">
+                  2
+                </span>
+                <div>
+                  <strong className="text-foreground">在真实硬件上运行。</strong> GitHub Actions
+                  将工作流调度到实际的目标加速器（NVIDIA、AMD
+                  等）上，并在运行过程中公开流式输出完整的任务日志。
+                </div>
+              </li>
+              <li className="flex gap-3">
+                <span className="flex size-6 shrink-0 items-center justify-center rounded-full bg-brand/10 text-brand font-semibold text-xs">
+                  3
+                </span>
+                <div>
+                  <strong className="text-foreground">上传产物。</strong> 请求延迟、token 计数、GPU
+                  功耗遥测数据和评估样本均附加到运行页面。GitHub Actions 保留这些产物 90
+                  天，同时每周发布完整基准测试数据库的快照作为公开的 GitHub
+                  Release，以实现更长期的可审计性。
+                </div>
+              </li>
+              <li className="flex gap-3">
+                <span className="flex size-6 shrink-0 items-center justify-center rounded-full bg-brand/10 text-brand font-semibold text-xs">
+                  4
+                </span>
+                <div>
+                  <strong className="text-foreground">导入仪表板。</strong>{' '}
+                  成功的运行将被加载到数据库中并在此展示。每个图表 tooltip
+                  都附带一个直接链接，指向生成该数据点的 GitHub Actions
+                  运行。点击任意数据点即可审计其来源。
+                </div>
+              </li>
+            </ol>
+            <div className="flex flex-wrap gap-3 text-sm">
+              <Link
+                href={`https://github.com/${GITHUB_OWNER}/${GITHUB_REPO}/actions?query=branch%3Amain+event%3Apush`}
+                target="_blank"
+                rel="noopener noreferrer"
+                className="inline-flex items-center gap-1.5 rounded-md border border-border px-3 py-1.5 hover:bg-accent transition-colors"
+              >
+                浏览工作流运行
+              </Link>
+              <Link
+                href={`https://github.com/${GITHUB_OWNER}/${GITHUB_REPO}/tree/main/benchmarks`}
+                target="_blank"
+                rel="noopener noreferrer"
+                className="inline-flex items-center gap-1.5 rounded-md border border-border px-3 py-1.5 hover:bg-accent transition-colors"
+              >
+                查看基准测试配方
+              </Link>
+              <Link
+                href="https://github.com/SemiAnalysisAI/InferenceX-app/releases?q=db-dump"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="inline-flex items-center gap-1.5 rounded-md border border-border px-3 py-1.5 hover:bg-accent transition-colors"
+              >
+                每周数据库快照
+              </Link>
+              <Link
+                href={`https://github.com/${GITHUB_OWNER}/${GITHUB_REPO}`}
+                target="_blank"
+                rel="noopener noreferrer"
+                className="inline-flex items-center gap-1.5 rounded-md border border-border px-3 py-1.5 hover:bg-accent transition-colors"
+              >
+                源代码仓库
+              </Link>
+            </div>
+          </Card>
+        </section>
+
+        <section>
+          <Card>
+            <h2 className="text-lg font-semibold mb-4">常见问题</h2>
+            <dl className="divide-y divide-border">
+              {FAQ_ITEMS_ZH.map((item) => (
+                <div key={item.question} className="py-4 first:pt-0 last:pb-0">
+                  <dt className="font-medium mb-1">{item.question}</dt>
+                  <dd className="text-muted-foreground text-sm">
+                    {item.answer && (
+                      <p>
+                        {item.answer}
+                        {item.link && (
+                          <>
+                            {' '}
+                            <a
+                              href={item.link.href}
+                              target="_blank"
+                              rel="noopener noreferrer"
+                              className="text-brand hover:underline font-medium"
+                            >
+                              {item.link.text}
+                            </a>
+                          </>
+                        )}
+                      </p>
+                    )}
+                    {item.list && (
+                      <ul className="mt-1.5 ml-8 list-disc space-y-0.5">
+                        {item.list.map((li) => (
+                          <li key={li}>{li}</li>
+                        ))}
+                      </ul>
+                    )}
+                  </dd>
+                </div>
+              ))}
+            </dl>
+          </Card>
+        </section>
+      </div>
+    </main>
+  );
+}
diff --git a/packages/app/src/app/zh/blog/[slug]/opengraph-image.tsx b/packages/app/src/app/zh/blog/[slug]/opengraph-image.tsx
new file mode 100644
index 00000000..378444bf
--- /dev/null
+++ b/packages/app/src/app/zh/blog/[slug]/opengraph-image.tsx
@@ -0,0 +1,43 @@
+import { ImageResponse } from 'next/og';
+
+import { getAllPosts, getPostBySlug } from '@/lib/blog';
+
+// The OG renderer's default Satori font has no CJK glyphs, so Chinese posts
+// reuse the ENGLISH post metadata for the image — same visual as the original
+// article card. Swapping in a CJK-capable font is a known follow-up.
+import { renderOgImage, size } from '../../../blog/[slug]/og-image-render';
+
+export const alt = 'InferenceX Articles';
+export { size };
+export const contentType = 'image/png';
+
+export function generateStaticParams() {
+  return getAllPosts('zh').map((post) => ({ slug: post.slug }));
+}
+
+export default async function ZhOgImage({ params }: { params: Promise<{ slug: string }> }) {
+  const { slug } = await params;
+  const result = getPostBySlug(slug);
+
+  if (!result) {
+    return new ImageResponse(
+      <div
+        style={{
+          display: 'flex',
+          alignItems: 'center',
+          justifyContent: 'center',
+          width: '100%',
+          height: '100%',
+          backgroundColor: '#18181b',
+          color: '#fafafa',
+          fontSize: 48,
+        }}
+      >
+        InferenceX Articles
+      </div>,
+      size,
+    );
+  }
+
+  return renderOgImage(result.meta);
+}
diff --git a/packages/app/src/app/zh/blog/[slug]/page.tsx b/packages/app/src/app/zh/blog/[slug]/page.tsx
new file mode 100644
index 00000000..6caac102
--- /dev/null
+++ b/packages/app/src/app/zh/blog/[slug]/page.tsx
@@ -0,0 +1,214 @@
+import type { Metadata } from 'next';
+import Link from 'next/link';
+import { notFound } from 'next/navigation';
+import { compileMDX } from 'next-mdx-remote/rsc';
+import rehypeShikiFromHighlighter from '@shikijs/rehype/core';
+import remarkGfm from 'remark-gfm';
+import { createHighlighterCore } from 'shiki/core';
+import { createOnigurumaEngine } from 'shiki/engine/oniguruma';
+
+import { BlogBackLink } from '@/components/blog/blog-back-link';
+import { BlogPostNav } from '@/components/blog/blog-post-nav';
+import { BlogToc } from '@/components/blog/blog-toc';
+import { HashScroll } from '@/components/blog/hash-scroll';
+import { createMdxComponents } from '@/components/blog/mdx-components';
+import { ReadingProgressBar } from '@/components/blog/reading-progress-bar';
+import { ShareTwitterButton, ShareLinkedInButton } from '@/components/share-buttons';
+import { Card } from '@/components/ui/card';
+import { JsonLd } from '@/components/json-ld';
+import { getAllPosts, getAdjacentPosts, extractHeadings, getPostBySlug } from '@/lib/blog';
+import { ZH_LANG_TAG, ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+import {
+  AUTHOR_HANDLE,
+  AUTHOR_NAME,
+  SITE_NAME,
+  SITE_URL,
+} from '@semianalysisai/inferencex-constants';
+
+interface Props {
+  params: Promise<{ slug: string }>;
+}
+
+export function generateStaticParams() {
+  return getAllPosts('zh').map((post) => ({ slug: post.slug }));
+}
+
+export async function generateMetadata({ params }: Props): Promise<Metadata> {
+  const { slug } = await params;
+  const result = getPostBySlug(slug, 'zh');
+  if (!result) return {};
+  const { meta } = result;
+
+  return {
+    title: meta.title,
+    description: meta.subtitle,
+    keywords: meta.tags,
+    authors: [{ name: AUTHOR_NAME }],
+    alternates: zhAlternates(`/blog/${slug}`),
+    openGraph: {
+      title: `${meta.title} | ${SITE_NAME}`,
+      description: meta.subtitle,
+      url: `${SITE_URL}/zh/blog/${slug}`,
+      type: 'article',
+      locale: ZH_OG_LOCALE,
+      publishedTime: `${meta.date}T00:00:00Z`,
+      ...(meta.modifiedDate && { modifiedTime: `${meta.modifiedDate}T00:00:00Z` }),
+      authors: [AUTHOR_NAME],
+      tags: meta.tags,
+    },
+    twitter: {
+      card: 'summary_large_image',
+      title: meta.title,
+      description: meta.subtitle,
+      site: AUTHOR_HANDLE,
+      creator: AUTHOR_HANDLE,
+    },
+  };
+}
+
+let highlighterPromise: ReturnType<typeof createHighlighterCore> | null = null;
+
+function getHighlighter() {
+  if (!highlighterPromise) {
+    highlighterPromise = createHighlighterCore({
+      themes: [import('shiki/themes/github-dark.mjs'), import('shiki/themes/github-light.mjs')],
+      langs: [
+        import('shiki/langs/typescript.mjs'),
+        import('shiki/langs/javascript.mjs'),
+        import('shiki/langs/python.mjs'),
+        import('shiki/langs/bash.mjs'),
+        import('shiki/langs/json.mjs'),
+        import('shiki/langs/yaml.mjs'),
+        import('shiki/langs/css.mjs'),
+        import('shiki/langs/html.mjs'),
+        import('shiki/langs/tsx.mjs'),
+        import('shiki/langs/jsx.mjs'),
+        import('shiki/langs/sql.mjs'),
+        import('shiki/langs/go.mjs'),
+        import('shiki/langs/rust.mjs'),
+      ],
+      engine: createOnigurumaEngine(import('shiki/wasm')),
+    });
+  }
+  return highlighterPromise;
+}
+
+export default async function ZhBlogPostPage({ params }: Props) {
+  const { slug } = await params;
+  const result = getPostBySlug(slug, 'zh');
+  if (!result) notFound();
+
+  const { meta, raw } = result;
+  const adjacent = getAdjacentPosts(slug, 'zh');
+  const headings = extractHeadings(raw);
+  const highlighter = await getHighlighter();
+
+  const { content } = await compileMDX({
+    source: raw,
+    components: createMdxComponents(),
+    options: {
+      mdxOptions: {
+        remarkPlugins: [remarkGfm],
+        rehypePlugins: [
+          [
+            rehypeShikiFromHighlighter,
+            highlighter,
+            {
+              themes: { dark: 'github-dark', light: 'github-light' },
+              defaultColor: false,
+            },
+          ],
+        ],
+      },
+    },
+  });
+
+  const jsonLd = {
+    '@context': 'https://schema.org',
+    '@type': 'BlogPosting',
+    headline: meta.title,
+    author: { '@type': 'Person', name: AUTHOR_NAME },
+    publisher: { '@type': 'Organization', name: AUTHOR_NAME },
+    datePublished: `${meta.date}T00:00:00Z`,
+    ...(meta.modifiedDate && { dateModified: `${meta.modifiedDate}T00:00:00Z` }),
+    description: meta.subtitle,
+    url: `${SITE_URL}/zh/blog/${slug}`,
+    inLanguage: ZH_LANG_TAG,
+    wordCount: raw.trim().split(/\s+/u).length,
+    timeRequired: `PT${meta.readingTime}M`,
+  };
+
+  return (
+    <main className="relative">
+      <HashScroll />
+      <ReadingProgressBar slug={slug} />
+      <JsonLd data={jsonLd} />
+      <div className="container mx-auto px-4 lg:px-8 flex flex-col gap-4">
+        <section data-blog-section="true" className="flex flex-col gap-4">
+          <Card>
+            <BlogBackLink href="/zh/blog" label="返回文章列表" />
+            <header>
+              <h2 className="text-2xl lg:text-4xl font-bold tracking-tight">{meta.title}</h2>
+              <p className="mt-3 text-base lg:text-lg text-muted-foreground">{meta.subtitle}</p>
+              <div className="flex flex-wrap items-center gap-3 text-sm text-muted-foreground mt-3">
+                <span>{AUTHOR_NAME}</span>
+                <span>&middot;</span>
+                <time dateTime={meta.date}>
+                  {new Date(`${meta.date}T00:00:00Z`).toLocaleDateString('zh-CN', {
+                    year: 'numeric',
+                    month: 'long',
+                    day: 'numeric',
+                    timeZone: 'UTC',
+                  })}
+                </time>
+                <span>&middot;</span>
+                <span>{meta.readingTime} 分钟阅读</span>
+                <span>&middot;</span>
+                <Link href={`/blog/${slug}`} hrefLang="en" className="hover:underline text-brand">
+                  阅读英文原文
+                </Link>
+                {meta.tags && meta.tags.length > 0 && (
+                  <>
+                    <span>&middot;</span>
+                    {meta.tags.map((tag) => (
+                      <span key={tag} className="rounded-full bg-muted px-3 py-0.5 text-xs">
+                        {tag}
+                      </span>
+                    ))}
+                  </>
+                )}
+              </div>
+              <div className="flex items-center gap-1.5 mt-4">
+                <ShareTwitterButton text={meta.title} />
+                <ShareLinkedInButton />
+              </div>
+            </header>
+            {headings.length > 0 && (
+              <div className="mt-4">
+                <BlogToc headings={headings} label="本页目录" />
+              </div>
+            )}
+            <div className="mt-6 pt-6 border-t border-border/40">
+              <article
+                data-blog-article
+                className="prose prose-neutral dark:prose-invert max-w-none blog-prose"
+              >
+                {content}
+                <p className="text-xs text-muted-foreground">
+                  本文由英文原文翻译而来，如有歧义以英文版为准。所有文章版权归 &copy; SemiAnalysis
+                  所有，保留所有权利。覆盖应用源代码的 AGPL-3.0 许可证不适用于文章内容。
+                </p>
+              </article>
+            </div>
+          </Card>
+          <BlogPostNav
+            prev={adjacent.prev ? { slug: adjacent.prev.slug, title: adjacent.prev.title } : null}
+            next={adjacent.next ? { slug: adjacent.next.slug, title: adjacent.next.title } : null}
+            basePath="/zh/blog"
+            labels={{ prev: '上一篇', next: '下一篇' }}
+          />
+        </section>
+      </div>
+    </main>
+  );
+}
diff --git a/packages/app/src/app/zh/blog/layout.tsx b/packages/app/src/app/zh/blog/layout.tsx
new file mode 100644
index 00000000..21764bb4
--- /dev/null
+++ b/packages/app/src/app/zh/blog/layout.tsx
@@ -0,0 +1,9 @@
+export default function ZhBlogLayout({ children }: { children: React.ReactNode }) {
+  return (
+    <>
+      <link rel="preconnect" href="https://substack-post-media.s3.amazonaws.com" />
+      <link rel="dns-prefetch" href="https://substack-post-media.s3.amazonaws.com" />
+      {children}
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/blog/page.tsx b/packages/app/src/app/zh/blog/page.tsx
new file mode 100644
index 00000000..ee61e664
--- /dev/null
+++ b/packages/app/src/app/zh/blog/page.tsx
@@ -0,0 +1,127 @@
+import type { Metadata } from 'next';
+import Link from 'next/link';
+
+import { BlogPostCard } from '@/components/blog/blog-post-card';
+import { BlogTagLink } from '@/components/blog/blog-tag-link';
+import { Card } from '@/components/ui/card';
+import { JsonLd } from '@/components/json-ld';
+import { getAllPosts } from '@/lib/blog';
+import { ZH_LANG_TAG, ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+import { SITE_URL, SITE_NAME, AUTHOR_NAME } from '@semianalysisai/inferencex-constants';
+
+export const metadata: Metadata = {
+  title: '文章',
+  description: `${SITE_NAME} by ${AUTHOR_NAME} 的技术文章——AI 推理基准测试、GPU 性能分析与 ML 基础设施洞见。`,
+  alternates: zhAlternates('/blog'),
+  openGraph: {
+    title: `文章 | ${SITE_NAME} by ${AUTHOR_NAME}`,
+    description: 'AI 推理基准测试洞见与 GPU 性能分析。',
+    url: `${SITE_URL}/zh/blog`,
+    locale: ZH_OG_LOCALE,
+  },
+};
+
+const jsonLd = {
+  '@context': 'https://schema.org',
+  '@type': 'Blog',
+  name: `${SITE_NAME} 文章`,
+  url: `${SITE_URL}/zh/blog`,
+  inLanguage: ZH_LANG_TAG,
+  publisher: {
+    '@type': 'Organization',
+    name: AUTHOR_NAME,
+  },
+};
+
+export default async function ZhBlogPage({
+  searchParams,
+}: {
+  searchParams: Promise<{ tag?: string }>;
+}) {
+  const { tag: activeTag } = await searchParams;
+  const posts = getAllPosts('zh');
+  const allTags = [...new Set(posts.flatMap((p) => p.tags ?? []))].toSorted();
+  const filtered = activeTag ? posts.filter((p) => p.tags?.includes(activeTag)) : posts;
+
+  return (
+    <main className="relative">
+      <JsonLd data={jsonLd} />
+      <div className="container mx-auto px-4 lg:px-8 flex flex-col gap-4">
+        <section className="flex flex-col gap-4">
+          <Card>
+            <h2 className="text-2xl lg:text-4xl font-bold tracking-tight">文章</h2>
+            <p className="mt-3 text-base lg:text-lg text-muted-foreground">
+              关于 AI 推理基准测试、GPU 性能与 ML 基础设施的深度洞见。
+            </p>
+            {allTags.length > 0 && (
+              <div className="flex flex-wrap gap-2 mt-4">
+                <Link
+                  href="/zh/blog"
+                  className={`rounded-full px-3 py-0.5 text-xs transition-colors ${
+                    activeTag
+                      ? 'bg-muted text-muted-foreground hover:bg-muted/80'
+                      : 'bg-primary/15 text-primary ring-1 ring-primary/30'
+                  }`}
+                >
+                  全部
+                </Link>
+                {allTags.map((tag) => (
+                  <BlogTagLink key={tag} tag={tag} active={activeTag === tag} basePath="/zh/blog" />
+                ))}
+              </div>
+            )}
+            <div className="mt-6 pt-6 border-t border-border/40">
+              {filtered.length === 0 ? (
+                <p className="text-muted-foreground">
+                  {activeTag ? `没有标签为“${activeTag}”的文章。` : '即将上线。'}
+                </p>
+              ) : (
+                <div className="flex flex-col gap-8">
+                  {filtered.map((post) => (
+                    <BlogPostCard
+                      key={post.slug}
+                      slug={post.slug}
+                      title={post.title}
+                      basePath="/zh/blog"
+                    >
+                      <article className="min-w-0">
+                        <div className="flex items-center gap-3 text-sm text-muted-foreground mb-2">
+                          <time dateTime={post.date}>
+                            {new Date(`${post.date}T00:00:00Z`).toLocaleDateString('zh-CN', {
+                              year: 'numeric',
+                              month: 'long',
+                              day: 'numeric',
+                              timeZone: 'UTC',
+                            })}
+                          </time>
+                          <span>&middot;</span>
+                          <span>{post.readingTime} 分钟阅读</span>
+                        </div>
+                        <h2 className="text-2xl font-semibold mb-2 group-hover:underline group-hover:text-brand">
+                          {post.title}
+                        </h2>
+                        <p className="text-muted-foreground mb-3">{post.subtitle}</p>
+                        {post.tags && post.tags.length > 0 && (
+                          <div className="flex flex-wrap gap-2">
+                            {post.tags.map((tag) => (
+                              <span
+                                key={tag}
+                                className="rounded-full bg-muted px-3 py-0.5 text-xs text-muted-foreground"
+                              >
+                                {tag}
+                              </span>
+                            ))}
+                          </div>
+                        )}
+                      </article>
+                    </BlogPostCard>
+                  ))}
+                </div>
+              )}
+            </div>
+          </Card>
+        </section>
+      </div>
+    </main>
+  );
+}
diff --git a/packages/app/src/app/zh/compare-per-dollar/layout.tsx b/packages/app/src/app/zh/compare-per-dollar/layout.tsx
new file mode 100644
index 00000000..0812858a
--- /dev/null
+++ b/packages/app/src/app/zh/compare-per-dollar/layout.tsx
@@ -0,0 +1,13 @@
+import { UnofficialRunProvider } from '@/components/unofficial-run-provider';
+
+export default function ComparePerDollarLayout({ children }: { children: React.ReactNode }) {
+  return (
+    <UnofficialRunProvider>
+      <main className="relative">
+        <div className="container mx-auto px-4 lg:px-8 flex flex-col gap-6 lg:gap-4 pb-8">
+          {children}
+        </div>
+      </main>
+    </UnofficialRunProvider>
+  );
+}
diff --git a/packages/app/src/app/zh/compare-per-dollar/page.tsx b/packages/app/src/app/zh/compare-per-dollar/page.tsx
new file mode 100644
index 00000000..32d5544e
--- /dev/null
+++ b/packages/app/src/app/zh/compare-per-dollar/page.tsx
@@ -0,0 +1,149 @@
+import type { Metadata } from 'next';
+
+import {
+  HW_REGISTRY,
+  SITE_NAME,
+  SITE_URL,
+  SUPPORTERS_LINE_ZH,
+} from '@semianalysisai/inferencex-constants';
+
+import { ComparePairCardLink } from '@/components/compare/compare-pair-card-link';
+import { JsonLd } from '@/components/json-ld';
+import { Card } from '@/components/ui/card';
+import { getComparablePairsByModelSlug } from '@/lib/compare-availability';
+import { type ComparePair, COMPARE_MODEL_SLUGS, type CompareModelSlug } from '@/lib/compare-slug';
+import { bucketComparePairsByVendor, formatModelList } from '@/lib/compare-ssr';
+import { ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+
+export const dynamic = 'force-dynamic';
+
+const DESCRIPTION = `哪款 GPU 每美元推理性能最高？InferenceX 是 SemiAnalysis 推出的独立开源基准测试平台，提供经过验证的、可复现的测试结果。${SUPPORTERS_LINE_ZH}横向对比 DeepSeek V4 Pro、DeepSeek R1、Kimi K2、MiniMax M3、GLM 5、Qwen 3.5 等模型基于云服务商 TCO 归一化的每百万 token 成本。`;
+
+export const metadata: Metadata = {
+  title: 'GPU 每美元性能',
+  description: DESCRIPTION,
+  alternates: zhAlternates('/compare-per-dollar'),
+  openGraph: {
+    title: `GPU 每美元性能 | ${SITE_NAME}`,
+    description: DESCRIPTION,
+    url: `${SITE_URL}/zh/compare-per-dollar`,
+    type: 'website',
+    locale: ZH_OG_LOCALE,
+  },
+  twitter: {
+    card: 'summary_large_image',
+    title: `GPU 每美元性能 | ${SITE_NAME}`,
+    description: DESCRIPTION,
+  },
+};
+
+interface VendorGroup {
+  heading: string;
+  description: string;
+  pairs: { a: string; b: string; slug: string; label: string }[];
+}
+
+function groupPairsByVendorForModel(
+  model: CompareModelSlug,
+  comparablePairs: ComparePair[],
+): VendorGroup[] {
+  const { cross, nvidia, amd } = bucketComparePairsByVendor(model.slug, comparablePairs);
+  const groups: VendorGroup[] = [];
+  if (cross.length > 0) {
+    groups.push({
+      heading: 'NVIDIA vs AMD',
+      description: '跨厂商的不同架构代际每 token 成本对比。',
+      pairs: cross,
+    });
+  }
+  if (nvidia.length > 0) {
+    groups.push({
+      heading: 'NVIDIA vs NVIDIA',
+      description: 'Hopper 与 Blackwell 代际每 token 成本对比。',
+      pairs: nvidia,
+    });
+  }
+  if (amd.length > 0) {
+    groups.push({
+      heading: 'AMD vs AMD',
+      description: 'CDNA 3 与 CDNA 4 代际每 token 成本对比。',
+      pairs: amd,
+    });
+  }
+  return groups;
+}
+
+const jsonLd = {
+  '@context': 'https://schema.org',
+  '@type': 'CollectionPage',
+  name: `GPU 每美元性能 | ${SITE_NAME}`,
+  description: DESCRIPTION,
+  url: `${SITE_URL}/zh/compare-per-dollar`,
+  inLanguage: 'zh-CN',
+};
+
+export default async function ComparePerDollarIndexPageZh() {
+  const comparablePairsByModel = await getComparablePairsByModelSlug();
+  const totalUrls = [...comparablePairsByModel.values()].reduce((s, p) => s + p.length, 0);
+  const modelsWithPairs = COMPARE_MODEL_SLUGS.filter(
+    (m) => (comparablePairsByModel.get(m.slug)?.length ?? 0) > 0,
+  );
+
+  return (
+    <>
+      <JsonLd data={jsonLd} />
+      <section>
+        <Card>
+          <h1 className="text-2xl lg:text-4xl font-bold tracking-tight">GPU 每美元性能</h1>
+          <p className="mt-3 text-base lg:text-lg text-muted-foreground max-w-3xl">
+            {totalUrls.toLocaleString()} 组每百万 token 成本的正面对比，涵盖{' '}
+            {formatModelList(modelsWithPairs)}
+            。性能按所属云服务商 TCO 归一化——每个页面展示每 token 成本图表及插值美元/百万 token
+            对比表格，帮助您在任意目标交互性水平下选出更经济的 GPU。
+          </p>
+        </Card>
+      </section>
+
+      {modelsWithPairs.map((model) => {
+        const pairs = comparablePairsByModel.get(model.slug) ?? [];
+        const groups = groupPairsByVendorForModel(model, pairs);
+        return (
+          <section key={model.slug} id={model.slug}>
+            <Card className="flex flex-col gap-4">
+              <div>
+                <h2 className="text-xl lg:text-2xl font-bold tracking-tight">{model.label}</h2>
+                <p className="text-sm text-muted-foreground mt-1">
+                  {pairs.length} 组 GPU 对比具有 {model.label} 的每 token 成本基准测试数据。
+                </p>
+              </div>
+              {groups.map((group) => (
+                <div key={`${model.slug}__${group.heading}`} className="flex flex-col gap-3">
+                  <div>
+                    <h3 className="text-base font-semibold">{group.heading}</h3>
+                    <p className="text-xs text-muted-foreground mt-1">{group.description}</p>
+                  </div>
+                  <div className="grid grid-cols-1 sm:grid-cols-2 lg:grid-cols-3 gap-3">
+                    {group.pairs.map(({ slug, label, a, b }) => {
+                      const aMeta = HW_REGISTRY[a];
+                      const bMeta = HW_REGISTRY[b];
+                      const archLine = `${aMeta?.arch ?? '—'} · ${bMeta?.arch ?? '—'}`;
+                      return (
+                        <ComparePairCardLink
+                          key={slug}
+                          href={`/compare-per-dollar/${slug}`}
+                          slug={slug}
+                          label={label}
+                          archLine={archLine}
+                        />
+                      );
+                    })}
+                  </div>
+                </div>
+              ))}
+            </Card>
+          </section>
+        );
+      })}
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/compare/layout.tsx b/packages/app/src/app/zh/compare/layout.tsx
new file mode 100644
index 00000000..9ccb8b2d
--- /dev/null
+++ b/packages/app/src/app/zh/compare/layout.tsx
@@ -0,0 +1,13 @@
+import { UnofficialRunProvider } from '@/components/unofficial-run-provider';
+
+export default function CompareLayout({ children }: { children: React.ReactNode }) {
+  return (
+    <UnofficialRunProvider>
+      <main className="relative">
+        <div className="container mx-auto px-4 lg:px-8 flex flex-col gap-6 lg:gap-4 pb-8">
+          {children}
+        </div>
+      </main>
+    </UnofficialRunProvider>
+  );
+}
diff --git a/packages/app/src/app/zh/compare/page.tsx b/packages/app/src/app/zh/compare/page.tsx
new file mode 100644
index 00000000..a3640a4c
--- /dev/null
+++ b/packages/app/src/app/zh/compare/page.tsx
@@ -0,0 +1,161 @@
+import type { Metadata } from 'next';
+import Link from 'next/link';
+
+import {
+  HW_REGISTRY,
+  SITE_NAME,
+  SITE_URL,
+  SUPPORTERS_LINE_ZH,
+} from '@semianalysisai/inferencex-constants';
+
+import { ComparePairCardLink } from '@/components/compare/compare-pair-card-link';
+import { JsonLd } from '@/components/json-ld';
+import { Card } from '@/components/ui/card';
+import { getComparablePairsByModelSlug } from '@/lib/compare-availability';
+import { type ComparePair, COMPARE_MODEL_SLUGS, type CompareModelSlug } from '@/lib/compare-slug';
+import { bucketComparePairsByVendor, formatModelList } from '@/lib/compare-ssr';
+import { ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+
+export const dynamic = 'force-dynamic';
+
+const DESCRIPTION = `InferenceX 是 SemiAnalysis 推出的独立开源 GPU 推理基准测试平台，提供经过验证的、可复现的每夜测试结果。${SUPPORTERS_LINE_ZH}横向对比 DeepSeek V4 Pro、DeepSeek R1、Kimi K2、MiniMax M3、GLM 5、Qwen 3.5 等模型的延迟、吞吐量与成本。`;
+
+export const metadata: Metadata = {
+  title: 'GPU 对比',
+  description: DESCRIPTION,
+  alternates: zhAlternates('/compare'),
+  openGraph: {
+    title: `GPU 对比 | ${SITE_NAME}`,
+    description: DESCRIPTION,
+    url: `${SITE_URL}/zh/compare`,
+    type: 'website',
+    locale: ZH_OG_LOCALE,
+  },
+  twitter: {
+    card: 'summary_large_image',
+    title: `GPU 对比 | ${SITE_NAME}`,
+    description: DESCRIPTION,
+  },
+};
+
+interface VendorGroup {
+  heading: string;
+  description: string;
+  pairs: { a: string; b: string; slug: string; label: string }[];
+}
+
+function groupPairsByVendorForModel(
+  model: CompareModelSlug,
+  comparablePairs: ComparePair[],
+): VendorGroup[] {
+  const { cross, nvidia, amd } = bucketComparePairsByVendor(model.slug, comparablePairs);
+  const groups: VendorGroup[] = [];
+  if (cross.length > 0) {
+    groups.push({
+      heading: 'NVIDIA vs AMD',
+      description: '跨厂商的不同架构代际对比。',
+      pairs: cross,
+    });
+  }
+  if (nvidia.length > 0) {
+    groups.push({
+      heading: 'NVIDIA vs NVIDIA',
+      description: 'Hopper 与 Blackwell 代际对比。',
+      pairs: nvidia,
+    });
+  }
+  if (amd.length > 0) {
+    groups.push({
+      heading: 'AMD vs AMD',
+      description: 'CDNA 3 与 CDNA 4 代际对比。',
+      pairs: amd,
+    });
+  }
+  return groups;
+}
+
+const jsonLd = {
+  '@context': 'https://schema.org',
+  '@type': 'CollectionPage',
+  name: `GPU 对比 | ${SITE_NAME}`,
+  description: DESCRIPTION,
+  url: `${SITE_URL}/zh/compare`,
+  inLanguage: 'zh-CN',
+};
+
+export default async function CompareIndexPageZh() {
+  const comparablePairsByModel = await getComparablePairsByModelSlug();
+  const totalUrls = [...comparablePairsByModel.values()].reduce((s, p) => s + p.length, 0);
+  const modelsWithPairs = COMPARE_MODEL_SLUGS.filter(
+    (m) => (comparablePairsByModel.get(m.slug)?.length ?? 0) > 0,
+  );
+
+  return (
+    <>
+      <JsonLd data={jsonLd} />
+      <section>
+        <Card>
+          <h1 className="text-2xl lg:text-4xl font-bold tracking-tight">GPU 对比</h1>
+          <p className="mt-3 text-base lg:text-lg text-muted-foreground max-w-3xl">
+            {totalUrls.toLocaleString()} 组推理基准测试的正面对比，涵盖{' '}
+            {formatModelList(modelsWithPairs)}
+            。每个页面均包含延迟、吞吐量和成本指标的交互式图表，以及插值对比表格。
+          </p>
+          <div className="mt-6">
+            <Link
+              data-testid="compare-index-per-dollar-link-zh"
+              href="/zh/compare-per-dollar"
+              className="inline-flex items-center gap-2 rounded-md bg-brand px-5 py-3 text-base lg:text-lg font-semibold text-primary-foreground shadow-sm transition-colors hover:bg-brand/90"
+            >
+              GPU 每美元性能对比
+              <span aria-hidden="true" className="text-lg lg:text-xl">
+                →
+              </span>
+            </Link>
+          </div>
+        </Card>
+      </section>
+
+      {modelsWithPairs.map((model) => {
+        const pairs = comparablePairsByModel.get(model.slug) ?? [];
+        const groups = groupPairsByVendorForModel(model, pairs);
+        return (
+          <section key={model.slug} id={model.slug}>
+            <Card className="flex flex-col gap-4">
+              <div>
+                <h2 className="text-xl lg:text-2xl font-bold tracking-tight">{model.label}</h2>
+                <p className="text-sm text-muted-foreground mt-1">
+                  {pairs.length} 组 GPU 对比具有 {model.label} 的基准测试数据。
+                </p>
+              </div>
+              {groups.map((group) => (
+                <div key={`${model.slug}__${group.heading}`} className="flex flex-col gap-3">
+                  <div>
+                    <h3 className="text-base font-semibold">{group.heading}</h3>
+                    <p className="text-xs text-muted-foreground mt-1">{group.description}</p>
+                  </div>
+                  <div className="grid grid-cols-1 sm:grid-cols-2 lg:grid-cols-3 gap-3">
+                    {group.pairs.map(({ slug, label, a, b }) => {
+                      const aMeta = HW_REGISTRY[a];
+                      const bMeta = HW_REGISTRY[b];
+                      const archLine = `${aMeta?.arch ?? '—'} · ${bMeta?.arch ?? '—'}`;
+                      return (
+                        <ComparePairCardLink
+                          key={slug}
+                          href={`/compare/${slug}`}
+                          slug={slug}
+                          label={label}
+                          archLine={archLine}
+                        />
+                      );
+                    })}
+                  </div>
+                </div>
+              ))}
+            </Card>
+          </section>
+        );
+      })}
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/land-acknowledgement/page.tsx b/packages/app/src/app/zh/land-acknowledgement/page.tsx
new file mode 100644
index 00000000..f9211700
--- /dev/null
+++ b/packages/app/src/app/zh/land-acknowledgement/page.tsx
@@ -0,0 +1,99 @@
+import type { Metadata } from 'next';
+
+import { Card } from '@/components/ui/card';
+import { zhAlternates, ZH_OG_LOCALE } from '@/lib/i18n';
+import { SITE_URL } from '@semianalysisai/inferencex-constants';
+
+const REGIONAL_ACKNOWLEDGEMENTS_ZH = [
+  {
+    region: 'San Jose',
+    peoples: 'Muwekma Ohlone 部落',
+    acknowledgement:
+      '我们位于 San Jose 地区的基准测试基础设施运行在旧金山湾区 Muwekma Ohlone 部落未被让渡的祖传家园之上。',
+  },
+  {
+    region: 'Los Angeles',
+    peoples: 'Tongva、Tataviam、Serrano、Kizh 和 Chumash 族群',
+    acknowledgement:
+      '我们位于 Los Angeles 地区的基准测试基础设施运行在 Tongva、Tataviam、Serrano、Kizh 和 Chumash 族群最初居住并至今仍在守护的土地之上。',
+  },
+  {
+    region: 'Chicago',
+    peoples: '三火议会、Illinois 联盟、Miami、Ho-Chunk、Menominee、Fox 和 Sac 族群',
+    acknowledgement:
+      '我们位于 Chicago 地区的基准测试基础设施运行在由三火议会（Ojibwe、Odawa 和 Potawatomi 部落）、Illinois 联盟以及包括 Miami、Ho-Chunk、Menominee、Fox 和 Sac 在内的众多原住民族群守护的土地之上。',
+  },
+];
+
+export const metadata: Metadata = {
+  title: '土地致谢',
+  description:
+    '对与 InferenceX 美国基准测试集群（San Jose、Los Angeles 和 Chicago）所在土地相关的原住民族群和家园的致谢。',
+  alternates: zhAlternates('/land-acknowledgement'),
+  openGraph: {
+    title: '土地致谢 | InferenceX',
+    description:
+      '对与 InferenceX 美国基准测试集群（San Jose、Los Angeles 和 Chicago）所在土地相关的原住民族群和家园的致谢。',
+    url: `${SITE_URL}/zh/land-acknowledgement`,
+    locale: ZH_OG_LOCALE,
+  },
+  twitter: {
+    title: '土地致谢 | InferenceX',
+    description:
+      '对与 InferenceX 美国基准测试集群（San Jose、Los Angeles 和 Chicago）所在土地相关的原住民族群和家园的致谢。',
+  },
+};
+
+export default function LandAcknowledgementPageZh() {
+  return (
+    <main data-testid="land-acknowledgement-page" className="relative">
+      <div className="container mx-auto px-4 lg:px-8 pb-8">
+        <Card className="gap-10">
+          <header className="max-w-3xl">
+            <p className="mb-3 text-xs font-semibold uppercase tracking-[0.32em] text-brand">
+              土地致谢
+            </p>
+            <h1 className="text-4xl font-semibold tracking-[-0.04em] text-foreground md:text-5xl">
+              我们致敬与我们美国基础设施所在土地相关的原住民家园。
+            </h1>
+            <p className="mt-4 text-sm leading-6 text-muted-foreground md:text-base">
+              InferenceX 基准测试集群为多个地区提供服务。本页聚焦于我们在美国的 San Jose、Los
+              Angeles 和 Chicago
+              站点，并向世代守护这些土地、至今仍在延续这一使命的原住民族群致以敬意。
+            </p>
+          </header>
+
+          <section
+            data-testid="land-acknowledgement-regions"
+            className="grid gap-4 lg:grid-cols-3"
+            aria-label="各地区土地致谢"
+          >
+            {REGIONAL_ACKNOWLEDGEMENTS_ZH.map((entry) => (
+              <article
+                key={entry.region}
+                data-testid={`land-acknowledgement-${entry.region
+                  .toLowerCase()
+                  .replaceAll(' ', '-')}`}
+                className="rounded-2xl border border-border/40 bg-background/20 p-5"
+              >
+                <p className="text-xs font-semibold uppercase tracking-[0.28em] text-muted-foreground">
+                  {entry.region}
+                </p>
+                <h2 className="mt-3 text-xl font-semibold tracking-[-0.04em] text-foreground">
+                  {entry.peoples}
+                </h2>
+                <p className="mt-4 text-sm leading-6 text-muted-foreground">
+                  {entry.acknowledgement}
+                </p>
+              </article>
+            ))}
+          </section>
+
+          <p className="max-w-3xl text-sm leading-6 text-muted-foreground">
+            致谢只是一个起点。我们怀着对原住民主权、历史和持续存在的社区的尊重分享这份声明，如果措辞需要改进，欢迎指正。
+          </p>
+        </Card>
+      </div>
+    </main>
+  );
+}
diff --git a/packages/app/src/app/zh/layout.tsx b/packages/app/src/app/zh/layout.tsx
new file mode 100644
index 00000000..057d3555
--- /dev/null
+++ b/packages/app/src/app/zh/layout.tsx
@@ -0,0 +1,18 @@
+import { SetDocumentLang } from '@/components/set-document-lang';
+import { ZH_LANG_TAG } from '@/lib/i18n';
+
+/**
+ * Simplified Chinese page tree. Every page under /zh is a hand-authored
+ * Chinese sibling of an English page (see AGENTS.md "Chinese Website Pages").
+ * The lang attribute on the wrapper scopes the content language for crawlers
+ * and assistive tech before hydration; SetDocumentLang fixes up <html lang>
+ * after hydration.
+ */
+export default function ZhLayout({ children }: { children: React.ReactNode }) {
+  return (
+    <div lang={ZH_LANG_TAG} className="contents">
+      <SetDocumentLang lang={ZH_LANG_TAG} />
+      {children}
+    </div>
+  );
+}
diff --git a/packages/app/src/app/zh/not-found.tsx b/packages/app/src/app/zh/not-found.tsx
new file mode 100644
index 00000000..209a74cb
--- /dev/null
+++ b/packages/app/src/app/zh/not-found.tsx
@@ -0,0 +1,16 @@
+import Link from 'next/link';
+
+export default function ZhNotFound() {
+  return (
+    <div className="flex flex-col items-center justify-center grow text-foreground">
+      <h1 className="text-4xl font-bold mb-4">404 - 页面不存在</h1>
+      <p className="text-lg mb-8">您访问的页面不存在。</p>
+      <Link
+        href="/zh"
+        className="px-4 py-2 bg-primary text-primary-foreground rounded-md hover:bg-primary/90"
+      >
+        返回首页
+      </Link>
+    </div>
+  );
+}
diff --git a/packages/app/src/app/zh/page.tsx b/packages/app/src/app/zh/page.tsx
new file mode 100644
index 00000000..3bb97746
--- /dev/null
+++ b/packages/app/src/app/zh/page.tsx
@@ -0,0 +1,26 @@
+import type { Metadata } from 'next';
+
+import { LandingPage } from '@/components/landing/landing-page';
+import { ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+import { LANDING_META_ZH } from '@/lib/tab-meta-zh';
+import { SITE_URL } from '@semianalysisai/inferencex-constants';
+
+export const metadata: Metadata = {
+  title: LANDING_META_ZH.title,
+  description: LANDING_META_ZH.description,
+  alternates: zhAlternates('/'),
+  openGraph: {
+    title: `${LANDING_META_ZH.title} | InferenceX`,
+    description: LANDING_META_ZH.description,
+    url: `${SITE_URL}/zh`,
+    locale: ZH_OG_LOCALE,
+  },
+  twitter: {
+    title: `${LANDING_META_ZH.title} | InferenceX`,
+    description: LANDING_META_ZH.description,
+  },
+};
+
+export default function ZhHomePage() {
+  return <LandingPage locale="zh" />;
+}
diff --git a/packages/app/src/app/zh/quotes/page.tsx b/packages/app/src/app/zh/quotes/page.tsx
new file mode 100644
index 00000000..b13e3dab
--- /dev/null
+++ b/packages/app/src/app/zh/quotes/page.tsx
@@ -0,0 +1,23 @@
+import type { Metadata } from 'next';
+
+import { QuotesContent } from '@/components/quotes/quotes-content';
+import { ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+import { SITE_URL } from '@semianalysisai/inferencex-constants';
+
+export const metadata: Metadata = {
+  title: '支持者',
+  description:
+    'InferenceX 计划获得众多主要算力买家与 ML 社区知名成员的支持，包括来自 MiniMax、Moonshot Kimi、阿里巴巴 Qwen、OpenAI、Microsoft、vLLM、PyTorch 基金会、Oracle 等机构的支持者。',
+  alternates: zhAlternates('/quotes'),
+  openGraph: {
+    title: '支持者 | InferenceX by SemiAnalysis',
+    description:
+      '获得 MiniMax、Moonshot Kimi、阿里巴巴 Qwen、OpenAI、Microsoft、vLLM、PyTorch 基金会、Oracle 及 ML 社区知名成员的支持。',
+    url: `${SITE_URL}/zh/quotes`,
+    locale: ZH_OG_LOCALE,
+  },
+};
+
+export default function ZhQuotesPage() {
+  return <QuotesContent locale="zh" />;
+}
diff --git a/packages/app/src/components/about/faq-data-zh.ts b/packages/app/src/components/about/faq-data-zh.ts
new file mode 100644
index 00000000..279cba25
--- /dev/null
+++ b/packages/app/src/components/about/faq-data-zh.ts
@@ -0,0 +1,127 @@
+import {
+  GPU_KEYS,
+  GPU_VENDORS,
+  DB_MODEL_TO_DISPLAY,
+  PRECISION_KEYS,
+  GITHUB_OWNER,
+  GITHUB_REPO,
+  FRAMEWORK_LABELS,
+} from '@semianalysisai/inferencex-constants';
+import { CAROUSEL_ORGS, CAROUSEL_LABELS } from '@/components/quotes/quotes-data';
+
+import type { FaqItem } from './faq-data';
+
+/* ---------- Dynamic lists from constants ---------- */
+
+const gpusByVendor = [...GPU_KEYS].reduce<Record<string, string[]>>((acc, key) => {
+  const vendor = GPU_VENDORS[key] ?? 'Other';
+  (acc[vendor] ??= []).push(key.toUpperCase());
+  return acc;
+}, {});
+
+const modelNames = Object.values({
+  ...DB_MODEL_TO_DISPLAY,
+  'kimik2.6': 'Kimi-K2.6',
+  'kimik2.7-code': 'Kimi-K2.7-Code',
+  'minimaxm2.7': 'MiniMax-M2.7',
+  'glm5.1': 'GLM-5.1',
+});
+
+const frameworkNames = [...new Set(Object.values(FRAMEWORK_LABELS))].map((n) =>
+  n.replace(/[¹²³⁴⁵⁶⁷⁸⁹⁰]+$/u, ''),
+);
+
+const supporterOrgs = CAROUSEL_ORGS.map((org) => CAROUSEL_LABELS[org] ?? org);
+
+/* ---------- FAQ content (Simplified Chinese) ---------- */
+
+export const FAQ_ITEMS_ZH: FaqItem[] = [
+  {
+    question: '什么是 InferenceX？',
+    answer:
+      'InferenceX（原名 InferenceMAX）是一个开源、厂商中立的基准测试（benchmark）平台，持续衡量各类 GPU 和软件栈的 AI 推理性能。每当配置发生变化时，基准测试会重新运行，确保结果始终跟随模型和框架的演进保持最新。',
+  },
+  {
+    question: 'InferenceX 由谁开发？',
+    answer: `InferenceX 由独立半导体与 AI 研究机构 SemiAnalysis 构建，受到 ${supporterOrgs.join('、')} 的支持与信赖。基准测试代码、数据和仪表板均在 GitHub 上开源。`,
+  },
+  {
+    question: 'InferenceX 测试了哪些 GPU？',
+    answer: '我们会在新加速器可用时持续添加。',
+    list: Object.entries(gpusByVendor).map(([vendor, gpus]) => `${vendor}: ${gpus.join(', ')}`),
+  },
+  {
+    question: '测试了哪些 AI 模型？',
+    answer: '每个模型均在多种序列长度配置（1k/1k、1k/8k、8k/1k tokens）和并发级别下进行测试。',
+    list: modelNames,
+  },
+  {
+    question: '测试了哪些推理框架和配置？',
+    answer: '',
+    list: [
+      `框架：${frameworkNames.join(', ')}`,
+      `精度：${[...PRECISION_KEYS].map((p) => p.toUpperCase()).join(', ')}`,
+      '运行时：CUDA、ROCm',
+      '分离式推理（Disaggregated serving，独立的 prefill/decode GPU 池）',
+      '多 token 预测（MTP）',
+      '面向 MoE 模型的宽专家并行（Wide Expert Parallelism）',
+    ],
+  },
+  {
+    question: 'InferenceX 测量哪些指标？',
+    answer: '',
+    list: [
+      '交互性（tok/s/user）',
+      '每 GPU token 吞吐量（tok/s/gpu）',
+      '每 GPU 输入和输出吞吐量',
+      '每兆瓦 token 吞吐量（tok/s/MW）',
+      'P99 首 token 延迟（TTFT）',
+      '每百万 token 成本（总计、输入、输出）——涵盖超大规模云、NeoCoud 和裸机租赁定价',
+      '每 token 能耗（焦耳，总计、输入、输出）',
+      '用户自定义成本和功耗计算',
+    ],
+  },
+  {
+    question: '基准测试多久运行一次？',
+    answer:
+      '基准测试最初按每日计划运行，但随着硬件/框架/模型组合数量的增长，这种方式已不再可行。现在，当配置发生变化（例如新软件发布、驱动更新或模型添加）时重新运行。仪表板中保留了历史数据。',
+  },
+  {
+    question: 'InferenceX 是开源的吗？',
+    answer: '是的。代码、数据和仪表板均为开源。',
+    link: {
+      text: `${GITHUB_OWNER}/${GITHUB_REPO}`,
+      href: `https://github.com/${GITHUB_OWNER}/${GITHUB_REPO}`,
+    },
+  },
+  {
+    question: 'InferenceX 与其他 AI 基准测试有何不同？',
+    answer:
+      '大多数 AI 基准测试是静态的、单时间点测量，参与者提交的是专为基准测试定制的镜像，无法反映真实的线上推理性能。InferenceX 在真实硬件上持续运行，采用完全可复现的配置。所有测试脚本均提交至代码仓库，基准测试日志在 GitHub Actions 上公开可见，结果端到端可审计。',
+  },
+  {
+    question: '结果如何实现可复现？',
+    answer:
+      '仪表板上的每一个数据点均由公开的 GitHub Actions 工作流运行产生。测试配方（模型、框架、精度、并行度、序列长度、并发数）已提交至仓库，在目标硬件上实际执行，产物（日志、指标、GPU 追踪数据）上传至运行页面。用户可从任何图表的 tooltip 直接点击链接，跳转到生成该数据点的 GitHub Actions 运行。',
+  },
+  {
+    question: '在哪里可以查看原始基准测试日志？',
+    answer:
+      '在图表上点击任意数据点即可打开 tooltip。其中的"GitHub Actions Run"链接将直接跳转到生成该数据点的工作流运行。在那里您可以查看完整的任务日志、框架和驱动版本、命令行参数，以及下载原始产物（包括请求延迟、token 计数和 GPU 功耗遥测数据）。',
+  },
+  {
+    question: '我可以自己重新运行基准测试吗？',
+    answer:
+      '可以。基准测试配方位于代码仓库的 /benchmarks 目录中，以独立的 shell 脚本形式存在。如果您拥有相同的硬件，可以 fork 仓库并直接运行脚本，或触发相同的 GitHub Actions 工作流来复现结果。',
+  },
+  {
+    question: '历史运行记录是否保留？',
+    answer:
+      '是的。GitHub Actions 保留工作流运行日志和产物 90 天。为了更长期的可审计性，我们还会每周发布完整基准测试数据库的快照作为公开的 GitHub Release，任何人都可以下载历史数据集并复现或重新分析仪表板中的任何图表。',
+  },
+  {
+    question: '我可以使用 InferenceX 的数据进行自己的分析吗？',
+    answer:
+      '可以。所有数据均可自由获取。仪表板支持按 GPU、模型、框架和日期范围筛选，您也可以直接从任何图表导出原始 CSV 数据。',
+  },
+];
diff --git a/packages/app/src/components/blog/blog-back-link.tsx b/packages/app/src/components/blog/blog-back-link.tsx
index e895da59..978ad787 100644
--- a/packages/app/src/components/blog/blog-back-link.tsx
+++ b/packages/app/src/components/blog/blog-back-link.tsx
@@ -3,15 +3,21 @@
 import Link from 'next/link';
 import { track } from '@/lib/analytics';
 
-export function BlogBackLink() {
+export function BlogBackLink({
+  href = '/blog',
+  label = 'Back to articles',
+}: {
+  href?: string;
+  label?: string;
+} = {}) {
   return (
     <nav>
       <Link
-        href="/blog"
+        href={href}
         className="text-sm text-muted-foreground hover:underline mb-4 inline-block"
         onClick={() => track('blog_back_clicked')}
       >
-        &larr;&nbsp;&nbsp;Back to articles
+        &larr;&nbsp;&nbsp;{label}
       </Link>
     </nav>
   );
diff --git a/packages/app/src/components/blog/blog-post-card.tsx b/packages/app/src/components/blog/blog-post-card.tsx
index ffa6b032..4e336358 100644
--- a/packages/app/src/components/blog/blog-post-card.tsx
+++ b/packages/app/src/components/blog/blog-post-card.tsx
@@ -8,13 +8,15 @@ import { track } from '@/lib/analytics';
 interface BlogPostCardProps {
   slug: string;
   title: string;
+  /** Blog list base path, e.g. '/zh/blog' on Chinese pages. */
+  basePath?: string;
   children: ReactNode;
 }
 
-export function BlogPostCard({ slug, title, children }: BlogPostCardProps) {
+export function BlogPostCard({ slug, title, basePath = '/blog', children }: BlogPostCardProps) {
   return (
     <Link
-      href={`/blog/${slug}`}
+      href={`${basePath}/${slug}`}
       className="group relative block rounded-xl border border-border bg-background/20 backdrop-blur-[2px] p-4 md:p-8 transition-all duration-200 hover:border-brand/50 hover:shadow-lg hover:shadow-brand/5 hover:scale-[1.01]"
       onClick={() => track('blog_post_clicked', { slug, title })}
     >
diff --git a/packages/app/src/components/blog/blog-post-nav.tsx b/packages/app/src/components/blog/blog-post-nav.tsx
index 01dde6b7..12b65a2d 100644
--- a/packages/app/src/components/blog/blog-post-nav.tsx
+++ b/packages/app/src/components/blog/blog-post-nav.tsx
@@ -12,22 +12,30 @@ interface PostLink {
 interface BlogPostNavProps {
   prev: PostLink | null;
   next: PostLink | null;
+  /** Blog list base path, e.g. '/zh/blog' on Chinese pages. */
+  basePath?: string;
+  labels?: { prev: string; next: string };
 }
 
-export function BlogPostNav({ prev, next }: BlogPostNavProps) {
+export function BlogPostNav({
+  prev,
+  next,
+  basePath = '/blog',
+  labels = { prev: 'Previous', next: 'Next' },
+}: BlogPostNavProps) {
   if (!prev && !next) return null;
 
   return (
     <nav className="flex flex-col sm:flex-row justify-between gap-4 mt-2">
       {prev ? (
         <Link
-          href={`/blog/${prev.slug}`}
+          href={`${basePath}/${prev.slug}`}
           className="group relative flex items-center gap-3 rounded-xl border border-border bg-background/20 backdrop-blur-[2px] p-4 transition-all duration-200 hover:border-brand/50 hover:shadow-lg hover:shadow-brand/5 hover:scale-[1.01] flex-1"
           onClick={() => track('blog_nav_prev', { slug: prev.slug, title: prev.title })}
         >
           <ChevronLeft className="size-5 text-muted-foreground shrink-0" />
           <div className="min-w-0">
-            <p className="text-xs text-muted-foreground">Previous</p>
+            <p className="text-xs text-muted-foreground">{labels.prev}</p>
             <p className="text-sm font-medium truncate group-hover:underline">{prev.title}</p>
           </div>
         </Link>
@@ -36,12 +44,12 @@ export function BlogPostNav({ prev, next }: BlogPostNavProps) {
       )}
       {next ? (
         <Link
-          href={`/blog/${next.slug}`}
+          href={`${basePath}/${next.slug}`}
           className="group relative flex items-center justify-end gap-3 rounded-xl border border-border bg-background/20 backdrop-blur-[2px] p-4 transition-all duration-200 hover:border-brand/50 hover:shadow-lg hover:shadow-brand/5 hover:scale-[1.01] flex-1 text-right"
           onClick={() => track('blog_nav_next', { slug: next.slug, title: next.title })}
         >
           <div className="min-w-0">
-            <p className="text-xs text-muted-foreground">Next</p>
+            <p className="text-xs text-muted-foreground">{labels.next}</p>
             <p className="text-sm font-medium truncate group-hover:underline">{next.title}</p>
           </div>
           <ChevronRight className="size-5 text-muted-foreground shrink-0" />
diff --git a/packages/app/src/components/blog/blog-tag-link.tsx b/packages/app/src/components/blog/blog-tag-link.tsx
index c7ae87c7..e3bb37b9 100644
--- a/packages/app/src/components/blog/blog-tag-link.tsx
+++ b/packages/app/src/components/blog/blog-tag-link.tsx
@@ -6,12 +6,14 @@ import { track } from '@/lib/analytics';
 interface BlogTagLinkProps {
   tag: string;
   active?: boolean;
+  /** Blog list base path, e.g. '/zh/blog' on Chinese pages. */
+  basePath?: string;
 }
 
-export function BlogTagLink({ tag, active }: BlogTagLinkProps) {
+export function BlogTagLink({ tag, active, basePath = '/blog' }: BlogTagLinkProps) {
   return (
     <Link
-      href={`/blog?tag=${encodeURIComponent(tag)}`}
+      href={`${basePath}?tag=${encodeURIComponent(tag)}`}
       className={`rounded-full px-3 py-0.5 text-xs transition-colors ${
         active
           ? 'bg-primary/15 text-primary ring-1 ring-primary/30'
diff --git a/packages/app/src/components/blog/blog-toc.tsx b/packages/app/src/components/blog/blog-toc.tsx
index 775fe498..591b9da3 100644
--- a/packages/app/src/components/blog/blog-toc.tsx
+++ b/packages/app/src/components/blog/blog-toc.tsx
@@ -7,6 +7,8 @@ import type { TocHeading } from '@/lib/blog';
 
 interface BlogTocProps {
   headings: TocHeading[];
+  /** Heading label, e.g. '本页目录' on Chinese pages. */
+  label?: string;
 }
 
 function handleClick(heading: TocHeading) {
@@ -17,7 +19,7 @@ function handleClick(heading: TocHeading) {
   window.scrollTo({ top, behavior: 'smooth' });
 }
 
-export function BlogToc({ headings }: BlogTocProps) {
+export function BlogToc({ headings, label = 'On this page' }: BlogTocProps) {
   const [activeId, setActiveId] = useState('');
   const [showSidebar, setShowSidebar] = useState(false);
   const observerRef = useRef<IntersectionObserver | null>(null);
@@ -141,8 +143,7 @@ export function BlogToc({ headings }: BlogTocProps) {
       {!showSidebar && (
         <details aria-label="Table of contents">
           <summary className="text-sm font-medium cursor-pointer">
-            On this page{' '}
-            <span className="text-muted-foreground font-normal">(click to expand)</span>
+            {label} <span className="text-muted-foreground font-normal">(click to expand)</span>
           </summary>
           <div className="mt-2">{list}</div>
         </details>
@@ -161,7 +162,7 @@ export function BlogToc({ headings }: BlogTocProps) {
             }}
             aria-label="Table of contents"
           >
-            <p className="text-sm font-medium mb-2">On this page</p>
+            <p className="text-sm font-medium mb-2">{label}</p>
             {list}
           </nav>,
           document.body,
diff --git a/packages/app/src/components/footer/footer.tsx b/packages/app/src/components/footer/footer.tsx
index 7410d362..182a1c2f 100644
--- a/packages/app/src/components/footer/footer.tsx
+++ b/packages/app/src/components/footer/footer.tsx
@@ -133,6 +133,14 @@ export const Footer = ({ starCount }: { starCount?: number | null }) => (
             >
               Performance per Dollar
             </Link>
+            <Link
+              data-testid="footer-link-zh"
+              href="/zh"
+              hrefLang="zh-CN"
+              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+            >
+              中文版
+            </Link>
           </div>
         </div>
 
diff --git a/packages/app/src/components/header/header.tsx b/packages/app/src/components/header/header.tsx
index eebef27d..576a3bdb 100644
--- a/packages/app/src/components/header/header.tsx
+++ b/packages/app/src/components/header/header.tsx
@@ -9,6 +9,8 @@ import { track } from '@/lib/analytics';
 import { ModeToggle } from '@/components/ui/mode-toggle';
 import { MinecraftToggles } from '@/components/minecraft/minecraft-toggles';
 import { navigateInApp } from '@/lib/client-navigation';
+import { hasZhSibling, isZhPathname, switchLocalePath, ZH_PREFIX, zhPath } from '@/lib/i18n';
+import { NAV_LABELS_ZH } from '@/lib/tab-meta-zh';
 import { cn } from '@/lib/utils';
 
 import { GitHubStars } from './GithubStars';
@@ -57,12 +59,36 @@ const NAV_LINKS = [
 ] as const;
 
 function isActive(pathname: string, href: string): boolean {
-  if (href === '/') return pathname === '/';
-  if (href === '/inference') return DASHBOARD_TABS.some((tab) => pathname.startsWith(tab));
+  // Chinese pages mirror the English tree under /zh; active state is computed
+  // against the English path so both trees highlight the same nav entry.
+  const enPathname = isZhPathname(pathname)
+    ? pathname === ZH_PREFIX
+      ? '/'
+      : pathname.slice(ZH_PREFIX.length)
+    : pathname;
+  if (href === '/') return enPathname === '/';
+  if (href === '/inference') return DASHBOARD_TABS.some((tab) => enPathname.startsWith(tab));
   // Exact match or a child path under `<href>/...`. The bare `startsWith` would
   // light up `/compare` when the user is on `/compare-per-dollar/...` since the
   // latter starts with the literal string `/compare`.
-  return pathname === href || pathname.startsWith(`${href}/`);
+  return enPathname === href || enPathname.startsWith(`${href}/`);
+}
+
+/** EN ↔ 中文 switcher; maps the current page to its sibling in the other language. */
+function LanguageToggle({ pathname }: { pathname: string }) {
+  const isZh = isZhPathname(pathname);
+  const target = switchLocalePath(pathname);
+  return (
+    <Link
+      href={target}
+      data-testid="language-toggle"
+      hrefLang={isZh ? 'en' : 'zh-CN'}
+      className="px-2 py-1.5 rounded-md text-sm font-medium text-muted-foreground hover:text-foreground hover:bg-muted transition-colors whitespace-nowrap"
+      onClick={() => track('header_language_toggled', { to: isZh ? 'en' : 'zh' })}
+    >
+      {isZh ? 'EN' : '中文'}
+    </Link>
+  );
 }
 
 export const Header = ({ starCount }: { starCount?: number | null }) => {
@@ -71,7 +97,16 @@ export const Header = ({ starCount }: { starCount?: number | null }) => {
   const [mobileMenuOpen, setMobileMenuOpen] = useState(false);
   const menuRef = useRef<HTMLDivElement>(null);
 
-  const navLinks = NAV_LINKS;
+  const isZh = isZhPathname(pathname);
+  // On /zh pages, nav entries with a Chinese sibling navigate within the
+  // Chinese tree and show Chinese labels; the rest keep their English target.
+  const navLinks = isZh
+    ? NAV_LINKS.map((link) => ({
+        ...link,
+        label: NAV_LABELS_ZH[link.href] ?? link.label,
+        displayHref: hasZhSibling(link.href) ? zhPath(link.href) : link.href,
+      }))
+    : NAV_LINKS.map((link) => ({ ...link, displayHref: link.href }));
 
   // Close menu on route change
   useEffect(() => {
@@ -126,11 +161,11 @@ export const Header = ({ starCount }: { starCount?: number | null }) => {
 
           {/* Desktop nav */}
           <nav className="hidden lg:flex items-center gap-1">
-            {navLinks.map(({ href, label, testId, event }) => (
+            {navLinks.map(({ href, displayHref, label, testId, event }) => (
               <Link
                 key={href}
                 data-testid={testId}
-                href={href}
+                href={displayHref}
                 className={cn(
                   'px-3 py-1.5 rounded-md text-sm font-medium transition-colors',
                   isActive(pathname, href)
@@ -139,7 +174,7 @@ export const Header = ({ starCount }: { starCount?: number | null }) => {
                 )}
                 onClick={(e) => {
                   track(event);
-                  if (href === '/inference') navigateInApp(e, router, href);
+                  if (href === '/inference') navigateInApp(e, router, displayHref);
                 }}
               >
                 {label}
@@ -150,6 +185,7 @@ export const Header = ({ starCount }: { starCount?: number | null }) => {
           {/* Right side */}
           <div className="ml-auto flex items-center gap-2">
             <GitHubStars owner="SemiAnalysisAI" repo="InferenceX" starCount={starCount} />
+            <LanguageToggle pathname={pathname} />
             <MinecraftToggles />
             <ModeToggle />
 
@@ -180,10 +216,10 @@ export const Header = ({ starCount }: { starCount?: number | null }) => {
               </button>
               {mobileMenuOpen && (
                 <div className="absolute right-0 top-full mt-2 z-50 flex flex-col rounded-lg border border-border bg-background p-1.5 shadow-lg min-w-40">
-                  {navLinks.map(({ href, label, event }) => (
+                  {navLinks.map(({ href, displayHref, label, event }) => (
                     <Link
                       key={href}
-                      href={href}
+                      href={displayHref}
                       className={cn(
                         'px-3 py-2 rounded-md text-sm font-medium transition-colors',
                         isActive(pathname, href)
@@ -192,7 +228,7 @@ export const Header = ({ starCount }: { starCount?: number | null }) => {
                       )}
                       onClick={(e) => {
                         track(event);
-                        if (href === '/inference') navigateInApp(e, router, href);
+                        if (href === '/inference') navigateInApp(e, router, displayHref);
                       }}
                     >
                       {label}
diff --git a/packages/app/src/components/intro-section.tsx b/packages/app/src/components/intro-section.tsx
index ec5a324a..e148f338 100644
--- a/packages/app/src/components/intro-section.tsx
+++ b/packages/app/src/components/intro-section.tsx
@@ -4,6 +4,7 @@ import { Card } from '@/components/ui/card';
 import { MinecraftSplash } from '@/components/minecraft/minecraft-splash';
 import { QuoteCarousel } from '@/components/quote-carousel';
 import { QUOTES, CAROUSEL_ORGS, CAROUSEL_LABELS } from '@/components/quotes/quotes-data';
+import type { Locale } from '@/lib/i18n';
 
 // Carousel order follows QUOTES order — carousel orgs are listed first there.
 const carouselQuotes = QUOTES.filter((q) => (CAROUSEL_ORGS as readonly string[]).includes(q.org));
@@ -12,22 +13,31 @@ const CAROUSEL_OVERRIDES = {
   labels: CAROUSEL_LABELS,
 };
 
-export function IntroSection() {
+const HEADING = {
+  en: 'Open Source Continuous Inference Benchmark Trusted by GigaWatt Token Factories',
+  zh: '受吉瓦级 token 工厂信赖的开源持续推理基准测试',
+} as const;
+
+export function IntroSection({ locale = 'en' }: { locale?: Locale } = {}) {
+  const isZh = locale === 'zh';
+  // Quotes fall back to the English original until a translation lands.
+  const quotes = isZh
+    ? carouselQuotes.map((q) => ({ ...q, text: q.textZh ?? q.text }))
+    : carouselQuotes;
   return (
     <section>
       <Card data-testid="intro-section">
         <div className="relative flex items-start gap-2 mb-4">
           <Quote className="size-5 shrink-0 mt-1 text-brand" />
-          <h2 className="text-lg font-semibold">
-            Open Source Continuous Inference Benchmark Trusted by GigaWatt Token Factories
-          </h2>
+          <h2 className="text-lg font-semibold">{HEADING[locale]}</h2>
           <MinecraftSplash />
         </div>
         <div>
           <QuoteCarousel
-            quotes={carouselQuotes}
+            quotes={quotes}
             overrides={CAROUSEL_OVERRIDES}
-            moreHref="/quotes"
+            moreHref={isZh ? '/zh/quotes' : '/quotes'}
+            moreLabel={isZh ? '查看更多支持者 →' : undefined}
           />
         </div>
       </Card>
diff --git a/packages/app/src/components/landing/landing-page.tsx b/packages/app/src/components/landing/landing-page.tsx
index ecc1c4c6..61c4f0d7 100644
--- a/packages/app/src/components/landing/landing-page.tsx
+++ b/packages/app/src/components/landing/landing-page.tsx
@@ -7,14 +7,74 @@ import { CuratedViewCard } from '@/components/landing/curated-view-card';
 import { NudgeEngine } from '@/components/nudge-engine';
 import { FAVORITE_PRESETS } from '@/components/favorites/favorite-presets';
 import { GITHUB_OWNER, GITHUB_REPO } from '@semianalysisai/inferencex-constants';
+import type { Locale } from '@/lib/i18n';
 
-export function LandingPage() {
+const STRINGS = {
+  en: {
+    fullDashboard: 'Full Dashboard',
+    fullDashboardP1:
+      'Every model, GPU, framework, and metric. Fully configurable inference benchmark charts with date ranges, concurrency sweeps, and raw data export.',
+    fullDashboardP2:
+      'Compare NVIDIA GB300 NVL72, GB200 NVL72, B300, B200, H200, H100, AMD MI355X, MI325X, MI300X and soon VR200 NVL72, AMD MI455X UALoE72, TPUv7 Ironwood, etc across DeepSeekv4 Pro, Qwen, Kimi, GLM, MiniMax, gpt-oss, Llama and other models.',
+    openDashboard: 'Open Dashboard',
+    reproTitle: 'Every Result Is Transparently done through Public GitHub Actions Automation',
+    reproP1:
+      'Every data point on the dashboard is produced by a public GitHub Actions workflow run. The recipe lives in the repo, the run executes on the actual target hardware, and the full logs and artifacts are publicly viewable. Click any point on a chart to jump straight to the run that produced it. All reproducible, auditable, and open source.',
+    reproStat: '1,000+ new benchmark datapoints added per week on average.',
+    reproStatTail: 'Browse every new model, GPU, framework, and configuration as it lands.',
+    actionsRunsTitle: 'Public Actions runs',
+    actionsRunsDesc:
+      'Every benchmark executes on GitHub Actions with full logs visible while the run is in progress.',
+    openRecipesTitle: 'Open recipes',
+    openRecipesDesc:
+      'Every model, framework, precision, and parallelism setting is committed to the public repo as a shell script.',
+    dbSnapshotsTitle: 'Weekly DB snapshots',
+    dbSnapshotsDesc:
+      'The full benchmark database is published as a public GitHub Release every week so the historical dataset stays auditable.',
+    browseSubmissions: 'Browse submissions',
+    viewRuns: 'View benchmark runs on GitHub Actions',
+    howItWorks: 'How it works',
+    quickComparisons: 'Quick Comparisons',
+    quickComparisonsDesc:
+      'Jump straight into the most popular GPU inference benchmark comparisons, curated and ready to explore.',
+  },
+  zh: {
+    fullDashboard: '完整仪表板',
+    fullDashboardP1:
+      '覆盖所有模型、GPU、框架与指标。完全可配置的推理基准测试图表，支持日期范围、并发扫描与原始数据导出。',
+    fullDashboardP2:
+      '跨 DeepSeekv4 Pro、Qwen、Kimi、GLM、MiniMax、gpt-oss、Llama 等模型，对比 NVIDIA GB300 NVL72、GB200 NVL72、B300、B200、H200、H100、AMD MI355X、MI325X、MI300X，以及即将上线的 VR200 NVL72、AMD MI455X UALoE72、TPUv7 Ironwood 等硬件。',
+    openDashboard: '打开仪表板',
+    reproTitle: '每一条结果都通过公开的 GitHub Actions 自动化流程透明产生',
+    reproP1:
+      '仪表板上的每个数据点都由公开的 GitHub Actions 工作流运行产生。配置方案（recipe）保存在公开仓库中，运行在真实目标硬件上执行，完整日志与产物公开可查。点击图表上的任意数据点即可跳转到生成它的那次运行。一切都可复现、可审计、开源。',
+    reproStat: '平均每周新增 1,000+ 条基准测试数据点。',
+    reproStatTail: '第一时间浏览每个新上线的模型、GPU、框架与配置。',
+    actionsRunsTitle: '公开的 Actions 运行',
+    actionsRunsDesc: '每次基准测试都在 GitHub Actions 上执行，运行过程中即可实时查看完整日志。',
+    openRecipesTitle: '开放的配置方案',
+    openRecipesDesc: '每个模型、框架、精度与并行配置都以 shell 脚本形式提交在公开仓库中。',
+    dbSnapshotsTitle: '每周数据库快照',
+    dbSnapshotsDesc:
+      '完整基准测试数据库每周以公开 GitHub Release 的形式发布，历史数据集持续可审计。',
+    browseSubmissions: '浏览提交记录',
+    viewRuns: '在 GitHub Actions 上查看基准测试运行',
+    howItWorks: '工作原理',
+    quickComparisons: '快速对比',
+    quickComparisonsDesc: '一键进入最热门的 GPU 推理基准测试对比，精选视图开箱即用。',
+  },
+} as const;
+
+export function LandingPage({ locale = 'en' }: { locale?: Locale } = {}) {
+  const t = STRINGS[locale];
+  // Internal links stay within the current language tree.
+  const prefix = locale === 'zh' ? '/zh' : '';
   return (
     <main className="relative">
       <LandingPageAnalytics />
       <NudgeEngine scope="landing" />
       <div className="container mx-auto px-4 lg:px-8 flex flex-col gap-6 lg:gap-4">
-        <IntroSection />
+        <IntroSection locale={locale} />
 
         {/* Split: Dashboard vs Presets */}
         <section className="flex flex-col gap-4 pb-8">
@@ -22,25 +82,18 @@ export function LandingPage() {
           <Card>
             <div className="flex items-center gap-2 mb-3">
               <BarChart3 className="size-5 shrink-0 text-brand" />
-              <h2 className="text-lg font-semibold">Full Dashboard</h2>
+              <h2 className="text-lg font-semibold">{t.fullDashboard}</h2>
             </div>
-            <p className="text-sm text-muted-foreground mb-2">
-              Every model, GPU, framework, and metric. Fully configurable inference benchmark charts
-              with date ranges, concurrency sweeps, and raw data export.
-            </p>
-            <p className="text-sm text-muted-foreground mb-6">
-              Compare NVIDIA GB300 NVL72, GB200 NVL72, B300, B200, H200, H100, AMD MI355X, MI325X,
-              MI300X and soon VR200 NVL72, AMD MI455X UALoE72, TPUv7 Ironwood, etc across DeepSeekv4
-              Pro, Qwen, Kimi, GLM, MiniMax, gpt-oss, Llama and other models.
-            </p>
+            <p className="text-sm text-muted-foreground mb-2">{t.fullDashboardP1}</p>
+            <p className="text-sm text-muted-foreground mb-6">{t.fullDashboardP2}</p>
             <div className="mt-auto">
               <LandingTrackedLink
-                href="/inference"
+                href={`${prefix}/inference`}
                 analyticsEvent="landing_full_dashboard_clicked"
                 appNavigation
                 className="inline-flex items-center justify-center gap-2 rounded-md text-sm sm:text-base font-medium h-12 px-8 bg-brand text-primary-foreground hover:bg-brand/90 transition-colors"
               >
-                Open Dashboard
+                {t.openDashboard}
                 <ArrowRight className="size-4" />
               </LandingTrackedLink>
             </div>
@@ -50,54 +103,35 @@ export function LandingPage() {
           <Card>
             <div className="flex items-center gap-2 mb-3">
               <ShieldCheck className="size-5 shrink-0 text-brand" />
-              <h2 className="text-lg font-semibold">
-                Every Result Is Transparently done through Public GitHub Actions Automation
-              </h2>
+              <h2 className="text-lg font-semibold">{t.reproTitle}</h2>
             </div>
+            <p className="text-sm text-muted-foreground mb-4">{t.reproP1}</p>
             <p className="text-sm text-muted-foreground mb-4">
-              Every data point on the dashboard is produced by a public GitHub Actions workflow run.
-              The recipe lives in the repo, the run executes on the actual target hardware, and the
-              full logs and artifacts are publicly viewable. Click any point on a chart to jump
-              straight to the run that produced it. All reproducible, auditable, and open source.
-            </p>
-            <p className="text-sm text-muted-foreground mb-4">
-              <span className="font-semibold text-foreground">
-                1,000+ new benchmark datapoints added per week on average.
-              </span>{' '}
-              Browse every new model, GPU, framework, and configuration as it lands.
+              <span className="font-semibold text-foreground">{t.reproStat}</span> {t.reproStatTail}
             </p>
             <div className="grid grid-cols-1 sm:grid-cols-3 gap-3 mb-4">
               <div className="rounded-md border border-border bg-card p-3">
-                <div className="text-sm font-semibold text-foreground">Public Actions runs</div>
-                <div className="text-xs text-muted-foreground mt-1">
-                  Every benchmark executes on GitHub Actions with full logs visible while the run is
-                  in progress.
-                </div>
+                <div className="text-sm font-semibold text-foreground">{t.actionsRunsTitle}</div>
+                <div className="text-xs text-muted-foreground mt-1">{t.actionsRunsDesc}</div>
               </div>
               <div className="rounded-md border border-border bg-card p-3">
-                <div className="text-sm font-semibold text-foreground">Open recipes</div>
-                <div className="text-xs text-muted-foreground mt-1">
-                  Every model, framework, precision, and parallelism setting is committed to the
-                  public repo as a shell script.
-                </div>
+                <div className="text-sm font-semibold text-foreground">{t.openRecipesTitle}</div>
+                <div className="text-xs text-muted-foreground mt-1">{t.openRecipesDesc}</div>
               </div>
               <div className="rounded-md border border-border bg-card p-3">
-                <div className="text-sm font-semibold text-foreground">Weekly DB snapshots</div>
-                <div className="text-xs text-muted-foreground mt-1">
-                  The full benchmark database is published as a public GitHub Release every week so
-                  the historical dataset stays auditable.
-                </div>
+                <div className="text-sm font-semibold text-foreground">{t.dbSnapshotsTitle}</div>
+                <div className="text-xs text-muted-foreground mt-1">{t.dbSnapshotsDesc}</div>
               </div>
             </div>
             <div className="flex flex-wrap gap-3 text-sm">
               <LandingTrackedLink
-                href="/submissions"
+                href={`${prefix}/submissions`}
                 data-testid="landing-submissions-link"
                 analyticsEvent="landing_submissions_clicked"
                 appNavigation
                 className="inline-flex items-center gap-1.5 rounded-md bg-brand text-primary-foreground hover:bg-brand/90 px-3 py-1.5 transition-colors font-medium"
               >
-                Browse submissions
+                {t.browseSubmissions}
                 <ArrowRight className="size-3.5" />
               </LandingTrackedLink>
               <LandingTrackedLink
@@ -107,15 +141,15 @@ export function LandingPage() {
                 analyticsEvent="landing_reproducibility_actions_clicked"
                 className="inline-flex items-center gap-1.5 rounded-md border border-border px-3 py-1.5 hover:bg-accent transition-colors"
               >
-                View benchmark runs on GitHub Actions
+                {t.viewRuns}
                 <ArrowRight className="size-3.5" />
               </LandingTrackedLink>
               <LandingTrackedLink
-                href="/about#reproducibility"
+                href={`${prefix}/about#reproducibility`}
                 analyticsEvent="landing_reproducibility_about_clicked"
                 className="inline-flex items-center gap-1.5 rounded-md border border-border px-3 py-1.5 hover:bg-accent transition-colors"
               >
-                How it works
+                {t.howItWorks}
               </LandingTrackedLink>
             </div>
           </Card>
@@ -124,12 +158,9 @@ export function LandingPage() {
           <Card>
             <div className="flex items-center gap-2 mb-3">
               <Sparkles className="size-5 shrink-0 text-brand" />
-              <h2 className="text-lg font-semibold">Quick Comparisons</h2>
+              <h2 className="text-lg font-semibold">{t.quickComparisons}</h2>
             </div>
-            <p className="text-sm text-muted-foreground mb-4">
-              Jump straight into the most popular GPU inference benchmark comparisons, curated and
-              ready to explore.
-            </p>
+            <p className="text-sm text-muted-foreground mb-4">{t.quickComparisonsDesc}</p>
             <div className="grid grid-cols-1 sm:grid-cols-2 gap-3">
               {FAVORITE_PRESETS.filter((preset) => !preset.hidden).map((preset) => (
                 <CuratedViewCard key={preset.id} preset={preset} />
diff --git a/packages/app/src/components/quote-carousel.tsx b/packages/app/src/components/quote-carousel.tsx
index 8fca5798..b0c46a8b 100644
--- a/packages/app/src/components/quote-carousel.tsx
+++ b/packages/app/src/components/quote-carousel.tsx
@@ -24,6 +24,8 @@ export interface QuoteCarouselProps {
   };
   /** Link to a page with all quotes */
   moreHref?: string;
+  /** Label for the moreHref link (default "See more supporters →") */
+  moreLabel?: string;
   /** Auto-rotate interval in ms (default 8000) */
   intervalMs?: number;
 }
@@ -94,6 +96,7 @@ export function QuoteCarousel({
   quotes,
   overrides = {},
   moreHref,
+  moreLabel,
   intervalMs = 8_000,
 }: QuoteCarouselProps) {
   const { labels = {} } = overrides;
@@ -202,7 +205,7 @@ export function QuoteCarousel({
             className="text-xs font-bold text-brand hover:underline"
             onClick={() => track('quote_carousel_see_more_clicked')}
           >
-            See more supporters &rarr;
+            {moreLabel ?? 'See more supporters →'}
           </Link>
         </div>
       )}
diff --git a/packages/app/src/components/quotes/quotes-content.tsx b/packages/app/src/components/quotes/quotes-content.tsx
index d3b2a5d7..d62ccd7c 100644
--- a/packages/app/src/components/quotes/quotes-content.tsx
+++ b/packages/app/src/components/quotes/quotes-content.tsx
@@ -7,9 +7,26 @@ import { Card } from '@/components/ui/card';
 import { ExternalLinkIcon } from '@/components/ui/external-link-icon';
 import { track } from '@/lib/analytics';
 
+import type { Locale } from '@/lib/i18n';
+
 import { CompanyLogo, highlightBrand } from './quote-utils';
 import { QUOTES } from './quotes-data';
 
+const STRINGS = {
+  en: {
+    heading: <>InferenceX&trade; Initiative Supporters</>,
+    intro:
+      'InferenceX™ initiative is supported by many major buyers of compute and prominent members of the ML community including those from MiniMax, Moonshot Kimi, Alibaba Qwen, OpenAI, Microsoft, vLLM, PyTorch Foundation, Oracle and more.',
+    jumpTo: (org: string) => `Jump to ${org}’s quote`,
+  },
+  zh: {
+    heading: <>InferenceX&trade; 计划支持者</>,
+    intro:
+      'InferenceX™ 计划获得众多主要算力买家与 ML 社区知名成员的支持，包括来自 MiniMax、Moonshot Kimi、阿里巴巴 Qwen、OpenAI、Microsoft、vLLM、PyTorch 基金会、Oracle 等机构的支持者。',
+    jumpTo: (org: string) => `跳转到 ${org} 的评价`,
+  },
+} as const;
+
 /** Stable anchor id for an org's quote (first occurrence wins). */
 function orgAnchorId(org: string): string {
   const slug = org
@@ -89,20 +106,15 @@ function QuoteCard({
   return content;
 }
 
-export function QuotesContent() {
+export function QuotesContent({ locale = 'en' }: { locale?: Locale } = {}) {
+  const t = STRINGS[locale];
   return (
     <main className="relative">
       <div className="container mx-auto px-4 lg:px-8 flex flex-col gap-4">
         <section className="flex flex-col gap-4">
           <Card>
-            <h2 className="text-2xl lg:text-4xl font-bold tracking-tight">
-              InferenceX&trade; Initiative Supporters
-            </h2>
-            <p className="mt-3 text-base lg:text-lg text-muted-foreground">
-              InferenceX&trade; initiative is supported by many major buyers of compute and
-              prominent members of the ML community including those from MiniMax, Moonshot Kimi,
-              Alibaba Qwen, OpenAI, Microsoft, vLLM, PyTorch Foundation, Oracle and more.
-            </p>
+            <h2 className="text-2xl lg:text-4xl font-bold tracking-tight">{t.heading}</h2>
+            <p className="mt-3 text-base lg:text-lg text-muted-foreground">{t.intro}</p>
             <div className="mt-6 flex flex-wrap items-center justify-center gap-2">
               {orgLogos.map(({ org, logo }) => (
                 <button
@@ -113,8 +125,8 @@ export function QuotesContent() {
                     scrollToOrg(org);
                   }}
                   className="group flex items-center justify-center h-10 px-3 cursor-pointer rounded-md transition-colors hover:bg-muted focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-brand"
-                  title={`Jump to ${org}’s quote`}
-                  aria-label={`Jump to ${org}’s quote`}
+                  title={t.jumpTo(org)}
+                  aria-label={t.jumpTo(org)}
                 >
                   <img
                     src={`/logos/${logo}`}
@@ -135,7 +147,7 @@ export function QuotesContent() {
                       id={
                         firstQuoteIndexForOrg[quote.org] === i ? orgAnchorId(quote.org) : undefined
                       }
-                      text={quote.text}
+                      text={locale === 'zh' ? (quote.textZh ?? quote.text) : quote.text}
                       name={quote.name}
                       title={quote.title}
                       org={quote.org}
diff --git a/packages/app/src/components/quotes/quotes-data.ts b/packages/app/src/components/quotes/quotes-data.ts
index 71158c07..4f84f727 100644
--- a/packages/app/src/components/quotes/quotes-data.ts
+++ b/packages/app/src/components/quotes/quotes-data.ts
@@ -1,5 +1,7 @@
 export interface Quote {
   text: string;
+  /** Simplified Chinese translation of `text`, shown on /zh pages. */
+  textZh?: string;
   name: string;
   title: string;
   org: string;
@@ -10,6 +12,8 @@ export interface Quote {
 export const QUOTES: Quote[] = [
   {
     text: "Vendor-neutral, continuously updated benchmarking is essential as models and inference stacks co-evolve. MiniMax M3 was built with both frontier capability and real-world deployment efficiency in mind, and the day-one vLLM support from the community reflects the collaborative spirit we're proud to be part of. InferenceX provides the kind of transparent, reproducible data the ecosystem needs.",
+    textZh:
+      '在模型与推理技术栈协同演进的今天，厂商中立、持续更新的基准测试不可或缺。MiniMax M3 在设计之初就兼顾了前沿能力与实际部署效率，而社区第一时间对 vLLM 的支持也体现了我们引以为豪的协作精神。InferenceX 正是生态所需的透明、可复现的数据平台。',
     name: 'Ryan Lee',
     title: 'Head of DevRel, MiniMax',
     org: 'MiniMax',
@@ -18,6 +22,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'At Moonshot AI, we are dedicated to supporting the open-source ecosystem by advancing frontier open models. As the Kimi K2 series evolves, we are glad to see its performance tracked in InferenceX™’s open and reproducible benchmarks. InferenceX™ helps the community better understand industry-level performance and encourages the ecosystem to keep improving and optimizing.',
+    textZh:
+      'Moonshot AI 致力于通过推动前沿开源模型来支持开源生态。随着 Kimi K2 系列的不断演进，我们很高兴看到其性能被 InferenceX™ 的开放、可复现基准测试持续追踪。InferenceX™ 帮助社区更好地理解行业级性能水平，并推动生态持续改进与优化。',
     name: 'Moonshot AI',
     title: '',
     org: 'Moonshot AI',
@@ -26,6 +32,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: "Qwen has always been about putting capable models into the hands of as many developers as possible, and real-world inference efficiency is what makes that scale. InferenceX™ brings rigorous, vendor-neutral measurement to exactly the questions that matter: how models like Qwen3.5 actually perform across accelerators. Independent, reproducible benchmarks on real hardware give the community the clarity it needs to deploy with confidence, and we're glad to see that level of transparency driving the inference ecosystem forward.",
+    textZh:
+      'Qwen 始终致力于将强大的模型交到尽可能多的开发者手中，而真实推理效率是实现规模化的关键。InferenceX™ 为最重要的问题带来了严谨、厂商中立的测量：像 Qwen3.5 这样的模型在各类加速器上的实际表现如何。基于真实硬件的独立、可复现基准测试为社区提供了自信部署所需的清晰洞察，我们很高兴看到这种透明度推动着推理生态不断向前发展。',
     name: 'Alibaba Qwen',
     title: '',
     org: 'Alibaba Qwen',
@@ -33,7 +41,9 @@ export const QUOTES: Quote[] = [
     link: 'https://qwen.ai',
   },
   {
-    text: "As we build systems at unprecedented scale, it's critical for the ML community to have open, transparent benchmarks that reflect how inference really performs across hardware and software. InferenceMAX\u2122's head-to-head benchmarks cut through the noise and provide a living picture of token throughput, performance per dollar, and tokens per Megawatt. This kind of open source effort strengthens the entire ecosystem and helps everyone, from researchers to operators of frontier datacenters, make smarter decisions.",
+    text: "As we build systems at unprecedented scale, it's critical for the ML community to have open, transparent benchmarks that reflect how inference really performs across hardware and software. InferenceMAX™'s head-to-head benchmarks cut through the noise and provide a living picture of token throughput, performance per dollar, and tokens per Megawatt. This kind of open source effort strengthens the entire ecosystem and helps everyone, from researchers to operators of frontier datacenters, make smarter decisions.",
+    textZh:
+      '在我们以前所未有的规模构建系统之际，机器学习社区拥有开放、透明的基准测试至关重要——它们真实反映了推理在不同硬件和软件上的表现。InferenceMAX™ 的对比基准测试穿透噪音，提供了关于 token 吞吐量、每美元性能和每兆瓦 token 数的动态全景。这种开源努力增强了整个生态，帮助从研究者到前沿数据中心运营者的每一个人做出更明智的决策。',
     name: 'Peter Hoeschele',
     title: 'VP of Infrastructure and Industrial Compute, OpenAI Stargate',
     org: 'OpenAI',
@@ -41,7 +51,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/peter-hoeschele/',
   },
   {
-    text: "Our mission at Azure is to give customers the most performant, efficient, and cost-effective cloud for AI. SemiAnalysis InferenceMAX\u2122 supports that mission by providing transparent, reproducible benchmarks that track inference performance across GPUs and software stacks under realistic workloads. This continuous data on throughput, efficiency, and cost per watt strengthens our ability to tune Azure's inference platform for scale, helping customers build with confidence on Microsoft Cloud.",
+    text: "Our mission at Azure is to give customers the most performant, efficient, and cost-effective cloud for AI. SemiAnalysis InferenceMAX™ supports that mission by providing transparent, reproducible benchmarks that track inference performance across GPUs and software stacks under realistic workloads. This continuous data on throughput, efficiency, and cost per watt strengthens our ability to tune Azure's inference platform for scale, helping customers build with confidence on Microsoft Cloud.",
+    textZh:
+      'Azure 的使命是为客户提供性能最强、效率最高且最具成本效益的 AI 云。SemiAnalysis InferenceMAX™ 通过提供透明、可复现的基准测试来追踪各类 GPU 和软件栈在真实工作负载下的推理性能，有力地支持了这一使命。关于吞吐量、效率和每瓦成本的持续数据增强了我们优化 Azure 推理平台规模化的能力，帮助客户在 Microsoft Cloud 上自信构建。',
     name: 'Scott Guthrie',
     title: 'Executive Vice President, Microsoft Cloud & AI',
     org: 'Microsoft',
@@ -49,7 +61,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/guthriescott/',
   },
   {
-    text: 'At Microsoft, delivering the best inference performance and economics for our customers at scale requires a deep understanding of how AI models interact with real-world hardware and software. Open-source, reproducible benchmarks, like InferenceMAX\u2122, are essential for generating transparent insights into throughput, efficiency, and cost under realistic workloads. These continuous signals help guide our platform strategy, enabling us to optimize the entire stack from silicon, to systems, to software, so that every layer works together to unlock the full potential of our infrastructure.',
+    text: 'At Microsoft, delivering the best inference performance and economics for our customers at scale requires a deep understanding of how AI models interact with real-world hardware and software. Open-source, reproducible benchmarks, like InferenceMAX™, are essential for generating transparent insights into throughput, efficiency, and cost under realistic workloads. These continuous signals help guide our platform strategy, enabling us to optimize the entire stack from silicon, to systems, to software, so that every layer works together to unlock the full potential of our infrastructure.',
+    textZh:
+      '在 Microsoft，为客户大规模交付最佳推理性能和经济性，需要深入理解 AI 模型如何与真实硬件和软件交互。像 InferenceMAX™ 这样的开源、可复现基准测试对于产出关于吞吐量、效率和成本的透明洞察至关重要。这些持续信号帮助指导我们的平台战略，使我们能够从芯片到系统再到软件对整个技术栈进行优化，让每一层协同工作，充分释放基础设施的潜力。',
     name: 'Saurabh Dighe',
     title: 'Corporate Vice President, Azure Strategic Planning & Architecture',
     org: 'Microsoft',
@@ -57,7 +71,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/saurabhdighe/',
   },
   {
-    text: 'PyTorch was built on the belief that open tools accelerate the entire AI ecosystem. InferenceX\u2122 embodies that same philosophy\u2014open, reproducible, and vendor-neutral benchmarks that give the community real data on real hardware. As inference workloads scale to serve billions of users, having a continuously updated, transparent performance baseline across accelerators is essential for practitioners and platform teams making critical infrastructure decisions.',
+    text: 'PyTorch was built on the belief that open tools accelerate the entire AI ecosystem. InferenceX™ embodies that same philosophy—open, reproducible, and vendor-neutral benchmarks that give the community real data on real hardware. As inference workloads scale to serve billions of users, having a continuously updated, transparent performance baseline across accelerators is essential for practitioners and platform teams making critical infrastructure decisions.',
+    textZh:
+      'PyTorch 基于一个信念而生：开放工具能加速整个 AI 生态。InferenceX™ 体现了同样的理念——开放、可复现、厂商中立的基准测试，为社区提供真实硬件上的真实数据。随着推理工作负载扩展到服务数十亿用户，在各类加速器上持续更新、透明的性能基线对于做出关键基础设施决策的从业者和平台团队而言不可或缺。',
     name: 'Joseph Spisak',
     title: 'Product Director, Meta Super Intelligence Lab',
     org: 'Meta Superintelligence Labs',
@@ -66,6 +82,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'Oracle Cloud Infrastructure is built to give frontier labs & enterprises flexibility and choice, with many GPU SKUs available for AI at scale. InferenceMAX strengthens that mission by delivering open source, reproducible benchmarks that reflect real-world performance, efficiency, and cost on the latest hardware and software. With this transparency, customers can confidently select the platforms that best align with their AI strategies.',
+    textZh:
+      'Oracle Cloud Infrastructure 旨在为前沿实验室和企业提供灵活性与选择，提供多种 GPU SKU 用于大规模 AI。InferenceMAX 通过提供开源、可复现的基准测试来支持这一使命，真实反映最新硬件和软件上的性能、效率与成本。凭借这种透明度，客户可以自信地选择与其 AI 战略最契合的平台。',
     name: 'Jay Jackson',
     title: 'Vice President, Oracle Cloud Infrastructure',
     org: 'Oracle',
@@ -73,7 +91,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/jayejackson/',
   },
   {
-    text: 'The gap between theoretical peak and real-world inference throughput is often determined by systems software: inference engine, distributed strategies, and low-level kernels. InferenceMAX\u2122 is valuable because it benchmarks the latest software showing how optimizations like FP4, MTP, speculative decode, and wide-EP actually play out across various hardware. Open, reproducible results like these help the whole community move faster.',
+    text: 'The gap between theoretical peak and real-world inference throughput is often determined by systems software: inference engine, distributed strategies, and low-level kernels. InferenceMAX™ is valuable because it benchmarks the latest software showing how optimizations like FP4, MTP, speculative decode, and wide-EP actually play out across various hardware. Open, reproducible results like these help the whole community move faster.',
+    textZh:
+      '理论峰值与实际推理吞吐量之间的差距往往取决于系统软件：推理引擎、分布式策略和底层内核。InferenceMAX™ 的价值在于它对最新软件进行基准测试，展示了 FP4、MTP、投机解码和 wide-EP 等优化在不同硬件上的实际效果。这种开放、可复现的结果帮助整个社区更快地前进。',
     name: 'Tri Dao',
     title: 'Chief Scientist of Together AI & Inventor of Flash Attention',
     org: 'Together AI',
@@ -81,7 +101,9 @@ export const QUOTES: Quote[] = [
     link: 'https://tridao.me/',
   },
   {
-    text: "The industry needs many public, reproducible benchmarks of inference performance. We're excited to collaborate with InferenceMAX\u2122 from the vLLM team. More diverse workloads and scenarios that everyone can trust and reference will help the ecosystem move forward. Fair, transparent measurements drive progress across every layer of the stack, from model architectures to inference engines to hardware.",
+    text: "The industry needs many public, reproducible benchmarks of inference performance. We're excited to collaborate with InferenceMAX™ from the vLLM team. More diverse workloads and scenarios that everyone can trust and reference will help the ecosystem move forward. Fair, transparent measurements drive progress across every layer of the stack, from model architectures to inference engines to hardware.",
+    textZh:
+      '行业需要大量公开、可复现的推理性能基准测试。vLLM 团队很高兴与 InferenceMAX™ 合作。更多元化的、人人可信赖和引用的工作负载与场景将推动生态向前发展。公平、透明的测量驱动着技术栈每一层的进步——从模型架构到推理引擎再到硬件。',
     name: 'Simon Mo',
     title: 'vLLM Project Co-Lead',
     org: 'vLLM',
@@ -89,7 +111,8 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/simon-mo-834217162/',
   },
   {
-    text: 'InferenceMAX\u2122 benchmark is pogchamp & W in chat',
+    text: 'InferenceMAX™ benchmark is pogchamp & W in chat',
+    textZh: 'InferenceMAX™ 基准测试绝绝子，大写的赢',
     name: 'Kaichao You',
     title: 'vLLM Project Co-Lead & PhD Student @ Tsinghua University',
     org: 'vLLM',
@@ -98,6 +121,7 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'Arguably the most important OSS benchmark suite out today InferenceX',
+    textZh: 'InferenceX 堪称当下最重要的开源基准测试套件',
     name: 'Mark Saroufim',
     title: 'GPU Mode Founder & Meta PyTorch Engineer',
     org: 'GPU Mode',
@@ -105,7 +129,9 @@ export const QUOTES: Quote[] = [
     link: 'https://x.com/marksaroufim',
   },
   {
-    text: 'InferenceMAX\u2122 demonstrates how an open ecosystem can operate in practice. Many leading inference stacks such as vLLM, SGLang, and TensorRT-LLM are built on PyTorch, and benchmarks like this show how innovations across kernels, runtimes, and frameworks translate into measurable performance on a range of hardware platforms, including NVIDIA and AMD GPUs. By being open source and running nightly, InferenceMAX\u2122 offers a transparent, community-driven approach to tracking progress and providing PyTorch users with data-driven insights.',
+    text: 'InferenceMAX™ demonstrates how an open ecosystem can operate in practice. Many leading inference stacks such as vLLM, SGLang, and TensorRT-LLM are built on PyTorch, and benchmarks like this show how innovations across kernels, runtimes, and frameworks translate into measurable performance on a range of hardware platforms, including NVIDIA and AMD GPUs. By being open source and running nightly, InferenceMAX™ offers a transparent, community-driven approach to tracking progress and providing PyTorch users with data-driven insights.',
+    textZh:
+      'InferenceMAX™ 展示了开放生态如何在实践中运作。vLLM、SGLang 和 TensorRT-LLM 等众多领先推理栈均构建于 PyTorch 之上，而这样的基准测试展示了内核、运行时和框架层面的创新如何转化为 NVIDIA 和 AMD GPU 等多种硬件平台上可衡量的性能。凭借开源属性和每夜运行，InferenceMAX™ 提供了一种透明的、社区驱动的方式来追踪进展，并为 PyTorch 用户提供数据驱动的洞察。',
     name: 'Matt White',
     title: 'Executive Director, PyTorch Foundation',
     org: 'PyTorch Foundation',
@@ -113,7 +139,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/mdwdata/',
   },
   {
-    text: 'InferenceMAX\u2122 raises the bar by delivering open, transparent benchmarks that track how inference really performs across the latest GPUs and software stacks. For customers, having reproducible data that measures real world tokens per dollar & tokens per watt, turns abstract marketing numbers into actionable insight. At CoreWeave, we support this effort because it brings clarity to a fast-moving space and helps the entire ecosystem build with confidence.',
+    text: 'InferenceMAX™ raises the bar by delivering open, transparent benchmarks that track how inference really performs across the latest GPUs and software stacks. For customers, having reproducible data that measures real world tokens per dollar & tokens per watt, turns abstract marketing numbers into actionable insight. At CoreWeave, we support this effort because it brings clarity to a fast-moving space and helps the entire ecosystem build with confidence.',
+    textZh:
+      'InferenceMAX™ 通过提供开放、透明的基准测试来追踪推理在最新 GPU 和软件栈上的实际表现，树立了新标杆。对客户而言，拥有衡量真实每美元 token 数和每瓦 token 数的可复现数据，将抽象的营销数字转化为可操作的洞察。CoreWeave 支持这一努力，因为它为这个快速发展的领域带来了清晰度，帮助整个生态自信构建。',
     name: 'Peter Salanki',
     title: 'CTO, CoreWeave',
     org: 'CoreWeave',
@@ -121,7 +149,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/salanki/',
   },
   {
-    text: "InferenceMAX\u2122 sets a new standard by providing open, transparent benchmarks that reveal how inference performs across today's leading GPUs and software stacks. With reproducible data measuring real-world tokens per dollar and tokens per watt, customers can move beyond marketing claims to actionable insights. For us at Nebius, as a full-stack AI cloud provider, this initiative helps us build our inference platform with confidence and ensure we are aligned with the ecosystem.",
+    text: "InferenceMAX™ sets a new standard by providing open, transparent benchmarks that reveal how inference performs across today's leading GPUs and software stacks. With reproducible data measuring real-world tokens per dollar and tokens per watt, customers can move beyond marketing claims to actionable insights. For us at Nebius, as a full-stack AI cloud provider, this initiative helps us build our inference platform with confidence and ensure we are aligned with the ecosystem.",
+    textZh:
+      'InferenceMAX™ 通过提供开放、透明的基准测试，揭示了推理在当今领先 GPU 和软件栈上的表现，树立了新标准。凭借衡量真实每美元 token 数和每瓦 token 数的可复现数据，客户可以超越营销宣传，获得可操作的洞察。对于作为全栈 AI 云服务商的 Nebius 而言，这一计划帮助我们自信地构建推理平台，并确保与生态保持一致。',
     name: 'Roman Chernin',
     title: 'Co-Founder & Chief Business Officer, Nebius',
     org: 'Nebius',
@@ -129,7 +159,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/roman-chernin-1b4b8758/',
   },
   {
-    text: "At TensorWave, we're building a next-generation cloud on AMD GPUs because we believe innovation thrives when customers have strong alternatives. InferenceMAX\u2122 reinforces that vision by providing open source, reproducible benchmarks that track throughput, efficiency, and cost across the latest hardware and software. By cutting through synthetic numbers and highlighting real-world inference performance, it helps customers see the full potential of AMD platforms for AI at scale.",
+    text: "At TensorWave, we're building a next-generation cloud on AMD GPUs because we believe innovation thrives when customers have strong alternatives. InferenceMAX™ reinforces that vision by providing open source, reproducible benchmarks that track throughput, efficiency, and cost across the latest hardware and software. By cutting through synthetic numbers and highlighting real-world inference performance, it helps customers see the full potential of AMD platforms for AI at scale.",
+    textZh:
+      '在 TensorWave，我们基于 AMD GPU 构建下一代云，因为我们相信当客户拥有强有力的替代方案时，创新才能蓬勃发展。InferenceMAX™ 通过提供开源、可复现的基准测试来追踪最新硬件和软件的吞吐量、效率与成本，强化了这一愿景。它穿透合成数据，突出真实推理性能，帮助客户看到 AMD 平台在大规模 AI 中的全部潜力。',
     name: 'Darrick Horton',
     title: 'CEO, TensorWave',
     org: 'TensorWave',
@@ -137,7 +169,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/darrick-horton/',
   },
   {
-    text: "SGLang is the inference engine behind many production inference factories such as xAI's Grok, earning its recognition as THE Inference King. At scale, we see firsthand how much performance varies across hardware, models, and configurations. InferenceX\u2122 benchmarks SGLang across every major GPU platform nightly, capturing that variance in a way no other benchmark does, continuously, & reproducibly.",
+    text: "SGLang is the inference engine behind many production inference factories such as xAI's Grok, earning its recognition as THE Inference King. At scale, we see firsthand how much performance varies across hardware, models, and configurations. InferenceX™ benchmarks SGLang across every major GPU platform nightly, capturing that variance in a way no other benchmark does, continuously, & reproducibly.",
+    textZh:
+      'SGLang 是 xAI Grok 等众多生产级推理工厂背后的推理引擎，被誉为推理之王。在大规模场景中，我们深刻体会到性能在不同硬件、模型和配置间的巨大差异。InferenceX™ 每夜在所有主流 GPU 平台上对 SGLang 进行基准测试，以其他基准测试无法做到的方式——持续且可复现地——捕捉这种差异。',
     name: 'Mingyi Lu',
     title: 'SGLang Product Lead',
     org: 'SGLang',
@@ -145,7 +179,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/mingyi-lu/',
   },
   {
-    text: "InferenceX\u2122 ensembles precisely that \u2014 open, reproducible benchmarks that are continuously updated as xPU accelerators (GPUs/TPUs/LPUs), memory, storage, and software stacks evolve. I'm excited to see the InferenceX benchmarking roadmap include agentic coding workloads that stress CPU KV Cache offloading & soon NVMe KV Cache offloading from xPUs. As WEKA helps scale the Memory Wall by building the KV Cache infrastructure that feeds these xPUs, having this level of visibility into inference performance helps the entire ecosystem make smarter decisions about where to invest.",
+    text: "InferenceX™ ensembles precisely that — open, reproducible benchmarks that are continuously updated as xPU accelerators (GPUs/TPUs/LPUs), memory, storage, and software stacks evolve. I'm excited to see the InferenceX benchmarking roadmap include agentic coding workloads that stress CPU KV Cache offloading & soon NVMe KV Cache offloading from xPUs. As WEKA helps scale the Memory Wall by building the KV Cache infrastructure that feeds these xPUs, having this level of visibility into inference performance helps the entire ecosystem make smarter decisions about where to invest.",
+    textZh:
+      'InferenceX™ 恰好体现了这一点——开放、可复现的基准测试，随着 xPU 加速器（GPU/TPU/LPU）、内存、存储和软件栈的演进而持续更新。我很高兴看到 InferenceX 基准测试路线图纳入了对 CPU KV Cache 卸载乃至即将到来的 NVMe KV Cache 卸载施压的智能体编程工作负载。WEKA 通过构建为这些 xPU 供给的 KV Cache 基础设施来帮助突破内存墙，拥有这种对推理性能的深度可见性有助于整个生态做出更明智的投资决策。',
     name: 'Val Bercovici',
     title: 'Chief AI Officer, WEKA',
     org: 'WEKA',
@@ -153,7 +189,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/valentinbercovici/',
   },
   {
-    text: 'For researchers working on inference optimizations, understanding how new techniques interact across the software and hardware stack is critical yet incredibly hard to measure. InferenceX\u2122 provides much-needed insights into how inference performance evolves across major hardware platforms, moving the field forward with open, reproducible data that makes the gaps and progress visible.',
+    text: 'For researchers working on inference optimizations, understanding how new techniques interact across the software and hardware stack is critical yet incredibly hard to measure. InferenceX™ provides much-needed insights into how inference performance evolves across major hardware platforms, moving the field forward with open, reproducible data that makes the gaps and progress visible.',
+    textZh:
+      '对于从事推理优化的研究者而言，理解新技术如何在软硬件栈中交互至关重要，却极难衡量。InferenceX™ 提供了亟需的洞察，展示了推理性能在各主要硬件平台上的演进轨迹，以开放、可复现的数据让差距与进展清晰可见，推动了该领域的发展。',
     name: 'Simon Guo',
     title: 'PhD Student, Stanford CS',
     org: 'Stanford',
@@ -161,7 +199,9 @@ export const QUOTES: Quote[] = [
     link: 'https://simonguo.tech/',
   },
   {
-    text: 'Hugging Face exists to make AI open and accessible to everyone. InferenceX\u2122 extends that mission to ai chip performance, pulling models directly from the Hub and benchmarking them across every major accelerator, continuously and transparently. When the community can see exactly how frontier open models perform on real hardware in real time, it raises the bar for the entire ecosystem.',
+    text: 'Hugging Face exists to make AI open and accessible to everyone. InferenceX™ extends that mission to ai chip performance, pulling models directly from the Hub and benchmarking them across every major accelerator, continuously and transparently. When the community can see exactly how frontier open models perform on real hardware in real time, it raises the bar for the entire ecosystem.',
+    textZh:
+      'Hugging Face 的存在是为了让 AI 对每个人都开放且可及。InferenceX™ 将这一使命延伸到 AI 芯片性能领域，直接从 Hub 拉取模型，在所有主流加速器上持续、透明地进行基准测试。当社区能够实时看到前沿开源模型在真实硬件上的确切表现时，整个生态的标准都将被提升。',
     name: 'Clement Delangue',
     title: 'CEO, Hugging Face',
     org: 'Hugging Face',
@@ -169,7 +209,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/cdelangue/',
   },
   {
-    text: 'Lambda exists to make GPU compute simple and accessible for AI teams, from individual researchers to the largest labs. InferenceX\u2122 aligns with that mission by giving the community open, reproducible benchmarks that measure what actually matters: real-world throughput, cost efficiency, and performance per watt across the latest hardware and software stacks. Teams can make informed compute choices grounded in transparent, continuously updated data.',
+    text: 'Lambda exists to make GPU compute simple and accessible for AI teams, from individual researchers to the largest labs. InferenceX™ aligns with that mission by giving the community open, reproducible benchmarks that measure what actually matters: real-world throughput, cost efficiency, and performance per watt across the latest hardware and software stacks. Teams can make informed compute choices grounded in transparent, continuously updated data.',
+    textZh:
+      'Lambda 致力于让 GPU 算力对 AI 团队——从个人研究者到大型实验室——都简单易用。InferenceX™ 通过为社区提供开放、可复现的基准测试来衡量真正重要的指标：真实吞吐量、成本效率以及最新硬件和软件栈上的每瓦性能，与这一使命高度契合。团队可以基于透明、持续更新的数据做出明智的算力选择。',
     name: 'Stephen Balaban',
     title: 'Co-founder and CEO, Lambda',
     org: 'Lambda',
@@ -177,7 +219,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/sbalaban/',
   },
   {
-    text: 'When we introduced DistServe, the thesis was simple: split prefill and decode and optimize each on its own terms. Eighteen months later, disaggregation is the default architecture across the industry. InferenceX\u2122 is the benchmark that comparing disaggregated and aggregated serving across the whole pareto curve. InferenceX shows exactly when and where P/D separation pays off in TTFT, TPOT, throughput, and cost.',
+    text: 'When we introduced DistServe, the thesis was simple: split prefill and decode and optimize each on its own terms. Eighteen months later, disaggregation is the default architecture across the industry. InferenceX™ is the benchmark that comparing disaggregated and aggregated serving across the whole pareto curve. InferenceX shows exactly when and where P/D separation pays off in TTFT, TPOT, throughput, and cost.',
+    textZh:
+      '当我们推出 DistServe 时，核心论点很简单：将预填充和解码分离，分别优化。十八个月后，解聚已成为行业默认架构。InferenceX™ 是在整条帕累托曲线上对比解聚与聚合服务的基准测试。InferenceX 精确展示了 P/D 分离在 TTFT、TPOT、吞吐量和成本方面何时何地带来收益。',
     name: 'Hao Zhang',
     title: 'Assistant Professor, UC San Diego & Co-Creator of DistServe, vLLM, and FastVideo',
     org: 'UC San Diego',
@@ -186,6 +230,7 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'The benchmark is good sir',
+    textZh: '这基准测试真不错',
     name: 'Michael Goin',
     title: 'vLLM Core Maintainer & Senior Principal Engineer at Red Hat',
     org: 'Red Hat',
@@ -194,6 +239,7 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'Now commonly hearing "We want the Semianalysis for X". Testament to what @dylan522p has built.',
+    textZh: '现在经常听到"我们想要X领域的 SemiAnalysis"。这是对 @dylan522p 所构建之物的最好证明。',
     name: 'Sriram Krishnan',
     title: 'White House Senior AI Advisor',
     org: 'White House',
@@ -202,6 +248,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'Open collaboration is driving the next era of AI innovation. The open-source InferenceMAX benchmark gives the community transparent, nightly results that inspire trust and accelerate progress. It highlights the competitive TCO performance of our AMD Instinct MI300, MI325X, and MI355X GPUs across diverse workloads, underscoring the strength of our platform and our commitment to giving developers real-time visibility into our software progress.',
+    textZh:
+      '开放协作正在推动 AI 创新的下一个时代。开源的 InferenceMAX 基准测试为社区提供透明的每夜结果，激发信任并加速进步。它突出了我们 AMD Instinct MI300、MI325X 和 MI355X GPU 在多样化工作负载中极具竞争力的 TCO 表现，彰显了我们平台的实力以及我们致力于让开发者实时了解软件进展的承诺。',
     name: 'Dr. Lisa Su',
     title: 'Chair and CEO, AMD',
     org: 'AMD',
@@ -209,7 +257,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/lisasu-amd/',
   },
   {
-    text: "Inference demand is growing exponentially, driven by long-context reasoning. NVIDIA Grace Blackwell NVL72 was invented for this new era of thinking AI. NVIDIA is meeting that demand through constant hardware and software innovation to enable what's next in AI. By benchmarking frequently, InferenceMAX\u2122 gives the industry a transparent view of LLM inference performance on real-world workloads. The results are clear: Grace Blackwell NVL72 with TRT-LLM and Dynamo delivers unmatched performance per dollar and per megawatt\u2014powering the most productive and cost-effective AI factories in the world.",
+    text: "Inference demand is growing exponentially, driven by long-context reasoning. NVIDIA Grace Blackwell NVL72 was invented for this new era of thinking AI. NVIDIA is meeting that demand through constant hardware and software innovation to enable what's next in AI. By benchmarking frequently, InferenceMAX™ gives the industry a transparent view of LLM inference performance on real-world workloads. The results are clear: Grace Blackwell NVL72 with TRT-LLM and Dynamo delivers unmatched performance per dollar and per megawatt—powering the most productive and cost-effective AI factories in the world.",
+    textZh:
+      '推理需求在长上下文推理的驱动下呈指数级增长。NVIDIA Grace Blackwell NVL72 正是为这个思考型 AI 的新时代而生。NVIDIA 通过持续的硬件和软件创新来满足这一需求，推动 AI 的下一步发展。通过高频基准测试，InferenceMAX™ 为行业提供了 LLM 推理在真实工作负载上性能的透明视角。结果一目了然：Grace Blackwell NVL72 搭配 TRT-LLM 和 Dynamo 提供了无与伦比的每美元性能和每兆瓦性能——驱动着全球最高效、最具成本效益的 AI 工厂。',
     name: 'Jensen Huang',
     title: 'Founder & CEO, NVIDIA',
     org: 'NVIDIA',
@@ -217,7 +267,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/jenhsunhuang/',
   },
   {
-    text: "Speed is the moat. InferenceMAX\u2122's nightly benchmarks match the speed of improvement of the AMD software stack. It's fantastic to see AMD's MI300, MI325, and MI355 GPUs performing so well across diverse workloads and interactivity levels.",
+    text: "Speed is the moat. InferenceMAX™'s nightly benchmarks match the speed of improvement of the AMD software stack. It's fantastic to see AMD's MI300, MI325, and MI355 GPUs performing so well across diverse workloads and interactivity levels.",
+    textZh:
+      '速度就是护城河。InferenceMAX™ 的每夜基准测试与 AMD 软件栈的改进速度同步。看到 AMD MI300、MI325 和 MI355 GPU 在多样化工作负载和交互级别上表现如此出色，令人振奋。',
     name: 'Anush Elangovan',
     title: 'VP GPU Software, AMD',
     org: 'AMD',
@@ -225,7 +277,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/anushelangovan/',
   },
   {
-    text: 'InferenceMAX\u2122 highlights workloads that the ML community cares about. At NVIDIA, we welcome these comparisons because they underscore the advantage of our full-stack approach\u2014from GPUs hardware to NVLink networking to NVL72 Rack Scale to Dynamo disaggregated serving that consistently delivers industry-leading inference performance and ROI at scale.',
+    text: 'InferenceMAX™ highlights workloads that the ML community cares about. At NVIDIA, we welcome these comparisons because they underscore the advantage of our full-stack approach—from GPUs hardware to NVLink networking to NVL72 Rack Scale to Dynamo disaggregated serving that consistently delivers industry-leading inference performance and ROI at scale.',
+    textZh:
+      'InferenceMAX™ 聚焦机器学习社区关注的工作负载。在 NVIDIA，我们欢迎这些对比，因为它们凸显了我们全栈方案的优势——从 GPU 硬件到 NVLink 网络，到 NVL72 机架级系统，再到 Dynamo 解聚服务，持续提供业界领先的推理性能和大规模投资回报率。',
     name: 'Ian Buck',
     title: 'VP & GM, Hyperscale, NVIDIA & Inventor of CUDA',
     org: 'NVIDIA',
@@ -233,7 +287,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/ian-buck-19201315/',
   },
   {
-    text: "InferenceMAX\u2122's nightly results highlight the rapid pace of progress in the AMD software stack. It's exciting to witness the birth of an open project that provides a tied feedback loop between what the software team works on here at AMD and how it affects specific ML use cases across our MI300, MI325, and MI355 GPUs. I'm looking forward to see what's next for InferenceMAX and to showcase what the AMD platform can do. AMD GPUs will continue to get faster every week.",
+    text: "InferenceMAX™'s nightly results highlight the rapid pace of progress in the AMD software stack. It's exciting to witness the birth of an open project that provides a tied feedback loop between what the software team works on here at AMD and how it affects specific ML use cases across our MI300, MI325, and MI355 GPUs. I'm looking forward to see what's next for InferenceMAX and to showcase what the AMD platform can do. AMD GPUs will continue to get faster every week.",
+    textZh:
+      'InferenceMAX™ 的每夜结果突出展示了 AMD 软件栈的快速进步。能够见证一个开源项目的诞生令人兴奋——它在 AMD 软件团队的工作与其对我们 MI300、MI325 和 MI355 GPU 上特定机器学习用例的影响之间建立了紧密的反馈闭环。我期待看到 InferenceMAX 的下一步发展，并展示 AMD 平台的能力。AMD GPU 将持续每周变得更快。',
     name: 'Quentin Colombet',
     title: 'Senior Director, AMD, Ex-Brium CEO',
     org: 'AMD',
@@ -241,7 +297,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/quentincolombet/',
   },
   {
-    text: "At Crusoe, we believe being a great partner means empowering our customers with choice and clarity. That's why we're proud to support InferenceMAX\u2122, which provides the entire AI community with open-source, reproducible benchmarks for the latest hardware. By delivering transparent, real-world data on throughput, efficiency, and cost, InferenceMAX\u2122 cuts through the hype and helps our customers confidently select the very best platform for their unique workloads.",
+    text: "At Crusoe, we believe being a great partner means empowering our customers with choice and clarity. That's why we're proud to support InferenceMAX™, which provides the entire AI community with open-source, reproducible benchmarks for the latest hardware. By delivering transparent, real-world data on throughput, efficiency, and cost, InferenceMAX™ cuts through the hype and helps our customers confidently select the very best platform for their unique workloads.",
+    textZh:
+      '在 Crusoe，我们相信成为优秀合作伙伴意味着赋予客户选择权和清晰度。这就是我们自豪地支持 InferenceMAX™ 的原因——它为整个 AI 社区提供最新硬件上开源、可复现的基准测试。通过提供关于吞吐量、效率和成本的透明真实数据，InferenceMAX™ 穿透炒作，帮助客户自信地为其独特工作负载选择最佳平台。',
     name: 'Chase Lochmiller',
     title: 'Co-Founder & CEO, Crusoe',
     org: 'Crusoe',
@@ -249,7 +307,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/chase-lochmiller-604483341/',
   },
   {
-    text: 'Supermicro is excited about the launch of InferenceMAX\u2122, the SemiAnalysis benchmarking system that measures real-world throughput, performance per dollar, and energy efficiency. This open-source tool provides reproducible benchmarks running on the latest hardware and software enabling AI labs and enterprises to choose the best platforms at scale.',
+    text: 'Supermicro is excited about the launch of InferenceMAX™, the SemiAnalysis benchmarking system that measures real-world throughput, performance per dollar, and energy efficiency. This open-source tool provides reproducible benchmarks running on the latest hardware and software enabling AI labs and enterprises to choose the best platforms at scale.',
+    textZh:
+      'Supermicro 对 InferenceMAX™ 的发布感到振奋——这是 SemiAnalysis 的基准测试系统，衡量真实吞吐量、每美元性能和能效。这一开源工具在最新硬件和软件上提供可复现的基准测试，帮助 AI 实验室和企业在大规模场景中选择最佳平台。',
     name: 'Charles Liang',
     title: 'Founder & CEO, Supermicro',
     org: 'Supermicro',
@@ -257,7 +317,9 @@ export const QUOTES: Quote[] = [
     link: 'https://en.wikipedia.org/wiki/Charles_Liang',
   },
   {
-    text: 'Vultr is committed to providing an open ecosystem that gives developers freedom in how they build and scale AI \u2014 whether on NVIDIA or AMD GPUs. With InferenceMAX\u2122, customers gain open, reproducible benchmarks that deliver clear insights into throughput, efficiency, and cost across cutting-edge hardware and software. By showcasing real-world performance, we empower teams to confidently choose the right platform for their AI workloads.',
+    text: 'Vultr is committed to providing an open ecosystem that gives developers freedom in how they build and scale AI — whether on NVIDIA or AMD GPUs. With InferenceMAX™, customers gain open, reproducible benchmarks that deliver clear insights into throughput, efficiency, and cost across cutting-edge hardware and software. By showcasing real-world performance, we empower teams to confidently choose the right platform for their AI workloads.',
+    textZh:
+      'Vultr 致力于提供一个开放生态，让开发者自由选择如何构建和扩展 AI——无论是在 NVIDIA 还是 AMD GPU 上。借助 InferenceMAX™，客户获得开放、可复现的基准测试，对前沿硬件和软件的吞吐量、效率与成本提供清晰洞察。通过展示真实性能，我们赋能团队自信地为其 AI 工作负载选择合适的平台。',
     name: 'Nathan Goulding',
     title: 'SVP of Engineering, Vultr',
     org: 'Vultr',
@@ -265,7 +327,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/nathangoulding/',
   },
   {
-    text: "At Prime Intellect, we're pushing the frontier of AI post-training and open research. InferenceX\u2122 complements that work by providing open, reproducible benchmarks that track real-world inference performance across hardware and software stacks as they evolve. For researchers like us, having transparent, continuously updated data on throughput and efficiency means we can focus on building better models instead of second-guessing infrastructure. This is the kind of community-driven effort that accelerates progress for everyone.",
+    text: "At Prime Intellect, we're pushing the frontier of AI post-training and open research. InferenceX™ complements that work by providing open, reproducible benchmarks that track real-world inference performance across hardware and software stacks as they evolve. For researchers like us, having transparent, continuously updated data on throughput and efficiency means we can focus on building better models instead of second-guessing infrastructure. This is the kind of community-driven effort that accelerates progress for everyone.",
+    textZh:
+      '在 Prime Intellect，我们正在推动 AI 后训练和开放研究的前沿。InferenceX™ 通过提供开放、可复现的基准测试来追踪推理性能在不断演进的硬件和软件栈上的真实表现，与我们的工作形成互补。对于像我们这样的研究者，拥有关于吞吐量和效率的透明、持续更新的数据意味着我们可以专注于构建更好的模型，而不必为基础设施选择纠结。这正是加速每个人进步的社区驱动力量。',
     name: 'Jack Min Ong',
     title: 'Researcher, Prime Intellect',
     org: 'Prime Intellect',
@@ -273,7 +337,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/jackminong/',
   },
   {
-    text: "At Firmus, we're building the most energy-efficient AI Factories in the world \u2014 and efficiency only matters if you can measure it. InferenceX\u2122 gives the industry open, reproducible benchmarks that track real-world throughput, cost, and performance per watt across the latest GPU platforms and software stacks. As we scale gigawatts of renewable-powered AI infrastructure across Asia-Pacific & Australia, this kind of transparent, continuously updated data helps the entire ecosystem understand what these systems actually deliver.",
+    text: "At Firmus, we're building the most energy-efficient AI Factories in the world — and efficiency only matters if you can measure it. InferenceX™ gives the industry open, reproducible benchmarks that track real-world throughput, cost, and performance per watt across the latest GPU platforms and software stacks. As we scale gigawatts of renewable-powered AI infrastructure across Asia-Pacific & Australia, this kind of transparent, continuously updated data helps the entire ecosystem understand what these systems actually deliver.",
+    textZh:
+      '在 Firmus，我们正在建造全球最节能的 AI 工厂——而效率只有在可衡量时才有意义。InferenceX™ 为行业提供开放、可复现的基准测试，追踪最新 GPU 平台和软件栈上的真实吞吐量、成本和每瓦性能。随着我们在亚太和澳洲扩展吉瓦级可再生能源驱动的 AI 基础设施，这种透明、持续更新的数据帮助整个生态了解这些系统的实际交付能力。',
     name: 'Tim Rosenfield',
     title: 'Co-Founder & Co-CEO, Firmus',
     org: 'Firmus',
@@ -282,6 +348,7 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'InferenceMAX has been useful for us even if Dylan Patel is a nice little guy with feelings',
+    textZh: 'InferenceMAX 对我们很有用，即使 Dylan Patel 是个有感情的可爱小伙子',
     name: 'Matthew Leavitt',
     title: 'Chief Science Officer, DatologyAI',
     org: 'DatologyAI',
@@ -289,14 +356,18 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/matthew-leavitt-6797703b/',
   },
   {
-    text: "InferenceX\u2122 provides the open source measurements the community needs \u2014 nightly results across real workloads, real hardware, and real software stacks. As someone who has written extensively about the gap between theoretical and actual system performance, I'm glad to see a project that makes that gap visible and trackable for everyone.",
+    text: "InferenceX™ provides the open source measurements the community needs — nightly results across real workloads, real hardware, and real software stacks. As someone who has written extensively about the gap between theoretical and actual system performance, I'm glad to see a project that makes that gap visible and trackable for everyone.",
+    textZh:
+      'InferenceX™ 提供了社区所需的开源测量——真实工作负载、真实硬件和真实软件栈上的每夜结果。作为一位大量撰写过理论性能与实际系统性能差距的人，我很高兴看到一个让这种差距对每个人都清晰可见、可追踪的项目。',
     name: 'Stas Bekman',
     title: 'Developer & Author of Machine Learning Engineering Open Book (17.5K+ ⭐)',
     org: 'Stas Bekman',
     link: 'https://github.com/stas00/ml-engineering',
   },
   {
-    text: 'We use InferenceX benchmarks ourselves as one of the key datapoints to help us make infrastructure decisions at Adaptive ML. Inference performance is critical for large-scale RL workloads, where fast generation directly impacts time to market & revenue for our customers. InferenceX\u2122 benchmarks the full stack continuously \u2014 engine, model, software, and hardware across rack-scale systems like GB300 NVL72. This is the kind of open, transparent, reproducible signal the ecosystem has been missing.',
+    text: 'We use InferenceX benchmarks ourselves as one of the key datapoints to help us make infrastructure decisions at Adaptive ML. Inference performance is critical for large-scale RL workloads, where fast generation directly impacts time to market & revenue for our customers. InferenceX™ benchmarks the full stack continuously — engine, model, software, and hardware across rack-scale systems like GB300 NVL72. This is the kind of open, transparent, reproducible signal the ecosystem has been missing.',
+    textZh:
+      '我们在 Adaptive ML 自己也使用 InferenceX 基准测试作为帮助我们做出基础设施决策的关键数据点之一。推理性能对于大规模强化学习工作负载至关重要，快速生成直接影响客户的上市时间和收入。InferenceX™ 持续对全栈进行基准测试——引擎、模型、软件和硬件，覆盖 GB300 NVL72 等机架级系统。这正是生态一直缺少的那种开放、透明、可复现的信号。',
     name: 'Julien Launay',
     title: 'Co-Founder & CEO, Adaptive ML',
     org: 'Adaptive ML',
@@ -304,7 +375,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/julienlaunay/',
   },
   {
-    text: "Our customers ship AI to production using frontier open-source models \u2014 and at scale, every token per second and every dollar per million tokens matters. InferenceX\u2122 gives the ecosystem something we've always needed: an objective, open benchmark that tracks real inference performance continuously across hardware such as GB300 NVL72, GB200 NVL72, H100 & soon Rubin & TPU & Trainium. Very helpful in allowing the wider community to understand the landscape and creating a clear taxonomy around performance.",
+    text: "Our customers ship AI to production using frontier open-source models — and at scale, every token per second and every dollar per million tokens matters. InferenceX™ gives the ecosystem something we've always needed: an objective, open benchmark that tracks real inference performance continuously across hardware such as GB300 NVL72, GB200 NVL72, H100 & soon Rubin & TPU & Trainium. Very helpful in allowing the wider community to understand the landscape and creating a clear taxonomy around performance.",
+    textZh:
+      '我们的客户使用前沿开源模型将 AI 投入生产——在大规模场景中，每秒每个 token 和每百万 token 的每一美元都至关重要。InferenceX™ 为生态提供了我们一直需要的东西：一个客观、开放的基准测试，持续追踪 GB300 NVL72、GB200 NVL72、H100 以及即将到来的 Rubin、TPU 和 Trainium 等硬件上的真实推理性能。这对帮助更广泛的社区理解行业格局并建立清晰的性能分类体系非常有价值。',
     name: 'Alex Ker',
     title: 'Engineer, Baseten',
     org: 'Baseten',
@@ -313,6 +386,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'We founded Verda to give AI engineers frictionless access to cutting-edge compute without gatekeeping. InferenceX supports this mission by giving AI builders open, reproducible benchmarks that show what GPUs actually deliver under real inference workloads. We want our customers to see transparent, continuously updated performance data, without marketing fluff. InferenceX provides exactly that.',
+    textZh:
+      '我们创立 Verda 是为了让 AI 工程师无障碍地使用前沿算力，没有门槛。InferenceX 通过为 AI 构建者提供开放、可复现的基准测试来支持这一使命，展示 GPU 在真实推理工作负载下的实际交付能力。我们希望客户看到透明、持续更新的性能数据，没有营销虚辞。InferenceX 恰好提供了这一切。',
     name: 'Ruben Bryon',
     title: 'Founder & CEO, Verda',
     org: 'Verda',
@@ -320,7 +395,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/ruben-bryon/',
   },
   {
-    text: 'Voltage Park is built to give AI teams fast, affordable access to GPU compute at scale. InferenceX\u2122 supports that goal by providing open, reproducible benchmarks that show how inference actually performs across the latest hardware and software stacks. With transparent, continuously updated data on throughput, efficiency, and cost, teams can make confident compute decisions instead of guessing. We\u2019re happy to back an effort that brings this level of clarity to the ecosystem.',
+    text: 'Voltage Park is built to give AI teams fast, affordable access to GPU compute at scale. InferenceX™ supports that goal by providing open, reproducible benchmarks that show how inference actually performs across the latest hardware and software stacks. With transparent, continuously updated data on throughput, efficiency, and cost, teams can make confident compute decisions instead of guessing. We’re happy to back an effort that brings this level of clarity to the ecosystem.',
+    textZh:
+      'Voltage Park 旨在为 AI 团队提供快速、经济的大规模 GPU 算力。InferenceX™ 通过提供开放、可复现的基准测试来展示推理在最新硬件和软件栈上的实际表现，有力支持了这一目标。凭借关于吞吐量、效率和成本的透明、持续更新数据，团队可以自信地做出算力决策而非凭空猜测。我们很高兴支持一项为生态带来如此清晰度的工作。',
     name: 'Saurabh Giri',
     title: 'CTO, Voltage Park',
     org: 'Voltage Park',
@@ -328,7 +405,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/saurabh-giri/',
   },
   {
-    text: "At Periodic Labs, we're building AI scientists that turn compute into real-world scientific discoveries. That means we care deeply about what each GPU actually delivers. InferenceX\u2122 provides open, reproducible benchmarks that cut through spec sheets and show real-world throughput, efficiency, and cost across the latest hardware and software stacks. Having done inference across thousands of GPUs, I can say this kind of transparent, continuously updated data is exactly what practitioners need to make smart infrastructure decisions.",
+    text: "At Periodic Labs, we're building AI scientists that turn compute into real-world scientific discoveries. That means we care deeply about what each GPU actually delivers. InferenceX™ provides open, reproducible benchmarks that cut through spec sheets and show real-world throughput, efficiency, and cost across the latest hardware and software stacks. Having done inference across thousands of GPUs, I can say this kind of transparent, continuously updated data is exactly what practitioners need to make smart infrastructure decisions.",
+    textZh:
+      '在 Periodic Labs，我们正在构建将算力转化为真实科学发现的 AI 科学家。这意味着我们非常关注每块 GPU 的实际交付能力。InferenceX™ 提供开放、可复现的基准测试，穿透规格表，展示最新硬件和软件栈上的真实吞吐量、效率与成本。在数千块 GPU 上做过推理后，我可以说这种透明、持续更新的数据正是从业者做出明智基础设施决策所需要的。',
     name: 'Xander Dunn',
     title: 'Founding Team, Periodic Labs',
     org: 'Periodic Labs',
@@ -337,6 +416,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'As AI infrastructure scales globally, no single vendor or region can define the benchmarks that matter for everyone. InferenceX is an important step toward a shared, transparent view of inference performance and TCO, enabling more rational investments for sovereign AI Cloud operators, as well as healthier competition, and ultimately more accessible AI capacity worldwide.',
+    textZh:
+      '随着 AI 基础设施在全球范围内扩展，没有任何单一厂商或地区能够定义适用于所有人的基准测试。InferenceX 是朝着共享、透明的推理性能和 TCO 视角迈出的重要一步，为主权 AI 云运营商带来更理性的投资决策、更健康的竞争，并最终在全球范围内提供更可及的 AI 算力。',
     name: 'Talal M. Al Kaissi',
     title: 'CEO',
     org: 'Core42',
@@ -344,6 +425,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'It is important to have an open and continuously updated platform for benchmarking inference engines across real workloads and diverse hardware. InferenceX provides this kind of transparent and practical evaluation, helping the community better understand real system bottlenecks and tradeoffs. Benchmarks like this are essential for building more efficient and scalable AI systems. Moreover, as LLM agents become increasingly capable at improving systems, such a platform can provide the reliable feedback needed to close the automatic optimization loop, further driving progress in this field.',
+    textZh:
+      '拥有一个开放且持续更新的平台来对推理引擎在真实工作负载和多样化硬件上进行基准测试非常重要。InferenceX 提供了这种透明、实用的评估，帮助社区更好地理解真实系统瓶颈和权衡。这样的基准测试对于构建更高效、更可扩展的 AI 系统至关重要。此外，随着 LLM 智能体在改进系统方面日益强大，这样的平台可以提供闭合自动优化循环所需的可靠反馈，进一步推动该领域的进步。',
     name: 'Cao Shiyi',
     title: 'Researcher, Sky Computing Lab',
     org: 'UC Berkeley',
@@ -351,6 +434,8 @@ export const QUOTES: Quote[] = [
   },
   {
     text: 'At GMI Cloud, we believe inference has become the center of AI value creation. SemiAnalysis has done something the industry has long needed with InferenceX—they’ve turned inference from a black box into a continuously measured, real-world system. By benchmarking not just hardware, but the full stack—models, runtimes, and distributed systems—InferenceX reflects how AI actually runs in production, not how it’s marketed.',
+    textZh:
+      '在 GMI Cloud，我们认为推理已成为 AI 价值创造的核心。SemiAnalysis 通过 InferenceX 做了行业期盼已久的事——将推理从一个黑箱变成了一个被持续衡量的真实系统。InferenceX 不仅对硬件进行基准测试，还覆盖完整技术栈——模型、运行时和分布式系统，反映的是 AI 在生产中的实际运行方式，而非营销宣传。',
     name: 'Alex Yeh',
     title: 'Founder & CEO, GMI Cloud',
     org: 'GMI Cloud',
@@ -358,7 +443,9 @@ export const QUOTES: Quote[] = [
     link: 'https://www.linkedin.com/in/gmi-yeh',
   },
   {
-    text: 'At EmbeddedLLM, our team works deep in the production inference stack, including major maintainer and contributor work in vLLM, so we see every day how much real-world AI performance depends on the full system: model, runtime, kernels, scheduling, and hardware. InferenceX\u2122 matters because it benchmarks that full system continuously and openly. It turns inference from a marketing conversation into an engineering discipline, giving AI labs, neoclouds, and enterprises the data they need to make decisions on throughput, cost, and efficiency at production scale.',
+    text: 'At EmbeddedLLM, our team works deep in the production inference stack, including major maintainer and contributor work in vLLM, so we see every day how much real-world AI performance depends on the full system: model, runtime, kernels, scheduling, and hardware. InferenceX™ matters because it benchmarks that full system continuously and openly. It turns inference from a marketing conversation into an engineering discipline, giving AI labs, neoclouds, and enterprises the data they need to make decisions on throughput, cost, and efficiency at production scale.',
+    textZh:
+      '在 EmbeddedLLM，我们的团队深耕于生产推理栈，包括 vLLM 的核心维护和贡献工作，因此我们每天都能看到真实 AI 性能在多大程度上取决于完整系统：模型、运行时、内核、调度和硬件。InferenceX™ 之所以重要，是因为它持续且公开地对完整系统进行基准测试。它将推理从营销话题转变为工程学科，为 AI 实验室、新型云服务商和企业提供在生产规模上做出吞吐量、成本和效率决策所需的数据。',
     name: 'Pin Siang Tan',
     title: 'Co-founder & CTO, EmbeddedLLM',
     org: 'EmbeddedLLM',
diff --git a/packages/app/src/components/set-document-lang.tsx b/packages/app/src/components/set-document-lang.tsx
new file mode 100644
index 00000000..8ef3f323
--- /dev/null
+++ b/packages/app/src/components/set-document-lang.tsx
@@ -0,0 +1,22 @@
+'use client';
+
+import { useEffect } from 'react';
+
+/**
+ * The root layout hardcodes <html lang="en"> and Next.js offers no supported
+ * way to override it per route segment without splitting into multiple root
+ * layouts. The /zh layout renders this to stamp the document language after
+ * hydration (crawlers detect page language from content and hreflang, so the
+ * pre-hydration attribute is not SEO-relevant; this keeps a11y tools correct).
+ */
+export function SetDocumentLang({ lang }: { lang: string }) {
+  useEffect(() => {
+    const previous = document.documentElement.lang;
+    document.documentElement.lang = lang;
+    return () => {
+      document.documentElement.lang = previous;
+    };
+  }, [lang]);
+
+  return null;
+}
diff --git a/packages/app/src/components/tab-nav.tsx b/packages/app/src/components/tab-nav.tsx
index ce5f3257..00aeb03e 100644
--- a/packages/app/src/components/tab-nav.tsx
+++ b/packages/app/src/components/tab-nav.tsx
@@ -21,6 +21,8 @@ import {
   SelectValue,
 } from '@/components/ui/select';
 import { UnofficialRunContext } from '@/components/unofficial-run-provider';
+import { isZhPathname, ZH_PREFIX } from '@/lib/i18n';
+import { TAB_LABELS_ZH } from '@/lib/tab-meta-zh';
 import { cn } from '@/lib/utils';
 
 const VISIBLE_TABS = [
@@ -61,8 +63,9 @@ const currentTabClass = (active: boolean) =>
     : 'hover:border-muted-foreground/30';
 
 function activeTab(pathname: string): string {
-  const seg = pathname.split('/').filter(Boolean)[0] || 'inference';
-  return seg;
+  const segments = pathname.split('/').filter(Boolean);
+  if (segments[0] === ZH_PREFIX.slice(1)) segments.shift();
+  return segments[0] || 'inference';
 }
 
 function handleDesktopClick(tab: string) {
@@ -74,8 +77,14 @@ export function TabNav() {
   const pathname = usePathname();
   const router = useRouter();
   const featureGateUnlocked = useFeatureGate();
+  const isZh = isZhPathname(pathname);
   const current = activeTab(pathname);
   const selectedTab = TAB_VALUES.has(current) ? current : '';
+  // On /zh pages, visible tabs navigate within the Chinese tree and show
+  // Chinese labels. Gated tabs have no /zh sibling and keep English targets.
+  const tabLabel = (tab: { href: string; label: string }) =>
+    isZh ? (TAB_LABELS_ZH[tab.href.slice(1)] ?? tab.label) : tab.label;
+  const localizedPath = (path: string) => (isZh ? `${ZH_PREFIX}${path}` : path);
 
   // Preserve the `unofficialrun(s)` URL param across tab navigation so an
   // overlay loaded on /inference doesn't get dropped when switching to
@@ -108,7 +117,7 @@ export function TabNav() {
   const handleMobileChange = (value: string) => {
     window.dispatchEvent(new CustomEvent('inferencex:tab-change'));
     track('tab_changed', { tab: value });
-    router.push(tabHref(`/${value}`));
+    router.push(tabHref(GATED_VALUES.has(value) ? `/${value}` : localizedPath(`/${value}`)));
   };
 
   return (
@@ -118,17 +127,17 @@ export function TabNav() {
         <div className="w-full pb-6" />
         <Card>
           <div className="space-y-2">
-            <Label htmlFor="chart-select">Select Chart</Label>
+            <Label htmlFor="chart-select">{isZh ? '选择图表' : 'Select Chart'}</Label>
             <Select value={selectedTab} onValueChange={handleMobileChange}>
               <SelectTrigger id="chart-select" data-testid="mobile-chart-select" className="w-full">
-                <SelectValue placeholder="Select Chart" />
+                <SelectValue placeholder={isZh ? '选择图表' : 'Select Chart'} />
               </SelectTrigger>
               <SelectContent>
                 {VISIBLE_TABS.map((tab) => {
                   const value = tab.href.slice(1);
                   return (
                     <SelectItem key={value} value={value} data-ph-capture-attribute-tab={value}>
-                      {tab.label}
+                      {tabLabel(tab)}
                     </SelectItem>
                   );
                 })}
@@ -168,13 +177,13 @@ export function TabNav() {
             {VISIBLE_TABS.map((tab) => (
               <Link
                 key={tab.href}
-                href={tabHref(tab.href)}
+                href={tabHref(localizedPath(tab.href))}
                 data-testid={tab.testId}
                 data-ph-capture-attribute-tab={tab.href.slice(1)}
                 onClick={() => handleDesktopClick(tab.href.slice(1))}
                 className={cn(tabLinkClass, currentTabClass(current === tab.href.slice(1)))}
               >
-                {tab.label}
+                {tabLabel(tab)}
               </Link>
             ))}
             {featureGateUnlocked && (
diff --git a/packages/app/src/components/zh/zh-tab-intro.tsx b/packages/app/src/components/zh/zh-tab-intro.tsx
new file mode 100644
index 00000000..e39eee80
--- /dev/null
+++ b/packages/app/src/components/zh/zh-tab-intro.tsx
@@ -0,0 +1,19 @@
+import { Card } from '@/components/ui/card';
+import { TAB_INTRO_ZH, TAB_META_ZH, type ZhTabKey } from '@/lib/tab-meta-zh';
+
+/**
+ * Server-rendered Chinese intro above the interactive dashboard on /zh tab
+ * pages. The charts below render in English; this block gives crawlers and
+ * readers genuine Chinese content describing what the page shows.
+ */
+export function ZhTabIntro({ tab }: { tab: ZhTabKey }) {
+  return (
+    <Card data-testid="zh-tab-intro">
+      <h1 className="text-xl lg:text-2xl font-bold tracking-tight">{TAB_META_ZH[tab].title}</h1>
+      <p className="mt-2 text-sm lg:text-base text-muted-foreground">{TAB_INTRO_ZH[tab]}</p>
+      <p className="mt-2 text-xs text-muted-foreground">
+        下方交互式图表界面目前为英文。图表中的模型、GPU 与框架名称均为业界通用英文名称。
+      </p>
+    </Card>
+  );
+}
diff --git a/packages/app/src/lib/blog.test.ts b/packages/app/src/lib/blog.test.ts
index 2a50b288..67633e87 100644
--- a/packages/app/src/lib/blog.test.ts
+++ b/packages/app/src/lib/blog.test.ts
@@ -7,6 +7,7 @@ import {
   getAllPosts,
   getPostBySlug,
   getReadingTime,
+  hasZhTranslation,
   slugify,
 } from './blog';
 
@@ -80,11 +81,44 @@ date: '2025-08-01'
 This post has no publishDate field at all.
 `;
 
+const FAKE_MDX_ZH = `---
+title: '测试文章'
+subtitle: '一个测试副标题'
+date: '2026-01-15'
+tags:
+  - testing
+---
+
+# 测试标题
+
+这是一段中文正文内容。
+`;
+
 vi.mock('node:fs', async (importOriginal) => {
   const actual = await importOriginal<typeof fs>();
   return { ...actual, default: { ...actual } };
 });
 
+const isZhPath = (p: string) => p.includes('/zh/');
+
+/**
+ * Mock the content tree with English posts and a zh/ translations subdir.
+ * Directory existence checks return true; file checks consult the maps.
+ */
+function mockLocalizedFiles(en: Record<string, string>, zh: Record<string, string>) {
+  const lookup = (p: string) => {
+    const files = isZhPath(p) ? zh : en;
+    return Object.entries(files).find(([name]) => p.includes(name.replace('.mdx', '')))?.[1];
+  };
+  vi.spyOn(fs, 'existsSync').mockImplementation((filePath) => {
+    const p = String(filePath);
+    if (!p.endsWith('.mdx')) return true;
+    return lookup(p) !== undefined;
+  });
+  vi.spyOn(fs, 'readdirSync').mockReturnValue(Object.keys(en) as any);
+  vi.spyOn(fs, 'readFileSync').mockImplementation((filePath) => lookup(String(filePath)) ?? '');
+}
+
 beforeEach(() => {
   vi.restoreAllMocks();
 });
@@ -112,6 +146,12 @@ describe('slugify', () => {
   it('passes through already-valid slugs unchanged', () => {
     expect(slugify('hello-world')).toBe('hello-world');
   });
+
+  it('keeps Han characters so Chinese headings get meaningful ids', () => {
+    expect(slugify('性能 分析')).toBe('性能-分析');
+    expect(slugify('GB200 性能对比')).toBe('gb200-性能对比');
+    expect(slugify('（结论）')).toBe('结论');
+  });
 });
 
 describe('getReadingTime', () => {
@@ -124,6 +164,19 @@ describe('getReadingTime', () => {
     // 500 words / 265 wpm = 1.89 → ceil = 2
     expect(getReadingTime(words)).toBe(2);
   });
+
+  it('counts CJK prose by characters, not whitespace-separated words', () => {
+    // 800 Han chars with no spaces: 800 / 400 cpm = 2 minutes. The old
+    // word-split logic would have counted this as a single "word" → 1 minute.
+    const cjk = '推'.repeat(800);
+    expect(getReadingTime(cjk)).toBe(2);
+  });
+
+  it('combines CJK characters and Latin words in mixed content', () => {
+    // 400 Han chars (1 min at 400 cpm) + 265 Latin words (1 min at 265 wpm)
+    const mixed = `${'理'.repeat(400)} ${Array.from({ length: 265 }, () => 'word').join(' ')}`;
+    expect(getReadingTime(mixed)).toBe(2);
+  });
 });
 
 describe('getAllPosts', () => {
@@ -279,6 +332,92 @@ describe('getAllPosts — publishDate filtering', () => {
   });
 });
 
+describe('getAllPosts — zh locale', () => {
+  afterEach(() => {
+    vi.unstubAllEnvs();
+  });
+
+  it('returns only posts with a zh translation, using zh frontmatter', () => {
+    mockLocalizedFiles(
+      { 'test-post.mdx': FAKE_MDX, 'older-post.mdx': FAKE_MDX_OLDER },
+      { 'test-post.mdx': FAKE_MDX_ZH },
+    );
+
+    const posts = getAllPosts('zh');
+    expect(posts).toHaveLength(1);
+    expect(posts[0].slug).toBe('test-post');
+    expect(posts[0].title).toBe('测试文章');
+    expect(posts[0].subtitle).toBe('一个测试副标题');
+  });
+
+  it('inherits publishDate gating from the English post in production', () => {
+    vi.stubEnv('NODE_ENV', 'production');
+    // English original is unpublished (no publishDate) — the zh translation
+    // must not leak even though its file exists.
+    mockLocalizedFiles(
+      { 'no-publish.mdx': FAKE_MDX_NO_PUBLISH },
+      { 'no-publish.mdx': FAKE_MDX_ZH },
+    );
+
+    expect(getAllPosts('zh')).toHaveLength(0);
+  });
+
+  it('keeps English getAllPosts unaffected by zh translations', () => {
+    mockLocalizedFiles({ 'test-post.mdx': FAKE_MDX }, { 'test-post.mdx': FAKE_MDX_ZH });
+
+    const posts = getAllPosts();
+    expect(posts).toHaveLength(1);
+    expect(posts[0].title).toBe('Test Post');
+  });
+});
+
+describe('getPostBySlug — zh locale', () => {
+  it('returns the zh translation meta and content', () => {
+    mockLocalizedFiles({ 'test-post.mdx': FAKE_MDX }, { 'test-post.mdx': FAKE_MDX_ZH });
+
+    const result = getPostBySlug('test-post', 'zh');
+    expect(result).not.toBeNull();
+    expect(result!.meta.title).toBe('测试文章');
+    expect(result!.raw).toContain('# 测试标题');
+  });
+
+  it('returns null when no zh translation exists', () => {
+    mockLocalizedFiles({ 'test-post.mdx': FAKE_MDX }, {});
+
+    expect(getPostBySlug('test-post', 'zh')).toBeNull();
+  });
+});
+
+describe('hasZhTranslation', () => {
+  it('reflects existence of the zh translation file', () => {
+    mockLocalizedFiles(
+      { 'test-post.mdx': FAKE_MDX, 'older-post.mdx': FAKE_MDX_OLDER },
+      { 'test-post.mdx': FAKE_MDX_ZH },
+    );
+
+    expect(hasZhTranslation('test-post')).toBe(true);
+    expect(hasZhTranslation('older-post')).toBe(false);
+  });
+});
+
+describe('getAdjacentPosts — zh locale', () => {
+  it('navigates within translated posts only', () => {
+    mockLocalizedFiles(
+      {
+        'test-post.mdx': FAKE_MDX,
+        'middle-post.mdx': FAKE_MDX_MIDDLE,
+        'older-post.mdx': FAKE_MDX_OLDER,
+      },
+      // middle-post has no translation: zh prev/next must skip over it.
+      { 'test-post.mdx': FAKE_MDX_ZH, 'older-post.mdx': FAKE_MDX_ZH },
+    );
+
+    const { prev, next } = getAdjacentPosts('test-post', 'zh');
+    expect(next).toBeNull();
+    expect(prev!.slug).toBe('older-post');
+  });
+});
+
 describe('getPostBySlug', () => {
   it('returns null for non-existent slug', () => {
     vi.spyOn(fs, 'existsSync').mockReturnValue(false);
diff --git a/packages/app/src/lib/blog.ts b/packages/app/src/lib/blog.ts
index 42c2df57..b69885ff 100644
--- a/packages/app/src/lib/blog.ts
+++ b/packages/app/src/lib/blog.ts
@@ -19,22 +19,37 @@ export interface BlogPostMeta extends BlogFrontmatter {
 
 const CONTENT_DIR = path.join(process.cwd(), 'content', 'blog');
 const WORDS_PER_MINUTE = 265;
+// CJK prose has no word boundaries; reading speed studies put Chinese at
+// roughly 300-500 characters per minute — we use a middle value.
+const CJK_CHARS_PER_MINUTE = 400;
+
+export type BlogLocale = 'en' | 'zh';
+
+/** Simplified Chinese translations live alongside the originals, same filename. */
+function contentDir(locale: BlogLocale): string {
+  return locale === 'zh' ? path.join(CONTENT_DIR, 'zh') : CONTENT_DIR;
+}
 
 export function slugify(raw: string): string {
   return (
     raw
       .toLowerCase()
-      .replaceAll(/[^a-z0-9]+/gu, '-')
+      // Keep Han characters so Chinese headings get meaningful anchor ids
+      // instead of all collapsing to the empty-slug fallback.
+      .replaceAll(/[^a-z0-9\p{Script=Han}]+/gu, '-')
       .replaceAll(/^-+|-+$/gu, '') || 'post'
   );
 }
 
+const CJK_CHAR_REGEX = /\p{Script=Han}/gu;
+
 export function getReadingTime(content: string): number {
-  const words = content.trim().split(/\s+/u).length;
-  return Math.max(1, Math.ceil(words / WORDS_PER_MINUTE));
+  const cjkChars = content.match(CJK_CHAR_REGEX)?.length ?? 0;
+  const words = content.replaceAll(CJK_CHAR_REGEX, ' ').trim().split(/\s+/u).filter(Boolean).length;
+  return Math.max(1, Math.ceil(words / WORDS_PER_MINUTE + cjkChars / CJK_CHARS_PER_MINUTE));
 }
 
-export function getAllPosts(): BlogPostMeta[] {
+export function getAllPosts(locale: BlogLocale = 'en'): BlogPostMeta[] {
   if (!fs.existsSync(CONTENT_DIR)) return [];
 
   const files = fs.readdirSync(CONTENT_DIR).filter((f) => f.endsWith('.mdx'));
@@ -57,7 +72,36 @@ export function getAllPosts(): BlogPostMeta[] {
       ? posts.filter((p) => Boolean(p.publishDate) && new Date(`${p.publishDate}T00:00:00Z`) <= now)
       : posts;
 
-  return visible.toSorted((a, b) => new Date(b.date).getTime() - new Date(a.date).getTime());
+  const sorted = visible.toSorted(
+    (a, b) => new Date(b.date).getTime() - new Date(a.date).getTime(),
+  );
+  if (locale === 'en') return sorted;
+
+  // Chinese visibility derives from the English post (single source of truth
+  // for publishDate gating) plus the existence of a translation file. Title,
+  // subtitle, and reading time come from the translation.
+  return sorted.flatMap((post) => {
+    const zh = readPost(post.slug, 'zh');
+    return zh ? [zh.meta] : [];
+  });
+}
+
+function readPost(slug: string, locale: BlogLocale): { meta: BlogPostMeta; raw: string } | null {
+  const safe = slugify(slug);
+  const filePath = path.join(contentDir(locale), `${safe}.mdx`);
+  if (!fs.existsSync(filePath)) return null;
+
+  const fileContent = fs.readFileSync(filePath, 'utf8');
+  const { data, content } = matter(fileContent);
+
+  return {
+    meta: {
+      ...(data as BlogFrontmatter),
+      slug: safe,
+      readingTime: getReadingTime(content),
+    },
+    raw: content,
+  };
 }
 
 export interface AdjacentPosts {
@@ -65,8 +109,8 @@ export interface AdjacentPosts {
   next: BlogPostMeta | null;
 }
 
-export function getAdjacentPosts(slug: string): AdjacentPosts {
-  const posts = getAllPosts();
+export function getAdjacentPosts(slug: string, locale: BlogLocale = 'en'): AdjacentPosts {
+  const posts = getAllPosts(locale);
   const index = posts.findIndex((p) => p.slug === slug);
   if (index === -1) return { prev: null, next: null };
   return {
@@ -105,20 +149,14 @@ export function extractHeadings(rawMdx: string): TocHeading[] {
   return headings;
 }
 
-export function getPostBySlug(slug: string): { meta: BlogPostMeta; raw: string } | null {
-  const safe = slugify(slug);
-  const filePath = path.join(CONTENT_DIR, `${safe}.mdx`);
-  if (!fs.existsSync(filePath)) return null;
-
-  const fileContent = fs.readFileSync(filePath, 'utf8');
-  const { data, content } = matter(fileContent);
+export function getPostBySlug(
+  slug: string,
+  locale: BlogLocale = 'en',
+): { meta: BlogPostMeta; raw: string } | null {
+  return readPost(slug, locale);
+}
 
-  return {
-    meta: {
-      ...(data as BlogFrontmatter),
-      slug: safe,
-      readingTime: getReadingTime(content),
-    },
-    raw: content,
-  };
+/** Whether a Simplified Chinese translation exists for a post (any visibility). */
+export function hasZhTranslation(slug: string): boolean {
+  return fs.existsSync(path.join(contentDir('zh'), `${slugify(slug)}.mdx`));
 }
diff --git a/packages/app/src/lib/i18n.test.ts b/packages/app/src/lib/i18n.test.ts
new file mode 100644
index 00000000..e99843e4
--- /dev/null
+++ b/packages/app/src/lib/i18n.test.ts
@@ -0,0 +1,103 @@
+import { describe, expect, it } from 'vitest';
+
+import { SITE_URL } from '@semianalysisai/inferencex-constants';
+
+import {
+  enAlternates,
+  hasZhSibling,
+  isZhPathname,
+  languageAlternates,
+  switchLocalePath,
+  zhAlternates,
+  zhPath,
+} from './i18n';
+
+describe('zhPath', () => {
+  it('maps the root to /zh without a trailing slash', () => {
+    expect(zhPath('/')).toBe('/zh');
+  });
+
+  it('prefixes non-root paths', () => {
+    expect(zhPath('/blog')).toBe('/zh/blog');
+    expect(zhPath('/blog/some-post')).toBe('/zh/blog/some-post');
+  });
+});
+
+describe('isZhPathname', () => {
+  it('matches the zh root and zh children', () => {
+    expect(isZhPathname('/zh')).toBe(true);
+    expect(isZhPathname('/zh/inference')).toBe(true);
+  });
+
+  it('does not match English paths or lookalikes', () => {
+    expect(isZhPathname('/')).toBe(false);
+    expect(isZhPathname('/inference')).toBe(false);
+    expect(isZhPathname('/zhejiang')).toBe(false);
+  });
+});
+
+describe('hasZhSibling', () => {
+  it('matches mirrored exact routes', () => {
+    expect(hasZhSibling('/')).toBe(true);
+    expect(hasZhSibling('/inference')).toBe(true);
+    expect(hasZhSibling('/about')).toBe(true);
+  });
+
+  it('matches blog child paths but not compare slug pages', () => {
+    expect(hasZhSibling('/blog/some-post')).toBe(true);
+    // Per-slug comparison pages are English-only; only the index is mirrored.
+    expect(hasZhSibling('/compare')).toBe(true);
+    expect(hasZhSibling('/compare/deepseek-r1-h100-vs-h200')).toBe(false);
+  });
+
+  it('rejects unmirrored routes', () => {
+    expect(hasZhSibling('/datasets')).toBe(false);
+    expect(hasZhSibling('/feedback')).toBe(false);
+  });
+});
+
+describe('switchLocalePath', () => {
+  it('switches English pages to their zh sibling', () => {
+    expect(switchLocalePath('/')).toBe('/zh');
+    expect(switchLocalePath('/inference')).toBe('/zh/inference');
+    expect(switchLocalePath('/blog/some-post')).toBe('/zh/blog/some-post');
+  });
+
+  it('switches zh pages back to English', () => {
+    expect(switchLocalePath('/zh')).toBe('/');
+    expect(switchLocalePath('/zh/quotes')).toBe('/quotes');
+    expect(switchLocalePath('/zh/blog/some-post')).toBe('/blog/some-post');
+  });
+
+  it('falls back to the other homepage for unmirrored paths', () => {
+    expect(switchLocalePath('/datasets')).toBe('/zh');
+    expect(switchLocalePath('/compare/foo-vs-bar')).toBe('/zh');
+    expect(switchLocalePath('/zh/unknown-page')).toBe('/');
+  });
+});
+
+describe('languageAlternates', () => {
+  it('links both languages with English as x-default', () => {
+    expect(languageAlternates('/about')).toEqual({
+      en: `${SITE_URL}/about`,
+      'zh-CN': `${SITE_URL}/zh/about`,
+      'x-default': `${SITE_URL}/about`,
+    });
+  });
+
+  it('uses the bare site URL for the root path', () => {
+    const alternates = languageAlternates('/');
+    expect(alternates.en).toBe(SITE_URL);
+    expect(alternates['zh-CN']).toBe(`${SITE_URL}/zh`);
+  });
+});
+
+describe('enAlternates / zhAlternates', () => {
+  it('canonicalizes each side to its own URL with a shared language set', () => {
+    const en = enAlternates('/quotes');
+    const zh = zhAlternates('/quotes');
+    expect(en.canonical).toBe(`${SITE_URL}/quotes`);
+    expect(zh.canonical).toBe(`${SITE_URL}/zh/quotes`);
+    expect(en.languages).toEqual(zh.languages);
+  });
+});
diff --git a/packages/app/src/lib/i18n.ts b/packages/app/src/lib/i18n.ts
new file mode 100644
index 00000000..de38f3c6
--- /dev/null
+++ b/packages/app/src/lib/i18n.ts
@@ -0,0 +1,110 @@
+import { SITE_URL } from '@semianalysisai/inferencex-constants';
+
+/**
+ * Minimal locale plumbing for the Simplified Chinese (/zh) page tree.
+ *
+ * The site is not fully internationalized — instead, every indexable page has
+ * a hand-authored Chinese sibling under /zh (see AGENTS.md "Chinese Website
+ * Pages"). These helpers keep the URL mapping and hreflang alternates in one
+ * place so English and Chinese pages always point at each other consistently.
+ */
+
+export type Locale = 'en' | 'zh';
+
+export const ZH_PREFIX = '/zh';
+
+/** BCP 47 tag used for hreflang, JSON-LD inLanguage, and the html lang attribute. */
+export const ZH_LANG_TAG = 'zh-CN';
+
+/** Open Graph locale for zh pages. */
+export const ZH_OG_LOCALE = 'zh_CN';
+
+/** `/` → `/zh`, `/blog/foo` → `/zh/blog/foo`. */
+export function zhPath(enPath: string): string {
+  return enPath === '/' ? ZH_PREFIX : `${ZH_PREFIX}${enPath}`;
+}
+
+export function isZhPathname(pathname: string): boolean {
+  return pathname === ZH_PREFIX || pathname.startsWith(`${ZH_PREFIX}/`);
+}
+
+/**
+ * English routes that have a Chinese sibling page. Used by the header nav and
+ * the language switcher so we never link into a /zh URL that doesn't exist.
+ * `exact` entries only match the path itself; prefix entries also match any
+ * child path (e.g. /blog matches /blog/some-post).
+ */
+export const ZH_MIRRORED_ROUTES: readonly { path: string; exact?: boolean }[] = [
+  { path: '/', exact: true },
+  { path: '/inference', exact: true },
+  { path: '/evaluation', exact: true },
+  { path: '/historical', exact: true },
+  { path: '/calculator', exact: true },
+  { path: '/reliability', exact: true },
+  { path: '/gpu-specs', exact: true },
+  { path: '/gpu-metrics', exact: true },
+  { path: '/submissions', exact: true },
+  { path: '/about', exact: true },
+  { path: '/quotes', exact: true },
+  { path: '/land-acknowledgement', exact: true },
+  { path: '/compare', exact: true },
+  { path: '/compare-per-dollar', exact: true },
+  { path: '/blog' },
+];
+
+export function hasZhSibling(enPathname: string): boolean {
+  return ZH_MIRRORED_ROUTES.some((route) =>
+    route.exact
+      ? enPathname === route.path
+      : enPathname === route.path || enPathname.startsWith(`${route.path}/`),
+  );
+}
+
+/**
+ * Map the current pathname to its counterpart in the other language, for the
+ * header language switcher. English pages without a Chinese sibling fall back
+ * to the /zh homepage; unknown /zh paths fall back to the English homepage.
+ */
+export function switchLocalePath(pathname: string): string {
+  if (isZhPathname(pathname)) {
+    const enPathname = pathname === ZH_PREFIX ? '/' : pathname.slice(ZH_PREFIX.length);
+    return hasZhSibling(enPathname) ? enPathname : '/';
+  }
+  return hasZhSibling(pathname) ? zhPath(pathname) : ZH_PREFIX;
+}
+
+/**
+ * hreflang map linking an English page and its Chinese sibling. Spread into
+ * `alternates.languages` on BOTH pages so each references the full set, per
+ * Google's bidirectional hreflang requirement. English is the x-default.
+ */
+export function languageAlternates(enPath: string): Record<string, string> {
+  const enUrl = enPath === '/' ? SITE_URL : `${SITE_URL}${enPath}`;
+  return {
+    en: enUrl,
+    [ZH_LANG_TAG]: `${SITE_URL}${zhPath(enPath)}`,
+    'x-default': enUrl,
+  };
+}
+
+/** `alternates` metadata for the English side of a mirrored page pair. */
+export function enAlternates(enPath: string): {
+  canonical: string;
+  languages: Record<string, string>;
+} {
+  return {
+    canonical: enPath === '/' ? SITE_URL : `${SITE_URL}${enPath}`,
+    languages: languageAlternates(enPath),
+  };
+}
+
+/** `alternates` metadata for the Chinese side of a mirrored page pair. */
+export function zhAlternates(enPath: string): {
+  canonical: string;
+  languages: Record<string, string>;
+} {
+  return {
+    canonical: `${SITE_URL}${zhPath(enPath)}`,
+    languages: languageAlternates(enPath),
+  };
+}
diff --git a/packages/app/src/lib/tab-meta-zh.test.ts b/packages/app/src/lib/tab-meta-zh.test.ts
new file mode 100644
index 00000000..dd0505b5
--- /dev/null
+++ b/packages/app/src/lib/tab-meta-zh.test.ts
@@ -0,0 +1,62 @@
+import { describe, expect, it } from 'vitest';
+
+import { SITE_URL } from '@semianalysisai/inferencex-constants';
+
+import { isValidTab, TAB_META } from './tab-meta';
+import {
+  isZhTab,
+  TAB_INTRO_ZH,
+  TAB_LABELS_ZH,
+  TAB_META_ZH,
+  tabMetadataZh,
+  ZH_TAB_KEYS,
+} from './tab-meta-zh';
+
+const HAN_REGEX = /\p{Script=Han}/u;
+
+describe('ZH_TAB_KEYS', () => {
+  it.each(ZH_TAB_KEYS)('mirrors a valid English tab "%s"', (tab) => {
+    expect(isValidTab(tab)).toBe(true);
+    expect(TAB_META[tab]).toBeDefined();
+  });
+
+  it.each(ZH_TAB_KEYS)('has complete Chinese meta, intro, and label for "%s"', (tab) => {
+    // Actual Chinese text, not an English placeholder that slipped through.
+    expect(TAB_META_ZH[tab].title).toMatch(HAN_REGEX);
+    expect(TAB_META_ZH[tab].description).toMatch(HAN_REGEX);
+    expect(TAB_INTRO_ZH[tab]).toMatch(HAN_REGEX);
+    expect(TAB_LABELS_ZH[tab]).toMatch(HAN_REGEX);
+  });
+});
+
+describe('isZhTab', () => {
+  it('accepts mirrored tabs and rejects unmirrored or unknown ones', () => {
+    expect(isZhTab('inference')).toBe(true);
+    expect(isZhTab('ai-chart')).toBe(false);
+    expect(isZhTab('feedback')).toBe(false);
+    expect(isZhTab('nonexistent')).toBe(false);
+  });
+});
+
+describe('tabMetadataZh', () => {
+  it('canonicalizes the inference tab to the zh homepage, mirroring English', () => {
+    const meta = tabMetadataZh('inference');
+    expect(meta.alternates?.canonical).toBe(`${SITE_URL}/zh`);
+  });
+
+  it('canonicalizes other tabs to their own zh URL with bidirectional hreflang', () => {
+    const meta = tabMetadataZh('evaluation');
+    expect(meta.alternates?.canonical).toBe(`${SITE_URL}/zh/evaluation`);
+    expect(meta.alternates?.languages).toEqual({
+      en: `${SITE_URL}/evaluation`,
+      'zh-CN': `${SITE_URL}/zh/evaluation`,
+      'x-default': `${SITE_URL}/evaluation`,
+    });
+  });
+
+  it('sets the zh Open Graph locale and URL', () => {
+    const meta = tabMetadataZh('gpu-specs');
+    expect(meta.openGraph?.locale).toBe('zh_CN');
+    expect(meta.openGraph?.url).toBe(`${SITE_URL}/zh/gpu-specs`);
+  });
+});
diff --git a/packages/app/src/lib/tab-meta-zh.ts b/packages/app/src/lib/tab-meta-zh.ts
new file mode 100644
index 00000000..643e5404
--- /dev/null
+++ b/packages/app/src/lib/tab-meta-zh.ts
@@ -0,0 +1,144 @@
+import type { Metadata } from 'next';
+
+import { AUTHOR_NAME, SITE_NAME, SITE_URL } from '@semianalysisai/inferencex-constants';
+import { ZH_OG_LOCALE, zhAlternates, zhPath } from '@/lib/i18n';
+
+export const LANDING_META_ZH = {
+  title: '开源 AI 推理基准测试',
+  description:
+    '跨 GPU 与推理框架对比 AI 推理性能。基于 NVIDIA GB200、B200、AMD MI355X 等硬件的真实基准测试。免费、开源、持续更新。',
+};
+
+/**
+ * Tabs that have a Chinese sibling page under /zh. Internal or feature-gated
+ * tabs (ai-chart, current-inferencex-image, feedback) are intentionally not
+ * mirrored — they are not indexable surfaces.
+ */
+export const ZH_TAB_KEYS = [
+  'inference',
+  'evaluation',
+  'historical',
+  'calculator',
+  'reliability',
+  'gpu-specs',
+  'gpu-metrics',
+  'submissions',
+] as const;
+
+export type ZhTabKey = (typeof ZH_TAB_KEYS)[number];
+
+export function isZhTab(tab: string): tab is ZhTabKey {
+  return (ZH_TAB_KEYS as readonly string[]).includes(tab);
+}
+
+export const TAB_META_ZH: Record<ZhTabKey, { title: string; description: string }> = {
+  inference: {
+    title: 'AI 推理基准测试',
+    description:
+      '跨 GPU 与云服务商对比 AI 推理延迟、吞吐量与首 token 延迟（TTFT）。基于 NVIDIA GB200、H100、AMD MI355X 等硬件的真实基准测试。',
+  },
+  evaluation: {
+    title: 'LLM 评估结果',
+    description: 'LLM 评估得分与准确率基准测试。使用标准化评估指标对比各服务商的模型质量。',
+  },
+  historical: {
+    title: '历史推理性能趋势',
+    description:
+      '跟踪 AI 推理性能随时间的变化。历史基准测试数据展示各 GPU 与服务商在延迟、吞吐量和成本上的改进。',
+  },
+  calculator: {
+    title: '吞吐量与 TCO 计算器',
+    description:
+      '计算 AI 推理吞吐量与总拥有成本（TCO）。跨硬件配置对比 LLM 推理服务的 GPU 成本效益。',
+  },
+  reliability: {
+    title: '服务商可靠性指标',
+    description: 'AI 推理服务商可靠性与可用性跟踪。对比各 GPU 云服务商的错误率与可用性。',
+  },
+  'gpu-specs': {
+    title: 'GPU 规格与对比',
+    description:
+      '面向 AI 推理的详细 GPU 规格。对比 NVIDIA、AMD 与 Intel GPU 的显存带宽、FLOPS、互连与拓扑。',
+  },
+  'gpu-metrics': {
+    title: 'GPU 功耗与能效指标',
+    description: 'AI 推理负载下的 GPU 功耗与能效指标。跨硬件对比每瓦 token 数。',
+  },
+  submissions: {
+    title: '基准测试提交记录',
+    description:
+      '提交到 InferenceX 的全部基准测试配置。查看各 GPU 厂商的提交历史、活动趋势与数据点数量。',
+  },
+};
+
+/**
+ * Server-rendered Chinese intro shown above the interactive dashboard on each
+ * /zh tab page. The charts themselves render in English; this block gives
+ * crawlers and readers genuine Chinese content describing the page.
+ */
+export const TAB_INTRO_ZH: Record<ZhTabKey, string> = {
+  inference:
+    '本页面展示 InferenceX 的 AI 推理基准测试结果：跨 GPU、推理框架与模型对比吞吐量（token/s/GPU）、交互性（token/s/用户）、首 token 延迟（TTFT）等指标。每个数据点都来自公开的 GitHub Actions 工作流，可复现、可审计。',
+  evaluation:
+    '本页面展示 LLM 评估（evaluation）结果：使用标准化评估集对比各模型与部署配置的准确率，验证推理优化不会损害模型质量。',
+  historical:
+    '本页面展示历史趋势图表：跟踪各 GPU、框架与模型的推理性能随时间的演进，量化软件栈优化带来的收益。',
+  calculator:
+    '本页面提供吞吐量与总拥有成本（TCO）计算器：基于真实基准测试数据，估算不同 GPU 配置下 LLM 推理服务的每百万 token 成本与性价比。',
+  reliability:
+    '本页面展示基准测试基础设施的可靠性指标：各 GPU 集群与服务商的运行成功率、错误率与可用性。',
+  'gpu-specs':
+    '本页面提供 GPU 规格对比：NVIDIA、AMD 等厂商加速器的显存容量、显存带宽、FLOPS、互连拓扑与功耗规格。',
+  'gpu-metrics':
+    '本页面展示 GPU 功耗与能效指标（PowerX）：推理负载下的实测功耗、每瓦 token 数与每兆瓦 token 产出。',
+  submissions:
+    '本页面列出提交到 InferenceX 的全部基准测试配置：按 GPU 厂商查看提交历史、活动趋势与数据点数量。',
+};
+
+/** Chinese labels for the dashboard tab bar (TabNav) on /zh pages. */
+export const TAB_LABELS_ZH: Record<string, string> = {
+  inference: '推理性能',
+  evaluation: '准确率评估',
+  historical: '历史趋势',
+  calculator: 'TCO 计算器',
+  reliability: '可靠性',
+  'gpu-specs': 'GPU 规格',
+  'gpu-metrics': 'GPU 功耗',
+  submissions: '提交记录',
+};
+
+/** Chinese labels for the site header nav on /zh pages, keyed by English href. */
+export const NAV_LABELS_ZH: Record<string, string> = {
+  '/': '首页',
+  '/inference': '仪表板',
+  '/compare': 'GPU 对比',
+  '/quotes': '支持者',
+  '/datasets': '数据集',
+  '/blog': '文章',
+  '/about': '关于',
+};
+
+const TITLE_SUFFIX = `${SITE_NAME} by ${AUTHOR_NAME}`;
+
+/** Generate Next.js Metadata for a /zh tab page (mirrors `tabMetadata`). */
+export function tabMetadataZh(tab: ZhTabKey): Metadata {
+  const meta = TAB_META_ZH[tab];
+  // The English inference tab canonicalizes to the site root; mirror that.
+  const enPath = tab === 'inference' ? '/' : `/${tab}`;
+  const url = `${SITE_URL}${zhPath(enPath)}`;
+  return {
+    title: meta.title,
+    description: meta.description,
+    alternates: zhAlternates(enPath),
+    openGraph: {
+      title: `${meta.title} | ${SITE_NAME}`,
+      description: meta.description,
+      url,
+      locale: ZH_OG_LOCALE,
+    },
+    twitter: {
+      title: `${meta.title} | ${TITLE_SUFFIX}`,
+      description: meta.description,
+    },
+  };
+}
diff --git a/packages/app/src/lib/tab-meta.ts b/packages/app/src/lib/tab-meta.ts
index 9145532b..b312a6e7 100644
--- a/packages/app/src/lib/tab-meta.ts
+++ b/packages/app/src/lib/tab-meta.ts
@@ -1,6 +1,7 @@
 import type { Metadata } from 'next';
 
 import { AUTHOR_NAME, SITE_NAME, SITE_URL } from '@semianalysisai/inferencex-constants';
+import { hasZhSibling, languageAlternates } from '@/lib/i18n';
 
 export const LANDING_META = {
   title: 'Open Source AI Inference Benchmark',
@@ -95,11 +96,16 @@ export function getTabTitle(tab: string): string {
 /** Generate Next.js Metadata for a tab page. */
 export function tabMetadata(tab: TabKey): Metadata {
   const meta = TAB_META[tab];
+  const enPath = tab === 'inference' ? '/' : `/${tab}`;
   const url = tab === 'inference' ? SITE_URL : `${SITE_URL}/${tab}`;
   return {
     title: meta.title,
     description: meta.description,
-    alternates: { canonical: url },
+    alternates: {
+      canonical: url,
+      // hreflang to the Chinese sibling page, for tabs mirrored under /zh.
+      ...(hasZhSibling(enPath) && { languages: languageAlternates(enPath) }),
+    },
     openGraph: {
       title: `${meta.title} | InferenceX`,
       description: meta.description,
diff --git a/packages/constants/src/seo.ts b/packages/constants/src/seo.ts
index 2f02bb33..f0332083 100644
--- a/packages/constants/src/seo.ts
+++ b/packages/constants/src/seo.ts
@@ -15,3 +15,13 @@ export const DESCRIPTION =
  */
 export const SUPPORTERS_LINE = 'Supported by OpenAI, Microsoft & the PyTorch Foundation.';
 export const OG_IMAGE = `${SITE_URL}/og-image.png`;
+
+/**
+ * Simplified Chinese equivalents for the /zh page tree. Brand and product
+ * names (InferenceX, SemiAnalysis, GPU SKUs) stay in English per the
+ * translation quality bar in AGENTS.md.
+ */
+export const SITE_TITLE_ZH = `${SITE_NAME} by ${AUTHOR_NAME} — AI 推理基准测试`;
+export const DESCRIPTION_ZH =
+  'InferenceX 是紧跟现代 AI 发展节奏的开源 AI 推理基准测试，由规模领先的开源 GPU CI/CD 集群持续驱动，涵盖 NVIDIA GB200、AMD MI355X 等众多硬件。';
+export const SUPPORTERS_LINE_ZH = '获得 OpenAI、Microsoft 与 PyTorch 基金会的支持。';

From ba03097e20eed69a1fa409cc95a965543449fa0f Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Sat, 4 Jul 2026 03:22:08 -0400
Subject: [PATCH 2/3] feat(i18n): localize UI chrome + mirror compare slug
 pages under /zh
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round 2 of the Chinese site, addressing review feedback that /zh pages
still showed English chrome and links escaping the Chinese tree:

- Footer fully localized (client component + useLocale hook); internal
  links stay under /zh; language link flips to "English" on zh pages
- New useLocale() + component-local STRINGS={en,zh} pattern applied to:
  landing Quick Comparisons preset cards, TCO calculator (labels,
  tooltips, toggles, table headers, chart title), inference chart
  display/controls + shared selectors + disagg caveat banners,
  evaluation + historical displays, reliability/gpu-specs/gpu-power/
  submissions displays, all 9 nudges, MTP conflict toast, share
  buttons, blog DashboardCTA. English output stays byte-identical.
- /zh/compare/[slug] + /zh/compare-per-dollar/[slug] fully mirrored:
  Chinese narrative templates in compare-ssr-zh.ts (1:1 port reusing
  compare-ssr.ts data helpers), localized page-clients, canonical zh
  redirects, bidirectional hreflang on EN slug pages, zh index cards
  now link within /zh, sitemap emits EN+zh pairs for every slug
- Cursor review fixes: gated tabs and header brand link no longer drop
  the /zh prefix (localize only when a zh sibling exists)
- AGENTS.md + docs/i18n.md updated: chrome localization pattern is now
  mandatory for new UI strings; compare-ssr-zh.ts sync rule added

中文：中文站第二轮改进——页脚全面本地化；通过 useLocale() 与组件内
STRINGS 字典将仪表板界面文案（计算器、推理、评估、历史趋势、可靠性、
GPU 规格、功耗、提交记录、提示弹窗、分享按钮、精选对比卡片）翻译为中
文，英文页面保持逐字节不变；新增 /zh/compare/[slug] 与
/zh/compare-per-dollar/[slug] 完整中文镜像（叙述模板位于
compare-ssr-zh.ts，与英文模板一一对应），站点地图成对输出全部对比页
URL；修复 Cursor 评审发现的两处 /zh 前缀丢失问题；AGENTS.md 新增界面
文案本地化与叙述模板同步规则。

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 AGENTS.md                                     |   4 +-
 docs/i18n.md                                  |   3 +-
 packages/app/cypress/e2e/zh-pages.cy.ts       |  12 +
 .../compare-per-dollar/[slug]/page-client.tsx | 136 ++++--
 .../app/compare-per-dollar/[slug]/page.tsx    |   8 +-
 .../src/app/compare/[slug]/page-client.tsx    |  79 +++-
 packages/app/src/app/compare/[slug]/page.tsx  |   8 +-
 packages/app/src/app/sitemap.ts               |  27 +-
 packages/app/src/app/zh/blog/[slug]/page.tsx  |   2 +-
 .../[slug]/opengraph-image.tsx                |   7 +
 .../app/zh/compare-per-dollar/[slug]/page.tsx | 185 ++++++++
 .../src/app/zh/compare-per-dollar/page.tsx    |   2 +-
 .../app/zh/compare/[slug]/opengraph-image.tsx |   7 +
 .../app/src/app/zh/compare/[slug]/page.tsx    | 178 +++++++
 packages/app/src/app/zh/compare/page.tsx      |   2 +-
 .../src/components/blog/mdx-components.tsx    |  15 +-
 .../components/calculator/CalculatorTable.tsx |  46 +-
 .../calculator/ThroughputBarChart.tsx         |   8 +-
 .../ThroughputCalculatorDisplay.tsx           | 260 ++++++++---
 .../evaluation/ui/ChartControls.tsx           |  43 +-
 .../components/evaluation/ui/ChartDisplay.tsx |  73 ++-
 .../components/favorites/favorite-presets.ts  |  22 +
 packages/app/src/components/footer/footer.tsx | 370 ++++++++-------
 .../components/gpu-power/GpuPowerDisplay.tsx  | 117 +++--
 .../gpu-specs/gpu-specs-content.tsx           | 167 ++++---
 packages/app/src/components/header/header.tsx |   2 +-
 .../components/inference/ui/ChartControls.tsx | 108 ++++-
 .../components/inference/ui/ChartDisplay.tsx  |  77 +++-
 .../components/landing/curated-view-card.tsx  |  11 +-
 .../components/mtp-engine-conflict-toast.tsx  |  38 +-
 packages/app/src/components/nudge-engine.tsx  |  61 ++-
 .../reliability/ui/ChartControls.tsx          |  43 +-
 .../reliability/ui/ChartDisplay.tsx           |  29 +-
 packages/app/src/components/share-buttons.tsx |  25 +-
 .../submissions/SubmissionsDisplay.tsx        |  88 +++-
 .../submissions/SubmissionsTable.tsx          | 221 ++++++---
 packages/app/src/components/tab-nav.tsx       |  14 +-
 .../trends/HistoricalTrendsDisplay.tsx        |  71 ++-
 .../app/src/components/ui/bottom-toast.tsx    |   4 +-
 .../components/ui/chart-display-helpers.tsx   |  94 +++-
 .../app/src/components/ui/chart-selectors.tsx |  68 ++-
 .../ui/unofficial-domain-notice.tsx           |  20 +-
 .../app/src/components/zh/zh-tab-intro.tsx    |   2 +-
 packages/app/src/lib/compare-ssr-zh.ts        | 436 ++++++++++++++++++
 packages/app/src/lib/compare-ssr.ts           |  19 +-
 packages/app/src/lib/i18n.test.ts             |  14 +-
 packages/app/src/lib/i18n.ts                  |   4 +-
 packages/app/src/lib/nudges/registry.tsx      |  30 ++
 packages/app/src/lib/nudges/types.ts          |   5 +
 packages/app/src/lib/use-locale.ts            |  16 +
 50 files changed, 2618 insertions(+), 663 deletions(-)
 create mode 100644 packages/app/src/app/zh/compare-per-dollar/[slug]/opengraph-image.tsx
 create mode 100644 packages/app/src/app/zh/compare-per-dollar/[slug]/page.tsx
 create mode 100644 packages/app/src/app/zh/compare/[slug]/opengraph-image.tsx
 create mode 100644 packages/app/src/app/zh/compare/[slug]/page.tsx
 create mode 100644 packages/app/src/lib/compare-ssr-zh.ts
 create mode 100644 packages/app/src/lib/use-locale.ts

diff --git a/AGENTS.md b/AGENTS.md
index 282a322b..41522234 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -133,7 +133,9 @@ The site ships a hand-authored Simplified Chinese sibling for every indexable pa
 2. **New dashboard tab** → add the tab to `ZH_TAB_KEYS`, `TAB_META_ZH`, `TAB_INTRO_ZH`, and `TAB_LABELS_ZH` in `src/lib/tab-meta-zh.ts`, then create `src/app/zh/(dashboard)/<tab>/page.tsx` mirroring the English page with `tabMetadataZh('<tab>')` and a `<ZhTabIntro tab="<tab>" />` block above the chart (the interactive chart UI itself stays English). `tab-meta-zh.test.ts` enforces dictionary completeness.
 3. **New blog post** → the translation `packages/app/content/blog/zh/<same-filename>.mdx` is REQUIRED in the same PR. Translate frontmatter `title`/`subtitle` and the body; keep `date`, `publishDate`, `modifiedDate`, `tags`, and the filename/slug identical (English and Chinese posts pair by filename; visibility gating always follows the English post's `publishDate`). Rewrite internal `/blog/<slug>` links to `/zh/blog/<slug>`; never alter numbers, code blocks, or `<Figure>`/`<JsonLd>` structure. The `/zh/blog` listing, hreflang, and sitemap pick the file up automatically.
 4. **Editing an existing English page or post** → update its Chinese sibling in the same PR. Content drift between languages is a 🔴 BLOCKING review issue.
-5. **Intentionally not mirrored** (skip these, or add them to `ZH_MIRRORED_ROUTES` when you do mirror them): per-slug compare pages (`/compare/[slug]`, `/compare-per-dollar/[slug]` — the `/zh/compare*` index pages link to the English slug pages), `/datasets`, feature-gated tabs (`ai-chart`, `current-inferencex-image`, `feedback`), `feed.xml`/`llms.txt`, and per-post OG images (Chinese posts reuse the English post's OG image — the OG renderer's font has no CJK glyphs).
+5. **Shared UI chrome** (headers, footers, dashboard card titles/descriptions, control labels, buttons, nudges) is localized in place, not duplicated: client components call `useLocale()` (`src/lib/use-locale.ts`) and read from a component-local `STRINGS = { en, zh }` dict; server components take an optional `locale` prop passed from the /zh page. The `en` dict must keep the exact original strings so English pages stay byte-identical. New user-visible chrome strings MUST ship both variants. Chart-internal rendering (D3 axes/tooltips/legend series, CSV export) and data-registry display values (model/GPU/framework/precision names) stay English.
+6. **Compare slug narrative sync**: the per-slug compare pages are mirrored at `/zh/compare/[slug]` and `/zh/compare-per-dollar/[slug]`; their Chinese prose templates live in `src/lib/compare-ssr-zh.ts`, a 1:1 port of the English templates in `compare-ssr.ts`. Any PR that changes the English narrative templates MUST update the zh port in the same commit.
+7. **Intentionally not mirrored** (skip these, or add them to `ZH_MIRRORED_ROUTES` when you do mirror them): `/datasets`, feature-gated tabs (`ai-chart`, `current-inferencex-image`, `feedback`), `feed.xml`/`llms.txt`, and per-post OG images (Chinese posts reuse the English post's OG image — the OG renderer's font has no CJK glyphs).
 
 ## Chart Interpolation — TS and Python Helpers MUST Stay in Sync
 
diff --git a/docs/i18n.md b/docs/i18n.md
index 50bdaca1..7b7ca0ae 100644
--- a/docs/i18n.md
+++ b/docs/i18n.md
@@ -25,5 +25,6 @@ Why the Simplified Chinese site is a hand-authored `/zh` page tree instead of an
 - **Reading time is CJK-aware**: `getReadingTime` counts Han characters at 400 chars/min alongside Latin words at 265 wpm; pure word-splitting counts an entire Chinese paragraph as ~1 "word".
 - **zh OG images reuse the English post meta** — the `next/og` default Satori font has no CJK glyphs, so a Chinese title would render as tofu. Loading a subset CJK font is a known follow-up.
 - **`/zh/inference` canonicalizes to `/zh`**, mirroring the English quirk where `/inference` canonicalizes to `/`.
-- **Compare slug pages are not mirrored**: `compareTableNarrative` (`compare-ssr.ts`) generates hundreds of lines of English prose per programmatic page; translating the templates is a separate project. The `/zh/compare*` index pages exist and link to the English slug pages.
+- **Shared chrome is localized in place** via `useLocale()` + component-local `STRINGS = { en, zh }` dicts (footer, TabNav, dashboard display headings/labels, nudges, preset cards). The `en` dict keeps the exact original strings so English pages are byte-identical; chart-internal rendering and data-registry display values stay English.
+- **Compare slug pages are mirrored** at `/zh/compare/[slug]` and `/zh/compare-per-dollar/[slug]`. The Chinese narrative templates live in `compare-ssr-zh.ts` as a 1:1 port of `compare-ssr.ts` (data logic is imported, only sentence templates differ) — the two files must change together.
 - **Sitemap pairs**: `localizedPair()` in `sitemap.ts` emits the EN and zh URL together, both carrying the same `alternates.languages` map. Blog posts without a translation fall back to an English-only entry, so a missing translation degrades gracefully instead of 404-ing crawlers.
diff --git a/packages/app/cypress/e2e/zh-pages.cy.ts b/packages/app/cypress/e2e/zh-pages.cy.ts
index 8b52407e..77fc8cfc 100644
--- a/packages/app/cypress/e2e/zh-pages.cy.ts
+++ b/packages/app/cypress/e2e/zh-pages.cy.ts
@@ -21,6 +21,18 @@ describe('Chinese (/zh) pages', () => {
     it('header language toggle points back to English', () => {
       cy.get('[data-testid="language-toggle"]').should('have.attr', 'href', '/');
     });
+
+    it('footer renders in Chinese with zh-internal links', () => {
+      cy.get('[data-testid="footer-brand-description"]').should('contain.text', '开源推理基准测试');
+      cy.get('[data-testid="footer-link-land-acknowledgement"]').should(
+        'have.attr',
+        'href',
+        '/zh/land-acknowledgement',
+      );
+      cy.get('[data-testid="footer-link-zh"]')
+        .should('contain.text', 'English')
+        .and('have.attr', 'href', '/');
+    });
   });
 
   describe('zh dashboard tab page', () => {
diff --git a/packages/app/src/app/compare-per-dollar/[slug]/page-client.tsx b/packages/app/src/app/compare-per-dollar/[slug]/page-client.tsx
index b6ee7550..b2216a38 100644
--- a/packages/app/src/app/compare-per-dollar/[slug]/page-client.tsx
+++ b/packages/app/src/app/compare-per-dollar/[slug]/page-client.tsx
@@ -19,6 +19,48 @@ interface SsrTableData {
   interactivityRange: { min: number; max: number };
 }
 
+/** Only show Cost + Concurrency in the interpolated table — the rest of the
+ *  metric rows (Throughput, tok/s/MW) live on the sibling /compare page. */
+const PER_DOLLAR_TABLE_METRICS = ['Cost ($/M tok)', 'Concurrency'];
+
+/** Rename "Cost ($/M tok)" to the full-English "Dollar per Million Tokens"
+ *  in the per-dollar table so the cell reads in line with the page's
+ *  "Performance per Dollar" framing and surfaces the SEO term verbatim. */
+const PER_DOLLAR_LABEL_OVERRIDES = {
+  'Cost ($/M tok)': 'Dollar per Million Tokens',
+};
+
+/** y_costh = Cost per Million Total Tokens (Owning - Hyperscaler). Defined in
+ *  packages/app/src/components/inference/inference-chart-config.json. */
+const PER_DOLLAR_DEFAULT_Y_AXIS = 'y_costh';
+
+const STRINGS = {
+  en: {
+    eyebrowSuffix: 'Performance per Dollar',
+    h1Suffix: 'Performance per Dollar',
+    mainChartLinkText: 'the main inference chart',
+    fullComparisonLinkText: 'View full latency + throughput comparison →',
+    caveatSeqFallback: 'sequence',
+    caveatPrecFallback: 'precision',
+    pricingLabel: 'GPU pricing (owning hyperscaler):',
+    pricingSource: 'Source:',
+    emptyState:
+      'No interpolated cost-per-token data available for the default model on this GPU pair. Use the chart controls below to select a model and precision with benchmark data for both GPUs.',
+  },
+  zh: {
+    eyebrowSuffix: '每美元性能',
+    h1Suffix: '每美元性能',
+    mainChartLinkText: '主推理图表',
+    fullComparisonLinkText: '查看完整延迟与吞吐量对比 →',
+    caveatSeqFallback: '序列',
+    caveatPrecFallback: '精度',
+    pricingLabel: 'GPU 定价（所属云服务商）：',
+    pricingSource: '来源：',
+    emptyState:
+      '当前默认模型在此 GPU 组合上没有可用的插值每 token 成本数据。请使用下方图表控件选择一个两款 GPU 均有基准测试数据的模型和精度。',
+  },
+} as const;
+
 interface ComparePerDollarPageClientProps {
   a: string;
   b: string;
@@ -50,23 +92,9 @@ interface ComparePerDollarPageClientProps {
   bCostPerGpuHr: number;
   /** Crawlable data graphic generated for the canonical default comparison. */
   heroImageSrc: string;
+  locale?: 'en' | 'zh';
 }
 
-/** Only show Cost + Concurrency in the interpolated table — the rest of the
- *  metric rows (Throughput, tok/s/MW) live on the sibling /compare page. */
-const PER_DOLLAR_TABLE_METRICS = ['Cost ($/M tok)', 'Concurrency'];
-
-/** Rename "Cost ($/M tok)" to the full-English "Dollar per Million Tokens"
- *  in the per-dollar table so the cell reads in line with the page's
- *  "Performance per Dollar" framing and surfaces the SEO term verbatim. */
-const PER_DOLLAR_LABEL_OVERRIDES = {
-  'Cost ($/M tok)': 'Dollar per Million Tokens',
-};
-
-/** y_costh = Cost per Million Total Tokens (Owning - Hyperscaler). Defined in
- *  packages/app/src/components/inference/inference-chart-config.json. */
-const PER_DOLLAR_DEFAULT_Y_AXIS = 'y_costh';
-
 function toModel(value: string): Model | undefined {
   return Object.values(Model).includes(value as Model) ? (value as Model) : undefined;
 }
@@ -101,6 +129,7 @@ export default function ComparePerDollarPageClient({
   aCostPerGpuHr,
   bCostPerGpuHr,
   heroImageSrc,
+  locale = 'en',
 }: ComparePerDollarPageClientProps) {
   useEffect(() => {
     track('compare_per_dollar_page_view', { gpu_a: a, gpu_b: b, default_model: defaultModel });
@@ -110,6 +139,8 @@ export default function ComparePerDollarPageClient({
   const initialModel = toModel(defaultModel);
   const initialSequence = toSequence(defaultSequence);
   const initialPrecisions = toPrecisions(defaultPrecision);
+  const t = STRINGS[locale];
+  const isZh = locale === 'zh';
 
   return (
     <GlobalFilterProvider
@@ -127,23 +158,37 @@ export default function ComparePerDollarPageClient({
           <Card className="flex w-full min-w-0 flex-col gap-3">
             <header>
               <div className="text-xs uppercase tracking-wider text-muted-foreground">
-                {modelLabel} · Performance per Dollar
+                {modelLabel} · {t.eyebrowSuffix}
               </div>
               <h1 className="text-2xl lg:text-3xl font-bold tracking-tight mt-1">
-                {label} Performance per Dollar
+                {label} {t.h1Suffix}
               </h1>
-              <p className="mt-2 text-sm text-muted-foreground">
-                Cost per million tokens of <strong>{aLabel}</strong> ({aVendor} {aArch}) versus{' '}
-                <strong>{bLabel}</strong> ({bVendor} {bArch}) on <strong>{modelLabel}</strong>.
-                Owning-hyperscaler TCO normalized by output tokens — performance per dollar across
-                LLM workloads. Pick the more cost-efficient SKU at every target interactivity level.
-                Use the chart controls below to switch sequences, precisions, and metrics — same
-                interactions as{' '}
-                <Link href="/" className="underline hover:text-primary">
-                  the main inference chart
-                </Link>
-                .
-              </p>
+              {isZh ? (
+                <p className="mt-2 text-sm text-muted-foreground">
+                  <strong>{aLabel}</strong>（{aVendor} {aArch}）与 <strong>{bLabel}</strong>（
+                  {bVendor} {bArch}）在 <strong>{modelLabel}</strong> 上的每百万 token
+                  成本。基于所属云服务商 TCO 归一化的输出 token 性能——在各类 LLM
+                  工作负载下的每美元性能。在每个目标交互性水平下选出更经济的
+                  SKU。使用下方图表控件切换序列、精度和指标——交互方式与
+                  <Link href="/zh" className="underline hover:text-primary">
+                    {t.mainChartLinkText}
+                  </Link>
+                  相同。
+                </p>
+              ) : (
+                <p className="mt-2 text-sm text-muted-foreground">
+                  Cost per million tokens of <strong>{aLabel}</strong> ({aVendor} {aArch}) versus{' '}
+                  <strong>{bLabel}</strong> ({bVendor} {bArch}) on <strong>{modelLabel}</strong>.
+                  Owning-hyperscaler TCO normalized by output tokens — performance per dollar across
+                  LLM workloads. Pick the more cost-efficient SKU at every target interactivity
+                  level. Use the chart controls below to switch sequences, precisions, and metrics —
+                  same interactions as{' '}
+                  <Link href="/" className="underline hover:text-primary">
+                    {t.mainChartLinkText}
+                  </Link>
+                  .
+                </p>
+              )}
               {narrative.length > 0 && (
                 <div
                   className="mt-3 flex flex-col gap-2"
@@ -156,10 +201,9 @@ export default function ComparePerDollarPageClient({
                         <>
                           {' '}
                           <span className="text-muted-foreground italic">
-                            (Numbers reflect the default {defaultSequence ?? 'sequence'} ·{' '}
-                            {defaultPrecision ?? 'precision'} selection for this URL — table and
-                            chart below update if you change sequence, precision, or model in the
-                            controls.)
+                            {isZh
+                              ? `（数据反映此 URL 的默认 ${defaultSequence ?? t.caveatSeqFallback} · ${defaultPrecision ?? t.caveatPrecFallback} 选择——如果您在控件中更改序列、精度或模型，下方表格和图表会自动更新。）`
+                              : `(Numbers reflect the default ${defaultSequence ?? t.caveatSeqFallback} · ${defaultPrecision ?? t.caveatPrecFallback} selection for this URL — table and chart below update if you change sequence, precision, or model in the controls.)`}
                           </span>
                         </>
                       )}
@@ -172,10 +216,11 @@ export default function ComparePerDollarPageClient({
                   className="mt-2 text-xs text-muted-foreground"
                   data-testid="compare-per-dollar-pricing"
                 >
-                  GPU pricing (owning hyperscaler): <strong>{aLabel}</strong>{' '}
+                  {t.pricingLabel} <strong>{aLabel}</strong>{' '}
                   {aCostPerGpuHr > 0 ? `$${aCostPerGpuHr.toFixed(2)}/GPU/hr` : '—'} ·{' '}
                   <strong>{bLabel}</strong>{' '}
-                  {bCostPerGpuHr > 0 ? `$${bCostPerGpuHr.toFixed(2)}/GPU/hr` : '—'}. Source:{' '}
+                  {bCostPerGpuHr > 0 ? `$${bCostPerGpuHr.toFixed(2)}/GPU/hr` : '—'}.{' '}
+                  {t.pricingSource}{' '}
                   <a
                     href="https://semianalysis.com/ai-cloud-tco-model/"
                     target="_blank"
@@ -190,11 +235,11 @@ export default function ComparePerDollarPageClient({
               )}
               <p className="mt-2 text-sm">
                 <Link
-                  href={`/compare/${slug}`}
+                  href={isZh ? `/zh/compare/${slug}` : `/compare/${slug}`}
                   className="underline hover:text-primary text-muted-foreground"
                   onClick={() => track('compare_per_dollar_cross_link_to_full', { slug })}
                 >
-                  View full latency + throughput comparison →
+                  {t.fullComparisonLinkText}
                 </Link>
               </p>
             </header>
@@ -204,7 +249,11 @@ export default function ComparePerDollarPageClient({
             >
               <img
                 src={heroImageSrc}
-                alt={`${modelLabel}: ${aLabel} versus ${bLabel} cost per million tokens at matched interactivity levels`}
+                alt={
+                  isZh
+                    ? `${modelLabel}：${aLabel} 与 ${bLabel} 在相同交互性水平下的每百万 token 成本`
+                    : `${modelLabel}: ${aLabel} versus ${bLabel} cost per million tokens at matched interactivity levels`
+                }
                 width={1200}
                 height={675}
                 loading="eager"
@@ -212,8 +261,9 @@ export default function ComparePerDollarPageClient({
                 className="w-full rounded-lg border border-border/50"
               />
               <figcaption className="text-xs text-muted-foreground">
-                {aLabel} versus {bLabel} cost per million tokens for this comparison's canonical
-                default workload. Lower cost indicates better performance per dollar.
+                {isZh
+                  ? `${aLabel} 与 ${bLabel} 在此对比默认工作负载下的每百万 token 成本。成本越低表示每美元性能越高。`
+                  : `${aLabel} versus ${bLabel} cost per million tokens for this comparison's canonical default workload. Lower cost indicates better performance per dollar.`}
               </figcaption>
             </figure>
             <CompareTableSection
@@ -222,6 +272,7 @@ export default function ComparePerDollarPageClient({
               aLabel={aLabel}
               bLabel={bLabel}
               ssrTableData={ssrTableData}
+              emptyStateText={t.emptyState}
             />
           </Card>
           <InferenceChartDisplay />
@@ -237,12 +288,14 @@ function CompareTableSection({
   aLabel,
   bLabel,
   ssrTableData,
+  emptyStateText,
 }: {
   a: string;
   b: string;
   aLabel: string;
   bLabel: string;
   ssrTableData: SsrTableData;
+  emptyStateText: string;
 }) {
   const { effectiveSequence, effectivePrecisions, selectedRunDate, selectedModel } =
     useGlobalFilters();
@@ -270,8 +323,7 @@ function CompareTableSection({
   if (ssrTableData.defaultTargets.length === 0) {
     return (
       <div className="border border-border/50 rounded-md px-4 py-3 text-sm text-muted-foreground bg-muted/30">
-        No interpolated cost-per-token data available for the default model on this GPU pair. Use
-        the chart controls below to select a model and precision with benchmark data for both GPUs.
+        {emptyStateText}
       </div>
     );
   }
diff --git a/packages/app/src/app/compare-per-dollar/[slug]/page.tsx b/packages/app/src/app/compare-per-dollar/[slug]/page.tsx
index 24d60579..5b12dd01 100644
--- a/packages/app/src/app/compare-per-dollar/[slug]/page.tsx
+++ b/packages/app/src/app/compare-per-dollar/[slug]/page.tsx
@@ -9,6 +9,7 @@ import {
 } from '@semianalysisai/inferencex-constants';
 
 import { JsonLd } from '@/components/json-ld';
+import { languageAlternates } from '@/lib/i18n';
 import { pickPairDefaults } from '@/lib/compare-pair-defaults';
 import {
   canonicalCompareSlug,
@@ -55,7 +56,12 @@ export async function generateMetadata({ params }: Props): Promise<Metadata> {
   return {
     title: `${fullLabel} — Performance per Dollar`,
     description,
-    alternates: { canonical: url },
+    alternates: {
+      canonical: url,
+      languages: languageAlternates(
+        `/compare-per-dollar/${canonicalCompareSlug(parsed.model.slug, parsed.a, parsed.b)}`,
+      ),
+    },
     openGraph: {
       title: `${fullLabel} — Performance per Dollar | ${SITE_NAME}`,
       description,
diff --git a/packages/app/src/app/compare/[slug]/page-client.tsx b/packages/app/src/app/compare/[slug]/page-client.tsx
index 6dc26b05..027014f0 100644
--- a/packages/app/src/app/compare/[slug]/page-client.tsx
+++ b/packages/app/src/app/compare/[slug]/page-client.tsx
@@ -19,6 +19,27 @@ interface SsrTableData {
   interactivityRange: { min: number; max: number };
 }
 
+const STRINGS = {
+  en: {
+    eyebrowSuffix: 'GPU comparison',
+    mainChartLinkText: 'the main inference chart',
+    perDollarLinkText: 'View performance-per-dollar view →',
+    caveatSeqFallback: 'sequence',
+    caveatPrecFallback: 'precision',
+    emptyState:
+      'No interpolated comparison data available for the default model. Use the chart controls below to select a model with benchmark data for both GPUs.',
+  },
+  zh: {
+    eyebrowSuffix: 'GPU 对比',
+    mainChartLinkText: '主推理图表',
+    perDollarLinkText: '查看每美元性能对比 →',
+    caveatSeqFallback: '序列',
+    caveatPrecFallback: '精度',
+    emptyState:
+      '当前默认模型没有可用的插值对比数据。请使用下方图表控件选择一个两款 GPU 均有基准测试数据的模型。',
+  },
+} as const;
+
 interface ComparePageClientProps {
   a: string;
   b: string;
@@ -46,6 +67,7 @@ interface ComparePageClientProps {
   bVendor: string;
   aArch: string;
   bArch: string;
+  locale?: 'en' | 'zh';
 }
 
 function toModel(value: string): Model | undefined {
@@ -79,6 +101,7 @@ export default function ComparePageClient({
   bVendor,
   aArch,
   bArch,
+  locale = 'en',
 }: ComparePageClientProps) {
   useEffect(() => {
     track('compare_page_view', { gpu_a: a, gpu_b: b, default_model: defaultModel });
@@ -88,6 +111,8 @@ export default function ComparePageClient({
   const initialModel = toModel(defaultModel);
   const initialSequence = toSequence(defaultSequence);
   const initialPrecisions = toPrecisions(defaultPrecision);
+  const t = STRINGS[locale];
+  const isZh = locale === 'zh';
 
   return (
     <GlobalFilterProvider
@@ -104,20 +129,33 @@ export default function ComparePageClient({
           <Card className="flex flex-col gap-3">
             <header>
               <div className="text-xs uppercase tracking-wider text-muted-foreground">
-                {modelLabel} · GPU comparison
+                {modelLabel} · {t.eyebrowSuffix}
               </div>
               <h1 className="text-2xl lg:text-3xl font-bold tracking-tight mt-1">{label}</h1>
-              <p className="mt-2 text-sm text-muted-foreground max-w-3xl">
-                Head-to-head AI inference benchmark comparison of <strong>{aLabel}</strong> (
-                {aVendor} {aArch}) and <strong>{bLabel}</strong> ({bVendor} {bArch}) on{' '}
-                <strong>{modelLabel}</strong>. Latency, throughput, and cost across LLM workloads.
-                Use the chart controls below to switch sequences, precisions, and metrics — same
-                interactions as{' '}
-                <Link href="/" className="underline hover:text-primary">
-                  the main inference chart
-                </Link>
-                .
-              </p>
+              {isZh ? (
+                <p className="mt-2 text-sm text-muted-foreground max-w-3xl">
+                  <strong>{aLabel}</strong>（{aVendor} {aArch}）与 <strong>{bLabel}</strong>（
+                  {bVendor} {bArch}）在 <strong>{modelLabel}</strong> 上的正面 AI
+                  推理基准测试对比。涵盖各类 LLM
+                  工作负载的延迟、吞吐量与成本。使用下方图表控件切换序列、精度和指标——交互方式与
+                  <Link href="/zh" className="underline hover:text-primary">
+                    {t.mainChartLinkText}
+                  </Link>
+                  相同。
+                </p>
+              ) : (
+                <p className="mt-2 text-sm text-muted-foreground max-w-3xl">
+                  Head-to-head AI inference benchmark comparison of <strong>{aLabel}</strong> (
+                  {aVendor} {aArch}) and <strong>{bLabel}</strong> ({bVendor} {bArch}) on{' '}
+                  <strong>{modelLabel}</strong>. Latency, throughput, and cost across LLM workloads.
+                  Use the chart controls below to switch sequences, precisions, and metrics — same
+                  interactions as{' '}
+                  <Link href="/" className="underline hover:text-primary">
+                    {t.mainChartLinkText}
+                  </Link>
+                  .
+                </p>
+              )}
               {narrative.length > 0 && (
                 <div className="mt-3 flex flex-col gap-2 max-w-3xl" data-testid="compare-narrative">
                   {narrative.map((para, i) => (
@@ -127,10 +165,9 @@ export default function ComparePageClient({
                         <>
                           {' '}
                           <span className="text-muted-foreground italic">
-                            (Numbers reflect the default {defaultSequence ?? 'sequence'} ·{' '}
-                            {defaultPrecision ?? 'precision'} selection for this URL — table and
-                            chart below update if you change sequence, precision, or model in the
-                            controls.)
+                            {isZh
+                              ? `（数据反映此 URL 的默认 ${defaultSequence ?? t.caveatSeqFallback} · ${defaultPrecision ?? t.caveatPrecFallback} 选择——如果您在控件中更改序列、精度或模型，下方表格和图表会自动更新。）`
+                              : `(Numbers reflect the default ${defaultSequence ?? t.caveatSeqFallback} · ${defaultPrecision ?? t.caveatPrecFallback} selection for this URL — table and chart below update if you change sequence, precision, or model in the controls.)`}
                           </span>
                         </>
                       )}
@@ -140,11 +177,11 @@ export default function ComparePageClient({
               )}
               <p className="mt-2 text-sm">
                 <Link
-                  href={`/compare-per-dollar/${slug}`}
+                  href={isZh ? `/zh/compare-per-dollar/${slug}` : `/compare-per-dollar/${slug}`}
                   className="underline hover:text-primary text-muted-foreground"
                   onClick={() => track('compare_cross_link_to_per_dollar', { slug })}
                 >
-                  View performance-per-dollar view →
+                  {t.perDollarLinkText}
                 </Link>
               </p>
             </header>
@@ -154,6 +191,7 @@ export default function ComparePageClient({
               aLabel={aLabel}
               bLabel={bLabel}
               ssrTableData={ssrTableData}
+              emptyStateText={t.emptyState}
             />
           </Card>
           <InferenceChartDisplay />
@@ -169,12 +207,14 @@ function CompareTableSection({
   aLabel,
   bLabel,
   ssrTableData,
+  emptyStateText,
 }: {
   a: string;
   b: string;
   aLabel: string;
   bLabel: string;
   ssrTableData: SsrTableData;
+  emptyStateText: string;
 }) {
   const { effectiveSequence, effectivePrecisions, selectedRunDate, selectedModel } =
     useGlobalFilters();
@@ -206,8 +246,7 @@ function CompareTableSection({
   if (ssrTableData.defaultTargets.length === 0) {
     return (
       <div className="border border-border/50 rounded-md px-4 py-3 text-sm text-muted-foreground bg-muted/30">
-        No interpolated comparison data available for the default model. Use the chart controls
-        below to select a model with benchmark data for both GPUs.
+        {emptyStateText}
       </div>
     );
   }
diff --git a/packages/app/src/app/compare/[slug]/page.tsx b/packages/app/src/app/compare/[slug]/page.tsx
index 89296005..763b000f 100644
--- a/packages/app/src/app/compare/[slug]/page.tsx
+++ b/packages/app/src/app/compare/[slug]/page.tsx
@@ -9,6 +9,7 @@ import {
 } from '@semianalysisai/inferencex-constants';
 
 import { JsonLd } from '@/components/json-ld';
+import { languageAlternates } from '@/lib/i18n';
 import { pickPairDefaults } from '@/lib/compare-pair-defaults';
 import {
   canonicalCompareSlug,
@@ -50,7 +51,12 @@ export async function generateMetadata({ params }: Props): Promise<Metadata> {
   return {
     title: `${fullLabel} Inference Benchmark`,
     description,
-    alternates: { canonical: url },
+    alternates: {
+      canonical: url,
+      languages: languageAlternates(
+        `/compare/${canonicalCompareSlug(parsed.model.slug, parsed.a, parsed.b)}`,
+      ),
+    },
     openGraph: {
       title: `${fullLabel} | ${SITE_NAME}`,
       description,
diff --git a/packages/app/src/app/sitemap.ts b/packages/app/src/app/sitemap.ts
index 6f3ae968..fbe5d987 100644
--- a/packages/app/src/app/sitemap.ts
+++ b/packages/app/src/app/sitemap.ts
@@ -76,23 +76,24 @@ export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
       if (!zhPosts.has(post.slug)) return [{ ...entry, url: `${BASE_URL}/blog/${post.slug}` }];
       return localizedPair(`/blog/${post.slug}`, entry);
     }),
-    ...compareSlugs.map(({ modelSlug, a, b }) => ({
-      url: `${BASE_URL}/compare/${canonicalCompareSlug(modelSlug, a, b)}`,
-      lastModified: now,
-      changeFrequency: 'daily' as const,
-      priority: 0.7,
-    })),
+    ...compareSlugs.flatMap(({ modelSlug, a, b }) =>
+      localizedPair(`/compare/${canonicalCompareSlug(modelSlug, a, b)}`, {
+        lastModified: now,
+        changeFrequency: 'daily' as const,
+        priority: 0.7,
+      }),
+    ),
     // Every indexed per-dollar landing page has a stable data graphic so image
-    // crawlers discover the PNG alongside the canonical comparison URL.
-    ...compareSlugs.map(({ modelSlug, a, b }) => {
-      const url = `${BASE_URL}/compare-per-dollar/${canonicalCompareSlug(modelSlug, a, b)}`;
-      return {
-        url,
-        images: [`${url}/performance-per-dollar.png`],
+    // crawlers discover the PNG alongside the canonical comparison URL. The
+    // Chinese sibling references the same English-hosted PNG.
+    ...compareSlugs.flatMap(({ modelSlug, a, b }) => {
+      const enPath = `/compare-per-dollar/${canonicalCompareSlug(modelSlug, a, b)}`;
+      return localizedPair(enPath, {
+        images: [`${BASE_URL}${enPath}/performance-per-dollar.png`],
         lastModified: now,
         changeFrequency: 'daily' as const,
         priority: 0.7,
-      };
+      });
     }),
   ];
 }
diff --git a/packages/app/src/app/zh/blog/[slug]/page.tsx b/packages/app/src/app/zh/blog/[slug]/page.tsx
index 6caac102..85b93660 100644
--- a/packages/app/src/app/zh/blog/[slug]/page.tsx
+++ b/packages/app/src/app/zh/blog/[slug]/page.tsx
@@ -105,7 +105,7 @@ export default async function ZhBlogPostPage({ params }: Props) {
 
   const { content } = await compileMDX({
     source: raw,
-    components: createMdxComponents(),
+    components: createMdxComponents('zh'),
     options: {
       mdxOptions: {
         remarkPlugins: [remarkGfm],
diff --git a/packages/app/src/app/zh/compare-per-dollar/[slug]/opengraph-image.tsx b/packages/app/src/app/zh/compare-per-dollar/[slug]/opengraph-image.tsx
new file mode 100644
index 00000000..0f1b62f3
--- /dev/null
+++ b/packages/app/src/app/zh/compare-per-dollar/[slug]/opengraph-image.tsx
@@ -0,0 +1,7 @@
+export {
+  default,
+  alt,
+  size,
+  contentType,
+  generateStaticParams,
+} from '../../../compare-per-dollar/[slug]/opengraph-image';
diff --git a/packages/app/src/app/zh/compare-per-dollar/[slug]/page.tsx b/packages/app/src/app/zh/compare-per-dollar/[slug]/page.tsx
new file mode 100644
index 00000000..ae2b19e1
--- /dev/null
+++ b/packages/app/src/app/zh/compare-per-dollar/[slug]/page.tsx
@@ -0,0 +1,185 @@
+import type { Metadata } from 'next';
+import { notFound, permanentRedirect } from 'next/navigation';
+
+import {
+  HW_REGISTRY,
+  SITE_NAME,
+  SITE_URL,
+  SUPPORTERS_LINE_ZH,
+} from '@semianalysisai/inferencex-constants';
+
+import { JsonLd } from '@/components/json-ld';
+import { pickPairDefaults } from '@/lib/compare-pair-defaults';
+import {
+  canonicalCompareSlug,
+  compareDisplayLabel,
+  compareModelDisplayLabel,
+  parseCompareSlug,
+} from '@/lib/compare-slug';
+import { getGpuSpecs } from '@/lib/constants';
+import {
+  computeCompareTableData,
+  dateRangeForPair,
+  getCachedBenchmarks,
+  KNOWN_MODELS,
+  KNOWN_PRECISIONS,
+  KNOWN_SEQUENCES,
+  pickString,
+  summarize,
+} from '@/lib/compare-ssr';
+import {
+  buildBreadcrumbJsonLdZh,
+  buildJsonLdZh,
+  compareTableNarrativeZh,
+} from '@/lib/compare-ssr-zh';
+import { ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+
+import ComparePerDollarPageClient from '../../../compare-per-dollar/[slug]/page-client';
+
+export const dynamic = 'force-dynamic';
+
+interface Props {
+  params: Promise<{ slug: string }>;
+  searchParams: Promise<Record<string, string | string[] | undefined>>;
+}
+
+export async function generateMetadata({ params }: Props): Promise<Metadata> {
+  const { slug } = await params;
+  const parsed = parseCompareSlug(slug);
+  if (!parsed) return {};
+  const fullLabel = compareModelDisplayLabel(parsed.model, parsed.a, parsed.b);
+  const gpuLabel = compareDisplayLabel(parsed.a, parsed.b);
+  const canonical = canonicalCompareSlug(parsed.model.slug, parsed.a, parsed.b);
+  const url = `${SITE_URL}/zh/compare-per-dollar/${canonical}`;
+  const description = `${gpuLabel} 在 ${parsed.model.label} 上的每美元性能：来自 InferenceX（SemiAnalysis 推出的独立开源基准测试平台）的经验证、可复现的每百万 token 成本结果，基于云服务商 TCO 归一化。${SUPPORTERS_LINE_ZH}查看哪款 GPU 在各交互性水平下更经济。`;
+  return {
+    title: `${fullLabel} — 每美元性能`,
+    description,
+    alternates: zhAlternates(`/compare-per-dollar/${canonical}`),
+    openGraph: {
+      title: `${fullLabel} — 每美元性能 | ${SITE_NAME}`,
+      description,
+      url,
+      type: 'website',
+      locale: ZH_OG_LOCALE,
+    },
+    twitter: {
+      card: 'summary_large_image',
+      title: `${fullLabel} — 每美元性能`,
+      description,
+    },
+  };
+}
+
+export default async function ComparePerDollarPageZh({ params, searchParams }: Props) {
+  const { slug } = await params;
+  const parsed = parseCompareSlug(slug);
+  if (!parsed) notFound();
+
+  const sp = await searchParams;
+
+  const canonical = canonicalCompareSlug(parsed.model.slug, parsed.a, parsed.b);
+  if (canonical !== slug.toLowerCase()) {
+    const qs = Object.entries(sp)
+      .flatMap(([k, v]) => {
+        if (Array.isArray(v)) return v.map((vv) => [k, vv] as const);
+        if (v === undefined) return [];
+        return [[k, v] as const];
+      })
+      .map(([k, v]) => `${encodeURIComponent(k)}=${encodeURIComponent(v)}`)
+      .join('&');
+    permanentRedirect(`/zh/compare-per-dollar/${canonical}${qs ? `?${qs}` : ''}`);
+  }
+
+  const rows = await getCachedBenchmarks(parsed.model.dbKeys);
+  const summaryA = summarize(rows, parsed.a);
+  const summaryB = summarize(rows, parsed.b);
+  const { sequence: pickedSequence, precision: pickedPrecision } = pickPairDefaults(
+    rows,
+    parsed.a,
+    parsed.b,
+  );
+
+  const urlSeq = pickString(sp.i_seq);
+  const urlPrec = pickString(sp.i_prec);
+  const urlModel = pickString(sp.g_model);
+  const effectiveSequence = urlSeq && KNOWN_SEQUENCES.has(urlSeq) ? urlSeq : pickedSequence;
+  const effectivePrecision = urlPrec && KNOWN_PRECISIONS.has(urlPrec) ? urlPrec : pickedPrecision;
+  const effectiveModel =
+    urlModel && KNOWN_MODELS.has(urlModel) ? urlModel : parsed.model.displayName;
+
+  const { defaultTargets, ssrRows, interactivityRange } = computeCompareTableData(
+    rows,
+    parsed.a,
+    parsed.b,
+    effectiveSequence,
+    effectivePrecision,
+  );
+
+  const url = `${SITE_URL}/zh/compare-per-dollar/${canonical}`;
+  const imageUrl = `${url}/performance-per-dollar.png`;
+  const { oldest, newest } = dateRangeForPair(rows, parsed.a, parsed.b);
+  const jsonLd = buildJsonLdZh(
+    'per-dollar',
+    parsed.model,
+    parsed.a,
+    parsed.b,
+    url,
+    summaryA,
+    summaryB,
+    ssrRows,
+    imageUrl,
+    oldest,
+    newest,
+    parsed.model.displayName,
+  );
+  const breadcrumbJsonLd = buildBreadcrumbJsonLdZh(
+    'per-dollar',
+    compareModelDisplayLabel(parsed.model, parsed.a, parsed.b),
+    url,
+  );
+  const label = compareModelDisplayLabel(parsed.model, parsed.a, parsed.b);
+  const aMeta = HW_REGISTRY[parsed.a];
+  const bMeta = HW_REGISTRY[parsed.b];
+  const aLabel = aMeta?.label ?? parsed.a.toUpperCase();
+  const bLabel = bMeta?.label ?? parsed.b.toUpperCase();
+  const narrative = compareTableNarrativeZh(
+    'per-dollar',
+    parsed.model.label,
+    aLabel,
+    bLabel,
+    ssrRows,
+    interactivityRange,
+  );
+  const aCostPerGpuHr = getGpuSpecs(parsed.a).costh;
+  const bCostPerGpuHr = getGpuSpecs(parsed.b).costh;
+
+  return (
+    <>
+      <JsonLd data={jsonLd} />
+      <JsonLd data={breadcrumbJsonLd} />
+      <ComparePerDollarPageClient
+        a={parsed.a}
+        b={parsed.b}
+        slug={canonical}
+        label={label}
+        modelLabel={parsed.model.label}
+        defaultModel={effectiveModel}
+        defaultSequence={effectiveSequence}
+        defaultPrecision={effectivePrecision}
+        ssrTableData={{ defaultTargets, ssrRows, interactivityRange }}
+        narrative={narrative}
+        aLabel={aLabel}
+        bLabel={bLabel}
+        aVendor={aMeta?.vendor ?? ''}
+        bVendor={bMeta?.vendor ?? ''}
+        aArch={aMeta?.arch ?? ''}
+        bArch={bMeta?.arch ?? ''}
+        aCostPerGpuHr={aCostPerGpuHr}
+        bCostPerGpuHr={bCostPerGpuHr}
+        heroImageSrc={`/compare-per-dollar/${canonical}/performance-per-dollar.png`}
+        locale="zh"
+      />
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/compare-per-dollar/page.tsx b/packages/app/src/app/zh/compare-per-dollar/page.tsx
index 32d5544e..6014dfbb 100644
--- a/packages/app/src/app/zh/compare-per-dollar/page.tsx
+++ b/packages/app/src/app/zh/compare-per-dollar/page.tsx
@@ -130,7 +130,7 @@ export default async function ComparePerDollarIndexPageZh() {
                       return (
                         <ComparePairCardLink
                           key={slug}
-                          href={`/compare-per-dollar/${slug}`}
+                          href={`/zh/compare-per-dollar/${slug}`}
                           slug={slug}
                           label={label}
                           archLine={archLine}
diff --git a/packages/app/src/app/zh/compare/[slug]/opengraph-image.tsx b/packages/app/src/app/zh/compare/[slug]/opengraph-image.tsx
new file mode 100644
index 00000000..5f3d0707
--- /dev/null
+++ b/packages/app/src/app/zh/compare/[slug]/opengraph-image.tsx
@@ -0,0 +1,7 @@
+export {
+  default,
+  alt,
+  size,
+  contentType,
+  generateStaticParams,
+} from '../../../compare/[slug]/opengraph-image';
diff --git a/packages/app/src/app/zh/compare/[slug]/page.tsx b/packages/app/src/app/zh/compare/[slug]/page.tsx
new file mode 100644
index 00000000..4cd11d03
--- /dev/null
+++ b/packages/app/src/app/zh/compare/[slug]/page.tsx
@@ -0,0 +1,178 @@
+import type { Metadata } from 'next';
+import { notFound, permanentRedirect } from 'next/navigation';
+
+import {
+  HW_REGISTRY,
+  SITE_NAME,
+  SITE_URL,
+  SUPPORTERS_LINE_ZH,
+} from '@semianalysisai/inferencex-constants';
+
+import { JsonLd } from '@/components/json-ld';
+import { pickPairDefaults } from '@/lib/compare-pair-defaults';
+import {
+  canonicalCompareSlug,
+  compareDisplayLabel,
+  compareModelDisplayLabel,
+  parseCompareSlug,
+} from '@/lib/compare-slug';
+import {
+  computeCompareTableData,
+  dateRangeForPair,
+  getCachedBenchmarks,
+  KNOWN_MODELS,
+  KNOWN_PRECISIONS,
+  KNOWN_SEQUENCES,
+  pickString,
+  summarize,
+} from '@/lib/compare-ssr';
+import {
+  buildBreadcrumbJsonLdZh,
+  buildJsonLdZh,
+  compareTableNarrativeZh,
+} from '@/lib/compare-ssr-zh';
+import { ZH_OG_LOCALE, zhAlternates } from '@/lib/i18n';
+
+import ComparePageClient from '../../../compare/[slug]/page-client';
+
+export const dynamic = 'force-dynamic';
+
+interface Props {
+  params: Promise<{ slug: string }>;
+  searchParams: Promise<Record<string, string | string[] | undefined>>;
+}
+
+export async function generateMetadata({ params }: Props): Promise<Metadata> {
+  const { slug } = await params;
+  const parsed = parseCompareSlug(slug);
+  if (!parsed) return {};
+  const fullLabel = compareModelDisplayLabel(parsed.model, parsed.a, parsed.b);
+  const gpuLabel = compareDisplayLabel(parsed.a, parsed.b);
+  const canonical = canonicalCompareSlug(parsed.model.slug, parsed.a, parsed.b);
+  const url = `${SITE_URL}/zh/compare/${canonical}`;
+  const description = `${gpuLabel} 在 ${parsed.model.label} 上的推理基准测试：来自 InferenceX（SemiAnalysis 推出的独立开源 GPU 基准测试平台）的经验证、可复现的正面对比结果。${SUPPORTERS_LINE_ZH}对比延迟、吞吐量与成本。`;
+  return {
+    title: `${fullLabel} 推理基准测试`,
+    description,
+    alternates: zhAlternates(`/compare/${canonical}`),
+    openGraph: {
+      title: `${fullLabel} | ${SITE_NAME}`,
+      description,
+      url,
+      type: 'website',
+      locale: ZH_OG_LOCALE,
+    },
+    twitter: {
+      card: 'summary_large_image',
+      title: `${fullLabel} 推理基准测试`,
+      description,
+    },
+  };
+}
+
+export default async function ComparePageZh({ params, searchParams }: Props) {
+  const { slug } = await params;
+  const parsed = parseCompareSlug(slug);
+  if (!parsed) notFound();
+
+  const sp = await searchParams;
+
+  const canonical = canonicalCompareSlug(parsed.model.slug, parsed.a, parsed.b);
+  if (canonical !== slug.toLowerCase()) {
+    const qs = Object.entries(sp)
+      .flatMap(([k, v]) => {
+        if (Array.isArray(v)) return v.map((vv) => [k, vv] as const);
+        if (v === undefined) return [];
+        return [[k, v] as const];
+      })
+      .map(([k, v]) => `${encodeURIComponent(k)}=${encodeURIComponent(v)}`)
+      .join('&');
+    permanentRedirect(`/zh/compare/${canonical}${qs ? `?${qs}` : ''}`);
+  }
+
+  const rows = await getCachedBenchmarks(parsed.model.dbKeys);
+  const summaryA = summarize(rows, parsed.a);
+  const summaryB = summarize(rows, parsed.b);
+  const { sequence: pickedSequence, precision: pickedPrecision } = pickPairDefaults(
+    rows,
+    parsed.a,
+    parsed.b,
+  );
+
+  const urlSeq = pickString(sp.i_seq);
+  const urlPrec = pickString(sp.i_prec);
+  const urlModel = pickString(sp.g_model);
+  const effectiveSequence = urlSeq && KNOWN_SEQUENCES.has(urlSeq) ? urlSeq : pickedSequence;
+  const effectivePrecision = urlPrec && KNOWN_PRECISIONS.has(urlPrec) ? urlPrec : pickedPrecision;
+  const effectiveModel =
+    urlModel && KNOWN_MODELS.has(urlModel) ? urlModel : parsed.model.displayName;
+
+  const { defaultTargets, ssrRows, interactivityRange } = computeCompareTableData(
+    rows,
+    parsed.a,
+    parsed.b,
+    effectiveSequence,
+    effectivePrecision,
+  );
+
+  const url = `${SITE_URL}/zh/compare/${canonical}`;
+  const { oldest, newest } = dateRangeForPair(rows, parsed.a, parsed.b);
+  const jsonLd = buildJsonLdZh(
+    'full',
+    parsed.model,
+    parsed.a,
+    parsed.b,
+    url,
+    summaryA,
+    summaryB,
+    ssrRows,
+    undefined,
+    oldest,
+    newest,
+    parsed.model.displayName,
+  );
+  const breadcrumbJsonLd = buildBreadcrumbJsonLdZh(
+    'full',
+    compareModelDisplayLabel(parsed.model, parsed.a, parsed.b),
+    url,
+  );
+  const label = compareModelDisplayLabel(parsed.model, parsed.a, parsed.b);
+  const aMeta = HW_REGISTRY[parsed.a];
+  const bMeta = HW_REGISTRY[parsed.b];
+  const aLabel = aMeta?.label ?? parsed.a.toUpperCase();
+  const bLabel = bMeta?.label ?? parsed.b.toUpperCase();
+  const narrative = compareTableNarrativeZh(
+    'full',
+    parsed.model.label,
+    aLabel,
+    bLabel,
+    ssrRows,
+    interactivityRange,
+  );
+
+  return (
+    <>
+      <JsonLd data={jsonLd} />
+      <JsonLd data={breadcrumbJsonLd} />
+      <ComparePageClient
+        a={parsed.a}
+        b={parsed.b}
+        slug={canonical}
+        label={label}
+        modelLabel={parsed.model.label}
+        defaultModel={effectiveModel}
+        defaultSequence={effectiveSequence}
+        defaultPrecision={effectivePrecision}
+        ssrTableData={{ defaultTargets, ssrRows, interactivityRange }}
+        narrative={narrative}
+        aLabel={aLabel}
+        bLabel={bLabel}
+        aVendor={aMeta?.vendor ?? ''}
+        bVendor={bMeta?.vendor ?? ''}
+        aArch={aMeta?.arch ?? ''}
+        bArch={bMeta?.arch ?? ''}
+        locale="zh"
+      />
+    </>
+  );
+}
diff --git a/packages/app/src/app/zh/compare/page.tsx b/packages/app/src/app/zh/compare/page.tsx
index a3640a4c..67867fcc 100644
--- a/packages/app/src/app/zh/compare/page.tsx
+++ b/packages/app/src/app/zh/compare/page.tsx
@@ -142,7 +142,7 @@ export default async function CompareIndexPageZh() {
                       return (
                         <ComparePairCardLink
                           key={slug}
-                          href={`/compare/${slug}`}
+                          href={`/zh/compare/${slug}`}
                           slug={slug}
                           label={label}
                           archLine={archLine}
diff --git a/packages/app/src/components/blog/mdx-components.tsx b/packages/app/src/components/blog/mdx-components.tsx
index 9abc1d6d..a77c7fe9 100644
--- a/packages/app/src/components/blog/mdx-components.tsx
+++ b/packages/app/src/components/blog/mdx-components.tsx
@@ -2,6 +2,7 @@ import type { ReactNode } from 'react';
 import Image from 'next/image';
 import Link from 'next/link';
 import { slugify } from '@/lib/blog';
+import type { Locale } from '@/lib/i18n';
 import { HeadingLink } from '@/components/blog/heading-link';
 import { JsonLd } from '@/components/json-ld';
 
@@ -50,7 +51,9 @@ function Blur(props: { children?: ReactNode }) {
 }
 
 /** Creates a fresh set of MDX components with clean heading dedup state per render. */
-export function createMdxComponents(): Record<string, React.ComponentType<any>> {
+export function createMdxComponents(
+  locale: Locale = 'en',
+): Record<string, React.ComponentType<any>> {
   const seen = new Set<string>();
   const parents: string[] = [];
   let figureCount = 0;
@@ -159,7 +162,13 @@ export function createMdxComponents(): Record<string, React.ComponentType<any>>
     ),
     Blur,
     DashboardCTA: (props: { href?: string; children?: ReactNode }) => {
-      const href = props.href ?? 'https://inferencex.semianalysis.com';
+      const defaultHref =
+        locale === 'zh'
+          ? 'https://inferencex.semianalysis.com/zh'
+          : 'https://inferencex.semianalysis.com';
+      const defaultLabel =
+        locale === 'zh' ? '查看完整 InferenceX 仪表板' : 'See full InferenceX Dashboard';
+      const href = props.href ?? defaultHref;
       return (
         <div className="my-6 flex justify-center">
           <a
@@ -168,7 +177,7 @@ export function createMdxComponents(): Record<string, React.ComponentType<any>>
             rel="noopener noreferrer"
             className="inline-flex items-center gap-2 rounded-md bg-brand px-4 py-0 text-sm font-medium text-primary-foreground shadow-sm transition-colors hover:bg-brand/90"
           >
-            {props.children ?? 'See full InferenceX Dashboard'}
+            {props.children ?? defaultLabel}
           </a>
         </div>
       );
diff --git a/packages/app/src/components/calculator/CalculatorTable.tsx b/packages/app/src/components/calculator/CalculatorTable.tsx
index 1b700f08..b2cc9fd3 100644
--- a/packages/app/src/components/calculator/CalculatorTable.tsx
+++ b/packages/app/src/components/calculator/CalculatorTable.tsx
@@ -9,6 +9,7 @@ import {
 } from '@/components/calculator/ThroughputBarChart';
 import { type DataTableColumn, DataTable } from '@/components/ui/data-table';
 import type { HardwareConfig } from '@/components/inference/types';
+import { useLocale } from '@/lib/use-locale';
 import { getDisplayLabel } from '@/lib/utils';
 
 interface CalculatorTableProps {
@@ -29,13 +30,43 @@ function getCost(r: InterpolatedResult, costType: CostType): number {
   return r.cost;
 }
 
+const STRINGS = {
+  en: {
+    throughputTotal: 'Total',
+    throughputInput: 'Input',
+    throughputOutput: 'Output',
+    throughputSuffix: ' Throughput (tok/s/gpu)',
+    costPrefix: 'Cost (',
+    costSuffix: ')',
+    concurrency: 'Concurrency',
+    footer:
+      'Values are interpolated from real InferenceMAX benchmark data points. Only GPUs with data in the measured range are shown.',
+  },
+  zh: {
+    throughputTotal: '总',
+    throughputInput: '输入',
+    throughputOutput: '输出',
+    throughputSuffix: '吞吐量 (tok/s/gpu)',
+    costPrefix: '成本 (',
+    costSuffix: ')',
+    concurrency: '并发数',
+    footer: '数值基于真实 InferenceMAX 基准测试数据插值计算。仅显示在测量范围内有数据的 GPU。',
+  },
+} as const;
+
 export default function CalculatorTable({
   results,
   costType,
   hardwareConfig,
 }: CalculatorTableProps) {
+  const locale = useLocale();
+  const s = STRINGS[locale];
   const throughputLabel =
-    costType === 'input' ? 'Input' : costType === 'output' ? 'Output' : 'Total';
+    costType === 'input'
+      ? s.throughputInput
+      : costType === 'output'
+        ? s.throughputOutput
+        : s.throughputTotal;
   const costLabel = `$/M ${costType === 'input' ? 'input ' : costType === 'output' ? 'output ' : ''}tok`;
   const mwLabel =
     costType === 'input'
@@ -53,14 +84,14 @@ export default function CalculatorTable({
         className: 'font-medium whitespace-nowrap',
       },
       {
-        header: `${throughputLabel} Throughput (tok/s/gpu)`,
+        header: `${throughputLabel}${s.throughputSuffix}`,
         align: 'right',
         cell: (r) => getThroughputForType(r, costType).toFixed(1),
         sortValue: (r) => getThroughputForType(r, costType),
         className: 'tabular-nums',
       },
       {
-        header: `Cost (${costLabel})`,
+        header: `${s.costPrefix}${costLabel}${s.costSuffix}`,
         align: 'right',
         cell: (r) => `$${getCost(r, costType).toFixed(3)}`,
         sortValue: (r) => getCost(r, costType),
@@ -74,14 +105,14 @@ export default function CalculatorTable({
         className: 'tabular-nums',
       },
       {
-        header: 'Concurrency',
+        header: s.concurrency,
         align: 'right',
         cell: (r) => `~${r.concurrency}`,
         sortValue: (r) => r.concurrency,
         className: 'tabular-nums',
       },
     ],
-    [costType, hardwareConfig, throughputLabel, costLabel, mwLabel],
+    [costType, hardwareConfig, throughputLabel, costLabel, mwLabel, s],
   );
 
   return (
@@ -92,10 +123,7 @@ export default function CalculatorTable({
         testId="calculator-results-table"
         analyticsPrefix="calculator_table"
       />
-      <p className="text-xs text-muted-foreground mt-3">
-        Values are interpolated from real InferenceMAX benchmark data points. Only GPUs with data in
-        the measured range are shown.
-      </p>
+      <p className="text-xs text-muted-foreground mt-3">{s.footer}</p>
     </>
   );
 }
diff --git a/packages/app/src/components/calculator/ThroughputBarChart.tsx b/packages/app/src/components/calculator/ThroughputBarChart.tsx
index c42ed063..de3c145b 100644
--- a/packages/app/src/components/calculator/ThroughputBarChart.tsx
+++ b/packages/app/src/components/calculator/ThroughputBarChart.tsx
@@ -3,6 +3,7 @@
 import { track } from '@/lib/analytics';
 import * as d3 from 'd3';
 import { useEffect, useMemo, useRef } from 'react';
+import { useLocale } from '@/lib/use-locale';
 
 import type { HardwareConfig } from '@/components/inference/types';
 import { getHardwareConfig } from '@/lib/constants';
@@ -554,14 +555,17 @@ export default function ThroughputBarChart({
     applySelectionOpacities(svg as any, selectedBars);
   }, [selectedBars]);
 
+  const locale = useLocale();
+
   if (results.length === 0) {
     return (
       <div
         className="flex items-center justify-center h-64 text-muted-foreground"
         data-testid="calculator-no-data"
       >
-        No data available for the current selection. Try adjusting the model, sequence, or
-        precision.
+        {locale === 'zh'
+          ? '当前选择无可用数据。请尝试调整模型、序列长度或精度。'
+          : 'No data available for the current selection. Try adjusting the model, sequence, or precision.'}
       </div>
     );
   }
diff --git a/packages/app/src/components/calculator/ThroughputCalculatorDisplay.tsx b/packages/app/src/components/calculator/ThroughputCalculatorDisplay.tsx
index 6c9a0ef0..f3fdd3f4 100644
--- a/packages/app/src/components/calculator/ThroughputCalculatorDisplay.tsx
+++ b/packages/app/src/components/calculator/ThroughputCalculatorDisplay.tsx
@@ -3,6 +3,7 @@
 import { track } from '@/lib/analytics';
 import Link from 'next/link';
 import { BarChart3, Table2 } from 'lucide-react';
+import { useLocale } from '@/lib/use-locale';
 import { useCallback, useEffect, useMemo, useRef, useState } from 'react';
 
 import CalculatorTable from '@/components/calculator/CalculatorTable';
@@ -44,10 +45,17 @@ import { calculatorChartToCsv } from '@/lib/csv-export-helpers';
 
 import ThroughputBarChart, {
   getChartTitle,
+  getCostProviderLabel,
   getThroughputForType,
   getTpPerMwForType,
 } from './ThroughputBarChart';
-import type { BarMetric, CostProvider, CostType, InterpolatedResult } from './types';
+import type {
+  BarMetric,
+  CalculatorMode,
+  CostProvider,
+  CostType,
+  InterpolatedResult,
+} from './types';
 import { useThroughputData } from './useThroughputData';
 
 const COST_PROVIDER_OPTIONS: { value: CostProvider; label: string }[] = [
@@ -68,12 +76,6 @@ const BAR_METRIC_OPTIONS: { value: BarMetric; label: string }[] = [
   { value: 'cost', label: 'Cost' },
 ];
 
-const getBarMetricLabel = (metric: BarMetric) => {
-  if (metric === 'throughput') return 'Throughput';
-  if (metric === 'cost') return 'Cost';
-  return 'tok/s/MW';
-};
-
 type CalculatorViewMode = 'chart' | 'table';
 
 const CALCULATOR_VIEW_MODE_OPTIONS: SegmentedToggleOption<CalculatorViewMode>[] = [
@@ -94,6 +96,124 @@ const CALCULATOR_VIEW_MODE_OPTIONS: SegmentedToggleOption<CalculatorViewMode>[]
 const CALCULATOR_MOBILE_VIEW_MODE_OPTIONS: SegmentedToggleOption<CalculatorViewMode>[] =
   CALCULATOR_VIEW_MODE_OPTIONS.map(({ testId: _testId, ...option }) => option);
 
+const STRINGS = {
+  en: {
+    title: 'TCO Calculator',
+    description:
+      'Set a target interactivity (tokens/sec/user) and compare the throughput and cost across all GPUs. Values are interpolated from real benchmark data.',
+    costProviderLabel: 'Cost Provider',
+    costProviderTooltip:
+      'The pricing tier used to calculate cost per million tokens. Hyperscaler (e.g. AWS/GCP), Neocloud (e.g. CoreWeave), or 3-year rental.',
+    costProviderPlaceholder: 'Cost provider',
+    tokenTypeLabel: 'Token Type',
+    tokenTypeTooltip:
+      'Whether to show costs for total tokens, input tokens only, or output tokens only.',
+    tokenTypePlaceholder: 'Token type',
+    metricLabel: 'Metric',
+    metricTooltip:
+      'The comparison metric shown in the chart. Throughput (tok/s/gpu), power efficiency (tok/s/MW), or cost per million tokens.',
+    targetLabel: 'Target Interactivity (tok/s/user)',
+    targetTooltip:
+      'The interactivity operating point used for interpolation. Adjust the slider to compare GPU throughput, cost, and power efficiency at different interactivity levels.',
+    metricThroughput: 'Throughput',
+    metricCost: 'Cost',
+    viewChart: 'Chart',
+    viewTable: 'Table',
+    viewModeAria: 'View mode',
+    errorLoading: 'Error loading data. Please try a different selection.',
+    clickToCompare: 'selected. Click another bar to compare.',
+    clearSelection: 'Clear selection',
+    highContrast: 'High Contrast',
+    resetFilter: 'Reset filter',
+    totalTokens: 'Total Tokens',
+    inputTokens: 'Input Tokens',
+    outputTokens: 'Output Tokens',
+    allInPower: 'All in Power/GPU: ',
+    tcoPerHr: 'TCO $/GPU/hr: ',
+    source: 'Source: ',
+    updated: ' • Updated: ',
+    note: 'Note:',
+    disaggCost:
+      ' Disaggregated inference configurations (e.g., MoRI SGLang, Dynamo TRTLLM) calculate cost per decode GPU or per prefill GPU, rather than per total GPU count. This makes direct cost comparison with aggregated configs not an apples-to-apples comparison.',
+    disaggThroughput:
+      ' Disaggregated inference configurations (e.g., MoRI SGLang, Dynamo TRTLLM) calculate throughput per decode GPU or per prefill GPU, rather than per total GPU count. This makes direct throughput comparison with aggregated configs not an apples-to-apples comparison.',
+    compMetricThroughput: 'throughput',
+    compMetricCost: 'cost efficiency',
+    compMetricPower: 'tok/s/MW',
+  },
+  zh: {
+    title: 'TCO 计算器',
+    description:
+      '设定目标交互性（tokens/sec/user），比较所有 GPU 的吞吐量和成本。数值基于真实基准测试数据插值计算。',
+    costProviderLabel: '成本供应商',
+    costProviderTooltip:
+      '用于计算每百万 token 成本的定价层级。Hyperscaler（如 AWS/GCP）、Neocloud（如 CoreWeave）或 3 年租赁。',
+    costProviderPlaceholder: '成本供应商',
+    tokenTypeLabel: 'Token 类型',
+    tokenTypeTooltip: '选择显示总 token、仅输入 token 还是仅输出 token 的成本。',
+    tokenTypePlaceholder: 'Token 类型',
+    metricLabel: '指标',
+    metricTooltip:
+      '图表中显示的比较指标。吞吐量（tok/s/gpu）、能效（tok/s/MW）或每百万 token 成本。',
+    targetLabel: '目标交互性 (tok/s/user)',
+    targetTooltip:
+      '用于插值的交互性操作点。调整滑块以比较不同交互性级别下 GPU 的吞吐量、成本和能效。',
+    metricThroughput: '吞吐量',
+    metricCost: '成本',
+    viewChart: '图表',
+    viewTable: '表格',
+    viewModeAria: '显示模式',
+    errorLoading: '加载数据出错，请尝试其他选择。',
+    clickToCompare: '已选中。点击另一个柱状图进行对比。',
+    clearSelection: '清除选择',
+    highContrast: '高对比度',
+    resetFilter: '重置筛选',
+    totalTokens: '总 Token',
+    inputTokens: '输入 Token',
+    outputTokens: '输出 Token',
+    allInPower: '全含功率/GPU：',
+    tcoPerHr: 'TCO $/GPU/hr：',
+    source: '来源：',
+    updated: ' • 更新于：',
+    note: '注意：',
+    disaggCost:
+      '解耦推理配置（如 MoRI SGLang、Dynamo TRTLLM）按解码 GPU 或预填充 GPU 计算成本，而非按 GPU 总数。因此与聚合配置的直接成本对比并非同类比较。',
+    disaggThroughput:
+      '解耦推理配置（如 MoRI SGLang、Dynamo TRTLLM）按解码 GPU 或预填充 GPU 计算吞吐量，而非按 GPU 总数。因此与聚合配置的直接吞吐量对比并非同类比较。',
+    compMetricThroughput: '吞吐量',
+    compMetricCost: '成本效率',
+    compMetricPower: 'tok/s/MW',
+  },
+} as const;
+
+function getChartTitleZh(
+  barMetric: BarMetric,
+  mode: CalculatorMode,
+  targetValue: number,
+  costType: CostType,
+  costProvider?: CostProvider,
+): string {
+  const targetLabel =
+    mode === 'interactivity_to_throughput'
+      ? `${targetValue} tok/s/user 交互性`
+      : `${targetValue} tok/s/gpu 吞吐量`;
+  const tokenTypeLabel = costType === 'input' ? '输入' : costType === 'output' ? '输出' : '总';
+  switch (barMetric) {
+    case 'power': {
+      return `${targetLabel}下每满配兆瓦${tokenTypeLabel} token 数`;
+    }
+    case 'cost': {
+      const providerLabel = getCostProviderLabel(costProvider || 'costh');
+      return `${targetLabel}下每百万${tokenTypeLabel} token 成本（${providerLabel}）`;
+    }
+    default: {
+      return mode === 'interactivity_to_throughput'
+        ? `${targetLabel}下每 GPU ${tokenTypeLabel} token 吞吐量`
+        : `${targetLabel}下的交互性`;
+    }
+  }
+}
+
 export default function ThroughputCalculatorDisplay({ urlSeed }: { urlSeed?: CalculatorUrlSeed }) {
   if (urlSeed && (urlSeed.model || urlSeed.sequence || urlSeed.precisions)) {
     return (
@@ -110,6 +230,8 @@ export default function ThroughputCalculatorDisplay({ urlSeed }: { urlSeed?: Cal
 }
 
 function ThroughputCalculatorInner() {
+  const locale = useLocale();
+  const t = STRINGS[locale];
   const [openDropdown, setOpenDropdown] = useState<string | null>(null);
   const handleDropdownOpenChange = (dropdownKey: string) => (isOpen: boolean) => {
     if (isOpen) {
@@ -145,6 +267,24 @@ function ThroughputCalculatorInner() {
   const [highContrast, setHighContrast] = useState(false);
   const [viewMode, setViewMode] = useState<CalculatorViewMode>('chart');
 
+  const costTypeLabels: Record<CostType, string> = useMemo(
+    () => ({ total: t.totalTokens, input: t.inputTokens, output: t.outputTokens }),
+    [t],
+  );
+
+  const viewModeOptions = useMemo<SegmentedToggleOption<CalculatorViewMode>[]>(() => {
+    if (locale === 'en') return CALCULATOR_VIEW_MODE_OPTIONS;
+    return CALCULATOR_VIEW_MODE_OPTIONS.map((opt) => ({
+      ...opt,
+      label: opt.value === 'chart' ? t.viewChart : t.viewTable,
+    }));
+  }, [locale, t]);
+
+  const mobileViewModeOptions = useMemo<SegmentedToggleOption<CalculatorViewMode>[]>(() => {
+    if (locale === 'en') return CALCULATOR_MOBILE_VIEW_MODE_OPTIONS;
+    return viewModeOptions.map(({ testId: _testId, ...opt }) => opt);
+  }, [locale, viewModeOptions]);
+
   const { hardwareConfig, ranges, getResults, loading, error, hasData, availableHwKeys } =
     useThroughputData(selectedModel, selectedSequence, selectedPrecisions, selectedRunDate);
 
@@ -348,7 +488,11 @@ function ThroughputCalculatorInner() {
     };
 
     const metricName =
-      barMetric === 'power' ? 'tok/s/MW' : barMetric === 'cost' ? 'cost efficiency' : 'throughput';
+      barMetric === 'power'
+        ? t.compMetricPower
+        : barMetric === 'cost'
+          ? t.compMetricCost
+          : t.compMetricThroughput;
 
     // Generate pairwise comparisons — always use lower as denominator
     const comparisons: string[] = [];
@@ -384,15 +528,21 @@ function ThroughputCalculatorInner() {
 
         if (lowerVal > 0) {
           const ratio = higherVal / lowerVal;
-          comparisons.push(
-            `${getLabel(higher)} is ${ratio.toFixed(1)}x more ${metricName} than ${getLabel(lower)}`,
-          );
+          if (locale === 'zh') {
+            comparisons.push(
+              `${getLabel(higher)} 的${metricName}比 ${getLabel(lower)} 高 ${ratio.toFixed(1)} 倍`,
+            );
+          } else {
+            comparisons.push(
+              `${getLabel(higher)} is ${ratio.toFixed(1)}x more ${metricName} than ${getLabel(lower)}`,
+            );
+          }
         }
       }
     }
 
     return comparisons;
-  }, [selectedBars, results, hardwareConfig, barMetric, costType, mode]);
+  }, [selectedBars, results, hardwareConfig, barMetric, costType, mode, locale, t]);
 
   // Build legend items for ChartLegend sidebar, sorted by MODEL_ORDER (same as Inference Performance tab)
   const legendItems = useMemo(() => {
@@ -416,7 +566,7 @@ function ThroughputCalculatorInner() {
     return (
       <Card>
         <div className="flex items-center justify-center h-64 text-muted-foreground">
-          Error loading data. Please try a different selection.
+          {t.errorLoading}
         </div>
       </Card>
     );
@@ -429,11 +579,8 @@ function ThroughputCalculatorInner() {
           <div className="flex flex-col gap-4">
             <div className="flex items-start justify-between">
               <div>
-                <h2 className="text-lg font-semibold mb-2">TCO Calculator</h2>
-                <p className="text-muted-foreground text-sm mb-4">
-                  Set a target interactivity (tokens/sec/user) and compare the throughput and cost
-                  across all GPUs. Values are interpolated from real benchmark data.
-                </p>
+                <h2 className="text-lg font-semibold mb-2">{t.title}</h2>
+                <p className="text-muted-foreground text-sm mb-4">{t.description}</p>
               </div>
               <ChartShareActions />
             </div>
@@ -472,8 +619,8 @@ function ThroughputCalculatorInner() {
                 <div className="flex flex-col space-y-1.5 lg:col-span-1">
                   <LabelWithTooltip
                     htmlFor="calc-cost"
-                    label="Cost Provider"
-                    tooltip="The pricing tier used to calculate cost per million tokens. Hyperscaler (e.g. AWS/GCP), Neocloud (e.g. CoreWeave), or 3-year rental."
+                    label={t.costProviderLabel}
+                    tooltip={t.costProviderTooltip}
                   />
                   <div id="calc-cost" data-testid="calc-cost-selector">
                     <MultiSelect
@@ -489,7 +636,7 @@ function ThroughputCalculatorInner() {
                       }}
                       open={openDropdown === 'costProvider'}
                       onOpenChange={handleDropdownOpenChange('costProvider')}
-                      placeholder="Cost provider"
+                      placeholder={t.costProviderPlaceholder}
                       minSelections={1}
                       maxSelections={1}
                       showClearAll={false}
@@ -503,14 +650,14 @@ function ThroughputCalculatorInner() {
                 <div className="flex flex-col space-y-1.5 lg:col-span-1">
                   <LabelWithTooltip
                     htmlFor="calc-cost-type"
-                    label="Token Type"
-                    tooltip="Whether to show costs for total tokens, input tokens only, or output tokens only."
+                    label={t.tokenTypeLabel}
+                    tooltip={t.tokenTypeTooltip}
                   />
                   <div id="calc-cost-type" data-testid="calc-cost-type-selector">
                     <MultiSelect
                       options={COST_TYPE_OPTIONS.map((ct) => ({
                         value: ct.value,
-                        label: ct.label,
+                        label: costTypeLabels[ct.value],
                       }))}
                       value={[costType]}
                       onChange={(values) => {
@@ -520,7 +667,7 @@ function ThroughputCalculatorInner() {
                       }}
                       open={openDropdown === 'costType'}
                       onOpenChange={handleDropdownOpenChange('costType')}
-                      placeholder="Token type"
+                      placeholder={t.tokenTypePlaceholder}
                       minSelections={1}
                       maxSelections={1}
                       showClearAll={false}
@@ -536,8 +683,8 @@ function ThroughputCalculatorInner() {
                 <div className="flex flex-col space-y-1.5">
                   <LabelWithTooltip
                     htmlFor="calc-metric"
-                    label="Metric"
-                    tooltip="The comparison metric shown in the chart. Throughput (tok/s/gpu), power efficiency (tok/s/MW), or cost per million tokens."
+                    label={t.metricLabel}
+                    tooltip={t.metricTooltip}
                   />
                   <div className="flex rounded-lg border border-border overflow-hidden h-9">
                     {BAR_METRIC_OPTIONS.map((opt) => (
@@ -552,7 +699,11 @@ function ThroughputCalculatorInner() {
                         }`}
                         onClick={() => handleBarMetricChange(opt.value)}
                       >
-                        {getBarMetricLabel(opt.value)}
+                        {opt.value === 'throughput'
+                          ? t.metricThroughput
+                          : opt.value === 'cost'
+                            ? t.metricCost
+                            : 'tok/s/MW'}
                       </button>
                     ))}
                   </div>
@@ -563,8 +714,8 @@ function ThroughputCalculatorInner() {
                 <div className="space-y-2">
                   <LabelWithTooltip
                     htmlFor="calc-target"
-                    label="Target Interactivity (tok/s/user)"
-                    tooltip="The interactivity operating point used for interpolation. Adjust the slider to compare GPU throughput, cost, and power efficiency at different interactivity levels."
+                    label={t.targetLabel}
+                    tooltip={t.targetTooltip}
                   />
                   <div className="flex items-center gap-4">
                     <div className="flex-1">
@@ -632,9 +783,9 @@ function ThroughputCalculatorInner() {
             leadingControls={
               <SegmentedToggle
                 value={viewMode}
-                options={CALCULATOR_VIEW_MODE_OPTIONS}
+                options={viewModeOptions}
                 onValueChange={handleViewModeChange}
-                ariaLabel="View mode"
+                ariaLabel={t.viewModeAria}
                 testId="calculator-view-toggle"
                 className="shrink-0"
               />
@@ -650,13 +801,15 @@ function ThroughputCalculatorInner() {
                     <>
                       <div className="flex items-start justify-between gap-4">
                         <h2 className="text-lg font-semibold">
-                          {getChartTitle(barMetric, mode, targetValue, costType, costProvider)}
+                          {locale === 'zh'
+                            ? getChartTitleZh(barMetric, mode, targetValue, costType, costProvider)
+                            : getChartTitle(barMetric, mode, targetValue, costType, costProvider)}
                         </h2>
                         <SegmentedToggle
                           value={viewMode}
-                          options={CALCULATOR_MOBILE_VIEW_MODE_OPTIONS}
+                          options={mobileViewModeOptions}
                           onValueChange={handleViewModeChange}
-                          ariaLabel="View mode"
+                          ariaLabel={t.viewModeAria}
                           className="md:hidden shrink-0"
                         />
                       </div>
@@ -665,8 +818,13 @@ function ThroughputCalculatorInner() {
                         {selectedPrecisions
                           .map((p) => getPrecisionLabel(p as Precision))
                           .join(', ')}{' '}
-                        • {getSequenceLabel(selectedSequence)} • Source: SemiAnalysis InferenceX™
-                        {selectedRunDate && <> • Updated: {selectedRunDate}</>}
+                        • {getSequenceLabel(selectedSequence)} • {t.source}SemiAnalysis InferenceX™
+                        {selectedRunDate && (
+                          <>
+                            {t.updated}
+                            {selectedRunDate}
+                          </>
+                        )}
                       </p>
                       {barMetric === 'power' && results.length > 0 && (
                         <>
@@ -674,7 +832,7 @@ function ThroughputCalculatorInner() {
                             className="text-muted-foreground mb-2 flex flex-wrap gap-2 items-center"
                             data-testid="calculator-cost-badges"
                           >
-                            All in Power/GPU:{' '}
+                            {t.allInPower}
                             {Object.entries(HW_REGISTRY).map(([base, specs]) => (
                               <Badge key={base} variant="outline">
                                 {base.toUpperCase()}: {specs.power}kW
@@ -683,7 +841,7 @@ function ThroughputCalculatorInner() {
                           </p>
                           <p className="text-muted-foreground">
                             <small>
-                              Source:{' '}
+                              {t.source}
                               <Link
                                 target="_blank"
                                 className="underline hover:text-foreground"
@@ -702,7 +860,7 @@ function ThroughputCalculatorInner() {
                             className="text-muted-foreground mb-2 flex flex-wrap gap-2 items-center"
                             data-testid="calculator-cost-badges"
                           >
-                            TCO $/GPU/hr:{' '}
+                            {t.tcoPerHr}
                             {Object.entries(HW_REGISTRY).map(([base, specs]) => (
                               <Badge key={base} variant="outline">
                                 {base.toUpperCase()}: $
@@ -718,7 +876,7 @@ function ThroughputCalculatorInner() {
                           </p>
                           <p className="text-muted-foreground">
                             <small>
-                              Source:{' '}
+                              {t.source}
                               <Link
                                 target="_blank"
                                 className="underline hover:text-foreground"
@@ -737,10 +895,8 @@ function ThroughputCalculatorInner() {
                         }`}
                       >
                         <p className="text-muted-foreground text-xs mt-2 border-l-2 border-amber-500 pl-2 bg-amber-500/5 py-1">
-                          <strong>Note:</strong> Disaggregated inference configurations (e.g., MoRI
-                          SGLang, Dynamo TRTLLM) calculate cost per decode GPU or per prefill GPU,
-                          rather than per total GPU count. This makes direct cost comparison with
-                          aggregated configs not an apples-to-apples comparison.
+                          <strong>{t.note}</strong>
+                          {t.disaggCost}
                         </p>
                       </div>
                       <div
@@ -751,10 +907,8 @@ function ThroughputCalculatorInner() {
                         }`}
                       >
                         <p className="text-muted-foreground text-xs mt-2 border-l-2 border-amber-500 pl-2 bg-amber-500/5 py-1">
-                          <strong>Note:</strong> Disaggregated inference configurations (e.g., MoRI
-                          SGLang, Dynamo TRTLLM) calculate throughput per decode GPU or per prefill
-                          GPU, rather than per total GPU count. This makes direct throughput
-                          comparison with aggregated configs not an apples-to-apples comparison.
+                          <strong>{t.note}</strong>
+                          {t.disaggThroughput}
                         </p>
                       </div>
                       <UnofficialDomainNotice />
@@ -788,7 +942,7 @@ function ThroughputCalculatorInner() {
                             switches={[
                               {
                                 id: 'calc-high-contrast',
-                                label: 'High Contrast',
+                                label: t.highContrast,
                                 checked: highContrast,
                                 onCheckedChange: (checked: boolean) => {
                                   setHighContrast(checked);
@@ -801,7 +955,7 @@ function ThroughputCalculatorInner() {
                                 ? [
                                     {
                                       id: 'calc-reset-filter',
-                                      label: 'Reset filter',
+                                      label: t.resetFilter,
                                       onClick: handleResetGpus,
                                     },
                                   ]
@@ -845,7 +999,7 @@ function ThroughputCalculatorInner() {
                       const baseName = config ? getDisplayLabel(config) : r.hwKey;
                       return r.precision ? `${baseName} (${r.precision.toUpperCase()})` : baseName;
                     })()}{' '}
-                    selected. Click another bar to compare.
+                    {t.clickToCompare}
                   </p>
                 )}
                 {comparisonText && comparisonText.length > 0 && (
@@ -866,7 +1020,7 @@ function ThroughputCalculatorInner() {
                 }}
                 className="text-xs text-muted-foreground hover:text-foreground underline shrink-0"
               >
-                Clear selection
+                {t.clearSelection}
               </button>
             </div>
           </Card>
diff --git a/packages/app/src/components/evaluation/ui/ChartControls.tsx b/packages/app/src/components/evaluation/ui/ChartControls.tsx
index 455c7152..0ca9a115 100644
--- a/packages/app/src/components/evaluation/ui/ChartControls.tsx
+++ b/packages/app/src/components/evaluation/ui/ChartControls.tsx
@@ -3,6 +3,7 @@
 import { useState } from 'react';
 
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
 import { ChevronDownIcon } from 'lucide-react';
 
 import { useEvaluation } from '@/components/evaluation/EvaluationContext';
@@ -14,7 +15,31 @@ import { MultiSelect } from '@/components/ui/multi-select';
 import { Popover, PopoverContent, PopoverTrigger } from '@/components/ui/popover';
 import { TooltipProvider } from '@/components/ui/tooltip';
 
+const STRINGS = {
+  en: {
+    benchmarkLabel: 'Benchmark',
+    benchmarkTooltip:
+      'The standardized test used to measure model performance. Common benchmarks include reasoning, coding, and knowledge-based evaluations.',
+    selectBenchmark: 'Select benchmark',
+    selectRunDate: 'Select run date',
+    changelog: 'Changelog',
+    newResultsOn: 'New results on',
+    noNewResults: 'No new results for this model on this date.',
+  },
+  zh: {
+    benchmarkLabel: '基准测试',
+    benchmarkTooltip:
+      '用于衡量模型性能的标准化测试。常见的基准测试包括推理能力、编程能力和知识评估。',
+    selectBenchmark: '选择基准测试',
+    selectRunDate: '选择运行日期',
+    changelog: '变更记录',
+    newResultsOn: '新结果 ·',
+    noNewResults: '该日期该模型无新结果。',
+  },
+};
+
 export default function EvaluationChartControls() {
+  const t = STRINGS[useLocale()];
   const [openDropdown, setOpenDropdown] = useState<string | null>(null);
   const handleDropdownOpenChange = (dropdownKey: string) => (isOpen: boolean) => {
     if (isOpen) {
@@ -59,8 +84,8 @@ export default function EvaluationChartControls() {
         <div className="flex flex-col space-y-1.5 lg:col-span-1">
           <LabelWithTooltip
             htmlFor="eval-benchmark-select"
-            label="Benchmark"
-            tooltip="The standardized test used to measure model performance. Common benchmarks include reasoning, coding, and knowledge-based evaluations."
+            label={t.benchmarkLabel}
+            tooltip={t.benchmarkTooltip}
           />
           <div>
             <MultiSelect
@@ -79,7 +104,7 @@ export default function EvaluationChartControls() {
               onOpenChange={handleDropdownOpenChange('benchmark')}
               triggerId="eval-benchmark-select"
               triggerTestId="evaluation-benchmark-selector"
-              placeholder="Select benchmark"
+              placeholder={t.selectBenchmark}
               minSelections={1}
               maxSelections={1}
               showClearAll={false}
@@ -116,7 +141,7 @@ export default function EvaluationChartControls() {
             setSelectedRunDate(date);
             track('evaluation_date_selected', { date });
           }}
-          placeholder="Select run date"
+          placeholder={t.selectRunDate}
           availableDates={availableDates}
         />
 
@@ -124,13 +149,15 @@ export default function EvaluationChartControls() {
         <Popover>
           <PopoverTrigger asChild>
             <Button variant="ghost" className="self-start">
-              <strong>Changelog</strong>
+              <strong>{t.changelog}</strong>
               <ChevronDownIcon />
             </Button>
           </PopoverTrigger>
           <PopoverContent className="w-[400px]">
             <div className="flex flex-col gap-3">
-              <div className="text-xs font-bold">New results on {selectedRunDate}</div>
+              <div className="text-xs font-bold">
+                {t.newResultsOn} {selectedRunDate}
+              </div>
               {changelogEntries.length > 0 ? (
                 changelogEntries.map((entry) => (
                   <div key={entry.benchmark} className="flex flex-col gap-1 text-xs">
@@ -143,9 +170,7 @@ export default function EvaluationChartControls() {
                   </div>
                 ))
               ) : (
-                <p className="text-xs text-muted-foreground">
-                  No new results for this model on this date.
-                </p>
+                <p className="text-xs text-muted-foreground">{t.noNewResults}</p>
               )}
             </div>
           </PopoverContent>
diff --git a/packages/app/src/components/evaluation/ui/ChartDisplay.tsx b/packages/app/src/components/evaluation/ui/ChartDisplay.tsx
index a6b90db9..99c7aef3 100644
--- a/packages/app/src/components/evaluation/ui/ChartDisplay.tsx
+++ b/packages/app/src/components/evaluation/ui/ChartDisplay.tsx
@@ -4,6 +4,7 @@ import { useCallback, useMemo, useState } from 'react';
 import { BarChart3, Table2 } from 'lucide-react';
 
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
 import { useEvaluation } from '@/components/evaluation/EvaluationContext';
 import EvaluationTable from '@/components/evaluation/ui/EvaluationTable';
 import { Card } from '@/components/ui/card';
@@ -21,22 +22,34 @@ import EvalBarChartD3 from './BarChartD3';
 
 type EvalViewMode = 'chart' | 'table';
 
-const VIEW_MODE_OPTIONS: SegmentedToggleOption<EvalViewMode>[] = [
-  {
-    value: 'chart',
-    label: 'Chart',
-    icon: <BarChart3 className="size-3.5" />,
-    testId: 'evaluation-chart-view-btn',
+const STRINGS = {
+  en: {
+    chartView: 'Chart',
+    tableView: 'Table',
+    viewModeAria: 'View mode',
+    heading: 'Accuracy Evals',
+    description:
+      'Benchmark results showing model quality versus throughput trade-offs across different GPUs, quantization levels, and inference configurations.',
+    captionHeading: 'Evaluation Score by Hardware Configuration',
+    sourceUnofficial: 'Source: UNOFFICIAL',
+    sourceOfficial: 'Source: SemiAnalysis InferenceX™',
+    updated: 'Updated:',
   },
-  {
-    value: 'table',
-    label: 'Table',
-    icon: <Table2 className="size-3.5" />,
-    testId: 'evaluation-table-view-btn',
+  zh: {
+    chartView: '图表',
+    tableView: '表格',
+    viewModeAria: '视图模式',
+    heading: '准确率评估',
+    description: '基准测试结果展示不同 GPU、量化精度和推理配置下，模型质量与吞吐量之间的权衡。',
+    captionHeading: '各硬件配置的评估得分',
+    sourceUnofficial: '来源：非官方',
+    sourceOfficial: '来源：SemiAnalysis InferenceX™',
+    updated: '更新时间：',
   },
-];
+};
 
 export default function EvaluationChartDisplay() {
+  const t = STRINGS[useLocale()];
   const CHART_ID = 'evaluation-chart';
   const {
     selectedModel,
@@ -62,6 +75,24 @@ export default function EvaluationChartDisplay() {
     track('evaluation_view_changed', { view: value });
   };
 
+  const viewModeOptions = useMemo(
+    (): SegmentedToggleOption<EvalViewMode>[] => [
+      {
+        value: 'chart',
+        label: t.chartView,
+        icon: <BarChart3 className="size-3.5" />,
+        testId: 'evaluation-chart-view-btn',
+      },
+      {
+        value: 'table',
+        label: t.tableView,
+        icon: <Table2 className="size-3.5" />,
+        testId: 'evaluation-table-view-btn',
+      },
+    ],
+    [t],
+  );
+
   const handleExportCsv = useCallback(() => {
     const { headers, rows } = evaluationChartToCsv(chartData);
     exportToCsv(`InferenceX_evaluation_${selectedModel}_${selectedBenchmark}`, headers, rows);
@@ -69,16 +100,15 @@ export default function EvaluationChartDisplay() {
 
   const caption = (
     <>
-      <h3 className="text-lg font-semibold">Evaluation Score by Hardware Configuration</h3>
+      <h3 className="text-lg font-semibold">{t.captionHeading}</h3>
       <p className="text-sm text-muted-foreground mb-2">
         {selectedModel} •{' '}
         {selectedPrecisions.map((p) => getPrecisionLabel(p as Precision)).join(', ')} •{' '}
-        {selectedBenchmark} •{' '}
-        {isUnofficialRun ? 'Source: UNOFFICIAL' : 'Source: SemiAnalysis InferenceX™'}
+        {selectedBenchmark} • {isUnofficialRun ? t.sourceUnofficial : t.sourceOfficial}
         {selectedRunDate && (
           <>
             {' '}
-            • Updated:{' '}
+            • {t.updated}{' '}
             {new Date(`${selectedRunDate}T00:00:00Z`).toLocaleDateString('en-US', {
               year: 'numeric',
               month: '2-digit',
@@ -99,11 +129,8 @@ export default function EvaluationChartDisplay() {
           <div className="flex flex-col gap-4">
             <div className="flex items-start justify-between">
               <div>
-                <h2 className="text-lg font-semibold mb-2">Accuracy Evals</h2>
-                <p className="text-muted-foreground text-sm mb-4">
-                  Benchmark results showing model quality versus throughput trade-offs across
-                  different GPUs, quantization levels, and inference configurations.
-                </p>
+                <h2 className="text-lg font-semibold mb-2">{t.heading}</h2>
+                <p className="text-muted-foreground text-sm mb-4">{t.description}</p>
               </div>
               <ChartShareActions />
             </div>
@@ -122,9 +149,9 @@ export default function EvaluationChartDisplay() {
         leadingControls={
           <SegmentedToggle
             value={viewMode}
-            options={VIEW_MODE_OPTIONS}
+            options={viewModeOptions}
             onValueChange={handleViewModeChange}
-            ariaLabel="View mode"
+            ariaLabel={t.viewModeAria}
             testId="evaluation-view-toggle"
           />
         }
diff --git a/packages/app/src/components/favorites/favorite-presets.ts b/packages/app/src/components/favorites/favorite-presets.ts
index cb6fb127..8efc594b 100644
--- a/packages/app/src/components/favorites/favorite-presets.ts
+++ b/packages/app/src/components/favorites/favorite-presets.ts
@@ -4,7 +4,9 @@ import { getModelExclusion, Model, Sequence } from '@/lib/data-mappings';
 export interface FavoritePreset {
   id: string;
   title: string;
+  titleZh?: string;
   description: string;
+  descriptionZh?: string;
   tags: string[];
   category: 'comparison' | 'improvements';
   wide?: boolean;
@@ -142,8 +144,10 @@ export const FAVORITE_PRESETS: FavoritePreset[] = [
   {
     id: 'minimax-m3-launch',
     title: 'MiniMax M3 — First Look',
+    titleZh: 'MiniMax M3 — 首发基准测试',
     description:
       'First benchmarks of MiniMax M3 across every available GPU. New configurations appear here as they come online.',
+    descriptionZh: '涵盖所有可用 GPU 的 MiniMax M3 首批基准测试结果。新配置上线后将在此同步更新。',
     tags: ['MiniMax', 'M3', 'New'],
     category: 'comparison',
     wide: true,
@@ -198,7 +202,10 @@ export const FAVORITE_PRESETS: FavoritePreset[] = [
   {
     id: 'gb200-vs-b200',
     title: 'GB200 NVL72 vs B200 — Multi vs Single Node',
+    titleZh: 'GB200 NVL72 vs B200 — 多节点 vs 单节点',
     description: 'GB200 NVL72 Dynamo TRTLLM vs B200 Dynamo TRTLLM on DeepSeek R1 (8k/1k) at FP4.',
+    descriptionZh:
+      'GB200 NVL72 Dynamo TRTLLM vs B200 Dynamo TRTLLM，基于 DeepSeek R1 (8k/1k)，FP4 精度。',
     tags: ['DeepSeek', 'GB200', 'B200', 'Dynamo', 'FP4', 'NVL72'],
     category: 'comparison',
     config: {
@@ -213,8 +220,11 @@ export const FAVORITE_PRESETS: FavoritePreset[] = [
   {
     id: 'b200-vs-h200',
     title: 'B200 vs H200 — Blackwell vs Hopper',
+    titleZh: 'B200 vs H200 — Blackwell vs Hopper',
     description:
       'Blackwell B200 vs Hopper H200 Dynamo TRTLLM throughput per GPU on DeepSeek R1 (8k/1k) at FP8.',
+    descriptionZh:
+      'Blackwell B200 vs Hopper H200 Dynamo TRTLLM 每 GPU 吞吐量对比，基于 DeepSeek R1 (8k/1k)，FP8 精度。',
     tags: ['DeepSeek', 'B200', 'H200', 'Dynamo', 'FP8'],
     category: 'comparison',
     config: {
@@ -229,8 +239,11 @@ export const FAVORITE_PRESETS: FavoritePreset[] = [
   {
     id: 'amd-generations',
     title: 'AMD MI300X → MI325X → MI355X',
+    titleZh: 'AMD MI300X → MI325X → MI355X',
     description:
       'Three generations of AMD Instinct on SGLang at FP8. Generational throughput scaling on DeepSeek R1 (8k/1k).',
+    descriptionZh:
+      'AMD Instinct 三代产品在 SGLang FP8 下的对比。DeepSeek R1 (8k/1k) 代际吞吐量提升趋势。',
     tags: ['DeepSeek', 'MI300X', 'MI325X', 'MI355X', 'SGLang', 'FP8'],
     category: 'comparison',
     config: {
@@ -245,7 +258,10 @@ export const FAVORITE_PRESETS: FavoritePreset[] = [
   {
     id: 'h100-vs-gb300-disagg',
     title: 'H100 vs GB300 Disagg — DeepSeek',
+    titleZh: 'H100 vs GB300 分离式推理 — DeepSeek',
     description: 'H100 FP8 disagg vs GB300 FP8 disagg vs GB300 FP4 disagg on DeepSeek R1 (8k/1k).',
+    descriptionZh:
+      'H100 FP8 分离式 vs GB300 FP8 分离式 vs GB300 FP4 分离式，基于 DeepSeek R1 (8k/1k)。',
     tags: ['DeepSeek', 'H100', 'GB300', 'Disagg', 'FP8', 'FP4'],
     category: 'comparison',
     config: {
@@ -260,8 +276,11 @@ export const FAVORITE_PRESETS: FavoritePreset[] = [
   {
     id: 'disagg-b200-vs-mi355x',
     title: 'Disagg B200 SGLang vs MI355X vs B200 TRTLLM',
+    titleZh: '分离式 B200 SGLang vs MI355X vs B200 TRTLLM',
     description:
       'Disaggregated B200 Dynamo SGLang vs MI355X MoRI SGLang vs B200 Dynamo TRTLLM on DeepSeek R1 (8k/1k) at FP8.',
+    descriptionZh:
+      '分离式 B200 Dynamo SGLang vs MI355X MoRI SGLang vs B200 Dynamo TRTLLM，基于 DeepSeek R1 (8k/1k)，FP8 精度。',
     tags: ['DeepSeek', 'B200', 'MI355X', 'Dynamo', 'MoRI', 'FP8', 'Disagg'],
     category: 'comparison',
     config: {
@@ -276,8 +295,11 @@ export const FAVORITE_PRESETS: FavoritePreset[] = [
   {
     id: 'mi355x-sglang-disagg-timeline',
     title: 'MI355X SGLang Disagg Over Time — DeepSeek (FP8)',
+    titleZh: 'MI355X SGLang 分离式推理历史趋势 — DeepSeek (FP8)',
     description:
       'MI355X SGLang disaggregated inference on DeepSeek R1 (8k/1k) FP8. Tracks throughput improvements over time.',
+    descriptionZh:
+      'MI355X SGLang 分离式推理在 DeepSeek R1 (8k/1k) FP8 下的表现，追踪吞吐量随时间的提升。',
     tags: ['DeepSeek', 'MI355X', 'SGLang', 'FP8', 'Disagg', 'Timeline'],
     category: 'improvements',
     config: {
diff --git a/packages/app/src/components/footer/footer.tsx b/packages/app/src/components/footer/footer.tsx
index 182a1c2f..2243d4e3 100644
--- a/packages/app/src/components/footer/footer.tsx
+++ b/packages/app/src/components/footer/footer.tsx
@@ -1,178 +1,236 @@
+'use client';
+
 import Image from 'next/image';
 import Link from 'next/link';
 
 import { ShareTwitterButton, ShareLinkedInButton } from '@/components/share-buttons';
+import { useLocale } from '@/lib/use-locale';
 
 import { StarButton } from './footer-star-cta';
 
-export const Footer = ({ starCount }: { starCount?: number | null }) => (
-  <footer data-testid="footer" className="relative w-full overflow-visible mt-auto pt-32">
-    <div className="container mx-auto px-4 lg:px-8 py-12">
-      {/* Main grid */}
-      <div className="flex flex-col md:flex-row md:justify-between gap-10 md:gap-8 mb-10">
-        {/* Left — Brand */}
-        <div data-testid="footer-brand" className="flex flex-col gap-4 items-center md:items-start">
-          <Link
-            data-testid="footer-brand-link"
-            target="_blank"
-            href="https://semianalysis.com/"
-            className="inline-block w-35 h-14.5"
-          >
-            <Image
-              width={140}
-              height={58}
-              src="/brand/logo-color.webp"
-              alt="SemiAnalysis logo"
-              className="h-auto"
-            />
-          </Link>
-          <p
-            data-testid="footer-brand-description"
-            className="text-sm text-muted-foreground max-w-xs text-center md:text-left"
-          >
-            Continuous open-source inference benchmarking. Real-world, reproducible, auditable
-            performance data trusted by trillion dollar AI infrastructure operators like OpenAI,
-            Meta, Oracle, Microsoft, etc.
-          </p>
-        </div>
+const STRINGS = {
+  en: {
+    description:
+      'Continuous open-source inference benchmarking. Real-world, reproducible, auditable performance data trusted by trillion dollar AI infrastructure operators like OpenAI, Meta, Oracle, Microsoft, etc.',
+    semianalysis: 'SemiAnalysis',
+    mainSite: 'Main Site',
+    newsletter: 'Newsletter',
+    about: 'About',
+    legal: 'Legal',
+    landAcknowledgement: 'Land Acknowledgement',
+    privacyPolicy: 'Privacy Policy',
+    cookiePolicy: 'Cookie Policy',
+    contribute: 'Contribute',
+    benchmarks: 'Benchmarks',
+    frontend: 'Frontend',
+    more: 'More',
+    gpuReliability: 'GPU Reliability',
+    perfPerDollar: 'Performance per Dollar',
+    languageLink: '中文版',
+    languageHref: '/zh',
+    languageHrefLang: 'zh-CN',
+    cta: 'If this data helps your work, consider starring us on GitHub or sharing with your network.',
+    rights: 'All rights reserved.',
+  },
+  zh: {
+    description:
+      '持续的开源推理基准测试。真实、可复现、可审计的性能数据，获得 OpenAI、Meta、Oracle、Microsoft 等万亿美元级 AI 基础设施运营方的信赖。',
+    semianalysis: 'SemiAnalysis',
+    mainSite: '官方网站',
+    newsletter: '订阅通讯',
+    about: '关于我们',
+    legal: '法律信息',
+    landAcknowledgement: '土地致谢',
+    privacyPolicy: '隐私政策',
+    cookiePolicy: 'Cookie 政策',
+    contribute: '参与贡献',
+    benchmarks: '基准测试仓库',
+    frontend: '前端仓库',
+    more: '更多',
+    gpuReliability: 'GPU 可靠性',
+    perfPerDollar: '每美元性能',
+    languageLink: 'English',
+    languageHref: '/',
+    languageHrefLang: 'en',
+    cta: '如果这些数据对您的工作有帮助，欢迎在 GitHub 上为我们加星或分享给您的同事。',
+    rights: '保留所有权利。',
+  },
+} as const;
 
-        {/* Center — Links */}
-        <div data-testid="footer-links" className="grid grid-cols-3 gap-x-6 gap-y-8">
-          <div data-testid="footer-links-semianalysis" className="flex flex-col gap-2.5">
-            <span className="text-sm font-medium text-foreground">SemiAnalysis</span>
-            <a
-              data-testid="footer-link-main-site"
-              href="https://semianalysis.com"
-              target="_blank"
-              rel="noopener noreferrer"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              Main Site
-            </a>
-            <a
-              data-testid="footer-link-newsletter"
-              href="https://newsletter.semianalysis.com"
-              target="_blank"
-              rel="noopener noreferrer"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              Newsletter
-            </a>
-            <a
-              data-testid="footer-link-about"
-              href="https://semianalysis.com/about/"
-              target="_blank"
-              rel="noopener noreferrer"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              About
-            </a>
-          </div>
-          <div data-testid="footer-links-legal" className="flex flex-col gap-2.5">
-            <span className="text-sm font-medium text-foreground">Legal</span>
+export const Footer = ({ starCount }: { starCount?: number | null }) => {
+  const locale = useLocale();
+  const t = STRINGS[locale];
+  // Internal links stay within the current language tree.
+  const prefix = locale === 'zh' ? '/zh' : '';
+  return (
+    <footer data-testid="footer" className="relative w-full overflow-visible mt-auto pt-32">
+      <div className="container mx-auto px-4 lg:px-8 py-12">
+        {/* Main grid */}
+        <div className="flex flex-col md:flex-row md:justify-between gap-10 md:gap-8 mb-10">
+          {/* Left — Brand */}
+          <div
+            data-testid="footer-brand"
+            className="flex flex-col gap-4 items-center md:items-start"
+          >
             <Link
-              data-testid="footer-link-land-acknowledgement"
-              href="/land-acknowledgement"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              Land Acknowledgement
-            </Link>
-            <a
-              data-testid="footer-link-privacy"
-              href="https://semianalysis.com/privacy-policy/"
-              target="_blank"
-              rel="noopener noreferrer"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              Privacy Policy
-            </a>
-            <a
-              data-testid="footer-link-cookies"
-              href="https://semianalysis.com/cookie-policy/"
-              target="_blank"
-              rel="noopener noreferrer"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              Cookie Policy
-            </a>
-          </div>
-          <div data-testid="footer-links-contribute" className="flex flex-col gap-2.5">
-            <span className="text-sm font-medium text-foreground">Contribute</span>
-            <a
-              data-testid="footer-link-benchmarks"
-              href="https://github.com/SemiAnalysisAI/InferenceX"
+              data-testid="footer-brand-link"
               target="_blank"
-              rel="noopener noreferrer"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              href="https://semianalysis.com/"
+              className="inline-block w-35 h-14.5"
             >
-              Benchmarks
-            </a>
-            <a
-              data-testid="footer-link-frontend"
-              href="https://github.com/SemiAnalysisAI/InferenceX-app"
-              target="_blank"
-              rel="noopener noreferrer"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              Frontend
-            </a>
-          </div>
-          <div data-testid="footer-links-more" className="flex flex-col gap-2.5">
-            <span className="text-sm font-medium text-foreground">More</span>
-            <Link
-              data-testid="footer-link-reliability"
-              href="/reliability"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              GPU Reliability
+              <Image
+                width={140}
+                height={58}
+                src="/brand/logo-color.webp"
+                alt="SemiAnalysis logo"
+                className="h-auto"
+              />
             </Link>
-            <Link
-              data-testid="footer-link-compare-per-dollar"
-              href="/compare-per-dollar"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+            <p
+              data-testid="footer-brand-description"
+              className="text-sm text-muted-foreground max-w-xs text-center md:text-left"
             >
-              Performance per Dollar
-            </Link>
-            <Link
-              data-testid="footer-link-zh"
-              href="/zh"
-              hrefLang="zh-CN"
-              className="text-sm text-muted-foreground hover:text-foreground transition-colors"
-            >
-              中文版
-            </Link>
+              {t.description}
+            </p>
           </div>
-        </div>
 
-        {/* Right — CTA + Social */}
-        <div data-testid="footer-cta" className="flex flex-col gap-4 items-center md:items-end">
-          <div data-testid="footer-social-buttons" className="flex items-center gap-1.5">
-            <div className="rounded-md bg-background/80 w-fit">
-              <StarButton starCount={starCount} />
+          {/* Center — Links */}
+          <div data-testid="footer-links" className="grid grid-cols-3 gap-x-6 gap-y-8">
+            <div data-testid="footer-links-semianalysis" className="flex flex-col gap-2.5">
+              <span className="text-sm font-medium text-foreground">{t.semianalysis}</span>
+              <a
+                data-testid="footer-link-main-site"
+                href="https://semianalysis.com"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.mainSite}
+              </a>
+              <a
+                data-testid="footer-link-newsletter"
+                href="https://newsletter.semianalysis.com"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.newsletter}
+              </a>
+              <a
+                data-testid="footer-link-about"
+                href="https://semianalysis.com/about/"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.about}
+              </a>
+            </div>
+            <div data-testid="footer-links-legal" className="flex flex-col gap-2.5">
+              <span className="text-sm font-medium text-foreground">{t.legal}</span>
+              <Link
+                data-testid="footer-link-land-acknowledgement"
+                href={`${prefix}/land-acknowledgement`}
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.landAcknowledgement}
+              </Link>
+              <a
+                data-testid="footer-link-privacy"
+                href="https://semianalysis.com/privacy-policy/"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.privacyPolicy}
+              </a>
+              <a
+                data-testid="footer-link-cookies"
+                href="https://semianalysis.com/cookie-policy/"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.cookiePolicy}
+              </a>
             </div>
-            <div className="rounded-md bg-background/80 w-fit">
-              <ShareTwitterButton />
+            <div data-testid="footer-links-contribute" className="flex flex-col gap-2.5">
+              <span className="text-sm font-medium text-foreground">{t.contribute}</span>
+              <a
+                data-testid="footer-link-benchmarks"
+                href="https://github.com/SemiAnalysisAI/InferenceX"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.benchmarks}
+              </a>
+              <a
+                data-testid="footer-link-frontend"
+                href="https://github.com/SemiAnalysisAI/InferenceX-app"
+                target="_blank"
+                rel="noopener noreferrer"
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.frontend}
+              </a>
             </div>
-            <div className="rounded-md bg-background/80 w-fit">
-              <ShareLinkedInButton />
+            <div data-testid="footer-links-more" className="flex flex-col gap-2.5">
+              <span className="text-sm font-medium text-foreground">{t.more}</span>
+              <Link
+                data-testid="footer-link-reliability"
+                href={`${prefix}/reliability`}
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.gpuReliability}
+              </Link>
+              <Link
+                data-testid="footer-link-compare-per-dollar"
+                href={`${prefix}/compare-per-dollar`}
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.perfPerDollar}
+              </Link>
+              <Link
+                data-testid="footer-link-zh"
+                href={t.languageHref}
+                hrefLang={t.languageHrefLang}
+                className="text-sm text-muted-foreground hover:text-foreground transition-colors"
+              >
+                {t.languageLink}
+              </Link>
             </div>
           </div>
-          <p className="text-sm text-muted-foreground text-center md:text-right max-w-xs">
-            If this data helps your work, consider starring us on GitHub or sharing with your
-            network.
-          </p>
+
+          {/* Right — CTA + Social */}
+          <div data-testid="footer-cta" className="flex flex-col gap-4 items-center md:items-end">
+            <div data-testid="footer-social-buttons" className="flex items-center gap-1.5">
+              <div className="rounded-md bg-background/80 w-fit">
+                <StarButton starCount={starCount} />
+              </div>
+              <div className="rounded-md bg-background/80 w-fit">
+                <ShareTwitterButton />
+              </div>
+              <div className="rounded-md bg-background/80 w-fit">
+                <ShareLinkedInButton />
+              </div>
+            </div>
+            <p className="text-sm text-muted-foreground text-center md:text-right max-w-xs">
+              {t.cta}
+            </p>
+          </div>
         </div>
-      </div>
 
-      {/* Bottom bar */}
-      <div
-        data-testid="footer-bottom-bar"
-        className="border-t border-border/40 pt-6 flex flex-col md:flex-row items-center justify-between gap-4"
-      >
-        <p data-testid="footer-copyright" className="text-xs text-muted-foreground">
-          &copy; {new Date().getFullYear()} semianalysis.com. All rights reserved.
-        </p>
+        {/* Bottom bar */}
+        <div
+          data-testid="footer-bottom-bar"
+          className="border-t border-border/40 pt-6 flex flex-col md:flex-row items-center justify-between gap-4"
+        >
+          <p data-testid="footer-copyright" className="text-xs text-muted-foreground">
+            &copy; {new Date().getFullYear()} semianalysis.com. {t.rights}
+          </p>
+        </div>
       </div>
-    </div>
-  </footer>
-);
+    </footer>
+  );
+};
diff --git a/packages/app/src/components/gpu-power/GpuPowerDisplay.tsx b/packages/app/src/components/gpu-power/GpuPowerDisplay.tsx
index 31e2538b..5034da88 100644
--- a/packages/app/src/components/gpu-power/GpuPowerDisplay.tsx
+++ b/packages/app/src/components/gpu-power/GpuPowerDisplay.tsx
@@ -23,6 +23,7 @@ import {
 } from '@/components/ui/select';
 
 import { relockFeatureGate } from '@/lib/use-feature-gate';
+import { useLocale } from '@/lib/use-locale';
 
 import GpuCorrelationChart from './GpuCorrelationChart';
 import GpuMetricsChart from './GpuPowerChart';
@@ -38,6 +39,63 @@ import {
 
 const GPU_COLORS = d3.schemeTableau10;
 
+const STRINGS = {
+  en: {
+    heading: 'PowerX',
+    descPre: 'Enter a GitHub Actions run ID to visualize GPU metrics over time from',
+    descPost: 'artifacts.',
+    relockButton: 'Re-lock feature gate',
+    runIdLabel: 'Run ID',
+    runIdPlaceholder: 'e.g. 22806827144',
+    loadButton: 'Load',
+    loadingButton: 'Loading...',
+    runLabel: 'Run:',
+    branchLabel: 'Branch:',
+    dateLabel: 'Date:',
+    statusLabel: 'Status:',
+    dataPointsLabel: 'Data points:',
+    artifactLabel: 'Artifact',
+    metricLabel: 'Metric',
+    copied: 'Copied',
+    share: 'Share',
+    xAxis: 'X Axis',
+    yAxis: 'Y Axis',
+    metricOverTimeSuffix: ' over Time',
+    metricCorrelation: 'Metric Correlation',
+    resetFilter: 'Reset filter',
+    downsample: 'Downsample',
+    perGpuStats: 'Per-GPU Statistics',
+    rows: 'rows',
+  },
+  zh: {
+    heading: 'PowerX',
+    descPre: '输入 GitHub Actions 运行 ID，可视化',
+    descPost: '产物中 GPU 指标的时间变化趋势。',
+    relockButton: '重新锁定功能入口',
+    runIdLabel: '运行 ID',
+    runIdPlaceholder: '例如 22806827144',
+    loadButton: '加载',
+    loadingButton: '加载中...',
+    runLabel: '运行：',
+    branchLabel: '分支：',
+    dateLabel: '日期：',
+    statusLabel: '状态：',
+    dataPointsLabel: '数据点：',
+    artifactLabel: '产物',
+    metricLabel: '指标',
+    copied: '已复制',
+    share: '分享',
+    xAxis: 'X 轴',
+    yAxis: 'Y 轴',
+    metricOverTimeSuffix: ' 时间趋势',
+    metricCorrelation: '指标相关性',
+    resetFilter: '重置筛选',
+    downsample: '降采样',
+    perGpuStats: '每 GPU 统计信息',
+    rows: '行',
+  },
+} as const;
+
 type GpuMetricsView = 'chart' | 'correlation';
 
 const GPU_METRICS_VIEW_OPTIONS: SegmentedToggleOption<GpuMetricsView>[] = [
@@ -57,6 +115,7 @@ const GPU_METRICS_VIEW_OPTIONS: SegmentedToggleOption<GpuMetricsView>[] = [
 
 export default function GpuMetricsDisplay() {
   const router = useRouter();
+  const t = STRINGS[useLocale()];
   const [runIdInput, setRunIdInput] = useState('22806827144');
   const [loading, setLoading] = useState(false);
   const [error, setError] = useState<string | null>(null);
@@ -237,10 +296,11 @@ export default function GpuMetricsDisplay() {
         <div className="space-y-3">
           <div className="flex items-start justify-between">
             <div>
-              <h2 className="text-lg font-semibold mb-2">PowerX</h2>
+              <h2 className="text-lg font-semibold mb-2">{t.heading}</h2>
               <p className="text-muted-foreground text-sm">
-                Enter a GitHub Actions run ID to visualize GPU metrics over time from{' '}
-                <code className="text-xs bg-muted px-1 py-0.5 rounded">gpu_metrics</code> artifacts.
+                {t.descPre}{' '}
+                <code className="text-xs bg-muted px-1 py-0.5 rounded">gpu_metrics</code>{' '}
+                {t.descPost}
               </p>
             </div>
             <div className="flex items-center gap-1.5">
@@ -256,18 +316,18 @@ export default function GpuMetricsDisplay() {
                 title="Re-lock feature gate"
               >
                 <Lock className="size-3" />
-                Re-lock feature gate
+                {t.relockButton}
               </Button>
               <ChartShareActions />
             </div>
           </div>
           <div className="flex flex-wrap items-end gap-3">
             <div className="flex-1 max-w-sm space-y-1">
-              <Label htmlFor="gpu-metrics-run-id">Run ID</Label>
+              <Label htmlFor="gpu-metrics-run-id">{t.runIdLabel}</Label>
               <Input
                 id="gpu-metrics-run-id"
                 data-testid="gpu-metrics-run-input"
-                placeholder="e.g. 22806827144"
+                placeholder={t.runIdPlaceholder}
                 value={runIdInput}
                 onChange={(e) => setRunIdInput(e.target.value)}
                 onKeyDown={(e) => {
@@ -283,10 +343,10 @@ export default function GpuMetricsDisplay() {
               {loading ? (
                 <>
                   <Loader2 className="mr-2 size-4 animate-spin" />
-                  Loading...
+                  {t.loadingButton}
                 </>
               ) : (
-                'Load'
+                t.loadButton
               )}
             </Button>
           </div>
@@ -304,7 +364,7 @@ export default function GpuMetricsDisplay() {
           <Card className="mb-4">
             <div className="flex flex-wrap gap-x-6 gap-y-1 text-sm mb-4">
               <span>
-                <span className="text-muted-foreground">Run:</span>{' '}
+                <span className="text-muted-foreground">{t.runLabel}</span>{' '}
                 <a
                   href={runInfo.url}
                   target="_blank"
@@ -315,17 +375,17 @@ export default function GpuMetricsDisplay() {
                 </a>
               </span>
               <span>
-                <span className="text-muted-foreground">Branch:</span> {runInfo.branch}
+                <span className="text-muted-foreground">{t.branchLabel}</span> {runInfo.branch}
               </span>
               <span>
-                <span className="text-muted-foreground">Date:</span>{' '}
+                <span className="text-muted-foreground">{t.dateLabel}</span>{' '}
                 {new Date(runInfo.createdAt).toLocaleDateString()}
               </span>
               <span>
-                <span className="text-muted-foreground">Status:</span> {runInfo.conclusion}
+                <span className="text-muted-foreground">{t.statusLabel}</span> {runInfo.conclusion}
               </span>
               <span>
-                <span className="text-muted-foreground">Data points:</span>{' '}
+                <span className="text-muted-foreground">{t.dataPointsLabel}</span>{' '}
                 {currentData.length.toLocaleString()}
               </span>
             </div>
@@ -333,7 +393,7 @@ export default function GpuMetricsDisplay() {
             <div className="grid grid-cols-1 sm:grid-cols-[1fr_auto] items-end gap-3">
               {artifacts.length > 1 && (
                 <div className="space-y-1 min-w-0">
-                  <Label htmlFor="gpu-metrics-artifact-select">Artifact</Label>
+                  <Label htmlFor="gpu-metrics-artifact-select">{t.artifactLabel}</Label>
                   <Select value={selectedArtifact} onValueChange={handleArtifactChange}>
                     <SelectTrigger
                       id="gpu-metrics-artifact-select"
@@ -345,7 +405,7 @@ export default function GpuMetricsDisplay() {
                     <SelectContent>
                       {artifacts.map((a) => (
                         <SelectItem key={a.name} value={a.name}>
-                          {a.name} ({a.data.length.toLocaleString()} rows)
+                          {a.name} ({a.data.length.toLocaleString()} {t.rows})
                         </SelectItem>
                       ))}
                     </SelectContent>
@@ -353,7 +413,7 @@ export default function GpuMetricsDisplay() {
                 </div>
               )}
               <div className="space-y-1">
-                <Label htmlFor="gpu-metrics-metric-select">Metric</Label>
+                <Label htmlFor="gpu-metrics-metric-select">{t.metricLabel}</Label>
                 <Select value={selectedMetric} onValueChange={handleMetricChange}>
                   <SelectTrigger
                     id="gpu-metrics-metric-select"
@@ -402,12 +462,12 @@ export default function GpuMetricsDisplay() {
                   {copied ? (
                     <>
                       <Check className="size-3" />
-                      Copied
+                      {t.copied}
                     </>
                   ) : (
                     <>
                       <LinkIcon className="size-3" />
-                      Share
+                      {t.share}
                     </>
                   )}
                 </Button>
@@ -417,7 +477,7 @@ export default function GpuMetricsDisplay() {
             {chartView === 'correlation' && (
               <div className="flex flex-wrap items-end gap-3 mb-3 no-export">
                 <div className="space-y-1">
-                  <Label className="text-xs">X Axis</Label>
+                  <Label className="text-xs">{t.xAxis}</Label>
                   <Select
                     value={corrXMetric}
                     onValueChange={(v) => setCorrXMetric(v as GpuMetricKey)}
@@ -435,7 +495,7 @@ export default function GpuMetricsDisplay() {
                   </Select>
                 </div>
                 <div className="space-y-1">
-                  <Label className="text-xs">Y Axis</Label>
+                  <Label className="text-xs">{t.yAxis}</Label>
                   <Select
                     value={corrYMetric}
                     onValueChange={(v) => setCorrYMetric(v as GpuMetricKey)}
@@ -464,7 +524,10 @@ export default function GpuMetricsDisplay() {
                 maxPoints={downsample ? 2000 : Infinity}
                 caption={
                   <>
-                    <h2 className="text-lg font-semibold">{metricConfig.label} over Time</h2>
+                    <h2 className="text-lg font-semibold">
+                      {metricConfig.label}
+                      {t.metricOverTimeSuffix}
+                    </h2>
                     <UnofficialDomainNotice />
                   </>
                 }
@@ -491,7 +554,7 @@ export default function GpuMetricsDisplay() {
                         : [
                             {
                               id: 'gpu-metrics-reset-filter',
-                              label: 'Reset filter',
+                              label: t.resetFilter,
                               onClick: selectAllGpus,
                             },
                           ]
@@ -499,7 +562,7 @@ export default function GpuMetricsDisplay() {
                     switches={[
                       {
                         id: 'gpu-metrics-downsample',
-                        label: 'Downsample',
+                        label: t.downsample,
                         checked: downsample,
                         onCheckedChange: (c) => {
                           setDownsample(c);
@@ -520,7 +583,7 @@ export default function GpuMetricsDisplay() {
                 maxPoints={downsample ? 2000 : Infinity}
                 caption={
                   <>
-                    <h2 className="text-lg font-semibold">Metric Correlation</h2>
+                    <h2 className="text-lg font-semibold">{t.metricCorrelation}</h2>
                     <UnofficialDomainNotice />
                   </>
                 }
@@ -547,7 +610,7 @@ export default function GpuMetricsDisplay() {
                         : [
                             {
                               id: 'gpu-metrics-reset-filter-2',
-                              label: 'Reset filter',
+                              label: t.resetFilter,
                               onClick: selectAllGpus,
                             },
                           ]
@@ -555,7 +618,7 @@ export default function GpuMetricsDisplay() {
                     switches={[
                       {
                         id: 'gpu-metrics-downsample-corr',
-                        label: 'Downsample',
+                        label: t.downsample,
                         checked: downsample,
                         onCheckedChange: (c) => {
                           setDownsample(c);
@@ -572,7 +635,7 @@ export default function GpuMetricsDisplay() {
           {/* Statistics Table */}
           <Card className="mt-4">
             <h3 className="text-sm font-semibold mb-2">
-              Per-GPU Statistics ({metricConfig.label})
+              {t.perGpuStats} ({metricConfig.label})
             </h3>
             <GpuStatsTable data={currentData} metricKey={selectedMetric} />
           </Card>
diff --git a/packages/app/src/components/gpu-specs/gpu-specs-content.tsx b/packages/app/src/components/gpu-specs/gpu-specs-content.tsx
index b4c0ec12..2900f100 100644
--- a/packages/app/src/components/gpu-specs/gpu-specs-content.tsx
+++ b/packages/app/src/components/gpu-specs/gpu-specs-content.tsx
@@ -1,6 +1,6 @@
 'use client';
 
-import { useRef, useState } from 'react';
+import { useMemo, useRef, useState } from 'react';
 import { track } from '@/lib/analytics';
 import { BarChart3, Radar, Table2 } from 'lucide-react';
 
@@ -26,6 +26,7 @@ import {
 } from '@/components/gpu-specs/scale-up-topology-diagram';
 import { GpuSpecsBarChart } from '@/components/gpu-specs/gpu-specs-bar-chart';
 import { GpuSpecsRadarChart } from '@/components/gpu-specs/gpu-specs-radar-chart';
+import { useLocale } from '@/lib/use-locale';
 
 function SpecCell({
   children,
@@ -65,28 +66,69 @@ function VendorBadge({ vendor }: { vendor: GpuSpec['vendor'] }) {
   );
 }
 
-type GpuSpecsViewMode = 'table' | 'chart' | 'radar';
-
-const GPU_SPECS_VIEW_MODE_OPTIONS: SegmentedToggleOption<GpuSpecsViewMode>[] = [
-  {
-    value: 'table',
-    label: 'Table',
-    icon: <Table2 className="size-3.5" />,
-    testId: 'gpu-specs-table-view-btn',
+const STRINGS = {
+  en: {
+    heading: 'GPU Specifications',
+    description:
+      'Hardware specifications for GPUs used in InferenceX™ benchmarks, including compute performance, memory bandwidth, and interconnect details.',
+    viewTable: 'Table',
+    viewChart: 'Chart',
+    viewRadar: 'Radar',
+    colGpu: 'GPU',
+    colMemory: 'Memory',
+    colMemBw: 'Mem BW',
+    colScaleUp: 'Scale Up',
+    colScaleUpBw: 'Scale Up BW',
+    colWorldSize: 'World Size',
+    colScaleUpDomainMem: 'Scale Up Domain Memory',
+    colScaleUpDomainMemBw: 'Scale Up Domain Mem BW',
+    colScaleUpTopology: 'Scale Up Topology',
+    colScaleUpSwitch: 'Scale Up Switch',
+    colScaleOutBwPerGpu: 'Scale Out BW per GPU',
+    colScaleOutTech: 'Scale Out Tech',
+    colScaleOutSwitch: 'Scale Out Switch',
+    colScaleOutTopology: 'Scale Out Topology',
+    colNic: 'NIC',
+    footnote1: 'Dense tensor core peak TFLOP/s (without sparsity).',
+    footnote2: 'Scale out isn’t used in InferenceX™ for rack scale.',
+    scaleOutHeading: 'Scale-Out Topology Diagrams',
+    scaleOutDescription:
+      'Per-server scale-out network topology for each GPU SKU, showing GPU → NIC → leaf switch connectivity.',
+    scaleUpHeading: 'Scale-Up Topology Diagrams',
+    scaleUpDescription:
+      'Intra-node scale-up interconnect topology for each GPU SKU, showing GPU → NVSwitch or direct GPU-to-GPU connectivity.',
   },
-  {
-    value: 'chart',
-    label: 'Chart',
-    icon: <BarChart3 className="size-3.5" />,
-    testId: 'gpu-specs-chart-view-btn',
+  zh: {
+    heading: 'GPU 规格',
+    description: 'InferenceX™ 基准测试中使用的 GPU 硬件规格，包括计算性能、显存带宽和互联详情。',
+    viewTable: '表格',
+    viewChart: '图表',
+    viewRadar: '雷达图',
+    colGpu: 'GPU',
+    colMemory: '显存',
+    colMemBw: '显存带宽',
+    colScaleUp: '纵向扩展',
+    colScaleUpBw: '纵向扩展带宽',
+    colWorldSize: '域内 GPU 数',
+    colScaleUpDomainMem: '纵向扩展域显存',
+    colScaleUpDomainMemBw: '纵向扩展域显存带宽',
+    colScaleUpTopology: '纵向扩展拓扑',
+    colScaleUpSwitch: '纵向扩展交换机',
+    colScaleOutBwPerGpu: '每 GPU 横向扩展带宽',
+    colScaleOutTech: '横向扩展技术',
+    colScaleOutSwitch: '横向扩展交换机',
+    colScaleOutTopology: '横向扩展拓扑',
+    colNic: 'NIC',
+    footnote1: '密集 Tensor Core 峰值 TFLOP/s（不含稀疏加速）。',
+    footnote2: 'InferenceX™ 机柜级测试不使用横向扩展。',
+    scaleOutHeading: '横向扩展拓扑图',
+    scaleOutDescription: '每台服务器的横向扩展网络拓扑，展示 GPU → NIC → Leaf 交换机的连接方式。',
+    scaleUpHeading: '纵向扩展拓扑图',
+    scaleUpDescription: '节点内纵向扩展互联拓扑，展示 GPU → NVSwitch 或 GPU 直连方式。',
   },
-  {
-    value: 'radar',
-    label: 'Radar',
-    icon: <Radar className="size-3.5" />,
-    testId: 'gpu-specs-radar-view-btn',
-  },
-];
+} as const;
+
+type GpuSpecsViewMode = 'table' | 'chart' | 'radar';
 
 function GpuSpecsTable({
   onTopologyClick,
@@ -95,19 +137,20 @@ function GpuSpecsTable({
   onTopologyClick?: (gpuName: string) => void;
   onScaleUpTopologyClick?: (gpuName: string) => void;
 }) {
+  const t = STRINGS[useLocale()];
   return (
     <div className="overflow-x-auto">
       <table className="w-full border-collapse min-w-[1400px]">
         <thead>
           <tr className="border-b border-border">
             <SpecCell header align="left" sticky>
-              GPU
+              {t.colGpu}
             </SpecCell>
             <SpecCell header align="right">
-              Memory
+              {t.colMemory}
             </SpecCell>
             <SpecCell header align="right">
-              Mem BW
+              {t.colMemBw}
             </SpecCell>
             <SpecCell header align="right">
               FP4{' '}
@@ -128,40 +171,40 @@ function GpuSpecsTable({
               </span>
             </SpecCell>
             <SpecCell header align="left">
-              Scale Up
+              {t.colScaleUp}
             </SpecCell>
             <SpecCell header align="right">
-              Scale Up BW
+              {t.colScaleUpBw}
             </SpecCell>
             <SpecCell header align="right">
-              World Size
+              {t.colWorldSize}
             </SpecCell>
             <SpecCell header align="right" className="min-w-36">
-              Scale Up Domain Memory
+              {t.colScaleUpDomainMem}
             </SpecCell>
             <SpecCell header align="right" className="min-w-36">
-              Scale Up Domain Mem BW
+              {t.colScaleUpDomainMemBw}
             </SpecCell>
             <SpecCell header align="left">
-              Scale Up Topology
+              {t.colScaleUpTopology}
             </SpecCell>
             <SpecCell header align="left">
-              Scale Up Switch
+              {t.colScaleUpSwitch}
             </SpecCell>
             <SpecCell header align="right" className="min-w-28">
-              Scale Out BW per GPU
+              {t.colScaleOutBwPerGpu}
             </SpecCell>
             <SpecCell header align="left">
-              Scale Out Tech
+              {t.colScaleOutTech}
             </SpecCell>
             <SpecCell header align="left">
-              Scale Out Switch
+              {t.colScaleOutSwitch}
             </SpecCell>
             <SpecCell header align="left">
-              Scale Out Topology
+              {t.colScaleOutTopology}
             </SpecCell>
             <SpecCell header align="left">
-              NIC
+              {t.colNic}
             </SpecCell>
           </tr>
         </thead>
@@ -286,10 +329,35 @@ function GpuSpecsTable({
 
 export function GpuSpecsContent() {
   const specsWithTopology = GPU_SPECS.filter((spec) => spec.scaleOutTopology !== null);
+  const t = STRINGS[useLocale()];
 
   const [viewMode, setViewMode] = useState<GpuSpecsViewMode>('table');
   const [selectedMetric, setSelectedMetric] = useState(GPU_CHART_METRICS[0].key);
 
+  const viewModeOptions = useMemo<SegmentedToggleOption<GpuSpecsViewMode>[]>(
+    () => [
+      {
+        value: 'table',
+        label: t.viewTable,
+        icon: <Table2 className="size-3.5" />,
+        testId: 'gpu-specs-table-view-btn',
+      },
+      {
+        value: 'chart',
+        label: t.viewChart,
+        icon: <BarChart3 className="size-3.5" />,
+        testId: 'gpu-specs-chart-view-btn',
+      },
+      {
+        value: 'radar',
+        label: t.viewRadar,
+        icon: <Radar className="size-3.5" />,
+        testId: 'gpu-specs-radar-view-btn',
+      },
+    ],
+    [t],
+  );
+
   // Refs for each scale-out topology diagram, keyed by GPU name
   const diagramRefs = useRef<Record<string, TopologyDiagramHandle | null>>({});
   // Refs for each scale-up topology diagram, keyed by GPU name
@@ -314,11 +382,8 @@ export function GpuSpecsContent() {
         <Card>
           <div className="flex items-start justify-between">
             <div>
-              <h2 className="text-lg font-semibold mb-2">GPU Specifications</h2>
-              <p className="text-muted-foreground text-sm">
-                Hardware specifications for GPUs used in InferenceX&trade; benchmarks, including
-                compute performance, memory bandwidth, and interconnect details.
-              </p>
+              <h2 className="text-lg font-semibold mb-2">{t.heading}</h2>
+              <p className="text-muted-foreground text-sm">{t.description}</p>
             </div>
             <ChartShareActions />
           </div>
@@ -330,7 +395,7 @@ export function GpuSpecsContent() {
             <div />
             <SegmentedToggle
               value={viewMode}
-              options={GPU_SPECS_VIEW_MODE_OPTIONS}
+              options={viewModeOptions}
               onValueChange={handleViewModeChange}
               ariaLabel="View mode"
               testId="gpu-specs-view-toggle"
@@ -346,10 +411,10 @@ export function GpuSpecsContent() {
               />
               <div className="px-4 md:px-8 pt-4">
                 <p className="text-xs text-muted-foreground">
-                  <sup>1</sup> Dense tensor core peak TFLOP/s (without sparsity).
+                  <sup>1</sup> {t.footnote1}
                 </p>
                 <p className="mt-1 text-xs text-muted-foreground">
-                  <sup>2</sup> Scale out isn&apos;t used in InferenceX&trade; for rack scale.
+                  <sup>2</sup> {t.footnote2}
                 </p>
               </div>
             </>
@@ -362,11 +427,8 @@ export function GpuSpecsContent() {
       </section>
       <section className="pt-8 md:pt-0">
         <Card>
-          <h3 className="text-lg font-semibold mb-2">Scale-Out Topology Diagrams</h3>
-          <p className="text-muted-foreground text-sm mb-6">
-            Per-server scale-out network topology for each GPU SKU, showing GPU &rarr; NIC &rarr;
-            leaf switch connectivity.
-          </p>
+          <h3 className="text-lg font-semibold mb-2">{t.scaleOutHeading}</h3>
+          <p className="text-muted-foreground text-sm mb-6">{t.scaleOutDescription}</p>
           <div className="grid grid-cols-1 md:grid-cols-2 xl:grid-cols-3 gap-6">
             {specsWithTopology.map((spec) => (
               <div
@@ -388,11 +450,8 @@ export function GpuSpecsContent() {
       </section>
       <section className="pt-8 md:pt-0">
         <Card>
-          <h3 className="text-lg font-semibold mb-2">Scale-Up Topology Diagrams</h3>
-          <p className="text-muted-foreground text-sm mb-6">
-            Intra-node scale-up interconnect topology for each GPU SKU, showing GPU &rarr; NVSwitch
-            or direct GPU-to-GPU connectivity.
-          </p>
+          <h3 className="text-lg font-semibold mb-2">{t.scaleUpHeading}</h3>
+          <p className="text-muted-foreground text-sm mb-6">{t.scaleUpDescription}</p>
           <div className="grid grid-cols-1 md:grid-cols-2 xl:grid-cols-3 gap-6">
             {GPU_SPECS.map((spec) => (
               <div
diff --git a/packages/app/src/components/header/header.tsx b/packages/app/src/components/header/header.tsx
index 576a3bdb..0fe42e86 100644
--- a/packages/app/src/components/header/header.tsx
+++ b/packages/app/src/components/header/header.tsx
@@ -145,7 +145,7 @@ export const Header = ({ starCount }: { starCount?: number | null }) => {
       <div className="container mx-auto px-4 lg:px-8">
         <div className="flex h-14 items-center gap-6">
           {/* Brand */}
-          <Link href="/" className="flex items-center gap-2 shrink-0">
+          <Link href={isZh ? '/zh' : '/'} className="flex items-center gap-2 shrink-0">
             <span className="pride-wordmark text-lg font-bold tracking-tight">InferenceX</span>
             <span className="hidden sm:flex items-center gap-1.5 text-xs text-muted-foreground">
               by
diff --git a/packages/app/src/components/inference/ui/ChartControls.tsx b/packages/app/src/components/inference/ui/ChartControls.tsx
index 9f333482..2ac914a0 100644
--- a/packages/app/src/components/inference/ui/ChartControls.tsx
+++ b/packages/app/src/components/inference/ui/ChartControls.tsx
@@ -30,6 +30,69 @@ import chartDefinitions from '@/components/inference/inference-chart-config.json
 import type { ChartDefinition, DisaggMode, SpecMode } from '@/components/inference/types';
 import { FRAMEWORK_FAMILIES } from '@/components/inference/utils/quickFilters';
 import { Sequence, type Model, type Percentile } from '@/lib/data-mappings';
+import { useLocale } from '@/lib/use-locale';
+
+const STRINGS = {
+  en: {
+    yAxisMetric: 'Y-Axis Metric',
+    yAxisMetricTooltip:
+      "The performance metric displayed on the chart's Y-axis. Options include throughput (tokens/sec), cost per million tokens, and custom user-defined values.",
+    xAxisMetric: 'X-Axis Metric',
+    xAxisMetricTooltip:
+      "The latency metric displayed on the chart's X-axis: P90 Time To First Token.",
+    xAxisScale: 'X-Axis Scale',
+    xAxisScaleTooltip:
+      'The scale type for the X-axis. Auto automatically chooses between linear and logarithmic based on the data range. Linear uses a linear scale. Logarithmic uses a log scale for better visualization of wide-ranging values.',
+    scaleAuto: 'Auto',
+    scaleLinear: 'Linear',
+    scaleLog: 'Logarithmic',
+    gpuConfig: 'GPU Config',
+    gpuConfigTooltip:
+      'Select up to 4 GPU configurations to compare their historical performance over time. This allows for tracking how software updates may affect specific hardware.',
+    gpuConfigPlaceholder: 'Select a GPU Config for comparison',
+    comparisonDateRange: 'Comparison Date Range',
+    comparisonDateRangeTooltip:
+      'Select the start and end dates for the historical comparison. The chart will show performance data for the selected GPU configs across this time range.',
+    dateRangePlaceholder: 'Select date range',
+    quickFilters: 'Quick Filters',
+    quickFiltersTooltip:
+      'Narrow the chart to any combination of GPU vendor, serving framework, aggregation mode (aggregated vs disaggregated serving), and speculative decoding (MTP vs standard). Selecting none in a group shows all.',
+    filterVendor: 'Vendor',
+    filterFramework: 'Framework',
+    filterAggregation: 'Aggregation',
+    filterSpecDecoding: 'Spec Decoding',
+    noData: 'No data for the current selection',
+  },
+  zh: {
+    yAxisMetric: 'Y 轴指标',
+    yAxisMetricTooltip:
+      '图表 Y 轴显示的性能指标。包括吞吐量（token/秒）、每百万 token 成本以及自定义用户值。',
+    xAxisMetric: 'X 轴指标',
+    xAxisMetricTooltip: '图表 X 轴显示的延迟指标：P90 Time To First Token。',
+    xAxisScale: 'X 轴刻度',
+    xAxisScaleTooltip:
+      'X 轴的刻度类型。自动模式根据数据范围自动选择线性或对数刻度。线性使用线性刻度。对数使用对数刻度，更适合展示范围较大的数据。',
+    scaleAuto: '自动',
+    scaleLinear: '线性',
+    scaleLog: '对数',
+    gpuConfig: 'GPU 配置',
+    gpuConfigTooltip:
+      '最多选择 4 个 GPU 配置以对比其历史性能趋势。可用于追踪软件更新对特定硬件的影响。',
+    gpuConfigPlaceholder: '选择 GPU 配置进行对比',
+    comparisonDateRange: '对比日期范围',
+    comparisonDateRangeTooltip:
+      '选择历史对比的起止日期。图表将展示所选 GPU 配置在此时间范围内的性能数据。',
+    dateRangePlaceholder: '选择日期范围',
+    quickFilters: '快捷筛选',
+    quickFiltersTooltip:
+      '按 GPU 厂商、推理框架、聚合模式（聚合 vs 分离式）和投机解码（MTP vs 标准）的任意组合筛选图表。某组不选则显示全部。',
+    filterVendor: '厂商',
+    filterFramework: '框架',
+    filterAggregation: '聚合模式',
+    filterSpecDecoding: '投机解码',
+    noData: '当前选择无可用数据',
+  },
+} as const;
 
 /**
  * Y-axis metric options from static chart config JSON — available immediately, no API wait.
@@ -110,6 +173,7 @@ interface ChartControlsProps {
 }
 
 export default function ChartControls({ hideGpuComparison = false }: ChartControlsProps) {
+  const t = STRINGS[useLocale()];
   // The percentile selector is rendered conditionally on `selectedSequence`,
   // which on the client is hydrated from URL params. SSR doesn't see the URL,
   // so deferring the conditional until after mount keeps the initial DOM
@@ -316,7 +380,7 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
   }[] = [
     {
       key: 'vendor',
-      label: 'Vendor',
+      label: t.filterVendor,
       options: QUICK_FILTER_VENDORS.map((o) => ({
         ...o,
         available: availableQuickFilters.vendors.includes(o.value),
@@ -327,7 +391,7 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
       ? [
           {
             key: 'framework' as const,
-            label: 'Framework',
+            label: t.filterFramework,
             options: frameworkOptions,
             selected: quickFilters.frameworks,
           },
@@ -335,7 +399,7 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
       : []),
     {
       key: 'disagg',
-      label: 'Aggregation',
+      label: t.filterAggregation,
       options: QUICK_FILTER_DISAGG.map((o) => ({
         ...o,
         available: availableQuickFilters.disagg.includes(o.value),
@@ -344,7 +408,7 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
     },
     {
       key: 'spec',
-      label: 'Spec Decoding',
+      label: t.filterSpecDecoding,
       options: QUICK_FILTER_SPEC.map((o) => ({
         ...o,
         available: availableQuickFilters.spec.includes(o.value),
@@ -391,8 +455,8 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
           <div className="flex flex-col space-y-1.5 lg:col-span-2">
             <LabelWithTooltip
               htmlFor="y-axis-select"
-              label="Y-Axis Metric"
-              tooltip="The performance metric displayed on the chart's Y-axis. Options include throughput (tokens/sec), cost per million tokens, and custom user-defined values."
+              label={t.yAxisMetric}
+              tooltip={t.yAxisMetricTooltip}
             />
             <SearchableSelect
               triggerId="y-axis-select"
@@ -414,8 +478,8 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
               <div className="flex flex-col space-y-1.5 lg:col-span-1">
                 <LabelWithTooltip
                   htmlFor="x-axis-select"
-                  label="X-Axis Metric"
-                  tooltip="The latency metric displayed on the chart's X-axis: P90 Time To First Token."
+                  label={t.xAxisMetric}
+                  tooltip={t.xAxisMetricTooltip}
                 />
                 <Select
                   onValueChange={handleXAxisMetricChange}
@@ -440,8 +504,8 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
               <div className="flex flex-col space-y-1.5 lg:col-span-1">
                 <LabelWithTooltip
                   htmlFor="scale-type-select"
-                  label="X-Axis Scale"
-                  tooltip="The scale type for the X-axis. Auto automatically chooses between linear and logarithmic based on the data range. Linear uses a linear scale. Logarithmic uses a log scale for better visualization of wide-ranging values."
+                  label={t.xAxisScale}
+                  tooltip={t.xAxisScaleTooltip}
                 />
                 <Select onValueChange={handleScaleTypeChange} value={scaleType}>
                   <SelectTrigger
@@ -452,9 +516,9 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
                     <SelectValue />
                   </SelectTrigger>
                   <SelectContent portalled={false}>
-                    <SelectItem value="auto">Auto</SelectItem>
-                    <SelectItem value="linear">Linear</SelectItem>
-                    <SelectItem value="log">Logarithmic</SelectItem>
+                    <SelectItem value="auto">{t.scaleAuto}</SelectItem>
+                    <SelectItem value="linear">{t.scaleLinear}</SelectItem>
+                    <SelectItem value="log">{t.scaleLog}</SelectItem>
                   </SelectContent>
                 </Select>
               </div>
@@ -464,8 +528,8 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
             <div className="flex flex-col space-y-1.5 lg:col-span-2">
               <LabelWithTooltip
                 htmlFor="gpu-config-select"
-                label="GPU Config"
-                tooltip="Select up to 4 GPU configurations to compare their historical performance over time. This allows for tracking how software updates may affect specific hardware."
+                label={t.gpuConfig}
+                tooltip={t.gpuConfigTooltip}
               />
               <div data-testid="gpu-multiselect">
                 <MultiSelect
@@ -474,7 +538,7 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
                   onChange={handleGPUChange}
                   open={openDropdown === 'gpu'}
                   onOpenChange={handleDropdownOpenChange('gpu')}
-                  placeholder="Select a GPU Config for comparison"
+                  placeholder={t.gpuConfigPlaceholder}
                   maxSelections={4}
                 />
               </div>
@@ -485,13 +549,13 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
             <div className="flex flex-col space-y-1.5 lg:col-span-2">
               <LabelWithTooltip
                 htmlFor="date-picker"
-                label="Comparison Date Range"
-                tooltip="Select the start and end dates for the historical comparison. The chart will show performance data for the selected GPU configs across this time range."
+                label={t.comparisonDateRange}
+                tooltip={t.comparisonDateRangeTooltip}
               />
               <DateRangePicker
                 dateRange={selectedDateRange}
                 onChange={handleDateRangeChange}
-                placeholder="Select date range"
+                placeholder={t.dateRangePlaceholder}
                 availableDates={dateRangeAvailableDates}
                 isCheckingAvailableDates={isCheckingAvailableDates}
                 className={
@@ -509,8 +573,8 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
           <div className="flex flex-col space-y-1.5" data-testid="quick-filters">
             <LabelWithTooltip
               htmlFor="quick-filters"
-              label="Quick Filters"
-              tooltip="Narrow the chart to any combination of GPU vendor, serving framework, aggregation mode (aggregated vs disaggregated serving), and speculative decoding (MTP vs standard). Selecting none in a group shows all."
+              label={t.quickFilters}
+              tooltip={t.quickFiltersTooltip}
             />
             <div className="flex flex-wrap items-center gap-x-6 gap-y-2">
               {quickFilterGroups.map((group) => (
@@ -530,7 +594,7 @@ export default function ChartControls({ hideGpuComparison = false }: ChartContro
                           variant={active ? 'default' : 'outline'}
                           aria-pressed={active}
                           disabled={disabled}
-                          title={disabled ? 'No data for the current selection' : undefined}
+                          title={disabled ? t.noData : undefined}
                           // Active pills use the brand color (blue in light, amber in dark)
                           // rather than the amber primary fill.
                           className={cn(
diff --git a/packages/app/src/components/inference/ui/ChartDisplay.tsx b/packages/app/src/components/inference/ui/ChartDisplay.tsx
index 55b90124..b4f24c44 100644
--- a/packages/app/src/components/inference/ui/ChartDisplay.tsx
+++ b/packages/app/src/components/inference/ui/ChartDisplay.tsx
@@ -65,6 +65,7 @@ import { isAgenticOnlyXAxisMode, type XAxisMode } from '@/components/inference/h
 import { isPersistedBenchmarkId } from '@/lib/benchmark-id';
 import { useTrendData } from '@/components/inference/hooks/useTrendData';
 import { getHardwareConfig, hardwareKeyMatchesAnyBase } from '@/lib/constants';
+import { useLocale } from '@/lib/use-locale';
 
 import ChartControls from './ChartControls';
 import ComparisonChangelog from './ComparisonChangelog';
@@ -82,6 +83,41 @@ import WorkflowInfoDisplay from './WorkflowInfoDisplay';
 
 type InferenceViewMode = 'chart' | 'table';
 
+const STRINGS = {
+  en: {
+    inferencePerformance: 'Inference Performance',
+    inferencePerformanceDesc:
+      'Inference performance metrics across different models, hardware configurations, and serving parameters.',
+    chart: 'Chart',
+    table: 'Table',
+    sourceUnofficial: 'Source: UNOFFICIAL',
+    sourceOfficial: 'Source: SemiAnalysis InferenceX™',
+    updated: 'Updated:',
+    normalizedE2eDisclaimer:
+      'Normalized E2E requires persisted per-request traces, so unofficial-run overlays are unavailable for this experimental view.',
+    selectDateRange: 'Select a date range or add a run to view GPU comparison',
+    performanceOverTime: 'Performance Over Time',
+    performanceOverTimeDesc:
+      'Double-click points on the scatter chart to track configurations over time.',
+    viewMode: 'View mode',
+  },
+  zh: {
+    inferencePerformance: '推理性能',
+    inferencePerformanceDesc: '不同模型、硬件配置和服务参数下的推理性能指标。',
+    chart: '图表',
+    table: '表格',
+    sourceUnofficial: '来源：非官方',
+    sourceOfficial: '来源：SemiAnalysis InferenceX™',
+    updated: '更新时间：',
+    normalizedE2eDisclaimer:
+      'Normalized E2E 需要持久化的逐请求 trace 数据，因此该实验性视图不支持非官方运行覆盖。',
+    selectDateRange: '请选择日期范围或添加运行以查看 GPU 对比',
+    performanceOverTime: '性能趋势',
+    performanceOverTimeDesc: '双击散点图上的数据点以追踪配置随时间的变化。',
+    viewMode: '视图模式',
+  },
+} as const;
+
 const X_AXIS_MODE_BUTTONS: { value: XAxisMode; label: string }[] = [
   { value: 'interactivity', label: 'Interactivity' },
   { value: 'e2e', label: 'E2E Latency' },
@@ -153,6 +189,8 @@ const VIEW_MODE_OPTIONS: SegmentedToggleOption<InferenceViewMode>[] = [
  * the current filtered benchmark data.
  */
 export default function ChartDisplay() {
+  const locale = useLocale();
+  const t = STRINGS[locale];
   const {
     graphs,
     loading,
@@ -252,6 +290,15 @@ export default function ChartDisplay() {
     track('inference_view_changed', { view: value, chartIndex: index });
   };
 
+  const viewModeOptions = useMemo<SegmentedToggleOption<InferenceViewMode>[]>(
+    () =>
+      VIEW_MODE_OPTIONS.map((opt) => ({
+        ...opt,
+        label: opt.value === 'chart' ? t.chart : t.table,
+      })),
+    [t],
+  );
+
   const {
     unofficialRunInfo,
     unofficialRunInfos,
@@ -506,9 +553,9 @@ export default function ChartDisplay() {
                     leadingControls={
                       <SegmentedToggle
                         value={getViewMode(graphIndex)}
-                        options={VIEW_MODE_OPTIONS}
+                        options={viewModeOptions}
                         onValueChange={(v) => handleViewModeChange(graphIndex, v)}
-                        ariaLabel="View mode"
+                        ariaLabel={t.viewMode}
                         testId={`inference-view-toggle-${graphIndex}`}
                       />
                     }
@@ -621,15 +668,13 @@ export default function ChartDisplay() {
                               .map((prec) => getPrecisionLabel(prec as Precision))
                               .join(', ')}{' '}
                             • {getSequenceLabel(graph.sequence as Sequence)} •{' '}
-                            {isUnofficialRun
-                              ? 'Source: UNOFFICIAL'
-                              : 'Source: SemiAnalysis InferenceX™'}
+                            {isUnofficialRun ? t.sourceUnofficial : t.sourceOfficial}
                             {selectedRunDate && (
                               <>
                                 {' '}
-                                • Updated:{' '}
+                                • {t.updated}{' '}
                                 {new Date(`${selectedRunDate}T00:00:00Z`).toLocaleDateString(
-                                  'en-US',
+                                  locale === 'zh' ? 'zh-CN' : 'en-US',
                                   {
                                     year: 'numeric',
                                     month: '2-digit',
@@ -643,8 +688,7 @@ export default function ChartDisplay() {
                           <MetricAssumptionNotes selectedYAxisMetric={selectedYAxisMetric} />
                           {isUnofficialRun && selectedXAxisMode === 'normalized-e2e' && (
                             <p className="mb-2 text-xs text-muted-foreground">
-                              Normalized E2E requires persisted per-request traces, so
-                              unofficial-run overlays are unavailable for this experimental view.
+                              {t.normalizedE2eDisclaimer}
                             </p>
                           )}
                           <UnofficialDomainNotice />
@@ -720,7 +764,7 @@ export default function ChartDisplay() {
                             selectedDates.length === 0 && (
                               <div className="absolute inset-0 flex items-center justify-center bg-background/60 backdrop-blur-[2px] rounded-lg z-10">
                                 <p className="text-sm font-medium text-muted-foreground bg-background/90 border border-border rounded-md px-4 py-2 shadow-sm">
-                                  Select a date range or add a run to view GPU comparison
+                                  {t.selectDateRange}
                                 </p>
                               </div>
                             )}
@@ -755,11 +799,8 @@ export default function ChartDisplay() {
           <div className="flex flex-col gap-4">
             <div className="flex items-start justify-between">
               <div>
-                <h2 className="text-lg font-semibold mb-2">Inference Performance</h2>
-                <p className="text-muted-foreground text-sm mb-4">
-                  Inference performance metrics across different models, hardware configurations,
-                  and serving parameters.
-                </p>
+                <h2 className="text-lg font-semibold mb-2">{t.inferencePerformance}</h2>
+                <p className="text-muted-foreground text-sm mb-4">{t.inferencePerformanceDesc}</p>
               </div>
               <ChartShareActions />
             </div>
@@ -854,10 +895,8 @@ export default function ChartDisplay() {
       >
         <DialogContent className="max-w-4xl max-h-[90vh] overflow-y-auto">
           <DialogHeader>
-            <DialogTitle>Performance Over Time</DialogTitle>
-            <DialogDescription>
-              Double-click points on the scatter chart to track configurations over time.
-            </DialogDescription>
+            <DialogTitle>{t.performanceOverTime}</DialogTitle>
+            <DialogDescription>{t.performanceOverTimeDesc}</DialogDescription>
           </DialogHeader>
           <div className="flex flex-wrap gap-2 mb-4">
             {trackedConfigs.map((config) => (
diff --git a/packages/app/src/components/landing/curated-view-card.tsx b/packages/app/src/components/landing/curated-view-card.tsx
index 6104595e..869f3d96 100644
--- a/packages/app/src/components/landing/curated-view-card.tsx
+++ b/packages/app/src/components/landing/curated-view-card.tsx
@@ -4,12 +4,17 @@ import { ArrowRight } from 'lucide-react';
 
 import { Badge } from '@/components/ui/badge';
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
 import type { FavoritePreset } from '@/components/favorites/favorite-presets';
 
 export function CuratedViewCard({ preset }: { preset: FavoritePreset }) {
+  const locale = useLocale();
+  const isZh = locale === 'zh';
   const isNew = preset.tags.some((t) => t.toLowerCase() === 'new');
   const visibleTags = preset.tags.filter((t) => t.toLowerCase() !== 'new');
-  const href = `/inference?preset=${preset.id}`;
+  const title = (isZh && preset.titleZh) || preset.title;
+  const description = (isZh && preset.descriptionZh) || preset.description;
+  const href = `${isZh ? '/zh' : ''}/inference?preset=${preset.id}`;
   const onClick = (e: React.MouseEvent<HTMLAnchorElement>) => {
     if (e.metaKey || e.ctrlKey || e.shiftKey || e.altKey || e.button !== 0) return;
     e.preventDefault();
@@ -31,7 +36,7 @@ export function CuratedViewCard({ preset }: { preset: FavoritePreset }) {
       <div className="absolute inset-y-3 left-0 w-0.5 rounded-full bg-brand/60 transition-all duration-200 group-hover:bg-brand group-hover:inset-y-2" />
       <div className="flex items-start justify-between gap-2">
         <h3 className="font-semibold text-sm leading-tight group-hover:text-brand transition-colors duration-200">
-          <span className="align-middle">{preset.title}</span>
+          <span className="align-middle">{title}</span>
           {isNew && (
             <span className="ml-2 inline-flex items-center gap-1.5 align-middle rounded-full bg-brand px-2 py-0.5 text-[10px] font-bold uppercase tracking-wider text-primary-foreground shadow-sm">
               New
@@ -41,7 +46,7 @@ export function CuratedViewCard({ preset }: { preset: FavoritePreset }) {
         <ArrowRight className="size-4 shrink-0 text-muted-foreground transition-all duration-200 group-hover:translate-x-0.5 group-hover:text-brand" />
       </div>
       <p className="text-xs text-muted-foreground leading-relaxed mt-1.5 line-clamp-2">
-        {preset.description}
+        {description}
       </p>
       <div className="flex flex-wrap gap-1.5 mt-auto pt-3">
         {visibleTags.map((tag) => (
diff --git a/packages/app/src/components/mtp-engine-conflict-toast.tsx b/packages/app/src/components/mtp-engine-conflict-toast.tsx
index 480912b4..d15f977e 100644
--- a/packages/app/src/components/mtp-engine-conflict-toast.tsx
+++ b/packages/app/src/components/mtp-engine-conflict-toast.tsx
@@ -6,6 +6,8 @@ import { useEffect, useState } from 'react';
 import { FRAMEWORK_LABELS } from '@semianalysisai/inferencex-constants';
 
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
+import type { Locale } from '@/lib/i18n';
 import { BottomToast } from '@/components/ui/bottom-toast';
 
 /**
@@ -31,7 +33,15 @@ function joinList(parts: string[]): string {
   return `${parts.slice(0, -1).join(', ')}, and ${parts.at(-1)}`;
 }
 
-function describe(detail: MtpEngineConflictDetail): string {
+function joinListZh(parts: string[]): string {
+  if (parts.length === 0) return '';
+  if (parts.length === 1) return parts[0];
+  if (parts.length === 2) return `${parts[0]} 和 ${parts[1]}`;
+  return `${parts.slice(0, -1).join('、')}和 ${parts.at(-1)}`;
+}
+
+function describe(detail: MtpEngineConflictDetail, locale: Locale): string {
+  if (locale === 'zh') return describeZh(detail);
   if (detail.kind === 'blocked') {
     const attempted = familyLabel(detail.attempted);
     if (detail.existing) {
@@ -47,12 +57,34 @@ function describe(detail: MtpEngineConflictDetail): string {
   return `${joinList(labels)} use different MTP acceptance-rate implementations and can't be shown on the same graph. All MTP configs are disabled by default. Enable one from the legend to view it.`;
 }
 
+function describeZh(detail: MtpEngineConflictDetail): string {
+  if (detail.kind === 'blocked') {
+    const attempted = familyLabel(detail.attempted);
+    if (detail.existing) {
+      const existing = familyLabel(detail.existing);
+      return `${attempted} 和 ${existing} 使用不同的 MTP 接受率实现，数值不可直接比较。请先移除 ${existing} MTP 配置再切换。`;
+    }
+    return `另一个引擎的 MTP 处于启用状态时，无法启用 ${attempted} MTP。请先移除现有 MTP 配置。`;
+  }
+  const labels = [...detail.families].toSorted().map(familyLabel);
+  if (labels.length === 0) {
+    return '不同引擎的 MTP 配置使用不同的接受率实现，无法在同一图表上显示。所有 MTP 配置默认禁用，请从图例中启用一项来查看。';
+  }
+  return `${joinListZh(labels)} 使用不同的 MTP 接受率实现，无法在同一图表上显示。所有 MTP 配置默认禁用，请从图例中启用一项来查看。`;
+}
+
+const TITLES = {
+  en: "MTP configs from different engines can't share a graph",
+  zh: '不同引擎的 MTP 配置无法共享同一图表',
+} as const;
+
 interface Props {
   detail: MtpEngineConflictDetail | null;
   onDismiss?: () => void;
 }
 
 export function MtpEngineConflictToast({ detail, onDismiss }: Props) {
+  const locale = useLocale();
   const [seq, setSeq] = useState(0);
 
   useEffect(() => {
@@ -73,8 +105,8 @@ export function MtpEngineConflictToast({ detail, onDismiss }: Props) {
       key={seq}
       testId="mtp-engine-conflict-toast"
       icon={<AlertTriangle className="text-amber-500" />}
-      title="MTP configs from different engines can't share a graph"
-      description={describe(detail)}
+      title={TITLES[locale]}
+      description={describe(detail, locale)}
       onDismiss={onDismiss}
     />
   );
diff --git a/packages/app/src/components/nudge-engine.tsx b/packages/app/src/components/nudge-engine.tsx
index a731d5fe..3a24786e 100644
--- a/packages/app/src/components/nudge-engine.tsx
+++ b/packages/app/src/components/nudge-engine.tsx
@@ -4,6 +4,8 @@ import { ArrowRight, X } from 'lucide-react';
 import { useCallback, useEffect, useMemo, useRef, useState } from 'react';
 
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
+import type { Locale } from '@/lib/i18n';
 import {
   isDismissed,
   isPermanentlySuppressed,
@@ -331,6 +333,24 @@ function setupTrigger(
   return null;
 }
 
+// ---------------------------------------------------------------------------
+// Locale helpers
+// ---------------------------------------------------------------------------
+
+function localized(locale: Locale, en: string, zh?: string): string {
+  return locale === 'zh' && zh ? zh : en;
+}
+
+const RENDERER_STRINGS = {
+  en: {
+    explore: 'Explore',
+    maybeLater: 'Maybe Later',
+    close: 'Close',
+    dismissBanner: 'Dismiss launch banner',
+  },
+  zh: { explore: '探索', maybeLater: '稍后再看', close: '关闭', dismissBanner: '关闭发布横幅' },
+} as const;
+
 // ---------------------------------------------------------------------------
 // Renderers
 // ---------------------------------------------------------------------------
@@ -344,18 +364,19 @@ function ToastRenderer({
   onDismiss: () => void;
   onAction: () => void;
 }) {
+  const locale = useLocale();
   const { content } = def;
   const Icon = content.icon;
   return (
     <BottomToast
       testId={content.testId}
       icon={<Icon className={content.iconClassName} />}
-      title={content.title}
-      description={content.description}
+      title={localized(locale, content.title, content.titleZh)}
+      description={localized(locale, content.description, content.descriptionZh)}
       action={
         content.action
           ? {
-              label: content.action.label,
+              label: localized(locale, content.action.label, content.action.labelZh),
               icon: content.action.icon,
               onClick: onAction,
             }
@@ -375,6 +396,8 @@ function ModalRenderer({
   onDismiss: () => void;
   onAction: () => void;
 }) {
+  const locale = useLocale();
+  const rs = RENDERER_STRINGS[locale];
   const { content } = def;
   const Icon = content.icon;
   const idPrefix = def.id;
@@ -397,7 +420,7 @@ function ModalRenderer({
         type="button"
         onClick={onDismiss}
         className="absolute right-4 top-4 rounded-sm opacity-70 transition-opacity hover:opacity-100 focus:outline-none focus:ring-2 focus:ring-ring focus:ring-offset-2"
-        aria-label="Close"
+        aria-label={rs.close}
       >
         <X className="size-4" />
       </button>
@@ -408,15 +431,15 @@ function ModalRenderer({
           <div className="space-y-1.5 pr-6">
             <h2 id={`${idPrefix}-title`} className="flex items-center gap-2 text-lg font-semibold">
               <Icon className={`size-5 ${content.iconClassName ?? ''}`} />
-              {content.title}
+              {localized(locale, content.title, content.titleZh)}
               {content.badge && (
                 <span className="ml-1 inline-flex items-center rounded-full bg-brand px-2 py-0.5 text-[10px] font-bold uppercase tracking-wider text-primary-foreground shadow-sm">
-                  {content.badge}
+                  {localized(locale, content.badge, content.badgeZh)}
                 </span>
               )}
             </h2>
             <p id={`${idPrefix}-description`} className="text-sm text-muted-foreground">
-              {content.description}
+              {localized(locale, content.description, content.descriptionZh)}
             </p>
           </div>
           <div className="flex flex-row justify-end gap-2">
@@ -425,7 +448,11 @@ function ModalRenderer({
               onClick={onDismiss}
               data-testid={content.testId ? `${content.testId}-dismiss` : undefined}
             >
-              {content.dismissLabel ?? 'Maybe Later'}
+              {localized(
+                locale,
+                content.dismissLabel ?? 'Maybe Later',
+                content.dismissLabelZh ?? rs.maybeLater,
+              )}
             </Button>
             {content.primaryAction && (
               <Button
@@ -434,7 +461,7 @@ function ModalRenderer({
                 className={content.actionClassName}
               >
                 {content.primaryAction.icon}
-                {content.primaryAction.label}
+                {localized(locale, content.primaryAction.label, content.primaryAction.labelZh)}
               </Button>
             )}
           </div>
@@ -448,7 +475,7 @@ function ModalRenderer({
       <div className="fixed inset-0 z-50 flex items-center justify-center p-4">
         <button
           type="button"
-          aria-label="Close"
+          aria-label={rs.close}
           onClick={onDismiss}
           className="absolute inset-0 cursor-default bg-black/50"
         />
@@ -469,6 +496,8 @@ function BannerRenderer({
   onDismiss: () => void;
   onAction: () => void;
 }) {
+  const locale = useLocale();
+  const rs = RENDERER_STRINGS[locale];
   const { content } = def;
   const Icon = content.icon;
 
@@ -502,19 +531,21 @@ function BannerRenderer({
         <div className="relative flex flex-1 flex-col sm:flex-row sm:items-center sm:gap-3 min-w-0">
           <div className="flex-1 min-w-0">
             <p className="text-sm font-semibold leading-tight truncate">
-              <span className="align-middle">{content.title}</span>
+              <span className="align-middle">
+                {localized(locale, content.title, content.titleZh)}
+              </span>
               {content.badge && (
                 <span className="ml-2 inline-flex items-center gap-1.5 align-middle rounded-full bg-brand px-2 py-0.5 text-[10px] font-bold uppercase tracking-wider text-primary-foreground shadow-sm">
-                  {content.badge}
+                  {localized(locale, content.badge, content.badgeZh)}
                 </span>
               )}
             </p>
             <p className="text-xs text-muted-foreground leading-snug truncate">
-              {content.description}
+              {localized(locale, content.description, content.descriptionZh)}
             </p>
           </div>
           <span className="hidden sm:inline-flex items-center gap-1 text-xs font-medium text-brand shrink-0 group-hover:translate-x-0.5 transition-transform duration-200">
-            Explore
+            {rs.explore}
             <ArrowRight className="size-3.5" />
           </span>
         </div>
@@ -522,7 +553,7 @@ function BannerRenderer({
           type="button"
           onClick={handleDismiss}
           className="relative ml-1 rounded-md p-1 text-muted-foreground opacity-70 transition-opacity hover:opacity-100 focus:outline-none focus:ring-2 focus:ring-ring"
-          aria-label="Dismiss launch banner"
+          aria-label={rs.dismissBanner}
           data-testid={content.testId ? `${content.testId}-dismiss` : undefined}
         >
           <X className="size-4" />
diff --git a/packages/app/src/components/reliability/ui/ChartControls.tsx b/packages/app/src/components/reliability/ui/ChartControls.tsx
index 53f295ba..28ef8c95 100644
--- a/packages/app/src/components/reliability/ui/ChartControls.tsx
+++ b/packages/app/src/components/reliability/ui/ChartControls.tsx
@@ -12,17 +12,44 @@ import {
   SelectValue,
 } from '@/components/ui/select';
 import { TooltipProvider } from '@/components/ui/tooltip';
+import { useLocale } from '@/lib/use-locale';
+
+const STRINGS = {
+  en: {
+    dateRangeLabel: 'Date Range',
+    dateRangeTooltip:
+      'Time window for calculating GPU reliability metrics. Longer ranges provide more stable statistics but may not reflect recent changes in hardware performance.',
+    dateRangePlaceholder: 'Select date range',
+    last3Days: 'Last 3 days',
+    last7Days: 'Last 7 days',
+    lastMonth: 'Last month',
+    last3Months: 'Last 3 months',
+    allTime: 'All time',
+  },
+  zh: {
+    dateRangeLabel: '时间范围',
+    dateRangeTooltip:
+      '计算 GPU 可靠性指标的时间窗口。更长的范围可提供更稳定的统计数据，但可能无法反映近期的硬件性能变化。',
+    dateRangePlaceholder: '选择时间范围',
+    last3Days: '最近 3 天',
+    last7Days: '最近 7 天',
+    lastMonth: '最近一个月',
+    last3Months: '最近三个月',
+    allTime: '全部时间',
+  },
+} as const;
 
 export default function ReliabilityChartControls() {
   const { dateRange, setDateRange } = useReliabilityContext();
+  const t = STRINGS[useLocale()];
 
   return (
     <TooltipProvider delayDuration={0}>
       <div className="flex flex-col space-y-1.5 sm:w-45">
         <LabelWithTooltip
           htmlFor="date-range-select"
-          label="Date Range"
-          tooltip="Time window for calculating GPU reliability metrics. Longer ranges provide more stable statistics but may not reflect recent changes in hardware performance."
+          label={t.dateRangeLabel}
+          tooltip={t.dateRangeTooltip}
         />
         <Select
           value={dateRange}
@@ -36,14 +63,14 @@ export default function ReliabilityChartControls() {
             data-testid="reliability-date-range"
             className="w-full"
           >
-            <SelectValue placeholder="Select date range" />
+            <SelectValue placeholder={t.dateRangePlaceholder} />
           </SelectTrigger>
           <SelectContent>
-            <SelectItem value="last-3-days">Last 3 days</SelectItem>
-            <SelectItem value="last-7-days">Last 7 days</SelectItem>
-            <SelectItem value="last-month">Last month</SelectItem>
-            <SelectItem value="last-3-months">Last 3 months</SelectItem>
-            <SelectItem value="all-time">All time</SelectItem>
+            <SelectItem value="last-3-days">{t.last3Days}</SelectItem>
+            <SelectItem value="last-7-days">{t.last7Days}</SelectItem>
+            <SelectItem value="last-month">{t.lastMonth}</SelectItem>
+            <SelectItem value="last-3-months">{t.last3Months}</SelectItem>
+            <SelectItem value="all-time">{t.allTime}</SelectItem>
           </SelectContent>
         </Select>
       </div>
diff --git a/packages/app/src/components/reliability/ui/ChartDisplay.tsx b/packages/app/src/components/reliability/ui/ChartDisplay.tsx
index 8fecf277..6edff5e9 100644
--- a/packages/app/src/components/reliability/ui/ChartDisplay.tsx
+++ b/packages/app/src/components/reliability/ui/ChartDisplay.tsx
@@ -9,13 +9,31 @@ import { ChartSection } from '@/components/ui/chart-section';
 import { UnofficialDomainNotice } from '@/components/ui/unofficial-domain-notice';
 import { exportToCsv } from '@/lib/csv-export';
 import { reliabilityChartToCsv } from '@/lib/csv-export-helpers';
+import { useLocale } from '@/lib/use-locale';
 
 import ReliabilityBarChartD3 from './BarChartD3';
 import ReliabilityChartControls from './ChartControls';
 
+const STRINGS = {
+  en: {
+    heading: 'GPU Reliability',
+    description:
+      'Success rate percentages for inference runs across GPU models, showing hardware reliability for inference runs over time.',
+    captionHeading: 'Success Rate by GPU Model',
+    captionSource: 'Source: SemiAnalysis InferenceX™',
+  },
+  zh: {
+    heading: 'GPU 可靠性',
+    description: '各 GPU 型号推理运行的成功率百分比，展示硬件在一段时间内的推理运行可靠性。',
+    captionHeading: '各 GPU 型号成功率',
+    captionSource: '数据来源：SemiAnalysis InferenceX™',
+  },
+} as const;
+
 export default function ReliabilityChartDisplay() {
   const CHART_ID = 'reliability-chart';
   const { setIsLegendExpanded, chartData } = useReliabilityContext();
+  const t = STRINGS[useLocale()];
 
   const handleExportCsv = useCallback(() => {
     const { headers, rows } = reliabilityChartToCsv(chartData);
@@ -29,11 +47,8 @@ export default function ReliabilityChartDisplay() {
           <div className="flex flex-col gap-4">
             <div className="flex items-start justify-between">
               <div>
-                <h2 className="text-lg font-semibold mb-2">GPU Reliability</h2>
-                <p className="text-muted-foreground text-sm mb-4">
-                  Success rate percentages for inference runs across GPU models, showing hardware
-                  reliability for inference runs over time.
-                </p>
+                <h2 className="text-lg font-semibold mb-2">{t.heading}</h2>
+                <p className="text-muted-foreground text-sm mb-4">{t.description}</p>
               </div>
               <ChartShareActions />
             </div>
@@ -53,8 +68,8 @@ export default function ReliabilityChartDisplay() {
         <ReliabilityBarChartD3
           caption={
             <>
-              <h3 className="text-lg font-semibold">Success Rate by GPU Model</h3>
-              <p className="text-sm text-muted-foreground">Source: SemiAnalysis InferenceX™</p>
+              <h3 className="text-lg font-semibold">{t.captionHeading}</h3>
+              <p className="text-sm text-muted-foreground">{t.captionSource}</p>
               <UnofficialDomainNotice />
             </>
           }
diff --git a/packages/app/src/components/share-buttons.tsx b/packages/app/src/components/share-buttons.tsx
index ed9d4aad..734fcad6 100644
--- a/packages/app/src/components/share-buttons.tsx
+++ b/packages/app/src/components/share-buttons.tsx
@@ -2,11 +2,24 @@
 
 import { Button } from '@/components/ui/button';
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
 
 const SITE_URL = 'https://inferencex.semianalysis.com';
 
-const SHARE_TEXT =
-  'Check out InferenceX — open-source ML inference benchmarks comparing GPUs across real-world workloads. Transparent, up-to-date data for the ML community.';
+const STRINGS = {
+  en: {
+    shareText:
+      'Check out InferenceX — open-source ML inference benchmarks comparing GPUs across real-world workloads. Transparent, up-to-date data for the ML community.',
+    twitter: 'Share on X (Twitter)',
+    linkedin: 'Share on LinkedIn',
+  },
+  zh: {
+    shareText:
+      '来看 InferenceX——开源 ML 推理基准测试，跨真实工作负载对比 GPU 性能。为 ML 社区提供透明、最新的数据。',
+    twitter: '分享到 X（推特）',
+    linkedin: '分享到 LinkedIn',
+  },
+} as const;
 
 function getShareUrl(): string {
   if (typeof window === 'undefined') return SITE_URL;
@@ -14,17 +27,18 @@ function getShareUrl(): string {
 }
 
 export function ShareTwitterButton({ text }: { text?: string }) {
+  const t = STRINGS[useLocale()];
   return (
     <Button
       variant="outline"
       size="icon"
       className="size-7"
-      title="Share on X (Twitter)"
+      title={t.twitter}
       data-testid="share-twitter"
       onClick={() => {
         const url = getShareUrl();
         window.open(
-          `https://twitter.com/intent/tweet?text=${encodeURIComponent(text ?? SHARE_TEXT)}&url=${encodeURIComponent(url)}`,
+          `https://twitter.com/intent/tweet?text=${encodeURIComponent(text ?? t.shareText)}&url=${encodeURIComponent(url)}`,
           '_blank',
           'noopener,noreferrer,width=600,height=400',
         );
@@ -39,12 +53,13 @@ export function ShareTwitterButton({ text }: { text?: string }) {
 }
 
 export function ShareLinkedInButton() {
+  const t = STRINGS[useLocale()];
   return (
     <Button
       variant="outline"
       size="icon"
       className="size-7"
-      title="Share on LinkedIn"
+      title={t.linkedin}
       data-testid="share-linkedin"
       onClick={() => {
         const url = getShareUrl();
diff --git a/packages/app/src/components/submissions/SubmissionsDisplay.tsx b/packages/app/src/components/submissions/SubmissionsDisplay.tsx
index e63b67b2..8fdbad61 100644
--- a/packages/app/src/components/submissions/SubmissionsDisplay.tsx
+++ b/packages/app/src/components/submissions/SubmissionsDisplay.tsx
@@ -10,6 +10,7 @@ import { SegmentedToggle, type SegmentedToggleOption } from '@/components/ui/seg
 import { exportToCsv } from '@/lib/csv-export';
 import { submissionsVolumeToCsv } from '@/lib/csv-export-helpers';
 import { useSubmissions } from '@/hooks/api/use-submissions';
+import { useLocale } from '@/lib/use-locale';
 
 import SubmissionsChart, { type ChartMode } from './SubmissionsChart';
 import SubmissionsTable from './SubmissionsTable';
@@ -17,15 +18,61 @@ import { computeTotalStats } from './submissions-utils';
 
 const CHART_ID = 'submissions-chart';
 
-const SUBMISSIONS_CHART_MODE_OPTIONS: SegmentedToggleOption<ChartMode>[] = [
-  { value: 'weekly', label: 'Weekly', testId: 'submissions-weekly-btn' },
-  { value: 'cumulative', label: 'Cumulative', testId: 'submissions-cumulative-btn' },
-];
+const STRINGS = {
+  en: {
+    heading: 'Benchmark Submissions',
+    description:
+      'All benchmark configurations submitted to InferenceX. View submission history, activity trends, and datapoint volumes.',
+    modeWeekly: 'Weekly',
+    modeCumulative: 'Cumulative',
+    loadingChart: 'Loading chart data...',
+    loadingTable: 'Loading submissions...',
+    errorText: 'Failed to load submission data.',
+    chartCaption: 'Submission Activity',
+    chartSource: 'Source: SemiAnalysis InferenceX™',
+    statDatapoints: 'Datapoints Generated',
+    statConfigs: 'Distinct Configurations',
+    statModels: 'Unique Models',
+    statHardware: 'Unique Hardware',
+    subtitleResults: 'results',
+    subtitleTested: 'tested',
+    subtitleLLMs: 'LLMs',
+    subtitleSKUs: 'SKUs',
+  },
+  zh: {
+    heading: '基准测试提交',
+    description: '所有提交至 InferenceX 的基准测试配置。查看提交历史、活动趋势和数据点数量。',
+    modeWeekly: '按周',
+    modeCumulative: '累计',
+    loadingChart: '正在加载图表数据...',
+    loadingTable: '正在加载提交记录...',
+    errorText: '加载提交数据失败。',
+    chartCaption: '提交活动',
+    chartSource: '数据来源：SemiAnalysis InferenceX™',
+    statDatapoints: '已生成数据点',
+    statConfigs: '不同配置数',
+    statModels: '模型数',
+    statHardware: '硬件类型',
+    subtitleResults: '条结果',
+    subtitleTested: '已测试',
+    subtitleLLMs: '个 LLM',
+    subtitleSKUs: '种 SKU',
+  },
+} as const;
 
 export default function SubmissionsDisplay() {
   const { data, isLoading, error } = useSubmissions();
+  const t = STRINGS[useLocale()];
   const [chartMode, setChartMode] = useState<ChartMode>('weekly');
 
+  const chartModeOptions = useMemo<SegmentedToggleOption<ChartMode>[]>(
+    () => [
+      { value: 'weekly', label: t.modeWeekly, testId: 'submissions-weekly-btn' },
+      { value: 'cumulative', label: t.modeCumulative, testId: 'submissions-cumulative-btn' },
+    ],
+    [t],
+  );
+
   useEffect(() => {
     track('submissions_page_viewed');
   }, []);
@@ -49,7 +96,7 @@ export default function SubmissionsDisplay() {
   if (error) {
     return (
       <Card>
-        <p className="text-destructive text-sm">Failed to load submission data.</p>
+        <p className="text-destructive text-sm">{t.errorText}</p>
       </Card>
     );
   }
@@ -61,11 +108,8 @@ export default function SubmissionsDisplay() {
         <Card>
           <div className="flex items-start justify-between">
             <div>
-              <h2 className="text-lg font-semibold mb-2">Benchmark Submissions</h2>
-              <p className="text-muted-foreground text-sm">
-                All benchmark configurations submitted to InferenceX. View submission history,
-                activity trends, and datapoint volumes.
-              </p>
+              <h2 className="text-lg font-semibold mb-2">{t.heading}</h2>
+              <p className="text-muted-foreground text-sm">{t.description}</p>
             </div>
             <div className="flex items-center gap-1.5">
               <ChartShareActions />
@@ -79,10 +123,14 @@ export default function SubmissionsDisplay() {
         <section>
           <div className="grid grid-cols-2 md:grid-cols-4 gap-3">
             {[
-              { label: 'Datapoints Generated', value: stats.totalDatapoints, subtitle: 'results' },
-              { label: 'Distinct Configurations', value: stats.totalConfigs, subtitle: 'tested' },
-              { label: 'Unique Models', value: stats.uniqueModels, subtitle: 'LLMs' },
-              { label: 'Unique Hardware', value: stats.uniqueGpus, subtitle: 'SKUs' },
+              {
+                label: t.statDatapoints,
+                value: stats.totalDatapoints,
+                subtitle: t.subtitleResults,
+              },
+              { label: t.statConfigs, value: stats.totalConfigs, subtitle: t.subtitleTested },
+              { label: t.statModels, value: stats.uniqueModels, subtitle: t.subtitleLLMs },
+              { label: t.statHardware, value: stats.uniqueGpus, subtitle: t.subtitleSKUs },
             ]
               .toSorted((a, b) => b.value - a.value)
               .map((s) => (
@@ -109,7 +157,7 @@ export default function SubmissionsDisplay() {
             leadingControls={
               <SegmentedToggle
                 value={chartMode}
-                options={SUBMISSIONS_CHART_MODE_OPTIONS}
+                options={chartModeOptions}
                 onValueChange={handleModeChange}
                 ariaLabel="Chart mode"
                 testId="submissions-mode-toggle"
@@ -120,7 +168,7 @@ export default function SubmissionsDisplay() {
           <Card>
             {isLoading ? (
               <div className="h-[600px] flex items-center justify-center text-muted-foreground text-sm">
-                Loading chart data...
+                {t.loadingChart}
               </div>
             ) : data?.volume ? (
               <SubmissionsChart
@@ -128,10 +176,8 @@ export default function SubmissionsDisplay() {
                 mode={chartMode}
                 caption={
                   <>
-                    <h3 className="text-lg font-semibold">Submission Activity</h3>
-                    <p className="text-sm text-muted-foreground">
-                      Source: SemiAnalysis InferenceX&trade;
-                    </p>
+                    <h3 className="text-lg font-semibold">{t.chartCaption}</h3>
+                    <p className="text-sm text-muted-foreground">{t.chartSource}</p>
                   </>
                 }
               />
@@ -145,7 +191,7 @@ export default function SubmissionsDisplay() {
         <Card>
           {isLoading ? (
             <div className="h-32 flex items-center justify-center text-muted-foreground text-sm">
-              Loading submissions...
+              {t.loadingTable}
             </div>
           ) : data?.summary ? (
             <SubmissionsTable data={data.summary} />
diff --git a/packages/app/src/components/submissions/SubmissionsTable.tsx b/packages/app/src/components/submissions/SubmissionsTable.tsx
index 5dd56f47..8e5c8975 100644
--- a/packages/app/src/components/submissions/SubmissionsTable.tsx
+++ b/packages/app/src/components/submissions/SubmissionsTable.tsx
@@ -20,6 +20,8 @@ import {
   TooltipContent,
 } from '@/components/ui/tooltip';
 
+import { useLocale } from '@/lib/use-locale';
+
 import {
   buildInferenceCompareUrl,
   computePreviousImages,
@@ -28,6 +30,116 @@ import {
   submissionRowKey,
 } from './submissions-utils';
 
+const STRINGS = {
+  en: {
+    searchPlaceholder: 'Search configs...',
+    colGpu: 'GPU',
+    colModel: 'Model',
+    colPrecision: 'Precision',
+    colSpecMethod: 'Spec Method',
+    colFramework: 'Framework',
+    colDate: 'Date',
+    colDatapoints: 'Datapoints',
+    colCompare: 'Compare',
+    noMatch: 'No matching submissions found.',
+    noData: 'No submission data available.',
+    vsPrev: 'vs prev',
+    vendorLabel: 'Vendor:',
+    vendorTip: 'GPU manufacturer',
+    specMethodLabel: 'Spec Method:',
+    specMethodTip: 'Speculative decoding method (e.g. MTP, Eagle)',
+    disaggLabel: 'Disaggregated:',
+    disaggTip: 'Prefill and decode run on separate GPU pools',
+    multinodeLabel: 'Multinode:',
+    multinodeTip: 'Config spans multiple physical nodes',
+    totalGpusLabel: 'Total GPUs:',
+    totalGpusTip: 'Total physical GPUs. When disaggregated, prefill + decode are separate pools',
+    prefillGpusLabel: 'Prefill GPUs:',
+    prefillGpusTip: 'GPUs for the prefill (prompt processing) phase',
+    decodeGpusLabel: 'Decode GPUs:',
+    decodeGpusTip: 'GPUs for the decode (token generation) phase',
+    prefillTpEpLabel: 'Prefill TP/EP:',
+    prefillTpEpTip: 'Tensor parallelism / Expert parallelism for prefill',
+    decodeTpEpLabel: 'Decode TP/EP:',
+    decodeTpEpTip: 'Tensor parallelism / Expert parallelism for decode',
+    sequencesLabel: 'Sequences:',
+    sequencesTip: 'Distinct ISL/OSL sequence length combinations tested',
+    concurrenciesLabel: 'Concurrencies:',
+    concurrenciesTip: 'Distinct concurrency levels tested',
+    imageLabel: 'Image:',
+    imageTipChanged:
+      'Container image used for this benchmark configuration. The previous run of this config used a different image — shown on the left.',
+    imageTipDefault: 'Container image used for this benchmark configuration',
+    showMorePre: 'Show ',
+    showMorePost: ' more',
+    hiddenPre: '(',
+    hiddenPost: ' hidden)',
+    showingPrefix: 'Showing ',
+    showingOf: ' of ',
+    configSingular: ' config',
+    configPlural: ' configs',
+    totalDatapointsSuffix: ' total datapoints',
+    compareTipPre: 'Compare ',
+    compareTipPost: ' on chart',
+    maxPrefix: 'max ',
+    yes: 'Yes',
+    no: 'No',
+  },
+  zh: {
+    searchPlaceholder: '搜索配置...',
+    colGpu: 'GPU',
+    colModel: '模型',
+    colPrecision: '精度',
+    colSpecMethod: '推测解码',
+    colFramework: '框架',
+    colDate: '日期',
+    colDatapoints: '数据点',
+    colCompare: '对比',
+    noMatch: '未找到匹配的提交记录。',
+    noData: '暂无提交数据。',
+    vsPrev: '对比',
+    vendorLabel: '厂商：',
+    vendorTip: 'GPU 制造商',
+    specMethodLabel: '推测解码方法：',
+    specMethodTip: '推测解码方法（如 MTP、Eagle）',
+    disaggLabel: '分离式部署：',
+    disaggTip: 'Prefill 和 Decode 在不同 GPU 池上运行',
+    multinodeLabel: '多节点：',
+    multinodeTip: '配置跨多个物理节点',
+    totalGpusLabel: '总 GPU 数：',
+    totalGpusTip: '物理 GPU 总数。分离式部署时，Prefill 和 Decode 使用不同的 GPU 池',
+    prefillGpusLabel: 'Prefill GPU 数：',
+    prefillGpusTip: '用于 Prefill（提示处理）阶段的 GPU',
+    decodeGpusLabel: 'Decode GPU 数：',
+    decodeGpusTip: '用于 Decode（Token 生成）阶段的 GPU',
+    prefillTpEpLabel: 'Prefill TP/EP：',
+    prefillTpEpTip: 'Prefill 的张量并行 / 专家并行',
+    decodeTpEpLabel: 'Decode TP/EP：',
+    decodeTpEpTip: 'Decode 的张量并行 / 专家并行',
+    sequencesLabel: '序列组合：',
+    sequencesTip: '测试的不同 ISL/OSL 序列长度组合数',
+    concurrenciesLabel: '并发数：',
+    concurrenciesTip: '测试的不同并发级别数',
+    imageLabel: '镜像：',
+    imageTipChanged: '此基准测试配置使用的容器镜像。上一次运行使用了不同的镜像——显示在左侧。',
+    imageTipDefault: '此基准测试配置使用的容器镜像',
+    showMorePre: '再显示 ',
+    showMorePost: ' 条',
+    hiddenPre: '（还有 ',
+    hiddenPost: ' 条隐藏）',
+    showingPrefix: '显示 ',
+    showingOf: ' / ',
+    configSingular: ' 条配置',
+    configPlural: ' 条配置',
+    totalDatapointsSuffix: ' 个数据点',
+    compareTipPre: '对比 ',
+    compareTipPost: '（在图表中查看）',
+    maxPrefix: '最大 ',
+    yes: '是',
+    no: '否',
+  },
+} as const;
+
 const ROW_PAGE_SIZE = 100;
 
 function DetailItem({
@@ -105,6 +217,7 @@ function SortHeader({
 }
 
 export default function SubmissionsTable({ data }: SubmissionsTableProps) {
+  const t = STRINGS[useLocale()];
   const [sortKey, setSortKey] = useState<SortKey>('date');
   const [sortDir, setSortDir] = useState<SortDir>('desc');
   const [search, setSearch] = useState('');
@@ -189,7 +302,7 @@ export default function SubmissionsTable({ data }: SubmissionsTableProps) {
         onBlur={() => {
           if (search.trim()) track('submissions_table_searched', { query: search.trim() });
         }}
-        placeholder="Search configs..."
+        placeholder={t.searchPlaceholder}
         className="w-full max-w-sm px-3 py-1.5 rounded-md border border-border bg-background text-sm placeholder:text-muted-foreground focus:outline-none focus:ring-1 focus:ring-ring"
       />
       <div className="overflow-x-auto rounded-md border border-border">
@@ -199,13 +312,13 @@ export default function SubmissionsTable({ data }: SubmissionsTableProps) {
               <th className="w-8 px-2" />
               {(
                 [
-                  ['GPU', 'hardware'],
-                  ['Model', 'model'],
-                  ['Precision', 'precision'],
-                  ['Spec Method', 'spec_method'],
-                  ['Framework', 'framework'],
-                  ['Date', 'date'],
-                  ['Datapoints', 'total_datapoints'],
+                  [t.colGpu, 'hardware'],
+                  [t.colModel, 'model'],
+                  [t.colPrecision, 'precision'],
+                  [t.colSpecMethod, 'spec_method'],
+                  [t.colFramework, 'framework'],
+                  [t.colDate, 'date'],
+                  [t.colDatapoints, 'total_datapoints'],
                 ] as [string, SortKey][]
               ).map(([label, field]) => (
                 <SortHeader
@@ -221,7 +334,7 @@ export default function SubmissionsTable({ data }: SubmissionsTableProps) {
                 className="px-3 py-2 text-left text-xs font-medium text-muted-foreground select-none"
                 scope="col"
               >
-                Compare
+                {t.colCompare}
               </th>
             </tr>
           </thead>
@@ -243,7 +356,7 @@ export default function SubmissionsTable({ data }: SubmissionsTableProps) {
             {sorted.length === 0 && (
               <tr>
                 <td colSpan={9} className="px-3 py-8 text-center text-muted-foreground">
-                  {search ? 'No matching submissions found.' : 'No submission data available.'}
+                  {search ? t.noMatch : t.noData}
                 </td>
               </tr>
             )}
@@ -259,15 +372,25 @@ export default function SubmissionsTable({ data }: SubmissionsTableProps) {
             onClick={loadMore}
             data-testid="submissions-load-more"
           >
-            Show {Math.min(ROW_PAGE_SIZE, hiddenCount)} more
-            <span className="text-muted-foreground">({hiddenCount} hidden)</span>
+            {t.showMorePre}
+            {Math.min(ROW_PAGE_SIZE, hiddenCount)}
+            {t.showMorePost}
+            <span className="text-muted-foreground">
+              {t.hiddenPre}
+              {hiddenCount}
+              {t.hiddenPost}
+            </span>
           </Button>
         </div>
       )}
       <p className="text-xs text-muted-foreground">
-        Showing {visibleRows.length} of {filtered.length} config
-        {filtered.length === 1 ? '' : 's'} ·{' '}
-        {filtered.reduce((sum, r) => sum + r.total_datapoints, 0).toLocaleString()} total datapoints
+        {t.showingPrefix}
+        {visibleRows.length}
+        {t.showingOf}
+        {filtered.length}
+        {filtered.length === 1 ? t.configSingular : t.configPlural} ·{' '}
+        {filtered.reduce((sum, r) => sum + r.total_datapoints, 0).toLocaleString()}
+        {t.totalDatapointsSuffix}
       </p>
     </div>
   );
@@ -286,6 +409,7 @@ function SubmissionRow({
   previousRun: SubmissionSummaryRow | null;
   onToggle: () => void;
 }) {
+  const t = STRINGS[useLocale()];
   const vendor = getVendor(row.hardware);
   const compareUrl = previousRun ? buildInferenceCompareUrl(row, previousRun) : null;
 
@@ -344,13 +468,15 @@ function SubmissionRow({
                       }}
                     >
                       <GitCompare className="size-3.5" />
-                      <span className="hidden lg:inline">vs prev</span>
+                      <span className="hidden lg:inline">{t.vsPrev}</span>
                     </a>
                   </Button>
                 </TooltipTrigger>
                 <TooltipContent side="left" collisionPadding={10}>
                   <span className="text-xs">
-                    Compare {previousRun.date} → {row.date} on chart
+                    {t.compareTipPre}
+                    {previousRun.date} → {row.date}
+                    {t.compareTipPost}
                   </span>
                 </TooltipContent>
               </TooltipRoot>
@@ -366,79 +492,54 @@ function SubmissionRow({
           <td colSpan={8} className="px-3 py-3">
             <TooltipProvider>
               <div className="grid grid-cols-2 md:grid-cols-4 gap-x-8 gap-y-2 text-sm">
-                <DetailItem label="Vendor:" tip="GPU manufacturer">
+                <DetailItem label={t.vendorLabel} tip={t.vendorTip}>
                   {vendor}
                 </DetailItem>
-                <DetailItem
-                  label="Spec Method:"
-                  tip="Speculative decoding method (e.g. MTP, Eagle)"
-                >
+                <DetailItem label={t.specMethodLabel} tip={t.specMethodTip}>
                   {row.spec_method && row.spec_method !== 'none'
                     ? resolveFrameworkPartLabel(DB_MODEL_TO_DISPLAY[row.model], row.spec_method)
                     : 'none'}
                 </DetailItem>
-                <DetailItem
-                  label="Disaggregated:"
-                  tip="Prefill and decode run on separate GPU pools"
-                >
-                  {row.disagg ? 'Yes' : 'No'}
+                <DetailItem label={t.disaggLabel} tip={t.disaggTip}>
+                  {row.disagg ? t.yes : t.no}
                 </DetailItem>
-                <DetailItem label="Multinode:" tip="Config spans multiple physical nodes">
-                  {row.is_multinode ? 'Yes' : 'No'}
+                <DetailItem label={t.multinodeLabel} tip={t.multinodeTip}>
+                  {row.is_multinode ? t.yes : t.no}
                 </DetailItem>
-                <DetailItem
-                  label="Total GPUs:"
-                  tip="Total physical GPUs. When disaggregated, prefill + decode are separate pools"
-                >
+                <DetailItem label={t.totalGpusLabel} tip={t.totalGpusTip}>
                   <span className="tabular-nums">
                     {row.disagg ? row.num_prefill_gpu + row.num_decode_gpu : row.num_prefill_gpu}
                   </span>
                 </DetailItem>
-                <DetailItem
-                  label="Prefill GPUs:"
-                  tip="GPUs for the prefill (prompt processing) phase"
-                >
+                <DetailItem label={t.prefillGpusLabel} tip={t.prefillGpusTip}>
                   <span className="tabular-nums">{row.num_prefill_gpu}</span>
                 </DetailItem>
-                <DetailItem label="Decode GPUs:" tip="GPUs for the decode (token generation) phase">
+                <DetailItem label={t.decodeGpusLabel} tip={t.decodeGpusTip}>
                   <span className="tabular-nums">{row.num_decode_gpu}</span>
                 </DetailItem>
-                <DetailItem
-                  label="Prefill TP/EP:"
-                  tip="Tensor parallelism / Expert parallelism for prefill"
-                >
+                <DetailItem label={t.prefillTpEpLabel} tip={t.prefillTpEpTip}>
                   <span className="tabular-nums">
                     {row.prefill_tp ?? '—'}/{row.prefill_ep ?? '—'}
                   </span>
                 </DetailItem>
-                <DetailItem
-                  label="Decode TP/EP:"
-                  tip="Tensor parallelism / Expert parallelism for decode"
-                >
+                <DetailItem label={t.decodeTpEpLabel} tip={t.decodeTpEpTip}>
                   <span className="tabular-nums">
                     {row.decode_tp ?? '—'}/{row.decode_ep ?? '—'}
                   </span>
                 </DetailItem>
-                <DetailItem
-                  label="Sequences:"
-                  tip="Distinct ISL/OSL sequence length combinations tested"
-                >
+                <DetailItem label={t.sequencesLabel} tip={t.sequencesTip}>
                   <span className="tabular-nums">{row.distinct_sequences ?? '—'}</span>
                 </DetailItem>
-                <DetailItem label="Concurrencies:" tip="Distinct concurrency levels tested">
+                <DetailItem label={t.concurrenciesLabel} tip={t.concurrenciesTip}>
                   <span className="tabular-nums">
                     {row.distinct_concurrencies ?? '—'}
-                    {row.max_concurrency ? ` (max ${row.max_concurrency})` : ''}
+                    {row.max_concurrency ? ` (${t.maxPrefix}${row.max_concurrency})` : ''}
                   </span>
                 </DetailItem>
                 <div className="col-span-2 md:col-span-4">
                   <DetailItem
-                    label="Image:"
-                    tip={
-                      previousImage
-                        ? 'Container image used for this benchmark configuration. The previous run of this config used a different image — shown on the left.'
-                        : 'Container image used for this benchmark configuration'
-                    }
+                    label={t.imageLabel}
+                    tip={previousImage ? t.imageTipChanged : t.imageTipDefault}
                   >
                     {previousImage ? (
                       <span
@@ -476,7 +577,9 @@ function SubmissionRow({
                         }}
                       >
                         <GitCompare className="size-3.5" />
-                        Compare {previousRun.date} → {row.date} on chart
+                        {t.compareTipPre}
+                        {previousRun.date} → {row.date}
+                        {t.compareTipPost}
                       </a>
                     </Button>
                   </div>
diff --git a/packages/app/src/components/tab-nav.tsx b/packages/app/src/components/tab-nav.tsx
index 00aeb03e..d15ec06d 100644
--- a/packages/app/src/components/tab-nav.tsx
+++ b/packages/app/src/components/tab-nav.tsx
@@ -21,7 +21,7 @@ import {
   SelectValue,
 } from '@/components/ui/select';
 import { UnofficialRunContext } from '@/components/unofficial-run-provider';
-import { isZhPathname, ZH_PREFIX } from '@/lib/i18n';
+import { hasZhSibling, isZhPathname, ZH_PREFIX } from '@/lib/i18n';
 import { TAB_LABELS_ZH } from '@/lib/tab-meta-zh';
 import { cn } from '@/lib/utils';
 
@@ -80,11 +80,13 @@ export function TabNav() {
   const isZh = isZhPathname(pathname);
   const current = activeTab(pathname);
   const selectedTab = TAB_VALUES.has(current) ? current : '';
-  // On /zh pages, visible tabs navigate within the Chinese tree and show
-  // Chinese labels. Gated tabs have no /zh sibling and keep English targets.
+  // On /zh pages, tabs with a Chinese sibling navigate within the Chinese
+  // tree and show Chinese labels; the rest (most gated tabs) keep English
+  // targets.
   const tabLabel = (tab: { href: string; label: string }) =>
     isZh ? (TAB_LABELS_ZH[tab.href.slice(1)] ?? tab.label) : tab.label;
-  const localizedPath = (path: string) => (isZh ? `${ZH_PREFIX}${path}` : path);
+  const localizedPath = (path: string) =>
+    isZh && hasZhSibling(path) ? `${ZH_PREFIX}${path}` : path;
 
   // Preserve the `unofficialrun(s)` URL param across tab navigation so an
   // overlay loaded on /inference doesn't get dropped when switching to
@@ -117,7 +119,7 @@ export function TabNav() {
   const handleMobileChange = (value: string) => {
     window.dispatchEvent(new CustomEvent('inferencex:tab-change'));
     track('tab_changed', { tab: value });
-    router.push(tabHref(GATED_VALUES.has(value) ? `/${value}` : localizedPath(`/${value}`)));
+    router.push(tabHref(localizedPath(`/${value}`)));
   };
 
   return (
@@ -189,7 +191,7 @@ export function TabNav() {
             {featureGateUnlocked && (
               <HiddenTabsPopover
                 current={current}
-                tabHref={tabHref}
+                tabHref={(path) => tabHref(localizedPath(path))}
                 onSelect={handleDesktopClick}
               />
             )}
diff --git a/packages/app/src/components/trends/HistoricalTrendsDisplay.tsx b/packages/app/src/components/trends/HistoricalTrendsDisplay.tsx
index 813a0883..0aa8f3c4 100644
--- a/packages/app/src/components/trends/HistoricalTrendsDisplay.tsx
+++ b/packages/app/src/components/trends/HistoricalTrendsDisplay.tsx
@@ -1,6 +1,7 @@
 'use client';
 
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
 import React, { useCallback, useMemo, useState } from 'react';
 
 import { useInference } from '@/components/inference/InferenceContext';
@@ -31,7 +32,41 @@ import {
 import { getDisplayLabel } from '@/lib/utils';
 import { useThemeColors } from '@/hooks/useThemeColors';
 
+const STRINGS = {
+  en: {
+    heading: 'Historical Trends',
+    description:
+      'Interpolated performance metrics over time at a fixed interactivity operating point.',
+    targetLabel: 'Target Interactivity (tok/s/user)',
+    targetTooltip:
+      "The interactivity operating point used for interpolation. Move the slider to see how each GPU's performance changes at different interactivity levels.",
+    captionTitle: (yTitle: string, target: number) =>
+      `${yTitle} Over Time at ${target} tok/s/user Interactivity`,
+    source: 'Source: SemiAnalysis InferenceX™',
+    updated: 'Updated:',
+    logScale: 'Log Scale',
+    highContrast: 'High Contrast',
+    resetFilter: 'Reset filter',
+    noData: 'No interactivity chart data available for the selected model and sequence.',
+  },
+  zh: {
+    heading: '历史趋势',
+    description: '在固定交互性操作点下，各性能指标随时间的插值变化。',
+    targetLabel: '目标交互性 (tok/s/user)',
+    targetTooltip: '用于插值的交互性操作点。移动滑块可查看各 GPU 在不同交互性水平下的性能变化。',
+    captionTitle: (yTitle: string, target: number) =>
+      `${yTitle} 随时间变化（交互性 ${target} tok/s/user）`,
+    source: '来源：SemiAnalysis InferenceX™',
+    updated: '更新时间：',
+    logScale: '对数缩放',
+    highContrast: '高对比度',
+    resetFilter: '重置筛选',
+    noData: '所选模型和序列无可用的交互性图表数据。',
+  },
+};
+
 export default function HistoricalTrendsDisplay() {
+  const t = STRINGS[useLocale()];
   const {
     graphs,
     loading,
@@ -162,10 +197,8 @@ export default function HistoricalTrendsDisplay() {
         <Card className="relative z-30">
           <div className="flex flex-col gap-4">
             <div>
-              <h2 className="text-lg font-semibold mb-2">Historical Trends</h2>
-              <p className="text-muted-foreground text-sm mb-4">
-                Interpolated performance metrics over time at a fixed interactivity operating point.
-              </p>
+              <h2 className="text-lg font-semibold mb-2">{t.heading}</h2>
+              <p className="text-muted-foreground text-sm mb-4">{t.description}</p>
             </div>
             <ChartControls hideGpuComparison />
             <div className="space-y-2">
@@ -190,10 +223,8 @@ export default function HistoricalTrendsDisplay() {
         <div className="flex flex-col gap-4">
           <div className="flex items-start justify-between">
             <div>
-              <h2 className="text-lg font-semibold mb-2">Historical Trends</h2>
-              <p className="text-muted-foreground text-sm mb-4">
-                Interpolated performance metrics over time at a fixed interactivity operating point.
-              </p>
+              <h2 className="text-lg font-semibold mb-2">{t.heading}</h2>
+              <p className="text-muted-foreground text-sm mb-4">{t.description}</p>
             </div>
             <ChartShareActions />
           </div>
@@ -205,8 +236,8 @@ export default function HistoricalTrendsDisplay() {
               <div className="space-y-2">
                 <LabelWithTooltip
                   htmlFor="historical-target"
-                  label="Target Interactivity (tok/s/user)"
-                  tooltip="The interactivity operating point used for interpolation. Move the slider to see how each GPU's performance changes at different interactivity levels."
+                  label={t.targetLabel}
+                  tooltip={t.targetTooltip}
                 />
                 <div className="flex items-center gap-4">
                   <div className="flex-1">
@@ -287,17 +318,19 @@ export default function HistoricalTrendsDisplay() {
                 caption={
                   <>
                     <h2 className="text-lg font-semibold">
-                      {currentYTitle} Over Time at {targetInteractivity} tok/s/user Interactivity
+                      {t.captionTitle(currentYTitle, targetInteractivity)}
                     </h2>
                     <p className="text-sm text-muted-foreground mb-2">
                       {getModelLabel(selectedModel as Model)} •{' '}
                       {selectedPrecisions
                         .map((prec: string) => getPrecisionLabel(prec as Precision))
                         .join(', ')}{' '}
-                      • {getSequenceLabel(selectedSequence as Sequence)} • Source: SemiAnalysis
-                      InferenceX™
+                      • {getSequenceLabel(selectedSequence as Sequence)} • {t.source}
                       {workflowInfo && workflowInfo.length > 0 && workflowInfo[0]?.run_date && (
-                        <> • Updated: {workflowInfo[0].run_date.split(',')[0]}</>
+                        <>
+                          {' '}
+                          • {t.updated} {workflowInfo[0].run_date.split(',')[0]}
+                        </>
                       )}
                     </p>
                     <MetricAssumptionNotes
@@ -343,7 +376,7 @@ export default function HistoricalTrendsDisplay() {
                     switches={[
                       {
                         id: 'historical-log-scale',
-                        label: 'Log Scale',
+                        label: t.logScale,
                         checked: logScale,
                         onCheckedChange: (checked: boolean) => {
                           setLogScale(checked);
@@ -352,7 +385,7 @@ export default function HistoricalTrendsDisplay() {
                       },
                       {
                         id: 'historical-high-contrast',
-                        label: 'High Contrast',
+                        label: t.highContrast,
                         checked: highContrast,
                         onCheckedChange: (checked: boolean) => {
                           setHighContrast(checked);
@@ -365,7 +398,7 @@ export default function HistoricalTrendsDisplay() {
                         ? [
                             {
                               id: 'historical-reset-filter',
-                              label: 'Reset filter',
+                              label: t.resetFilter,
                               onClick: () => {
                                 selectAllHwTypes();
                                 track('historical_legend_filter_reset');
@@ -384,9 +417,7 @@ export default function HistoricalTrendsDisplay() {
         </section>
       ) : (
         <Card>
-          <p className="text-muted-foreground text-sm">
-            No interactivity chart data available for the selected model and sequence.
-          </p>
+          <p className="text-muted-foreground text-sm">{t.noData}</p>
         </Card>
       )}
     </section>
diff --git a/packages/app/src/components/ui/bottom-toast.tsx b/packages/app/src/components/ui/bottom-toast.tsx
index ea0e92f0..4129974a 100644
--- a/packages/app/src/components/ui/bottom-toast.tsx
+++ b/packages/app/src/components/ui/bottom-toast.tsx
@@ -4,6 +4,7 @@ import { X } from 'lucide-react';
 import { useCallback, useEffect, useRef, useState } from 'react';
 
 import { track } from '@/lib/analytics';
+import { useLocale } from '@/lib/use-locale';
 
 const DISMISS_EVENT = 'inferencex:dismiss-toast';
 
@@ -34,6 +35,7 @@ export function BottomToast({
   onDismiss,
   testId,
 }: BottomToastProps) {
+  const locale = useLocale();
   const [animate, setAnimate] = useState(false);
   const [visible, setVisible] = useState(true);
   const actionClickedRef = useRef(false);
@@ -80,7 +82,7 @@ export function BottomToast({
           type="button"
           onClick={dismiss}
           className="absolute top-2 right-2 text-muted-foreground hover:text-foreground transition-colors"
-          aria-label="Dismiss"
+          aria-label={locale === 'zh' ? '关闭' : 'Dismiss'}
         >
           <X className="size-3.5" />
         </button>
diff --git a/packages/app/src/components/ui/chart-display-helpers.tsx b/packages/app/src/components/ui/chart-display-helpers.tsx
index f7798a65..5c7d97c4 100644
--- a/packages/app/src/components/ui/chart-display-helpers.tsx
+++ b/packages/app/src/components/ui/chart-display-helpers.tsx
@@ -1,3 +1,5 @@
+'use client';
+
 import Link from 'next/link';
 import type { ReactNode } from 'react';
 
@@ -5,6 +7,8 @@ import { Badge } from '@/components/ui/badge';
 import { ExternalLinkIcon } from '@/components/ui/external-link-icon';
 import { ShareButton } from '@/components/ui/share-button';
 import { HW_REGISTRY } from '@semianalysisai/inferencex-constants';
+import { useLocale } from '@/lib/use-locale';
+import type { Locale } from '@/lib/i18n';
 
 // Keep these metric-key groups in sync with chart-utils/chart configs when new source-backed
 // metrics are added; this helper owns which caption notes and caveats appear for each family.
@@ -35,11 +39,19 @@ function MetricBadges({
   );
 }
 
-function SourceLink({ href, children }: { href: string; children: ReactNode }) {
+function SourceLink({
+  href,
+  children,
+  sourceLabel = 'Source:',
+}: {
+  href: string;
+  children: ReactNode;
+  sourceLabel?: string;
+}) {
   return (
     <p className="text-muted-foreground">
       <small>
-        Source:{' '}
+        {sourceLabel}{' '}
         <Link target="_blank" className="underline hover:text-foreground" href={href}>
           {children}
           <ExternalLinkIcon />
@@ -49,15 +61,45 @@ function SourceLink({ href, children }: { href: string; children: ReactNode }) {
   );
 }
 
+const NOUN_ZH: Record<string, string> = {
+  cost: '成本',
+  'input throughput': '输入吞吐量',
+  'output throughput': '输出吞吐量',
+  power: '功耗',
+  Joules: '能耗',
+  'Joules per token': '每 token 能耗',
+};
+
 function DisaggCaveat({
   visible,
   calculationNoun,
   comparisonNoun = calculationNoun,
+  locale = 'en',
 }: {
   visible: boolean;
   calculationNoun: string;
   comparisonNoun?: string;
+  locale?: Locale;
 }) {
+  const content =
+    locale === 'zh' ? (
+      <>
+        <strong>注意：</strong>分离式推理配置（如 MoRI SGLang、Dynamo TRTLLM）按解码 GPU 或预填充
+        GPU 计算
+        {NOUN_ZH[calculationNoun] ?? calculationNoun}
+        ，而非按 GPU 总数计算。因此，与聚合配置进行
+        {NOUN_ZH[comparisonNoun] ?? comparisonNoun}
+        的直接对比并不完全等价。
+      </>
+    ) : (
+      <>
+        <strong>Note:</strong> Disaggregated inference configurations (e.g., MoRI SGLang, Dynamo
+        TRTLLM) calculate {calculationNoun} per decode GPU or per prefill GPU, rather than per total
+        GPU count. This makes direct {comparisonNoun} comparison with aggregated configs not an
+        apples-to-apples comparison.
+      </>
+    );
+
   return (
     <div
       className={`overflow-hidden transition-all duration-200 ease-in-out ${
@@ -65,10 +107,7 @@ function DisaggCaveat({
       }`}
     >
       <p className="text-muted-foreground text-xs mt-2 border-l-2 border-amber-500 pl-2 bg-amber-500/5 py-1">
-        <strong>Note:</strong> Disaggregated inference configurations (e.g., MoRI SGLang, Dynamo
-        TRTLLM) calculate {calculationNoun} per decode GPU or per prefill GPU, rather than per total
-        GPU count. This makes direct {comparisonNoun} comparison with aggregated configs not an
-        apples-to-apples comparison.
+        {content}
       </p>
     </div>
   );
@@ -106,6 +145,7 @@ export function MetricAssumptionNotes({
   includeAllPowerThroughputMetrics?: boolean;
   includePowerThroughputCaveat?: boolean;
 }) {
+  const locale = useLocale();
   const showPowerSource = includeAllPowerThroughputMetrics
     ? POWER_SOURCE_METRICS.has(selectedYAxisMetric)
     : selectedYAxisMetric === 'y_tpPerMw';
@@ -121,37 +161,60 @@ export function MetricAssumptionNotes({
       ? getCostValues(selectedYAxisMetric)
       : null;
 
+  const powerLabel = locale === 'zh' ? '全包功耗/GPU：' : 'All in Power/GPU:';
+  const costLabel = locale === 'zh' ? 'TCO $/GPU/小时：' : 'TCO $/GPU/hr:';
+  const sourceLabel = locale === 'zh' ? '来源：' : 'Source:';
+
   return (
     <>
       {showPowerSource && (
         <>
-          <MetricBadges label="All in Power/GPU:" values={POWER_VALUES} />
-          <SourceLink href="https://semianalysis.com/datacenter-industry-model/">
+          <MetricBadges label={powerLabel} values={POWER_VALUES} />
+          <SourceLink
+            href="https://semianalysis.com/datacenter-industry-model/"
+            sourceLabel={sourceLabel}
+          >
             SemiAnalysis Datacenter Industry Model
           </SourceLink>
         </>
       )}
       {costValues && (
         <>
-          <MetricBadges label="TCO $/GPU/hr:" values={costValues} />
-          <SourceLink href="https://semianalysis.com/ai-cloud-tco-model/">
+          <MetricBadges label={costLabel} values={costValues} />
+          <SourceLink href="https://semianalysis.com/ai-cloud-tco-model/" sourceLabel={sourceLabel}>
             SemiAnalysis Market August 2025 Pricing Surveys & AI Cloud TCO Model
           </SourceLink>
         </>
       )}
-      <DisaggCaveat visible={selectedYAxisMetric.startsWith('y_cost')} calculationNoun="cost" />
-      <DisaggCaveat visible={showInputThroughputCaveat} calculationNoun="input throughput" />
-      <DisaggCaveat visible={showOutputThroughputCaveat} calculationNoun="output throughput" />
+      <DisaggCaveat
+        visible={selectedYAxisMetric.startsWith('y_cost')}
+        calculationNoun="cost"
+        locale={locale}
+      />
+      <DisaggCaveat
+        visible={showInputThroughputCaveat}
+        calculationNoun="input throughput"
+        locale={locale}
+      />
+      <DisaggCaveat
+        visible={showOutputThroughputCaveat}
+        calculationNoun="output throughput"
+        locale={locale}
+      />
       {includePowerThroughputCaveat && (
         <DisaggCaveat
           visible={POWER_SOURCE_METRICS.has(selectedYAxisMetric)}
           calculationNoun="power"
+          locale={locale}
         />
       )}
       {showJouleSource && (
         <>
-          <MetricBadges label="All in Power/GPU:" values={POWER_VALUES} />
-          <SourceLink href="https://semianalysis.com/datacenter-industry-model/">
+          <MetricBadges label={powerLabel} values={POWER_VALUES} />
+          <SourceLink
+            href="https://semianalysis.com/datacenter-industry-model/"
+            sourceLabel={sourceLabel}
+          >
             SemiAnalysis Datacenter Industry Model
           </SourceLink>
         </>
@@ -160,6 +223,7 @@ export function MetricAssumptionNotes({
         visible={showJouleSource}
         calculationNoun="Joules"
         comparisonNoun="Joules per token"
+        locale={locale}
       />
     </>
   );
diff --git a/packages/app/src/components/ui/chart-selectors.tsx b/packages/app/src/components/ui/chart-selectors.tsx
index 6aee97dd..f63f4e69 100644
--- a/packages/app/src/components/ui/chart-selectors.tsx
+++ b/packages/app/src/components/ui/chart-selectors.tsx
@@ -30,6 +30,41 @@ import {
   groupByCategory,
   sequenceKind,
 } from '@/lib/data-mappings';
+import { useLocale } from '@/lib/use-locale';
+
+const STRINGS = {
+  en: {
+    model: 'Model',
+    modelTooltip: 'The language model being benchmarked.',
+    islOsl: 'ISL / OSL',
+    islOslTooltip:
+      'Input Sequence Length / Output Sequence Length. Defines the number of input and output tokens for the benchmark (e.g., 1K/8K means 1,024 input tokens and 8,192 output tokens).',
+    scenario: 'Scenario',
+    scenarioTooltip:
+      'Benchmark scenario. Fixed Sequence Length runs use a defined input/output token count (ISL/OSL). Agentic Traces replay real agentic workloads with variable inputs/outputs.',
+    latencyPercentile: 'Latency Percentile',
+    latencyPercentileTooltip:
+      'Percentile of the latency distribution used for the chart x-axis on agentic runs.',
+    precision: 'Precision',
+    precisionTooltip:
+      "Numerical precision used for model weights. Lower precision like 'FP4' uses less memory and increases throughput but may slightly reduce accuracy compared to higher precisions like 'FP8'.",
+  },
+  zh: {
+    model: '模型',
+    modelTooltip: '正在进行基准测试的语言模型。',
+    islOsl: 'ISL / OSL',
+    islOslTooltip:
+      '输入序列长度 / 输出序列长度（Input Sequence Length / Output Sequence Length）。定义基准测试的输入和输出 token 数量（如 1K/8K 表示 1,024 个输入 token 和 8,192 个输出 token）。',
+    scenario: '场景',
+    scenarioTooltip:
+      '基准测试场景。Fixed Sequence Length 使用预设的输入/输出 token 数（ISL/OSL）。Agentic Traces 回放具有可变输入/输出的真实智能体工作负载。',
+    latencyPercentile: '延迟分位数',
+    latencyPercentileTooltip: '用于智能体运行图表 X 轴的延迟分布分位数。',
+    precision: '精度',
+    precisionTooltip:
+      '模型权重的数值精度。FP4 等低精度占用更少显存并提高吞吐量，但与 FP8 等高精度相比可能略微降低准确度。',
+  },
+} as const;
 
 function CategorySectionTitle({ label, reason }: { label: string; reason: string }) {
   return (
@@ -69,6 +104,7 @@ export function ModelSelector({
   availableModels,
   'data-testid': testId,
 }: ModelSelectorProps) {
+  const t = STRINGS[useLocale()];
   const groups = groupByCategory(availableModels, (m) => getModelCategory(m as Model));
   const sections = [
     {
@@ -128,11 +164,7 @@ export function ModelSelector({
 
   return (
     <div className="flex flex-col space-y-1.5 lg:col-span-2">
-      <LabelWithTooltip
-        htmlFor={id}
-        label="Model"
-        tooltip="The language model being benchmarked."
-      />
+      <LabelWithTooltip htmlFor={id} label={t.model} tooltip={t.modelTooltip} />
       <div>
         <MultiSelect
           sections={sections}
@@ -179,6 +211,7 @@ export function SequenceSelector({
   availableSequences,
   'data-testid': testId,
 }: SequenceSelectorProps) {
+  const t = STRINGS[useLocale()];
   const groups = groupByCategory(availableSequences, (s) => getSequenceCategory(s as Sequence));
   const sections = [
     {
@@ -209,11 +242,7 @@ export function SequenceSelector({
 
   return (
     <div className="flex flex-col space-y-1.5 lg:col-span-1">
-      <LabelWithTooltip
-        htmlFor={id}
-        label="ISL / OSL"
-        tooltip="Input Sequence Length / Output Sequence Length. Defines the number of input and output tokens for the benchmark (e.g., 1K/8K means 1,024 input tokens and 8,192 output tokens)."
-      />
+      <LabelWithTooltip htmlFor={id} label={t.islOsl} tooltip={t.islOslTooltip} />
       <div>
         <MultiSelect
           sections={sections}
@@ -265,17 +294,14 @@ export function ScenarioSelector({
   availableSequences,
   'data-testid': testId,
 }: ScenarioSelectorProps) {
+  const t = STRINGS[useLocale()];
   const fixedSeq = availableSequences.filter((s) => sequenceKind(s as Sequence) === 'fixed-seq');
   const agentic = availableSequences.filter((s) => sequenceKind(s as Sequence) === 'agentic');
   const fixedGroups = groupByCategory(fixedSeq, (s) => getSequenceCategory(s as Sequence));
 
   return (
     <div className="flex flex-col space-y-1.5 lg:col-span-1">
-      <LabelWithTooltip
-        htmlFor={id}
-        label="Scenario"
-        tooltip="Benchmark scenario. Fixed Sequence Length runs use a defined input/output token count (ISL/OSL). Agentic Traces replay real agentic workloads with variable inputs/outputs."
-      />
+      <LabelWithTooltip htmlFor={id} label={t.scenario} tooltip={t.scenarioTooltip} />
       <Select
         value={value}
         onValueChange={(v) => {
@@ -349,12 +375,13 @@ export function PercentileSelector({
   onChange,
   'data-testid': testId,
 }: PercentileSelectorProps) {
+  const t = STRINGS[useLocale()];
   return (
     <div className="flex flex-col space-y-1.5 lg:col-span-1">
       <LabelWithTooltip
         htmlFor={id}
-        label="Latency Percentile"
-        tooltip="Percentile of the latency distribution used for the chart x-axis on agentic runs."
+        label={t.latencyPercentile}
+        tooltip={t.latencyPercentileTooltip}
       />
       <Select
         value={value}
@@ -397,13 +424,10 @@ export function PrecisionSelector({
   availablePrecisions,
   'data-testid': testId,
 }: PrecisionSelectorProps) {
+  const t = STRINGS[useLocale()];
   return (
     <div className="flex flex-col space-y-1.5 lg:col-span-1">
-      <LabelWithTooltip
-        htmlFor={id}
-        label="Precision"
-        tooltip="Numerical precision used for model weights. Lower precision like 'FP4' uses less memory and increases throughput but may slightly reduce accuracy compared to higher precisions like 'FP8'."
-      />
+      <LabelWithTooltip htmlFor={id} label={t.precision} tooltip={t.precisionTooltip} />
       <div>
         <MultiSelect
           options={availablePrecisions.map((p) => ({
diff --git a/packages/app/src/components/ui/unofficial-domain-notice.tsx b/packages/app/src/components/ui/unofficial-domain-notice.tsx
index 91712439..4646855e 100644
--- a/packages/app/src/components/ui/unofficial-domain-notice.tsx
+++ b/packages/app/src/components/ui/unofficial-domain-notice.tsx
@@ -3,11 +3,26 @@
 import { useEffect, useState } from 'react';
 
 import { SITE_URL } from '@semianalysisai/inferencex-constants';
+import { useLocale } from '@/lib/use-locale';
 
 const OFFICIAL_HOSTNAME = new URL(SITE_URL).hostname;
 
+const STRINGS = {
+  en: {
+    note: 'Note:',
+    text: 'and is not affiliated with or endorsed by SemiAnalysis. Data shown here may be unofficial, modified, or out of date — visit the official site for authoritative InferenceX™ results.',
+    notHosted: 'This deployment is not hosted at',
+  },
+  zh: {
+    note: '注意：',
+    text: '与 SemiAnalysis 无关联或背书。此处显示的数据可能为非官方、已修改或过期数据——请访问官方网站获取权威的 InferenceX™ 结果。',
+    notHosted: '此部署未托管在',
+  },
+} as const;
+
 export function UnofficialDomainNotice() {
   const [isUnofficial, setIsUnofficial] = useState(false);
+  const t = STRINGS[useLocale()];
 
   useEffect(() => {
     setIsUnofficial(window.location.hostname !== OFFICIAL_HOSTNAME);
@@ -17,7 +32,7 @@ export function UnofficialDomainNotice() {
 
   return (
     <p className="text-muted-foreground text-xs mt-2 border-l-2 border-amber-500 pl-2 bg-amber-500/5 py-1">
-      <strong>Note:</strong> This deployment is not hosted at{' '}
+      <strong>{t.note}</strong> {t.notHosted}{' '}
       <a
         href={SITE_URL}
         target="_blank"
@@ -26,8 +41,7 @@ export function UnofficialDomainNotice() {
       >
         {OFFICIAL_HOSTNAME}
       </a>{' '}
-      and is not affiliated with or endorsed by SemiAnalysis. Data shown here may be unofficial,
-      modified, or out of date — visit the official site for authoritative InferenceX™ results.
+      {t.text}
     </p>
   );
 }
diff --git a/packages/app/src/components/zh/zh-tab-intro.tsx b/packages/app/src/components/zh/zh-tab-intro.tsx
index e39eee80..be1340e8 100644
--- a/packages/app/src/components/zh/zh-tab-intro.tsx
+++ b/packages/app/src/components/zh/zh-tab-intro.tsx
@@ -12,7 +12,7 @@ export function ZhTabIntro({ tab }: { tab: ZhTabKey }) {
       <h1 className="text-xl lg:text-2xl font-bold tracking-tight">{TAB_META_ZH[tab].title}</h1>
       <p className="mt-2 text-sm lg:text-base text-muted-foreground">{TAB_INTRO_ZH[tab]}</p>
       <p className="mt-2 text-xs text-muted-foreground">
-        下方交互式图表界面目前为英文。图表中的模型、GPU 与框架名称均为业界通用英文名称。
+        图表中的模型、GPU、框架与指标名称均沿用业界通用英文名称。
       </p>
     </Card>
   );
diff --git a/packages/app/src/lib/compare-ssr-zh.ts b/packages/app/src/lib/compare-ssr-zh.ts
new file mode 100644
index 00000000..cd48611a
--- /dev/null
+++ b/packages/app/src/lib/compare-ssr-zh.ts
@@ -0,0 +1,436 @@
+/**
+ * Simplified Chinese ports of the English-prose-generating functions in
+ * compare-ssr.ts. Provides zh narrative templates, JSON-LD builders, and
+ * breadcrumb helpers for /zh/compare and /zh/compare-per-dollar slug pages.
+ *
+ * MUST be updated whenever compare-ssr.ts narrative templates change.
+ */
+import {
+  AUTHOR_NAME,
+  AUTHOR_URL,
+  HW_REGISTRY,
+  SITE_URL,
+} from '@semianalysisai/inferencex-constants';
+
+import { type CompareModelSlug, compareModelDisplayLabel } from '@/lib/compare-slug';
+import {
+  bandFor,
+  type CompareJsonLdVariant,
+  fmtCost,
+  fmtPctDelta,
+  type FullBoth,
+  jsonLdEntryFor,
+  type PairSummary,
+  type PerDollarBoth,
+  pickRotated,
+  type SsrInterpolatedRow,
+} from '@/lib/compare-ssr';
+
+// ---------------------------------------------------------------------------
+// Band phrase — Chinese
+// ---------------------------------------------------------------------------
+
+const BAND_PHRASE_ZH: Record<'low' | 'middle' | 'high', string> = {
+  low: '低端',
+  middle: '中部',
+  high: '高端',
+};
+
+// ---------------------------------------------------------------------------
+// /compare-per-dollar variant — both GPUs, no tie, non-zero costs
+// ---------------------------------------------------------------------------
+
+const PER_DOLLAR_BOTH_TEMPLATES_ZH: ((i: PerDollarBoth) => string)[] = [
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，${i.aLabel} 每百万 token 成本为 ${fmtCost(i.aCost)}，${i.bLabel} 为 ${fmtCost(i.bCost)}。${i.cheaper} 在此工作点上的成本效率高出 ${fmtPctDelta(i.ratio)}。`,
+  (i) =>
+    `${i.cheaper} 在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时领先于 ${i.pricier}——每百万 token 成本 ${fmtCost(i.cheaperCost)} 对 ${fmtCost(i.pricierCost)}，差距达 ${fmtPctDelta(i.ratio)}。`,
+  (i) =>
+    `将 ${i.modelLabel} 推至 ${i.target} tok/s/user 时，${i.aLabel} 每百万 token 成本为 ${fmtCost(i.aCost)}，${i.bLabel} 为 ${fmtCost(i.bCost)}——${i.cheaper} 领先 ${fmtPctDelta(i.ratio)}。`,
+  (i) =>
+    `${i.aLabel}：每百万 token ${fmtCost(i.aCost)}。${i.bLabel}：${fmtCost(i.bCost)}。均在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行，${i.cheaper} 便宜 ${fmtPctDelta(i.ratio)}。`,
+  (i) =>
+    `在 ${i.range} 交互性区间的${BAND_PHRASE_ZH[i.band]}——即 ${i.target} tok/s/user 处——${i.aLabel} 运行 ${i.modelLabel} 每百万 token 成本为 ${fmtCost(i.aCost)}，${i.bLabel} 为 ${fmtCost(i.bCost)}。${i.cheaper} 便宜 ${fmtPctDelta(i.ratio)}。`,
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，每百万 token 成本分别为：${i.aLabel} ${fmtCost(i.aCost)}、${i.bLabel} ${fmtCost(i.bCost)}；${i.cheaper} 每美元多产出 ${fmtPctDelta(i.ratio)} 的 token。`,
+];
+
+const PER_DOLLAR_TIED_TEMPLATES_ZH: ((i: PerDollarBoth) => string)[] = [
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，${i.aLabel} 和 ${i.bLabel} 的每百万 token 成本几乎相同（${fmtCost(i.aCost)} 对 ${fmtCost(i.bCost)}），差距在 ~1% 以内。`,
+  (i) =>
+    `${i.aLabel} ${fmtCost(i.aCost)}、${i.bLabel} ${fmtCost(i.bCost)} 每百万 token，在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行：成本实质相同。`,
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，${i.aLabel}（${fmtCost(i.aCost)}）与 ${i.bLabel}（${fmtCost(i.bCost)}）的每百万 token 成本基本持平。`,
+];
+
+const PER_DOLLAR_ZERO_TEMPLATES_ZH: ((args: {
+  modelLabel: string;
+  aLabel: string;
+  bLabel: string;
+  target: number;
+  aCost: number;
+  bCost: number;
+}) => string)[] = [
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，${i.aLabel} 和 ${i.bLabel} 每百万 token 成本分别为 ${fmtCost(i.aCost)} 和 ${fmtCost(i.bCost)}——其中一方缺少定价或吞吐量数据，无法进行等价比较。`,
+  (i) =>
+    `${i.aLabel}（${fmtCost(i.aCost)}）与 ${i.bLabel}（${fmtCost(i.bCost)}）每百万 token，在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行：至少有一方数据为零，无法计算比率。`,
+];
+
+const PER_DOLLAR_SINGLE_TEMPLATES_ZH: ((args: {
+  modelLabel: string;
+  presentLabel: string;
+  missingLabel: string;
+  target: number;
+  presentCost: number;
+}) => string)[] = [
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，${i.presentLabel} 每百万 token 成本为 ${fmtCost(i.presentCost)}；${i.missingLabel} 在此目标点没有基准测试数据。`,
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，${i.presentLabel} 每百万 token 成本为 ${fmtCost(i.presentCost)}。${i.missingLabel} 尚未在此工作点进行基准测试。`,
+  (i) =>
+    `仅 ${i.presentLabel} 在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时有成本数据——每百万 token ${fmtCost(i.presentCost)}。${i.missingLabel} 在此目标点尚未测试。`,
+];
+
+// ---------------------------------------------------------------------------
+// /compare 'full' variant — both GPUs, mentions cost AND throughput
+// ---------------------------------------------------------------------------
+
+function fullSummaryZh(i: FullBoth): string {
+  const costPart = i.costTied
+    ? '每 token 成本基本持平'
+    : i.costRatio === null
+      ? null
+      : `${i.cheaper} 每 token 成本低 ${fmtPctDelta(i.costRatio)}`;
+  const tputPart = i.tputTied
+    ? '每 GPU 吞吐量基本持平'
+    : i.tputRatio === null
+      ? null
+      : `${i.faster} 每 GPU 吞吐量高出 ${fmtPctDelta(i.tputRatio)}`;
+  const both = [costPart, tputPart].filter(Boolean).join('；');
+  return both.length > 0 ? both : '差距极小，难以判定优劣';
+}
+
+const FULL_BOTH_TEMPLATES_ZH: ((i: FullBoth) => string)[] = [
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 交互性运行时，${i.aLabel} 吞吐量为 ${i.aValue.toFixed(0)} tok/s/GPU，每百万 token 成本 ${fmtCost(i.aCost)}；${i.bLabel} 吞吐量为 ${i.bValue.toFixed(0)} tok/s/GPU，成本 ${fmtCost(i.bCost)}。${fullSummaryZh(i)}。`,
+  (i) =>
+    `${i.aLabel} 在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时达到 ${i.aValue.toFixed(0)} tok/s/GPU（每百万 token ${fmtCost(i.aCost)}）；${i.bLabel} 达到 ${i.bValue.toFixed(0)} tok/s/GPU（${fmtCost(i.bCost)}）。${fullSummaryZh(i)}。`,
+  (i) =>
+    `${i.modelLabel} 在 ${i.target} tok/s/user 交互性下的吞吐量：${i.aLabel} 为 ${i.aValue.toFixed(0)} tok/s/GPU，${i.bLabel} 为 ${i.bValue.toFixed(0)}。每百万 token 成本分别为 ${fmtCost(i.aCost)} 和 ${fmtCost(i.bCost)}。${fullSummaryZh(i)}。`,
+  (i) =>
+    `${i.aLabel} / ${i.bLabel} 在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行：${i.aValue.toFixed(0)} / ${i.bValue.toFixed(0)} tok/s/GPU，${fmtCost(i.aCost)} / ${fmtCost(i.bCost)} 每百万 token。${fullSummaryZh(i)}。`,
+  (i) =>
+    `在 ${i.range} 交互性区间的${BAND_PHRASE_ZH[i.band]}，即 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时：${i.aLabel} 达到 ${i.aValue.toFixed(0)} tok/s/GPU（${fmtCost(i.aCost)}/百万 token），${i.bLabel} 达到 ${i.bValue.toFixed(0)}（${fmtCost(i.bCost)}/百万）。${fullSummaryZh(i)}。`,
+  (i) =>
+    `以 ${i.target} tok/s/user 为目标在 ${i.modelLabel} 上运行时，${i.aLabel} 产出 ${i.aValue.toFixed(0)} tok/s/GPU（每百万 token ${fmtCost(i.aCost)}），${i.bLabel} 产出 ${i.bValue.toFixed(0)}（${fmtCost(i.bCost)}）。${fullSummaryZh(i)}。`,
+];
+
+const FULL_SINGLE_TEMPLATES_ZH: ((args: {
+  modelLabel: string;
+  presentLabel: string;
+  missingLabel: string;
+  target: number;
+  presentValue: number;
+  presentCost: number;
+}) => string)[] = [
+  (i) =>
+    `在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时，${i.presentLabel} 吞吐量为 ${i.presentValue.toFixed(0)} tok/s/GPU，每百万 token 成本 ${fmtCost(i.presentCost)}；${i.missingLabel} 在此目标点没有基准测试数据。`,
+  (i) =>
+    `${i.presentLabel} 在 ${i.modelLabel} 上以 ${i.target} tok/s/user 运行时达到 ${i.presentValue.toFixed(0)} tok/s/GPU（每百万 token ${fmtCost(i.presentCost)}）。${i.missingLabel} 在此工作点没有数据。`,
+  (i) =>
+    `${i.presentLabel}：${i.presentValue.toFixed(0)} tok/s/GPU，每百万 token ${fmtCost(i.presentCost)}（${i.modelLabel} 上以 ${i.target} tok/s/user 运行）。${i.missingLabel} 在此点尚未测试。`,
+];
+
+// ---------------------------------------------------------------------------
+// compareTableNarrativeZh
+// ---------------------------------------------------------------------------
+
+export function compareTableNarrativeZh(
+  variant: CompareJsonLdVariant,
+  modelLabel: string,
+  aLabel: string,
+  bLabel: string,
+  ssrRows: SsrInterpolatedRow[],
+  interactivityRange: { min: number; max: number },
+): string[] {
+  if (ssrRows.length === 0) return [];
+
+  const range = `${interactivityRange.min}–${interactivityRange.max} tok/s/user`;
+  const pageSeed = `${variant}|${modelLabel}|${aLabel}|${bLabel}`;
+  const paragraphs: string[] = [];
+
+  for (const [rowIndex, row] of ssrRows.entries()) {
+    const { target, a, b } = row;
+    if (!a && !b) continue;
+    const band = bandFor(target, interactivityRange);
+
+    if (variant === 'per-dollar') {
+      if (a && b) {
+        if (!(a.cost > 0 && b.cost > 0)) {
+          paragraphs.push(
+            pickRotated(
+              PER_DOLLAR_ZERO_TEMPLATES_ZH,
+              pageSeed,
+              rowIndex,
+            )({
+              modelLabel,
+              aLabel,
+              bLabel,
+              target,
+              aCost: a.cost,
+              bCost: b.cost,
+            }),
+          );
+          continue;
+        }
+        const aCheaper = a.cost < b.cost;
+        const cheaper = aCheaper ? aLabel : bLabel;
+        const pricier = aCheaper ? bLabel : aLabel;
+        const ratio = aCheaper ? b.cost / a.cost : a.cost / b.cost;
+        const inputs: PerDollarBoth = {
+          modelLabel,
+          aLabel,
+          bLabel,
+          cheaper,
+          pricier,
+          cheaperCost: aCheaper ? a.cost : b.cost,
+          pricierCost: aCheaper ? b.cost : a.cost,
+          ratio,
+          target,
+          aCost: a.cost,
+          bCost: b.cost,
+          range,
+          band,
+        };
+        const pool = ratio < 1.01 ? PER_DOLLAR_TIED_TEMPLATES_ZH : PER_DOLLAR_BOTH_TEMPLATES_ZH;
+        paragraphs.push(pickRotated(pool, pageSeed, rowIndex)(inputs));
+        continue;
+      }
+      const present = (a ?? b)!;
+      paragraphs.push(
+        pickRotated(
+          PER_DOLLAR_SINGLE_TEMPLATES_ZH,
+          pageSeed,
+          rowIndex,
+        )({
+          modelLabel,
+          presentLabel: a ? aLabel : bLabel,
+          missingLabel: a ? bLabel : aLabel,
+          target,
+          presentCost: present.cost,
+        }),
+      );
+      continue;
+    }
+
+    // 'full' variant
+    if (a && b) {
+      const costOk = a.cost > 0 && b.cost > 0;
+      const tputOk = a.value > 0 && b.value > 0;
+      const aCheaper = a.cost < b.cost;
+      const aFaster = a.value > b.value;
+      const costRatio = costOk ? (aCheaper ? b.cost / a.cost : a.cost / b.cost) : null;
+      const tputRatio = tputOk ? (aFaster ? a.value / b.value : b.value / a.value) : null;
+      const inputs: FullBoth = {
+        modelLabel,
+        aLabel,
+        bLabel,
+        cheaper: aCheaper ? aLabel : bLabel,
+        faster: aFaster ? aLabel : bLabel,
+        costRatio,
+        tputRatio,
+        costTied: costOk && costRatio !== null && costRatio < 1.01,
+        tputTied: tputOk && tputRatio !== null && tputRatio < 1.01,
+        target,
+        aCost: a.cost,
+        bCost: b.cost,
+        aValue: a.value,
+        bValue: b.value,
+        range,
+        band,
+      };
+      paragraphs.push(pickRotated(FULL_BOTH_TEMPLATES_ZH, pageSeed, rowIndex)(inputs));
+      continue;
+    }
+    const present = (a ?? b)!;
+    paragraphs.push(
+      pickRotated(
+        FULL_SINGLE_TEMPLATES_ZH,
+        pageSeed,
+        rowIndex,
+      )({
+        modelLabel,
+        presentLabel: a ? aLabel : bLabel,
+        missingLabel: a ? bLabel : aLabel,
+        target,
+        presentValue: present.value,
+        presentCost: present.cost,
+      }),
+    );
+  }
+
+  return paragraphs;
+}
+
+// ---------------------------------------------------------------------------
+// JSON-LD — Chinese
+// ---------------------------------------------------------------------------
+
+export function buildBreadcrumbJsonLdZh(
+  variant: CompareJsonLdVariant,
+  pairLabel: string,
+  url: string,
+) {
+  const indexUrl =
+    variant === 'per-dollar' ? `${SITE_URL}/zh/compare-per-dollar` : `${SITE_URL}/zh/compare`;
+  const indexName = variant === 'per-dollar' ? 'GPU 每美元性能' : 'GPU 对比';
+  return {
+    '@context': 'https://schema.org',
+    '@type': 'BreadcrumbList',
+    itemListElement: [
+      { '@type': 'ListItem', position: 1, name: '首页', item: `${SITE_URL}/zh` },
+      { '@type': 'ListItem', position: 2, name: indexName, item: indexUrl },
+      { '@type': 'ListItem', position: 3, name: pairLabel, item: url },
+    ],
+  };
+}
+
+export function buildJsonLdZh(
+  variant: CompareJsonLdVariant,
+  model: CompareModelSlug,
+  a: string,
+  b: string,
+  url: string,
+  summaryA: PairSummary,
+  summaryB: PairSummary,
+  ssrRows: SsrInterpolatedRow[],
+  imageUrl?: string,
+  datePublished?: string,
+  dateModified?: string,
+  modelApiKey?: string,
+) {
+  const aLabel = HW_REGISTRY[a]?.label ?? a.toUpperCase();
+  const bLabel = HW_REGISTRY[b]?.label ?? b.toUpperCase();
+  const fullLabel = compareModelDisplayLabel(model, a, b);
+
+  const itemListName =
+    variant === 'per-dollar' ? `${fullLabel} — 每美元性能` : `${fullLabel} 推理基准测试`;
+  const itemListDescription =
+    variant === 'per-dollar'
+      ? `${aLabel} 与 ${bLabel} 在 ${model.label} 上的每百万 token 成本。基于所属云服务商 TCO 归一化的 GPU 推理性能。`
+      : `${aLabel} 与 ${bLabel} 在 ${model.label} 上的正面 AI 推理基准测试对比。`;
+  const datasetName =
+    variant === 'per-dollar'
+      ? `${aLabel} vs ${bLabel}（${model.label}）每美元性能对比`
+      : `${aLabel} vs ${bLabel}（${model.label}）插值基准测试对比`;
+  const datasetDescription =
+    variant === 'per-dollar'
+      ? `${aLabel} 与 ${bLabel} 在 ${model.label} 上的所属云服务商每百万 token 成本，在相同交互性水平下对齐——美元归一化推理基准测试。`
+      : `${aLabel} 与 ${bLabel} 在 ${model.label} 上在相同交互性水平下的插值吞吐量、成本、能效及并发数。`;
+
+  const comparisonRows = ssrRows
+    .filter((row) => row.a || row.b)
+    .map((row) => {
+      const metrics: { name: string; value: string }[] = [
+        { name: 'Model', value: model.displayName },
+        { name: 'Target Interactivity (tok/s/user)', value: String(row.target) },
+      ];
+      if (row.a) {
+        metrics.push(
+          { name: `${aLabel} Throughput (tok/s/gpu)`, value: row.a.value.toFixed(1) },
+          { name: `${aLabel} Cost ($/M tok)`, value: row.a.cost.toFixed(3) },
+          { name: `${aLabel} tok/s/MW`, value: row.a.tpPerMw.toFixed(0) },
+          { name: `${aLabel} Concurrency`, value: String(Math.round(row.a.concurrency)) },
+        );
+      }
+      if (row.b) {
+        metrics.push(
+          { name: `${bLabel} Throughput (tok/s/gpu)`, value: row.b.value.toFixed(1) },
+          { name: `${bLabel} Cost ($/M tok)`, value: row.b.cost.toFixed(3) },
+          { name: `${bLabel} tok/s/MW`, value: row.b.tpPerMw.toFixed(0) },
+          { name: `${bLabel} Concurrency`, value: String(Math.round(row.b.concurrency)) },
+        );
+      }
+      return {
+        '@type': 'Dataset',
+        name: `${model.label} 在 ${row.target} tok/s/user 交互性下的对比`,
+        variableMeasured: metrics.map((m) => ({
+          '@type': 'PropertyValue',
+          name: m.name,
+          value: m.value,
+        })),
+      };
+    });
+
+  return {
+    '@context': 'https://schema.org',
+    '@graph': [
+      {
+        '@type': 'ItemList',
+        name: itemListName,
+        description: itemListDescription,
+        url,
+        inLanguage: 'zh-CN',
+        ...(imageUrl && { image: imageUrl }),
+        itemListOrder: 'https://schema.org/ItemListOrderAscending',
+        numberOfItems: 2,
+        itemListElement: [jsonLdEntryFor(a, summaryA, 1), jsonLdEntryFor(b, summaryB, 2)],
+      },
+      ...(comparisonRows.length > 0
+        ? [
+            {
+              '@type': 'Dataset',
+              name: datasetName,
+              description: datasetDescription,
+              url,
+              inLanguage: 'zh-CN',
+              license: 'https://www.apache.org/licenses/LICENSE-2.0',
+              isAccessibleForFree: true,
+              measurementTechnique:
+                'Open-source automated GPU CI/CD inference benchmark (github.com/SemiAnalysisAI/InferenceX)',
+              keywords: [
+                ...new Set(
+                  [
+                    'AI inference benchmark',
+                    'GPU comparison',
+                    variant === 'per-dollar' ? 'cost per million tokens' : 'inference latency',
+                    variant === 'per-dollar' ? 'performance per dollar' : 'tokens per second',
+                    model.label,
+                    aLabel,
+                    bLabel,
+                    HW_REGISTRY[a]?.vendor,
+                    HW_REGISTRY[b]?.vendor,
+                  ].filter(Boolean),
+                ),
+              ].join(', '),
+              ...(datePublished && { datePublished }),
+              ...(dateModified && { dateModified }),
+              creator: {
+                '@type': 'Organization',
+                name: AUTHOR_NAME,
+                url: AUTHOR_URL,
+              },
+              ...(modelApiKey && {
+                distribution: {
+                  '@type': 'DataDownload',
+                  encodingFormat: 'application/json',
+                  contentUrl: `${SITE_URL}/api/v1/benchmarks?model=${encodeURIComponent(modelApiKey)}`,
+                  name: `${model.label} latest benchmark rows (JSON)`,
+                },
+              }),
+              ...(imageUrl && {
+                image: {
+                  '@type': 'ImageObject',
+                  contentUrl: imageUrl,
+                  caption: datasetName,
+                },
+              }),
+              hasPart: comparisonRows,
+            },
+          ]
+        : []),
+    ],
+  };
+}
diff --git a/packages/app/src/lib/compare-ssr.ts b/packages/app/src/lib/compare-ssr.ts
index 92c0912d..daf2f582 100644
--- a/packages/app/src/lib/compare-ssr.ts
+++ b/packages/app/src/lib/compare-ssr.ts
@@ -319,7 +319,7 @@ export function computeCompareImageRows(
 // JSON-LD graph
 // ---------------------------------------------------------------------------
 
-function jsonLdEntryFor(key: string, summary: PairSummary, position: number) {
+export function jsonLdEntryFor(key: string, summary: PairSummary, position: number) {
   const meta = HW_REGISTRY[key];
   const label = meta?.label ?? key.toUpperCase();
   const props: { name: string; value: string | number }[] = [{ name: 'Category', value: 'GPU' }];
@@ -376,13 +376,13 @@ export type CompareJsonLdVariant = 'full' | 'per-dollar';
 // ---------------------------------------------------------------------------
 
 /** Format cost as $X.XX or $X.X depending on magnitude. */
-function fmtCost(v: number): string {
+export function fmtCost(v: number): string {
   if (v >= 10) return `$${v.toFixed(1)}`;
   return `$${v.toFixed(2)}`;
 }
 
 /** Round a ratio (always ≥ 1) into a percentage delta, e.g. 1.3 → "30%". */
-function fmtPctDelta(ratio: number): string {
+export function fmtPctDelta(ratio: number): string {
   return `${Math.round((ratio - 1) * 100)}%`;
 }
 
@@ -400,7 +400,10 @@ function hashStr(s: string): number {
 /** Bucket the target into low / middle / high segment of the benchmarked
  *  range. Used by templates that say things like "Near the low end" or
  *  "At the upper edge" so the same prose array doesn't all read identically. */
-function bandFor(target: number, range: { min: number; max: number }): 'low' | 'middle' | 'high' {
+export function bandFor(
+  target: number,
+  range: { min: number; max: number },
+): 'low' | 'middle' | 'high' {
   const span = range.max - range.min;
   if (span <= 0) return 'middle';
   const t = (target - range.min) / span;
@@ -409,7 +412,7 @@ function bandFor(target: number, range: { min: number; max: number }): 'low' | '
   return 'middle';
 }
 
-interface PerDollarBoth {
+export interface PerDollarBoth {
   modelLabel: string;
   aLabel: string;
   bLabel: string;
@@ -425,7 +428,7 @@ interface PerDollarBoth {
   band: 'low' | 'middle' | 'high';
 }
 
-interface FullBoth {
+export interface FullBoth {
   modelLabel: string;
   aLabel: string;
   bLabel: string;
@@ -444,7 +447,7 @@ interface FullBoth {
   band: 'low' | 'middle' | 'high';
 }
 
-const BAND_PHRASE: Record<'low' | 'middle' | 'high', string> = {
+export const BAND_PHRASE: Record<'low' | 'middle' | 'high', string> = {
   low: 'near the low end',
   middle: 'around the middle',
   high: 'toward the upper edge',
@@ -566,7 +569,7 @@ const FULL_SINGLE_TEMPLATES: ((args: {
  *  pages get different starting templates. Avoids the birthday-problem
  *  collisions that pickTemplate alone produces when sampling N times from a
  *  pool of size M near N. */
-function pickRotated<T>(arr: T[], pageSeed: string, rowIndex: number): T {
+export function pickRotated<T>(arr: T[], pageSeed: string, rowIndex: number): T {
   const start = hashStr(pageSeed) % arr.length;
   return arr[(start + rowIndex) % arr.length];
 }
diff --git a/packages/app/src/lib/i18n.test.ts b/packages/app/src/lib/i18n.test.ts
index e99843e4..c3ce6356 100644
--- a/packages/app/src/lib/i18n.test.ts
+++ b/packages/app/src/lib/i18n.test.ts
@@ -43,11 +43,11 @@ describe('hasZhSibling', () => {
     expect(hasZhSibling('/about')).toBe(true);
   });
 
-  it('matches blog child paths but not compare slug pages', () => {
+  it('matches blog and compare child paths', () => {
     expect(hasZhSibling('/blog/some-post')).toBe(true);
-    // Per-slug comparison pages are English-only; only the index is mirrored.
     expect(hasZhSibling('/compare')).toBe(true);
-    expect(hasZhSibling('/compare/deepseek-r1-h100-vs-h200')).toBe(false);
+    expect(hasZhSibling('/compare/deepseek-r1-h100-vs-h200')).toBe(true);
+    expect(hasZhSibling('/compare-per-dollar/deepseek-r1-h100-vs-h200')).toBe(true);
   });
 
   it('rejects unmirrored routes', () => {
@@ -69,9 +69,15 @@ describe('switchLocalePath', () => {
     expect(switchLocalePath('/zh/blog/some-post')).toBe('/blog/some-post');
   });
 
+  it('switches compare slug pages within the language trees', () => {
+    expect(switchLocalePath('/compare/foo-vs-bar')).toBe('/zh/compare/foo-vs-bar');
+    expect(switchLocalePath('/zh/compare-per-dollar/foo-vs-bar')).toBe(
+      '/compare-per-dollar/foo-vs-bar',
+    );
+  });
+
   it('falls back to the other homepage for unmirrored paths', () => {
     expect(switchLocalePath('/datasets')).toBe('/zh');
-    expect(switchLocalePath('/compare/foo-vs-bar')).toBe('/zh');
     expect(switchLocalePath('/zh/unknown-page')).toBe('/');
   });
 });
diff --git a/packages/app/src/lib/i18n.ts b/packages/app/src/lib/i18n.ts
index de38f3c6..cdd5afc8 100644
--- a/packages/app/src/lib/i18n.ts
+++ b/packages/app/src/lib/i18n.ts
@@ -47,8 +47,8 @@ export const ZH_MIRRORED_ROUTES: readonly { path: string; exact?: boolean }[] =
   { path: '/about', exact: true },
   { path: '/quotes', exact: true },
   { path: '/land-acknowledgement', exact: true },
-  { path: '/compare', exact: true },
-  { path: '/compare-per-dollar', exact: true },
+  { path: '/compare' },
+  { path: '/compare-per-dollar' },
   { path: '/blog' },
 ];
 
diff --git a/packages/app/src/lib/nudges/registry.tsx b/packages/app/src/lib/nudges/registry.tsx
index 234e129a..783cc9fb 100644
--- a/packages/app/src/lib/nudges/registry.tsx
+++ b/packages/app/src/lib/nudges/registry.tsx
@@ -51,10 +51,14 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: ShieldCheck,
       iconClassName: 'text-brand',
       title: 'Every result is reproducible',
+      titleZh: '每项结果均可复现',
       description:
         'Each data point is produced by a public GitHub Actions run. Click any point on a chart to jump to the exact run, logs, and artifacts.',
+      descriptionZh:
+        '每个数据点都由公开的 GitHub Actions 运行产生。点击图表上的任意数据点即可跳转到对应的运行记录、日志和产物。',
       action: {
         label: 'See how',
+        labelZh: '了解详情',
         onClick: () => {
           window.location.href = '/about#reproducibility';
         },
@@ -84,9 +88,12 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: Star,
       iconClassName: 'text-yellow-500 fill-yellow-500',
       title: 'Finding us useful?',
+      titleZh: '觉得有用吗？',
       description: 'Help the project grow so we can add more benchmarks! Star us on GitHub.',
+      descriptionZh: '帮助项目成长，让我们可以添加更多基准测试！在 GitHub 上为我们加星。',
       action: {
         label: 'Star on GitHub',
+        labelZh: '在 GitHub 上加星',
         icon: <GitHubIcon />,
         onClick: () => {
           window.open(GITHUB_REPO_URL, '_blank', 'noopener,noreferrer');
@@ -117,8 +124,10 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: Download,
       iconClassName: 'text-blue-500',
       title: 'Need the data?',
+      titleZh: '需要数据？',
       description:
         'Use the download button on any chart to export as PNG or CSV — no need to copy from tooltips.',
+      descriptionZh: '使用任意图表上的下载按钮导出 PNG 或 CSV——无需从提示框中复制。',
       testId: 'export-nudge',
     },
     analytics: {
@@ -138,10 +147,13 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: Palette,
       iconClassName: 'text-purple-500',
       title: 'Try Gradient Labels',
+      titleZh: '试试渐变标签',
       description:
         'Gradient labels color-code data points by parallelism level, making it easier to spot performance patterns at a glance.',
+      descriptionZh: '渐变标签按并发级别对数据点进行颜色编码，让您一目了然地发现性能模式。',
       action: {
         label: 'Enable Gradient Labels',
+        labelZh: '启用渐变标签',
         onClick: (eventDetail?: unknown) => {
           const detail = eventDetail as { enableGradient?: () => void } | undefined;
           detail?.enableGradient?.();
@@ -179,8 +191,10 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: MessageSquareText,
       iconClassName: 'text-brand',
       title: "See the model's actual answers",
+      titleZh: '查看模型的实际回答',
       description:
         'Click Prompts on any row to compare each prompt, the expected answer, and what the model actually responded.',
+      descriptionZh: '点击任意行的"提示词"按钮，对比每条提示、预期答案和模型的实际回复。',
       testId: 'eval-samples-nudge',
     },
     analytics: {
@@ -210,7 +224,9 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: MessageSquareText,
       iconClassName: 'text-brand',
       title: 'Help us improve InferenceX',
+      titleZh: '帮助我们改进 InferenceX',
       description: "We'd love to hear what's working and what isn't.",
+      descriptionZh: '我们非常希望了解哪些方面做得好，哪些方面需要改进。',
       testId: 'feedback-modal',
       centered: true,
       renderContent: ({ dismiss }) => <FeedbackForm onDismiss={dismiss} />,
@@ -236,14 +252,20 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: Sparkles,
       iconClassName: 'text-brand',
       title: 'MiniMax M3 is live',
+      titleZh: 'MiniMax M3 已上线',
       description:
         'Day-zero benchmarks for MiniMax M3 are now available across the latest NVIDIA and AMD GPUs. Results are experimental — see how the new model performs across hardware.',
+      descriptionZh:
+        'MiniMax M3 的首日基准测试数据现已覆盖最新的 NVIDIA 和 AMD GPU。结果为实验性数据——来看看新模型在不同硬件上的表现。',
       testId: 'launch-modal',
       containerClassName: 'border-brand/40',
       badge: 'New',
+      badgeZh: '最新',
       dismissLabel: 'Maybe Later',
+      dismissLabelZh: '稍后再看',
       primaryAction: {
         label: 'Explore',
+        labelZh: '开始探索',
         icon: <ArrowRight className="size-4" />,
         onClick: () => {
           window.location.href = '/inference?preset=minimax-m3-launch';
@@ -270,12 +292,17 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: Star,
       iconClassName: 'text-yellow-500 fill-yellow-500',
       title: 'Star InferenceX on GitHub',
+      titleZh: '在 GitHub 上为 InferenceX 加星',
       description:
         'Star InferenceX on GitHub to get notified when we publish new benchmark data. We update GPU performance comparisons regularly — starring is the easiest way to stay in the loop and help the project grow.',
+      descriptionZh:
+        '在 GitHub 上为 InferenceX 加星，以便在我们发布新基准测试数据时收到通知。我们定期更新 GPU 性能对比——加星是保持关注并帮助项目成长的最简单方式。',
       testId: 'github-star-modal',
       dismissLabel: 'Maybe Later',
+      dismissLabelZh: '稍后再看',
       primaryAction: {
         label: 'Star on GitHub',
+        labelZh: '在 GitHub 上加星',
         icon: <GitHubIcon className="size-4" />,
         onClick: () => {
           window.open(GITHUB_REPO_URL, '_blank', 'noopener,noreferrer');
@@ -307,9 +334,12 @@ export const NUDGE_REGISTRY: NudgeDefinition[] = [
       icon: Sparkles,
       iconClassName: 'text-brand',
       title: 'MiniMax M3 benchmarks are live',
+      titleZh: 'MiniMax M3 基准测试已上线',
       description: 'First inference numbers across NVIDIA and AMD GPUs, click to explore.',
+      descriptionZh: 'NVIDIA 和 AMD GPU 的首批推理数据，点击探索。',
       testId: 'launch-banner',
       badge: 'New',
+      badgeZh: '最新',
       href: '/inference?preset=minimax-m3-launch',
       onLinkClick: () => {
         window.location.href = '/inference?preset=minimax-m3-launch';
diff --git a/packages/app/src/lib/nudges/types.ts b/packages/app/src/lib/nudges/types.ts
index 05160181..6605d1c9 100644
--- a/packages/app/src/lib/nudges/types.ts
+++ b/packages/app/src/lib/nudges/types.ts
@@ -45,6 +45,7 @@ export interface NudgeCondition {
 
 export interface NudgeAction {
   label: string;
+  labelZh?: string;
   icon?: ReactNode;
   /**
    * Called when the user clicks the action button.
@@ -62,7 +63,9 @@ export interface NudgeContent {
   icon: ComponentType<{ className?: string }>;
   iconClassName?: string;
   title: string;
+  titleZh?: string;
   description: string;
+  descriptionZh?: string;
   action?: NudgeAction;
   /** data-testid on the nudge container (preserves existing selectors). */
   testId?: string;
@@ -74,6 +77,7 @@ export interface NudgeContent {
 
   /** Label for the dismiss button (default "Maybe Later"). */
   dismissLabel?: string;
+  dismissLabelZh?: string;
   /** Label + handler for the primary CTA (modals only). */
   primaryAction?: NudgeAction;
   /** Extra CSS class on the modal container (e.g. branded border). */
@@ -84,6 +88,7 @@ export interface NudgeContent {
   actionClassName?: string;
   /** Badge text rendered next to the title (e.g. "New"). */
   badge?: string;
+  badgeZh?: string;
 
   // -- Banner-specific (ignored by toasts/modals) --
 
diff --git a/packages/app/src/lib/use-locale.ts b/packages/app/src/lib/use-locale.ts
new file mode 100644
index 00000000..4de38fc9
--- /dev/null
+++ b/packages/app/src/lib/use-locale.ts
@@ -0,0 +1,16 @@
+'use client';
+
+import { usePathname } from 'next/navigation';
+
+import { isZhPathname, type Locale } from '@/lib/i18n';
+
+/**
+ * Current page language, derived from the /zh route prefix. Lets shared
+ * client components (footer, dashboard chrome, nudges) render Chinese
+ * strings on /zh pages without prop drilling — pair with a component-local
+ * `STRINGS = { en: {...}, zh: {...} }` dictionary.
+ */
+export function useLocale(): Locale {
+  const pathname = usePathname();
+  return isZhPathname(pathname ?? '') ? 'zh' : 'en';
+}

From 37a4d820b03892209cc3fc37f1bf5f7a188bab38 Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Sat, 4 Jul 2026 03:26:57 -0400
Subject: [PATCH 3/3] fix(i18n): keep cross-post links in zh blog posts within
 /zh tree
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Four zh translations linked sibling posts via absolute English URLs
(https://inferencex.semianalysis.com/blog/...); rewrite them to the
/zh/blog/... siblings per the AGENTS.md translation rule, so readers
and crawlers on the Chinese tree stay in it. Flagged by Cursor Bugbot.

中文：修复 4 篇中文博客译文中指向英文 /blog/ 绝对链接的交叉引用，改为
指向 /zh/blog/ 中文版本，使中文读者与爬虫停留在中文页面树内。

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 .../gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx   | 2 +-
 .../blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx    | 2 +-
 .../blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx     | 2 +-
 .../content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx  | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx b/packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx
index e4b34623..c1cd0c43 100644
--- a/packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx
+++ b/packages/app/content/blog/zh/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt.mdx
@@ -138,7 +138,7 @@ NVIDIA 的 [SGLang GB200 NVL72 结果](https://lmsys.org/blog/2025-09-25-gb200-p
 
 ## 致谢
 
-感谢 NVIDIA 的 Dynamo 和 TensorRT-LLM 团队 — 包括 Jatin Gangani、Kedar Potdar、Sridhar Ramaswamy、Ishan Dhanani 和 Sahithi Chigurupati — 交付了 B200 多节点 RoCEv2 和 GB200 NVL72 上的分离式部署方案。请查看我们另一篇关于 [GB200 NVL72 对比 B200 运行 Kimi K2.5 的博文](https://inferencex.semianalysis.com/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200)。
+感谢 NVIDIA 的 Dynamo 和 TensorRT-LLM 团队 — 包括 Jatin Gangani、Kedar Potdar、Sridhar Ramaswamy、Ishan Dhanani 和 Sahithi Chigurupati — 交付了 B200 多节点 RoCEv2 和 GB200 NVL72 上的分离式部署方案。请查看我们另一篇关于 [GB200 NVL72 对比 B200 运行 Kimi K2.5 的博文](https://inferencex.semianalysis.com/zh/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200)。
 
 <DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_seq=1k%2F1k&i_active=b200_dynamo-trt_mtp%2Cgb200_dynamo-trt_mtp">
   点击查看完整 InferenceX 仪表板 →
diff --git a/packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx b/packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx
index dfff75bd..38fb02d1 100644
--- a/packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx
+++ b/packages/app/content/blog/zh/gb300-nvl72-vs-gb200-nvl72-dsv4-pro-vllm-fp4.mdx
@@ -135,7 +135,7 @@ GB200 的每 GPU 峰值吞吐量为 8,933，交互性为 15.3 tok/s/user。GB300
 
 ## 致谢
 
-感谢 NVIDIA 的 Dynamo 和 vLLM 团队——包括 Jatin Gangani、Kedar Potdar、Sridhar Ramaswamy、Ishan Dhanani 和 Sahithi Chigurupati——以及 vLLM 团队，是他们将 GB200 和 GB300 的 DSv4-Pro 配方交付落地，使得机架间对比成为可能。配套文章：[GB200 NVL72 vs B200 DeepSeek R1 对比](https://inferencex.semianalysis.com/blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt)，覆盖了 SKU 梯队下一级的 scale-up 互联优势。
+感谢 NVIDIA 的 Dynamo 和 vLLM 团队——包括 Jatin Gangani、Kedar Potdar、Sridhar Ramaswamy、Ishan Dhanani 和 Sahithi Chigurupati——以及 vLLM 团队，是他们将 GB200 和 GB300 的 DSv4-Pro 配方交付落地，使得机架间对比成为可能。配套文章：[GB200 NVL72 vs B200 DeepSeek R1 对比](https://inferencex.semianalysis.com/zh/blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt)，覆盖了 SKU 梯队下一级的 scale-up 互联优势。
 
 <DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&i_active=gb200_dynamo-vllm%2Cgb300_dynamo-vllm&g_model=DeepSeek-V4-Pro&i_linelabel=1">
   点击查看完整 InferenceX 仪表板 →
diff --git a/packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx b/packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx
index c8c0e976..a4817108 100644
--- a/packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx
+++ b/packages/app/content/blog/zh/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx
@@ -137,7 +137,7 @@ TileLang 依赖版本已更新以在 AMD 上启用 FP8 GEMM，并新增了 `spar
 此次结果为单节点、聚合、仅 FP8。仍有两个差距待弥合：
 
 - **FP4 可组合性。** 本次对比中 B200 使用的是 CUDA nightly 上的 FP8。B200 NVFP4 SGLang 的 GLM-5 方案已开始交付，将进一步压缩 B200 的成本曲线。MI355X MXFP4 GLM-5.1 SGLang 已通过 [InferenceX PR #1098](https://github.com/SemiAnalysisAI/InferenceX/pull/1098) 于 2026-04-21 交付，但 MI355X 上的 FP4 + MTP 组合尚未达到本文展示的 FP8 + MTP 方案的水平。
-- **分离式部署和宽专家并行。** MI355X 上的 GLM-5 尚无分离式部署或宽 EP 方案。NVIDIA 的 GB200 NVL72 Dynamo TRT-LLM 和 Dynamo vLLM 方案在 Kimi K2.5 上已展示了[机架级宽 EP 带来的约 3 倍每 GPU 吞吐量优势](https://inferencex.semianalysis.com/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200)。AMD 尚未为 GLM-5 交付分离式部署方案。
+- **分离式部署和宽专家并行。** MI355X 上的 GLM-5 尚无分离式部署或宽 EP 方案。NVIDIA 的 GB200 NVL72 Dynamo TRT-LLM 和 Dynamo vLLM 方案在 Kimi K2.5 上已展示了[机架级宽 EP 带来的约 3 倍每 GPU 吞吐量优势](https://inferencex.semianalysis.com/zh/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200)。AMD 尚未为 GLM-5 交付分离式部署方案。
 
 ## 致谢
 
diff --git a/packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx b/packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx
index 0331d3fc..e76bf1bd 100644
--- a/packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx
+++ b/packages/app/content/blog/zh/mi355x-kimi-k2-5-vllm-aiter-7x-speedup.mdx
@@ -18,7 +18,7 @@ tags:
 
 然而，他们最令人印象深刻的成就在于推动曲线改变的速度。[vLLM PR #35850](https://github.com/vllm-project/vllm/pull/35850) 于 3 月 6 日合入并随 vLLM 0.18 发布，到 3 月 26 日 InferenceX 的基准测试流水线就通过 [InferenceX PR #936](https://github.com/SemiAnalysisAI/InferenceX/pull/936)（该 PR 启用了 MI355X Kimi K2.5 [配方](https://recipes.vllm.ai/moonshotai/Kimi-K2.5?hardware=mi355x&features=tool_calling%2Creasoning%2Cencoder_parallel)上的 AITER、专家并行及 vLLM 0.18.0 升级）捕获了完整效果——距离我们 3 月 1 日的 vLLM 0.16.0 基线仅 25 天。MI355X Kimi K2.5 MXFP4 上的每一个工作点都从一个几乎不可用、只有单点延迟下限的状态，被重写为一条可达 78.9 tok/s/user 低批次交互性和 2,687 tok/s/GPU 峰值吞吐量的完整 Pareto 前沿。这正是我们构建 [InferenceX](https://github.com/SemiAnalysisAI/InferenceX) 自动化基准测试的原因——高效地捕获并报告此类变化。
 
-我们在 [InferenceXv2](https://inferencex.semianalysis.com/blog/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper) 中对 AMD Kimi K2.5 推理提出的最持续的批评之一是可组合性。MI355X 在 CDNA4 上的硅片能力在 tensor-core 层面与 B200 有竞争力，但 AMD 的 ROCm 和 vLLM 路径并不总能释放该能力。这在推理配方仍在成熟中的新一代前沿 MoE 模型上尤为明显。
+我们在 [InferenceXv2](https://inferencex.semianalysis.com/zh/blog/inferencex-v2-nvidia-blackwell-vs-amd-vs-hopper) 中对 AMD Kimi K2.5 推理提出的最持续的批评之一是可组合性。MI355X 在 CDNA4 上的硅片能力在 tensor-core 层面与 B200 有竞争力，但 AMD 的 ROCm 和 vLLM 路径并不总能释放该能力。这在推理配方仍在成熟中的新一代前沿 MoE 模型上尤为明显。
 
 <DashboardCTA href="https://inferencex.semianalysis.com/inference">
   点击查看完整 InferenceX 仪表板 →