diff --git a/.gitignore b/.gitignore
index 475b250..593147e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -27,4 +27,4 @@ config.yaml
 .playwright-mcp/
 
 # Log files (dual-write logging)
-coding-proxy.log*
+.logs/
diff --git a/AGENTS.md b/AGENTS.md
index 30d9d7a..ea86087 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -2,15 +2,11 @@
 
 ## Collaboration Protocol (协作协议)
 
-本文件旨在规范 AI Agent（Claude Code、Antigravity 等）在本项目中的代码与文档协作行为。
+本文件旨在规范 AI Agent（Claude Code、Antigravity 等）在本项目中的代码与文档协作行为。项目定位详见 [README.md](./README.md)。
 
 - **Core Language**: Output MUST be in **Chinese (Simplified)** unless serving code/technical constraints.
 - **Tone**: Professional, precise, and evidence-based.
 
-## Project Positioning (项目定位)
-
-参考 README.md
-
 ## Engineering Code of Conduct (工程行为准则)
 
 **Core Philosophy**: **Entropy Reduction (熵减)**. 通过上下文锚定、复用驱动与标准化流水线，对抗软件系统的无序熵增。
@@ -19,67 +15,44 @@
 
 - **Context-Driven (上下文驱动)**: 上下文是第一性要素 (Context Quality First)。任何变更需建立在深度理解之上（CDD），拒绝基于关键字匹配的机械式修改。
 - **Minimal Intervention (最小干预)**: 遵循奥卡姆剃刀与 YAGNI 原则，仅实施必要的变更，推崇演进式设计 (Evolutionary Design) 而非过度设计。
-- **Evidence-Based (循证工程)**: 杜绝主观臆断，核心决策需以权威文献（IEEE 格式）为佐证，构建 Feedback Loops 以验证假设。
-- **Systemic Integrity (系统完整性)**: 具备全局视角与二阶思维 (Second-Order Thinking)，评估变更对上下游依赖及整个生态（Engine, Adapter, Agent, UI）的“涟漪效应”，优先保障整体稳定性与逻辑自洽。
+- **Evidence-Based (循证工程)**: 杜绝主观臆断，核心决策需以**最新**且**权威**的文献（IEEE 格式）为佐证，构建“设计-实现-验证”的完整反馈闭环，确保每一项工程行动都能产生可观测的反馈信号（测试、日志、监控），以验证假设并指导迭代。
+- **Systemic Integrity (系统完整性)**: 具备全局视角与二阶思维 (Second-Order Thinking)，评估变更对上下游依赖及整个生态（Engine, Adapter, Agent, UI）的“涟漪效应”，不只关注变更的直接结果，更要预测“结果的结果”（如引入缓存导致的陈旧数据、重试机制引发的雪崩），优先保障整体稳定性与逻辑自洽。
+- **Knowledge Crystallization (知识结晶)**: 将系统视为有机体，通过将工程错误与 AI 失败案例转化为经验约束 (Negative Prompts) 和持久化知识，驱动系统的自我进化与持续熵减。
+- **Proactive Navigation (主动导航)**: 智能体不应止步于被动响应，需即时转化为“领航者”。在交付任务结果的同时，**必须**基于上下文预判并提出**下一步最佳行动建议 (Next Best Action)**，不仅交付“答案”，更要交付“路径”，消除用户决策的认知摩擦。
 
 ### 法 (Strategy - 架构原则)
 
-- **Plan Node Default (默认规划模式)**: 面对任何非琐碎任务（预估步骤 > 3 或涉及架构级决策），**必须**率先进入 Plan 模式。规划产物需明确界定：功能边界、边缘 Case 应对策略、与现有逻辑的交互锚点以及预计改动的爆炸半径。
+- **Plan-First Default (规划先行)**: 面对任何非琐碎任务（预估步骤 > 3 或涉及架构级决策），**必须**率先进入 Plan 模式。规划产物需明确界定：功能边界、边缘 Case 应对策略、与现有逻辑的交互锚点以及预计改动的爆炸半径。
 - **Subagent Strategy (子代理并发策略)**: 面对高复杂度命题，严禁主 Agent 单点统揽。应贯彻“算力换空间”思路，果断编排 Subagent 进行任务拆解与并行攻坚，主 Agent 的职责需严格收敛于上下文协同与最终成果的组装整合。
 - **Verification Before Done (交付前验证定式)**: 严禁在缺乏确凿运行证据的情况下标记任务为“已完成”。交付阶段**强制要求**提供客观自证材料：Diff 变更分析、测试用例覆盖、实施日志截图及核心链路边缘 Case 验证结果，并时刻以“方案是否能通过 Staff Engineer 严格审查”的视角自检。
-- **Reuse-Driven (复用驱动)**: Composition over Construction。系统变更**必须**主动参考业界经典设计模式与最佳实践。在进入实质性编码前，需率先对相关领域的成熟范式进行深度调研，并结合当前项目上下文输出充分的关联分析与方案梳理。坚决贯彻“拿来主义”，优先通过组合与集成来构建系统，防范闭门造车与重复造轮子。
+- **Reuse-Driven (复用驱动)**: Compose over Reinvent。系统变更**必须**主动参考业界经典设计模式与最佳实践。在进入实质性编码前，需率先对相关领域的成熟范式进行深度调研，并结合当前项目上下文输出充分的关联分析与方案梳理。坚决贯彻“拿来主义”，优先通过组合与集成来构建系统，防范闭门造车与重复造轮子。
 - **Boundary Management (边界管理)**: 严控模块/Agent 间的职责边界与契约，确保高内聚低耦合，防范隐式依赖穿透。
 - **Orthogonal Decomposition (正交分解)**: 坚持“正交地提取概念主体”。识别系统中独立变化的维度并进行解耦（如机制与策略分离），确保单一概念主体的变更具备局部性，避免逻辑纠缠。
-- **Feedback Loops (反馈闭环)**：构建“设计-实现-验证”的完整闭环，确保每一项工程行动都能产生可观测的反馈信号（测试、日志、监控），以验证假设并指导迭代。
-- **Evolutionary Design (演进式设计)**: 将系统视为有机体，通过将 AI 错误转化为经验约束 (Negative Prompts) 和持久化知识，实现系统的自我进化与熵减。
-- **Second-Order Thinking (二阶思维)**：不只关注变更的直接结果，更要预测“结果的结果”（如引入缓存导致的陈旧数据、重试机制引发的雪崩），未雨绸缪防范隐性风险。
 - **Single Source of Truth (单一事实源)**：严格维护唯一的权威定义源。引用时**必须**使用轻量级指针 (Link/ID) 而非数据副本 (Copy-Paste)，从根源消除断裂 (Split-Brain) 风险。
-- **Proactive Navigation (主动导航)**: 智能体不应止步于被动响应，需即时转化为“领航者”。在交付任务结果的同时，**必须**基于上下文预判并提出**下一步最佳行动建议 (Next Best Action)**。不仅交付“答案”，更要交付“路径”，消除用户决策的认知摩擦，确保持续的熵减动量。
 
 ### 术 (Tactics - 执行规范)
 
-- **Vibe Coding Pipeline**: 遵循 **Specification-Driven (规划驱动)** + **Context-Anchored (上下文锚定)** + **AI-Pair (AI 结对)** 模式，将开发固化为可审计的流水线，避免代码腐化为无法维护的“大泥球 (Big Ball of Mud)”。
-- **Visual Documentation (图文并茂)**: 对于复杂逻辑，优先使用 Mermaid 图表（Sequence/Flowchart/Class）辅助说明，构建“图文并茂”的直观文档。
-- **Direct Hyperlinking (直接跳转)**: 在文档中提及 Repo 内其他资源（文档/代码）时，**必须**构建可跳转的相对路径链接（如 `[Doc Name](./path.md)`），严禁使用“死文本”引用，以降低信息检索熵。
+- **Structured AI-Pair Pipeline (规范化 AI 结对流水线)**: 遵循 **Specification-Driven (规约驱动)** + **Context-Anchored (上下文锚定)** + **AI-Pair (AI 结对)** 模式，将开发固化为可审计的流水线，避免代码腐化为无法维护的“大泥球 (Big Ball of Mud)”。
 - **Operational Excellence (卓越运营)**:
-  1. **Git Hygiene**: 如非显性要求，严禁调用 git commit；
+  1. **Git Discipline**: 默认严禁调用 git commit；当用户显式要求提交时，一律使用 Claude Code 的自定义 Slash Command: `/commit-no-push` 进行操作（若非 Claude Code 运行环境，则读取 /commit-no-push 命令中的规则执行）。严禁执行 Rebase；
   2. **Temp Management**: 临时产物（执行计划等）一律收敛至 `.temp/` 并及时清理；
   3. **Link Validity**: 确保所有引用的 URL 可访问且具备明确的上下文价值；
-  4. **Git Commit**: 在需要提交变更到 Git 时，一律使用 Shell 调用 Claude Code 的自定义 Slash Command: `/commit` 进行 git commit 操作（若环境中未安装 Claude Code，则直接读取 `~/.claude/commands/commit.md`，按照其中的规则进行 git commit 操作）。不要执行 Rebase。
-  5. **Pre-commit Hooks**: 克隆仓库后执行 `uv run pre-commit install` 激活本地 Git hooks，使 Ruff lint（含 auto-fix）、Ruff format 及通用代码卫生检查在每次 commit 前自动运行。若 hooks 自动修复了问题，提交会被中断，执行 `git add -p` 审阅修复内容后重新提交即可。
-  6. **Issue**: 在 docs/issue.md 中维护你处理过的 Issue 摘要（问题描述、表因根因、处理方式、后续防范、同类问题影响与处理注意实现等），便于同类问题的跨上下文处理；注意识别相同 Issue，不要同 Issue 多处维护。
+  4. **Testing**: 统一在 tests/ 下维护测试用例，区分单元测试（unit）和集成测试（integration），所有测试的本地运行总时间控制在 3 min 以内；
+  5. **Pre-commit Hooks**: 首次克隆仓库使用 `uv run pre-commit install` 激活本地 Git hooks，使 Ruff lint（含 auto-fix）、Ruff format 及通用代码卫生检查在每次 commit 前自动运行。若 hooks 自动修复了问题，提交会被中断，执行 `git add -p` 审阅修复内容后重新提交即可；
+  6. **Issue**: 在 [issue.md](docs/agents/issue.md) 中维护你处理过的 Issue 摘要（问题描述、表因根因、处理方式、后续防范、同类问题影响与处理注意事项等），便于同类问题的跨上下文处理；注意识别相同 Issue，不要同 Issue 多处维护；
 - **Package Management Standardization (包管理规范)**:
   1. **Python**: 严禁使用 pip/poetry，**必须**统一使用 `uv` 进行包管理与脚本执行（如 `uv run`）；
-  2. **JavaScript/TypeScript**: 严禁使用 npm/yarn，**必须**统一使用 `pnpm` 进行包管理与脚本执行。
+  2. **JavaScript/TypeScript**: 严禁使用 npm/yarn，**必须**统一使用 `pnpm` 进行包管理与脚本执行；
 - **Database Management**: 谨慎操作，数据迁移、测试等操作严禁将现有数据删除，谨慎操作数据迁移的回滚，防止数据被清理。
 - **In-depth and close to the facts**：系统且全面地进行问题的分析，深入贴近事实，如有疑问，需先发问，不要乱做决定。
-
-## Documentation Standards (文档规范)
-
-### Mermaid Visualization Norms (Mermaid 可视化规范)
-
-- **色彩语义与兼容性**：为图表节点配置具备语义辨识度的色彩，并确保在深色模式（Dark Mode）下具有极高的对比度与清晰度。
-- **逻辑模块化解构**：针对业务跨度较大的架构流程，强制采用 `subgraph` 容器进行层级解构与边界划分，以增强图表的自解说（Self-explaining）能力。
-
-### Reference Specifications (IEEE)
-
-为保障工程决策的可追溯性与学术严谨性，核心引用需遵循 **IEEE 标准引用格式**。
-
-> **模版准则**：[编号] 作者缩写. 姓, "文章标题," _刊名/会议名缩写 (斜体)_, 卷号, 期数, 页码, 年份.
-
-```latex
-[1] A. Author, B. Author, and C. Author, "Title of paper," *Abbrev. Title of Journal*, vol. X, no. Y, pp. XX–XX, Year.
-```
-
-**引用实践**
-
-- **文内锚定**：采用标准上标链接形式：`描述内容<sup>[[1]](#ref1)</sup>`。
-- **文献索引**：底层采用 HTML 锚点 `id` 实现跳转稳定性。
-
-```latex
-<a id="ref1"></a>[1] A. Vaswani et al., "Attention is all you need," Adv. Neural Inf. Process. Syst., vol. 30, pp. 5998–6008, 2017.
-```
-
-## Knowledge Map (知识索引)
-
-(WIP)
+- **Browser Validation Protocol (浏览器验证准则)**：Agent 不得自行完成、绕过或模拟任何 OAuth / SSO 认证流程，所有登录态均来源于用户已认证的 Chrome 主 profile（真实用户登录态）。完整协议（连通性自检、凭证管理、E2E 集成、实机回归等）详见 [浏览器验证协议](./docs/agents/browser-validation.md)；
+  1. **安全红线**：禁止在 Sandbox 浏览器中跳转 Google 同意屏；禁止以模拟用户或第三方账号替代真实登录态；禁止要求用户在 chat 中粘贴密码、Cookie 或验证码；
+- **Knowledge Map (知识索引)**：项目所有文档索引统一维护在 [知识索引](./docs/agents/knowledge-map.md)，并在文档目录变更时即时同步跟新；
+- **Documentation Standards (文档规范)**：
+  1. **Visual Documentation (图文并茂)**: 对于复杂逻辑，优先 **Mermaid Visualization Norms (Mermaid 可视化规范)**，构建”图文并茂”的直观文档；
+     - **色彩语义与兼容性**：为图表节点配置具备语义辨识度的色彩，并确保在深色模式（Dark Mode）下具有极高的对比度与清晰度；
+     - **逻辑模块化解构**：针对业务跨度较大的架构流程，强制采用 `subgraph` 容器进行层级解构与边界划分，以增强图表的自解说（Self-explaining）能力；
+  2. **语言叙事**：用语精准，叙事完备，行文专业，聚焦核心，篇幅精炼，形象具体，体现真实作用与用户吸引性，字数恰当；
+  3. **Direct Hyperlinking (直接跳转)**: 在文档中提及 Repo 内其他资源（文档/代码）时，**必须**构建可跳转的相对路径链接（如 `[Doc Name](./path.md)`），严禁使用”死文本”引用，以降低信息检索熵；
+  4. **实操截图**：文档需要引入必要的浏览器实操截图时，需自行通过默认浏览器打开相关页面，通过实操现场截图并保留到文档路径进行文档引用；
+- **Reference Specifications (IEEE)**：为保障工程决策的可追溯性与学术严谨性，核心引用需遵循 [reference-specifications.md](docs/agents/reference-specifications.md)IEEE 标准引用格式；
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 0fb0f1d..f745ec1 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,30 @@
 
 ## [Unreleased]
 
+## [v0.5.0](https://github.com/ThreeFish-AI/coding-proxy/releases/tag/v0.5.0) - 2026-05-27
+
+> [!IMPORTANT]
+>
+> **🚀 Model Calling 实时状态！**
+>
+> 模型并发与排队深度一目了然，运行时动态调整每个模型并行度，预防 vendor 侧的 429 幺蛾子。
+
+![model-calling](assets/model-calling-v0.5.0.png)
+
+### ✨ 核心亮点
+
+- feat(concurrency): 新增 Model Calling 实时状态模块，可视化每模型并发与排队深度，支持运行时动态修改每模型并行度 (#250) (#251)
+- feat(zhipu): 新增每模型并发限制，默认 3 个并行请求 FIFO 排队 (#248)
+- feat(zhipu): 为 429 Rate Limit 添加指数退避重试挽回机制 (#242)
+
+### 🔧 更多特性
+
+- fix(antigravity): 修复 v1internal 模式检测逻辑并新增 E2E 测试; (#234)
+- fix(routes): 修复 count_tokens 路由对 target_vendor.name 的错误属性访问; (#235)
+- fix(vendor-channels): 修复 zhipu→anthropic 通道 tool_use/tool_result 配对漏洞; (#236)
+- fix(native-api): 修复 Gemini :verb 路径中 %3A URL 编码导致上游 400 的兼容问题; (#237)
+- fix(zhipu): 诊断首选 tier 语义拒绝降级问题，增强可观测性并提取跨供应商清洗共享函数 (#243)
+
 ## [v0.4.0](https://github.com/ThreeFish-AI/coding-proxy/releases/tag/v0.4.0) — 2026-05-01
 
 > [!IMPORTANT]
diff --git a/README.md b/README.md
index 1383338..6cb7211 100644
--- a/README.md
+++ b/README.md
@@ -30,7 +30,7 @@ When you're deeply immersed in your coding "zone" with **Claude Code** (or any A
 ## 🌟 Core Features
 
 <div align="center">
-    <img src="assets/dashboard-v0.2.4.png">
+    <img src="assets/dashboard-v0.4.0.png">
 </div>
 
 - **⛓️ N-tier Chained Failover**: Autonomous descending sequence, supporting Claude's official plans, as well as Coding Plans from GitHub Copilot, Google Antigravity, Z AI, MiniMax, Alibaba Qwen, Xiaomi, Kimi, Doubao, etc.
diff --git a/assets/dashboard-v0.2.4.png b/assets/dashboard-v0.2.4.png
deleted file mode 100644
index aef75f7..0000000
Binary files a/assets/dashboard-v0.2.4.png and /dev/null differ
diff --git a/assets/dashboard-v0.4.0.png b/assets/dashboard-v0.4.0.png
new file mode 100644
index 0000000..14e985b
Binary files /dev/null and b/assets/dashboard-v0.4.0.png differ
diff --git a/assets/model-calling-v0.5.0.png b/assets/model-calling-v0.5.0.png
new file mode 100644
index 0000000..1b1e31b
Binary files /dev/null and b/assets/model-calling-v0.5.0.png differ
diff --git a/docs/agents/browser-validation.md b/docs/agents/browser-validation.md
new file mode 100644
index 0000000..ee4b705
--- /dev/null
+++ b/docs/agents/browser-validation.md
@@ -0,0 +1,172 @@
+# Browser Validation Protocol（浏览器验证协议）
+
+> 由 [AGENTS.md §Browser Validation Protocol](../../AGENTS.md) 锚定的浏览器自动化与认证态使用协议。本协议是工程行为准则的子集，**任何 AI Agent 在执行浏览器自动化任务前必须完整遵循**。
+>
+> **协议版本**：v1.0 ｜ **生效范围**：所有面向本仓库的 AI Agent 协作场景
+>
+> **关联工具**：`chrome-devtools` MCP、`claude-in-chrome` MCP、`playwright` MCP
+
+[TOC]
+
+---
+
+## 1. 协议目的
+
+为 AI Agent 在浏览器自动化场景下提供**统一、可审计、不可绕过**的认证态使用规范，解决以下问题：
+
+- AI Agent 不应也不可代用户决策"我是谁"——所有登录态归属问题必须由用户本人主导
+- 浏览器自动化能力一旦失控，可能在用户毫不知情时产生不可撤销的副作用（消息发送、订单提交、权限变更等）
+- OAuth / SSO 同意屏在自动化上下文中存在被绕过的潜在风险，违反平台 ToS 与基本伦理
+
+本协议通过"原则—红线—操作流程—验证"四层结构，将上述问题约束在工程可控范围内。
+
+---
+
+## 2. 核心原则
+
+| 原则                       | 具体含义                                                                                              |
+| -------------------------- | ----------------------------------------------------------------------------------------------------- |
+| **登录态归属于用户**       | Agent 不得自行完成、绕过或模拟任何 OAuth / SSO 认证流程；所有登录态来源于用户已认证的 Chrome 主 profile |
+| **真实主 profile 优先**    | 浏览器自动化默认接入用户日常使用的 Chrome 主 profile，复用其 Cookie / Session / SSO 状态              |
+| **可审计、可回放**         | 浏览器路径关键操作（点击、表单填写、跳转）应留下可被 GIF 回放或日志追溯的痕迹                         |
+| **最小副作用**             | 优先以只读方式（查看、提取、断言）完成任务；写操作（提交、发送）需在协议第 5 节框架下显式确认         |
+
+---
+
+## 3. 安全红线
+
+> 以下条款**不可协商**，违反任一条款即视为协议违反。
+
+1. **禁止跳转 Google 同意屏**：在 Sandbox / 自动化浏览器环境内**严禁**触发 Google OAuth 同意屏跳转。同意屏只能在用户主 profile 的真实浏览会话中由用户本人完成。
+2. **禁止模拟身份**：禁止以模拟用户身份、虚构 Cookie、第三方账号或测试账号替代真实登录态完成任务。
+3. **禁止凭证泄露**：禁止要求用户在 chat 中粘贴密码、Cookie、Session Token、二维码扫描结果或任何形式的验证码（含 6 位数字、短信、TOTP）。
+4. **禁止跨账号操作**：在多用户环境下，Agent 不得在未经显式确认的情况下切换 profile 或账号身份。
+5. **禁止规避 ToS**：不得通过 Headless 模式、UA 伪装、Captcha 自动求解等方式规避目标站点的服务条款。
+6. **禁止下载执行**：浏览器路径触发的任何文件下载需在主对话中显式确认；下载文件不得自动执行或注入到项目目录。
+
+---
+
+## 4. 连通性自检（Connectivity Probe）
+
+执行浏览器自动化任务前，Agent **必须**完成以下自检序列：
+
+| 步骤                  | 操作                                                              | 通过判据                                                  |
+| --------------------- | ----------------------------------------------------------------- | --------------------------------------------------------- |
+| 4.1 工具可用性        | 列出当前会话可用的 MCP 工具                                       | 至少存在 `chrome-devtools` / `claude-in-chrome` 之一      |
+| 4.2 主 profile 加载   | 通过工具调用获取当前 Tab 列表或 Page 列表                         | 返回非空，且 Tab 标题来自用户真实浏览历史而非空白会话     |
+| 4.3 目标域名可达      | 通过 `navigate_page` 或 `browser_navigate` 访问目标域名首页       | HTTP 200 / 已登录态正常加载                               |
+| 4.4 登录态识别        | 在目标域名首页定位"已登录"标识（头像、用户名、退出按钮）          | 能在 Snapshot / AOM 中找到一致标识                        |
+| 4.5 异常路径分类      | 若 4.4 失败，按"未登录 vs 会话过期 vs 拒绝服务"分类，**不**自动重登 | 输出明确分类，转入第 5 节的用户接力流程                   |
+
+> **失败处置**：自检任一步骤失败，Agent **必须**停止任务、向用户输出诊断结论，**不得**尝试 OAuth / 凭证补救。
+
+---
+
+## 5. 凭证管理（Credential Lifecycle）
+
+### 5.1 发现路径
+
+凭证通过以下路径**被动**发现，Agent **不**主动读取、导出或日志化：
+
+- 浏览器 Cookie / LocalStorage（仅由浏览器引擎内部使用）
+- 浏览器扩展（如 Claude in Chrome）持有的 Session
+- 用户在 chat 中以"我刚登录了 X"形式给出的事实陈述（非凭证本身）
+
+### 5.2 过期检测信号
+
+| 信号                                     | 处置                                            |
+| ---------------------------------------- | ----------------------------------------------- |
+| HTTP 401 / 403                           | 转 5.3 接力流程                                 |
+| 重定向到登录页（含 `/login`、`/signin`） | 转 5.3 接力流程                                 |
+| 同意屏触发（OAuth scope 变更）           | **立即停止**，由用户在主 profile 完成同意       |
+| Captcha 出现                             | **立即停止**，输出"需用户介入"                  |
+
+### 5.3 用户接力流程（Re-authentication Handoff）
+
+```
+1. Agent 检测到登录态失效
+2. Agent 向用户输出：（a）失效域名 （b）建议在用户主 profile 完成登录的指引
+3. Agent 暂停浏览器任务，**不**触发任何登录流程
+4. 用户在真实浏览器完成登录后，回到 chat 通知 Agent
+5. Agent 重新执行第 4 节连通性自检
+6. 自检通过后恢复任务
+```
+
+### 5.4 凭证刷新约束
+
+- Agent **不**调用任何 refresh_token / device_code 接口
+- Agent **不**触发邮箱链接、短信验证码、TOTP 输入
+- 凭证刷新由用户在原始登录路径自主完成
+
+---
+
+## 6. E2E 集成（End-to-End Integration）
+
+### 6.1 与项目 OAuth 模块的边界
+
+本项目内置 GitHub Device Flow 与 Google OAuth 模块（`src/coding/proxy/auth/`）。浏览器协议与之的边界如下：
+
+- **项目 OAuth 模块**：服务端运行时凭证管理，由 `coding-proxy auth login/reauth` CLI 触发，目标是给 **proxy 自身**获取上游 API 凭证
+- **本协议**：客户端浏览器自动化场景，目标是让 **Agent 协助用户**完成日常任务（如查文档、填表单）
+
+二者**互不调用**：Agent 不调用 `coding-proxy auth` 替用户完成项目 OAuth；项目 OAuth 流程也不依赖本协议第 4 节自检。
+
+### 6.2 与 CLI 命令的协同
+
+| 场景                            | 由谁触发                  |
+| ------------------------------- | ------------------------- |
+| 给 proxy 注入 GitHub PAT        | 用户运行 `auth login`     |
+| 给 proxy 注入 Google OAuth      | 用户运行 `auth login`     |
+| 凭证过期重认证                  | 用户运行 `auth reauth`    |
+| 浏览器查看 GitHub Token 状态    | Agent 通过本协议浏览器访问 |
+
+### 6.3 测试用例的浏览器隔离
+
+- 单元测试（`tests/unit/`）**不**触发任何浏览器路径
+- 集成测试（`tests/integration/`）**不**触发任何浏览器路径
+- 浏览器路径仅在交互式 Agent 会话中触发，不进入 CI 自动化测试链路
+
+---
+
+## 7. 实机回归（Real-Device Regression）
+
+### 7.1 提交前的浏览器路径自检清单
+
+涉及浏览器路径的改动在提交前需手工核验：
+
+- [ ] 第 4 节连通性自检在用户主 profile 通过
+- [ ] 第 3 节安全红线未被触碰（特别是同意屏、密码粘贴）
+- [ ] 浏览器路径的关键操作有 GIF / Snapshot 留痕
+- [ ] 失败路径输出明确的用户接力指引
+
+### 7.2 与 CI 的边界
+
+CI 流水线（详见 [ops/ci-cd.md](../ops/ci-cd.md)）**不**触发浏览器自动化路径。所有浏览器侧验证均在本地实机完成。
+
+### 7.3 回归失败上报
+
+若实机回归失败：
+
+1. 在 [docs/issue.md](../issue.md) 记录现象、根因、防范
+2. 若涉及协议本身缺陷，提交 PR 修订本文件并同步 [AGENTS.md](../../AGENTS.md) 锚点
+3. 不通过的 Agent 行为应在 [knowledge-map.md](./knowledge-map.md) 标注为已知问题
+
+---
+
+## 8. 引用规范
+
+- 本协议章节可被 [AGENTS.md](../../AGENTS.md) / [CLAUDE.md](../../CLAUDE.md) 通过标题锚点形式引用
+- 修订本协议**必须**在 [docs/issue.md](../issue.md) 留存背景与决策记录
+- 协议条款发生变更时，需同步检查 [AGENTS.md §Browser Validation Protocol](../../AGENTS.md) 的兜底原则与本协议是否一致
+
+---
+
+## 附录 A：术语对照
+
+| 术语                | 说明                                                              |
+| ------------------- | ----------------------------------------------------------------- |
+| 主 profile          | 用户日常使用的 Chrome / Edge 浏览器档案，含真实登录态             |
+| Sandbox 浏览器      | 自动化工具启动的临时/隔离浏览器，无真实用户态                     |
+| 同意屏（Consent）   | OAuth 流程中用户授予权限范围的页面                                |
+| 接力流程            | Agent 停止 → 用户介入完成 → Agent 恢复 的三段式协作               |
+| 实机回归            | 在用户真实终端（非 CI）完成的端到端验证                           |
diff --git a/docs/agents/issue.md b/docs/agents/issue.md
new file mode 100644
index 0000000..c202b8a
--- /dev/null
+++ b/docs/agents/issue.md
@@ -0,0 +1,280 @@
+# Issue 处理档案
+
+> 维护已处理过的 Issue 摘要（问题描述、表因根因、处理方式、后续防范、同类问题影响与处理注意事项），便于同类问题的跨上下文处理。识别相同 Issue 时应在原条目追加复盘，避免同 Issue 多处维护。
+
+---
+
+## streaming usage parse failed: 'NoneType' object has no attribute 'get'
+
+**问题描述**
+
+OpenAI 兼容 SSE 流式响应过程中，单次请求日志反复刷出数十条 WARNING：
+
+```
+WARNING streaming usage parse failed: 'NoneType' object has no attribute 'get'
+```
+
+警告本身被上层 `try/except` 吞掉不影响主链路，但日志噪声严重，且每帧都丢失了 usage 累加。
+
+**表因**
+
+`StreamingUsageAccumulator.feed` 调用 `parse_usage_from_chunk` 解析 SSE chunk 时抛出 `AttributeError`。
+
+**根因**
+
+`src/coding/proxy/routing/usage_parser.py::parse_usage_from_chunk` 中 Anthropic message_start 与 Anthropic message_delta / OpenAI 两条分支都使用了脆弱的判空模式：
+
+```python
+if "usage" in data:        # 仅判断 key 存在
+    u = data["usage"]      # 但值可能是 null
+    u.get("output_tokens", 0)  # AttributeError
+```
+
+部分上游（含某些 OpenAI 兼容供应商）在中间 chunk 显式发送 `"usage": null` 占位帧，`in` 检查通过但取出的是 `None`。
+
+**处理方式**
+
+将两处 guard 统一改为 `u = container.get("usage"); if isinstance(u, dict):`，既排除缺省也排除 null，并顺手移除内部冗余的 `if isinstance(u, dict):` 包装层（已被外层 guard 覆盖）。同时新增三个回归用例覆盖 `data.usage = null` / `message.usage = null` / null 帧后跟有效帧三种场景。
+
+**后续防范**
+
+- 解析外部 SSE / JSON 结构时, 不要单独使用 `if key in data` 作为安全 guard, 应统一采用 `value = data.get(key); if isinstance(value, dict):` 的双重保护, 同时排除缺省与显式 null。
+- 对 try/except 包裹的 WARNING 路径要保持警觉: 异常被吞不代表无害，重复刷屏的同类警告往往暗示防御性 guard 过窄，需要回溯至根因修复，而非依赖 except 兜底。
+
+**同类问题影响与处理注意事项**
+
+- 本仓库内 `parse_usage_from_chunk` 的 Gemini `usageMetadata` 分支 (line ~219) 已经使用 `isinstance(um, dict)` 防御, 不受影响, 可作为参考实现。
+- 检查其他解析器 (如 routing / vendor adapter 层) 是否还有 `if "key" in data: v = data["key"]; v.get(...)` 这种模式, 必要时同步加固。
+
+---
+
+## anthropic 400: `tool_use` ids were found without `tool_result` blocks immediately after
+
+**问题描述**
+
+zhipu → anthropic 通道流式请求偶发 400, 错误形如:
+
+```
+WARNING anthropic stream error: status=400 body=...
+  messages.3: `tool_use` ids were found without `tool_result` blocks immediately after: toolu_normalized_2.
+INFO  Failover: anthropic → zhipu (reason: HTTP 400)
+INFO  Tier zhipu stream succeeded (took over from failed tier: anthropic)
+```
+
+同一请求伴随 `Applied transition channel zhipu → anthropic: rewritten_N_srvtoolu_ids, misplaced_tool_result_relocated, stripped_M_thinking_blocks` 的 adaptations 但**没有 `orphaned_tool_use_repaired`**, 即转换层主观上认为已配对、但 Anthropic 仍判定结构不合规。Failover 至 zhipu 后请求成功, 证明上游消息体本身没有损坏, 问题出在 zhipu→anthropic 通道转换过程引入了不一致。
+
+**表因**
+
+`src/coding/proxy/convert/vendor_channels.py::_rewrite_srvtoolu_ids` 在单遍循环中同时承担 Case A (assistant 端 `server_tool_use` → `tool_use` 与 `srvtoolu_*` ID 重写) 与 Case B (任意位置 `tool_result.tool_use_id` 同步重写)。Case B 依赖 `id_map` 已被 Case A 填入。
+
+**根因**
+
+Zhipu GLM-5 流式响应偶发将 inline `tool_result` 块输出在**对应的 `server_tool_use` 块之前** (同 assistant content 内乱序), 或将 `tool_result` 放在更早的 user 消息中而对应 `tool_use` 在更晚的 assistant 消息。两种乱序下, 单遍扫描遍历到 `tool_result` 时 `id_map` 还是空 → `tool_result.tool_use_id` 不被改写, 停留在 `srvtoolu_X`; 随后 Case A 把对应 `tool_use.id` 改写为 `toolu_normalized_N`。
+
+后续 `enforce_anthropic_tool_pairing` Step A 提取这条 misplaced tool_result 时使用**旧 ID** 作为 `extracted_tool_results` 字典 key, Step F 用新 ID 去查 → 不命中 → 走 `existing_result_ids` 分支, 因为相邻 user 的 tool_result 已经被改写到新 ID, 该 uid 命中 `existing_result_ids` 被 continue 跳过, 于是 enforce 错误地认为完成配对、不产生 `orphaned_tool_use_repaired` 标签, 而被默默丢弃的 misplaced tool_result 本应填补到的 user 槽位实际上**仍然缺位**。最终 body 中某条 assistant 的 tool_use 在下一条 user 中找不到对应 tool_result → Anthropic 400。
+
+**处理方式**
+
+1. `_rewrite_srvtoolu_ids` 改为**两遍扫描**: Pass 1 仅遍历 assistant 消息收集 `id_map` (按 assistant 出现顺序分配, 保持序号兼容性); Pass 2 全量遍历改写任意 `tool_result.tool_use_id`。以"先建表、后改写"的次序消除时序耦合。
+2. 在 `enforce_anthropic_tool_pairing` 主循环末尾追加独立 helper `_enforce_pairing_sanity_pass`, 仅做检测+合成 `is_error=True` 占位 (不剥离、不重定位), 命中追加 `pairing_sanity_repaired` adaptation 并打 WARNING (含 message index 与 uid)。这层作为纵深防御, 在主循环未来重构时仍能稳定守住 Anthropic 配对约束。
+3. 新增回归测试覆盖三类场景: 同 assistant content 内乱序、跨消息边界 tool_result 早于 tool_use、端到端复现日志故障形态。新增 `TestEnforcePairingSanityPass` 独立测试套件确保兜底分支具备正向回归保护。
+
+**后续防范**
+
+- 任何在多 content block 之间存在**前向引用** (后出现的块定义的标识符被前面的块引用) 的就地改写逻辑, 都必须采用两遍扫描或全局表先建后用, 不可依赖遍历位置上 "上一次循环已经写入" 的隐含次序。
+- 纵深防御层 (sanity helper) 必须**独立可单测**, 而不是把 sanity 内嵌在主路径内部 — 否则主路径的快速通道会让 sanity 分支永远走不到正向测试, 缺乏回归保护。
+- adaptations 标签 (`pairing_sanity_repaired`) 与主循环标签 (`orphaned_tool_use_repaired`) 分离, 便于运维聚合时按层归因。
+
+**同类问题影响与处理注意事项**
+
+- 历史教训: commit `9061cd0` 曾经实现"两遍扫描 + sanity helper"修复了正是这类问题, 但 commit `2bac9a7` revert 至 v0.3.0 时**连带回滚**了它 — revert 的真实目标是去除 `f497077` / `fdd4a92` / `43488a1` 引入的"zhipu 自清理通道"和"tool_result.id 注入"副作用, 两遍扫描属无辜方。**后续若再次需要 revert `vendor_channels.py`**, 必须先 `grep _enforce_pairing_sanity_pass` 与 `Pass 1` / `Pass 2` 注释, 确认这两段是核心修复而非可以一起回滚的实验性代码。
+- 类似 "vendor 私有 ID 跨消息体改写" 场景 (如 doubao、minimax 未来若引入类似机制), 实现时同样应当遵循"先全局收集 id_map、后统一改写"的两阶段模式。
+- 单元测试覆盖"块顺序敏感"类 bug 时, 建议在用例命名中显式标注顺序条件 (如 `test_two_pass_handles_inline_tool_result_before_server_tool_use`), 让未来 reviewer 一眼看出测试的边界价值。
+
+---
+
+## count_tokens 路由 `AttributeError: 'ZhipuVendor' object has no attribute 'name'`
+
+**问题描述**
+
+后台日志反复出现 `POST /v1/messages/count_tokens?beta=true 500 Internal Server Error`，并伴随：
+
+```
+File ".../coding/proxy/server/routes.py", line 153, in count_tokens
+    channel_fn = get_transition_channel(source, target_vendor.name)
+AttributeError: 'ZhipuVendor' object has no attribute 'name'
+```
+
+同一时间窗口内大量请求 200 OK、少量请求 500，呈"间歇性"故障特征。
+
+**表因**
+
+`src/coding/proxy/server/routes.py` 的 `count_tokens` 在 153 / 160 两处访问 `target_vendor.name`，触发 `AttributeError` 被 ASGI 中间件捕获返回 500。
+
+**根因**
+
+`BaseVendor` 仅暴露**抽象方法** `get_name() -> str`（`src/coding/proxy/vendors/base.py:75-77`），所有派生类（`AnthropicVendor`、`ZhipuVendor`、`CopilotVendor`、`MinimaxVendor`、`DoubaoVendor`、`KimiVendor` 等）均通过 `_vendor_name` 类属性配合 `get_name()` 返回名称 —— **并无 `name` 实例属性**。该错误访问在 lint/类型检查阶段无告警（因 `BaseVendor` 未在类型系统中约束 `name` 字段），仅在运行时触发。
+
+间歇性原因：第 152 行 `if source:` 是守卫；`source` 由 `infer_source_vendor_from_body(body)`（`src/coding/proxy/convert/vendor_channels.py:357-394`）从请求体启发式推断，仅当出现 zhipu 私有产物（`srvtoolu_*` 形式的 `tool_use.id` 或 `server_tool_use` / `server_tool_use_delta` 类型 content block）时返回 `"zhipu"`，否则 `None`。纯净的首轮 count_tokens 请求 `source is None` 自然绕过 153 行，因此 200/500 共存。
+
+**处理方式**
+
+1. `routes.py:153,160` 将 `target_vendor.name` 改为 `target_vendor.get_name()`，并将结果提取到局部变量 `target_name` 复用，避免重复方法调用与日志/调用点不一致风险。
+2. `tests/test_app_routes.py` 新增 `test_count_tokens_triggers_zhipu_to_target_channel`：通过注入 `server_tool_use` + `srvtoolu_*` 让 `infer_source_vendor_from_body` 返回 `"zhipu"`，断言返回 200 且 debug 日志含 `"count_tokens channel zhipu → anthropic"`，证明通道被实际触发。此前 6 个 count_tokens 测试的请求体都是纯净的、未触达该分支，是 bug 长期漏过的根因。
+
+**后续防范**
+
+- 跨模块引用 Vendor 实例字段时，**统一通过 `BaseVendor` 暴露的方法**（`get_name()`、`map_model()` 等），避免直接访问派生类未定义的"假属性"。
+- 长期演进可考虑在 `BaseVendor` 增加 `@property name` 指向 `get_name()`，将契约前移到类型系统由 mypy / pyright 拦截 —— 该重构属"演进式设计"范畴，不在本次最小干预范围内。
+- 测试覆盖原则：路由层涉及"内容感知"分支（如 `infer_source_vendor_from_body`）时，至少补一个让分支命中的最小用例，避免守卫掩盖代码缺陷。
+
+**同类问题影响与处理注意事项**
+
+- 已 `grep -rn "vendor\.name\b" src/` 全仓扫描，确认 `target_vendor.name | vendor.name` 误用仅 routes.py 的这两处，已随本次修复一并消除。`/v1/messages` 主链路在 executor 中调用 `tier.name`（`Tier` 对象的合法 dataclass 属性），与 vendor 实例 `name` 无关，不受影响。
+- 若未来新增 Vendor 子类，仍只需实现 `get_name()` 抽象方法；外部调用方应遵循同一契约，本档案的修复模式可作为参考。
+
+---
+
+## Gemini embedding 透传至 Vertex AI 上游返回 `request body doesn't contain valid prompts`
+
+**问题描述**
+
+通过本代理调用 Gemini embedding 模型时，上游返回 400：
+
+```
+litellm.BadRequestError: GeminiException BadRequestError -
+{"error":{"message":"request body doesn't contain valid prompts"}}
+POST /api/gemini/v1beta/models/gemini-embedding-001%3AbatchEmbedContents 400
+```
+
+litellm 报错日志中 URL 路径是 `:batchEmbedContents`，调用端疑似格式不兼容。
+
+**表因**
+
+litellm 按 Google AI Studio 格式构造请求：
+- 路径：`POST {api_base}/v1beta/models/{model}:batchEmbedContents`
+- Body：`{"requests": [{"model": "models/...", "content": {"parts": [{"text": "..."}]}}]}`
+
+但实际上游（如 `llms.as-in.io` 这类 Vertex AI 风格网关）只接受 Vertex AI 格式：
+- 路径：`POST {api_base}/v1beta1/publishers/google/models/{model}:embedContent`
+- Body：`{"content": {"parts": [{"text": "..."}]}}`
+
+且无 `batchEmbedContents` 端点。
+
+**根因**
+
+1. 代理 `NativeProxyHandler.dispatch()` 是字节级透传，对 embedding 端点未做协议适配，直接把 Google AI Studio 格式的 URL/Body 转给 Vertex AI 上游，路由不匹配。
+2. litellm `_check_custom_proxy()` 在自定义 `api_base` 场景下会丢失 `v1beta/` 版本前缀，发送 `{api_base}/models/{model}:verb`，使代理原有的 `OperationClassifier` 正则（要求 `v1beta/` 前缀）失配，进而走原始透传分支再次失败。
+
+**处理方式**
+
+1. `src/coding/proxy/native_api/operation.py`：放宽 Gemini 路径正则中的 `v1(?:beta1?)?/` 段为可选，兼容 litellm 丢失版本前缀的异常路径。
+2. `src/coding/proxy/native_api/handler.py`：在 `dispatch()` 中新增 Gemini embedding Vertex AI 适配分支：
+   - 仅当 `provider == "gemini"`、`operation in {"embedding", "embedding.batch"}`、且 `base_url` 非官方 `generativelanguage.googleapis.com` 时启用；
+   - `embedContent` → 重写路径为 `v1beta1/publishers/google/models/{model}:embedContent`，剥离 body 中的 `model` 字段；
+   - `batchEmbedContents` → 拆分为多次并发 `embedContent` 调用（`asyncio.gather`），聚合响应为 `{"embeddings": [...]}` 返回；
+   - 用量抽取累加各子请求的 `usageMetadata`。
+3. `tests/test_native_api_handler.py`：新增 3 个回归测试覆盖单次 / 批量 / 官方上游透传不变三类场景。
+
+**后续防范**
+
+- 协议适配层只对**非官方上游**生效，官方 `generativelanguage.googleapis.com` 仍走字节级透传，避免引入不必要的转换开销与协议偏差。
+- 上游路径分支的判定优先用 base_url 域名而非依赖网关行为特征，便于后续扩展（如 Vertex Express、其他 LLM gateway）时的精确匹配。
+- 真实链路验证：使用 litellm `embedding(api_base=..., api_key=...)` 单输入 / 多输入分别调用，确认返回 3072 维向量及正确批量计数。
+
+**同类问题影响与处理注意事项**
+
+- litellm 在 Gemini 其他端点（`generateContent` / `countTokens`）同样存在 `_check_custom_proxy` 丢失 `v1beta/` 前缀的 bug；本次仅放宽了 `operation.py` 中的路径正则（让分类器能识别此类异常路径），未对这些端点做格式转换，因为非 embedding 端点的 Google AI Studio / Vertex AI 请求体差异较小，多数上游兼容。如未来出现类似失配再做针对性适配。
+- 若上游网关同时支持 OpenAI `/v1/embeddings` 与 Vertex AI 路径，建议优先在客户端配置 OpenAI 兼容路径，减少协议转换链路。
+
+---
+
+## Dashboard Sessions 页 `Tokens` 列漏算缓存 Token
+
+**问题描述**
+
+Dashboard 的 **Sessions** 标签页中，每条会话的 `Tokens` 列与展开详情卡的 `Tokens` 值，仅统计 `input + output`，遗漏了 `cache_creation`（写缓存）与 `cache_read`（读缓存）。在长链路 Anthropic Prompt Cache 场景下，读取命中常常是 input/output 的数倍，导致 Sessions 页总量被显著低估，与 Overview 标签页（卡片、Token 时序图）跨页口径分裂。
+
+**表因**
+
+前端 `dashboard.py:1597 / 1614` 直接渲染 `s.total_tokens`，该值由 `/api/dashboard/sessions` 透传自 `token_logger.query_recent_sessions()` 的聚合结果。
+
+**根因**
+
+`src/coding/proxy/logging/db.py` 中两条按 `session_key` 分组的聚合 SQL 使用了不完整的求和口径：
+
+```sql
+SUM(input_tokens + output_tokens) AS total_tokens   -- 第 607 行（query_recent_sessions）
+SUM(input_tokens + output_tokens) AS total_tokens   -- 第 634 行（query_session_profile）
+```
+
+而同文件内 `query_usage()`（第 465–466 行分别 `SUM(...)` 四列）与 `query_total_tokens_by_vendor()`（第 584 行 `SUM(input + output + cache_creation + cache_read)`）已采用完整四项口径，构成了同文件内的口径双标。
+
+**处理方式**
+
+复用 `query_total_tokens_by_vendor` 的四项求和表达式，将两处 `total_tokens` 改写为：
+
+```sql
+SUM(input_tokens + output_tokens
+    + cache_creation_tokens + cache_read_tokens) AS total_tokens
+```
+
+不改动 API 返回结构、不新增字段、不改前端 detail-card——前端 `fmtTokens(s.total_tokens)` 调用无须变更。同时在 `tests/test_session_aware.py` 的 `test_query_recent_sessions_basic` / `test_query_session_profile_found` 中追加 `cache_creation_tokens` / `cache_read_tokens` 入参与完整口径断言，覆盖回归。
+
+**后续防范**
+
+- SQL 聚合层涉及"总 Tokens"概念时，必须保持**单一权威定义**（Single Source of Truth）：要么所有视图共用同一求和表达式，要么抽取为常量片段集中引用，杜绝多处独立维护造成的语义漂移。
+- 未来若引入新的 token 维度（如 reasoning_tokens、tool_tokens 等），需要全文检索 `SUM(input_tokens + output_tokens` 这一历史模式并同步补齐，避免出现新的口径分裂点。
+
+**同类问题影响与处理注意事项**
+
+- 历次 PR 中 cache token 字段的引入是渐进式的（schema 已有四列、`log()` 入参齐全、Overview 已全口径消费），但部分聚合视图的口径升级被遗漏；任何向 `usage_log` 增列后，**必须**审计所有 `SUM(input_tokens` / `SUM(output_tokens` 出现处的聚合表达式是否需要同步更新。
+- 跨标签页同一指标（如"总 Tokens"）的口径一致性，建议在添加新视图时主动与 Overview 现有口径做交叉核对，必要时在 SQL 注释中标注口径来源，便于后续 review。
+
+---
+
+## Zhipu vendor 间歇性 `[1210][API 调用参数有误]` 拒绝（诊断阶段）
+
+**问题描述**
+
+Zhipu vendor 作为首选 tier 时，处理 `claude-haiku-* → glm-5-turbo` 的部分请求被上游直接拒绝：
+
+```
+WARNING Tier zhipu semantic rejection
+  (type=invalid_request_error,
+   msg=[1210][API 调用参数有误，请检查文档。][...])
+  [model=claude-haiku-4-5-20251001, messages=1], trying next tier without recording failure
+INFO  Tier anthropic message succeeded (took over from failed tier: zhipu)
+```
+
+失败请求统一表现为 `duration<1s + tokens=[0 0 0 0]`，被 zhipu 在入口校验阶段直接拒绝、未消耗任何 token。两次观察窗口失败率分别为 4%（2026-05-23 22:24，glm-4.7 旧映射）与 27%（2026-05-25 17:26+，glm-5-turbo 当前映射），均触发降级至 anthropic / copilot。
+
+**表因**
+
+`is_semantic_rejection` 检测到 zhipu 返回 `invalid_request_error + 1210` 含「API 调用参数有误」中文标记，判定为语义拒绝，跳过下一层 tier。1210 是智谱官方错误码，[官方文档](https://docs.bigmodel.cn/cn/api/api-code) 定义为「参数格式/类型不符规范」（区别于 1213「必需字段缺失」、1214「字段参数非法」）。
+
+**根因（已定位，修复中）**
+
+PR #247 (Step 1 v2) 部署后，2026-05-26 16:30–16:31 的诊断日志显示 8 次连续拒绝**全部携带 `thinking={"type": "adaptive"}`**（Anthropic Claude 4.x 新增的参数类型），而同一时段其他会话的请求持续成功。之前 curl 测试仅验证了 `{"type": "enabled"}`，未覆盖 `adaptive` 类型。GLM 可能不支持此特定类型值，导致 [1210] 参数校验失败。
+
+**处理方式（分阶段）**
+
+- **Step 1（PR #244，已合并）**：在 `executor.py::_build_semantic_rejection_diagnostic` 中输出 thinking / cache_control 相关字段 — 但证据反转，覆盖不足以定位真因。
+- **Step 1 v2（PR #247，已合并）**：扩展诊断函数覆盖 `system_kind|blocks(+cc)` / `tools` / `tool_choice` / 采样参数 / `stream` / `metadata_keys` / `content_types` / `body_bytes` 等维度。所有项「仅存在时输出」以控制日志噪声。配套 14 个单元测试（`TestBuildSemanticRejectionDiagnostic`）覆盖各字段组合。
+- **Step 2（进行中）**：基于 Step 1 v2 的日志证据，在 `ZhipuVendor._prepare_request` 中实现 **兼容转换**（而非移除）：
+  - `thinking.type="adaptive"` → `{"type": "enabled", "budget_tokens": 16000}`（保留 thinking 能力）
+  - 新增 `_build_zhipu_request_snapshot` 诊断快照，同时覆盖成功/失败请求，建立可对比证据链
+  - 扩展语义拒绝日志的错误体截断限制（200 → 500 字符），保留完整字段级诊断
+  - `metadata` 暂不处理（待进一步诊断确认兼容性）
+
+**后续防范**
+
+- **「无证据，不下结论」**：当初版诊断字段无法覆盖根因时，禁止反复猜测，应优先扩展诊断维度抓取更多线索。本次先扩展再修复的迭代节奏可作为同类「黑盒 API 报错」问题的范式。
+- **诊断字段设计原则**：所有诊断项应「仅存在时输出」，避免常态化噪声；输出格式紧凑（`key=val`）便于日志检索；参数值用 `!r:.N` 截断防止巨型对象灌入日志。
+- **错误码差异化**：智谱 12xx 系列错误码语义并不等价（1210 ≠ 1213 ≠ 1214），未来面对类似 `[code][message]` 形式的供应商错误时，应优先查阅其官方错误码字典，避免基于错误消息字面意思的误判。
+
+**同类问题影响与处理注意事项**
+
+- 其他薄透传 vendor（minimax / kimi / doubao / alibaba / xiaomi）共用 `NativeAnthropicVendor._prepare_request`，若它们也开始报「参数错误」类语义拒绝，可复用本次扩展的诊断函数定位差异。
+- 若证据指向 `tools` 字段（如工具 schema 不兼容）、`metadata` 字段（如自定义键被 zhipu 拒收）等具体路径，修复时应优先复用 `convert/vendor_channels.py` 中已有的 `normalize_for_zhipu` / `strip_thinking_blocks` 工具，避免在 vendor 内部重复实现剥离逻辑。
+- 部署 Step 1 v2 后，建议观察至少 48 小时收集足够样本（>20 次失败），通过失败/成功请求形态对比统计找出**唯一差异维度**，再进入 Step 2。
diff --git a/docs/agents/knowledge-map.md b/docs/agents/knowledge-map.md
new file mode 100644
index 0000000..08bd983
--- /dev/null
+++ b/docs/agents/knowledge-map.md
@@ -0,0 +1,95 @@
+# Knowledge Map（知识索引）
+
+> 项目所有文档的统一入口与权威索引。由 [AGENTS.md §Knowledge Map](../../AGENTS.md) 锚定，文档目录变更时**必须**即时同步更新本文件。
+>
+> **使用方式**：按"受众 × 目的"二维定位所需文档；不确定起点时，从「入口导航」开始。
+
+[TOC]
+
+---
+
+## 1. 入口导航
+
+| 文档                                          | 角色                                            | 受众            |
+| --------------------------------------------- | ----------------------------------------------- | --------------- |
+| [README.md](../../README.md)                  | 项目首页（英文版门面）                          | 公开访客        |
+| [docs/zh-CN/README.md](../zh-CN/README.md)    | 项目首页中文镜像（与英文版功能对等）            | 中文公开访客    |
+| [docs/user-guide.md](../user-guide.md)        | 用户操作上位导航 + 配置概览速查                 | 终端用户        |
+| [docs/framework.md](../framework.md)          | 架构枢纽（项目动机、设计目标、模块清单）        | 架构师/贡献者   |
+
+---
+
+## 2. 用户向（[docs/guide/](../guide/)）
+
+> 面向最终用户的操作手册，按"安装 → 配置 → 运行 → 观测 → 排障"线性铺陈。
+
+| 文档                                              | 主旨                                                |
+| ------------------------------------------------- | --------------------------------------------------- |
+| [guide/quickstart.md](../guide/quickstart.md)     | 环境要求、安装、最小配置、启动、Claude Code 集成    |
+| [guide/vendors.md](../guide/vendors.md)           | 全部 9 种供应商配置详情、模型映射、定价表           |
+| [guide/cli-reference.md](../guide/cli-reference.md) | start / status / usage / reset / auth 全部命令      |
+| [guide/api-reference.md](../guide/api-reference.md) | /v1/messages、health、status、reset、dashboard 等   |
+| [guide/dashboard.md](../guide/dashboard.md)       | Web 可视化看板功能与交互                            |
+| [guide/monitoring.md](../guide/monitoring.md)     | 日志、用量统计、性能调优、常见场景、故障排查        |
+
+---
+
+## 3. 架构向（[docs/arch/](../arch/)）
+
+> 面向贡献者与维护者的架构与实现细节，从 [framework.md](../framework.md) 正交分解而来。
+
+| 文档                                                  | 主旨                                                  |
+| ----------------------------------------------------- | ----------------------------------------------------- |
+| [arch/config-reference.md](../arch/config-reference.md) | 配置参数权威定义（Single Source of Truth）            |
+| [arch/design-patterns.md](../arch/design-patterns.md) | 13 种设计模式详解（熔断器、状态机、Composite 等）     |
+| [arch/routing.md](../arch/routing.md)                 | 路由引擎 12 个子模块职责                              |
+| [arch/vendors.md](../arch/vendors.md)                 | Vendor 类层次结构与 9 种实现                          |
+| [arch/convert.md](../arch/convert.md)                 | Anthropic ↔ Gemini ↔ OpenAI 三向格式转换              |
+| [arch/testing.md](../arch/testing.md)                 | 测试覆盖矩阵与工具链                                  |
+
+---
+
+## 4. 运维向（[docs/ops/](../ops/)）
+
+> 面向运维与发布工程的流程文档。
+
+| 文档                                | 主旨                                              |
+| ----------------------------------- | ------------------------------------------------- |
+| [ops/ci-cd.md](../ops/ci-cd.md)     | 发布流程、热修复、回滚、CI/CD 故障排查            |
+
+---
+
+## 5. Agent 协作（[docs/agents/](./)）
+
+> AGENTS.md 工程行为准则的卫星文件，定义 AI Agent 协作过程中的规范与协议。
+
+| 文档                                                            | 主旨                                          |
+| --------------------------------------------------------------- | --------------------------------------------- |
+| [agents/knowledge-map.md](./knowledge-map.md)                   | 本文件——项目文档统一索引                      |
+| [agents/reference-specifications.md](./reference-specifications.md) | IEEE 文献引用格式模板与实践指南               |
+| [agents/browser-validation.md](./browser-validation.md)         | 浏览器验证协议（连通性自检、凭证管理、E2E）   |
+
+---
+
+## 6. 问题档案
+
+| 文档                              | 主旨                                                  |
+| --------------------------------- | ----------------------------------------------------- |
+| [docs/issue.md](../issue.md)      | 已处理 Issue 摘要档案（表因、根因、防范）              |
+
+---
+
+## 7. 工程规范（顶层）
+
+| 文档                              | 主旨                                                  |
+| --------------------------------- | ----------------------------------------------------- |
+| [AGENTS.md](../../AGENTS.md)      | 工程行为准则与 AI Agent 协作协议（与 CLAUDE.md 同源） |
+| [CHANGELOG.md](../../CHANGELOG.md)| 版本历史与变更日志                                    |
+
+---
+
+## 维护约束
+
+1. **同步原则**：新增/删除/重命名 `docs/` 下任意 .md 文件时，**必须**同步本索引。
+2. **路径基准**：本文件位于 `docs/agents/`，所有相对路径以此为基准（向上一级 `../` 访问 `docs/`，向上两级 `../../` 访问仓库根）。
+3. **链接验证**：维护者修改本文件后应通过 grep 自检：所有 `[...](path)` 中的 `path` 文件存在。
diff --git a/docs/agents/reference-specifications.md b/docs/agents/reference-specifications.md
new file mode 100644
index 0000000..896b866
--- /dev/null
+++ b/docs/agents/reference-specifications.md
@@ -0,0 +1,16 @@
+# Reference Specifications (IEEE)
+
+> **模版准则**：[编号] 作者缩写. 姓, "文章标题," _刊名/会议名缩写 (斜体)_, 卷号, 期数, 页码, 年份.
+
+```latex
+[1] A. Author, B. Author, and C. Author, "Title of paper," *Abbrev. Title of Journal*, vol. X, no. Y, pp. XX–XX, Year.
+```
+
+**引用实践**
+
+- **文内锚定**：采用标准上标链接形式：`描述内容<sup>[[1]](#ref1)</sup>`。
+- **文献索引**：底层采用 HTML 锚点 `id` 实现跳转稳定性。
+
+```latex
+<a id="ref1"></a>[1] A. Vaswani et al., "Attention is all you need," Adv. Neural Inf. Process. Syst., vol. 30, pp. 5998–6008, 2017.
+```
diff --git a/docs/arch/config-reference.md b/docs/arch/config-reference.md
index 24e11e5..1f4460f 100644
--- a/docs/arch/config-reference.md
+++ b/docs/arch/config-reference.md
@@ -89,12 +89,13 @@ flowchart TD
 
 ## 5. VendorConfig 弹性字段
 
-| 字段                 | 类型           | 默认值               | 说明                        |
-| -------------------- | -------------- | -------------------- | --------------------------- |
-| `circuit_breaker`    | config \| None | `None`               | 熔断器配置（None = 终端层） |
-| `retry`              | config         | `RetryConfig()`      | 重试策略配置                |
-| `quota_guard`        | config         | `QuotaGuardConfig()` | 日度配额守卫配置            |
-| `weekly_quota_guard` | config         | `QuotaGuardConfig()` | 周度配额守卫配置            |
+| 字段                 | 类型           | 默认值               | 说明                                |
+| -------------------- | -------------- | -------------------- | ----------------------------------- |
+| `circuit_breaker`    | config \| None | `None`               | 熔断器配置（None = 终端层）         |
+| `retry`              | config         | `RetryConfig()`      | 重试策略配置                        |
+| `quota_guard`        | config         | `QuotaGuardConfig()` | 日度配额守卫配置                    |
+| `weekly_quota_guard` | config         | `QuotaGuardConfig()` | 周度配额守卫配置                    |
+| `concurrency`        | config \| None | `None`               | `[zhipu]` 每模型并发限制（详见 5.5） |
 
 <a id="elastic-params"></a>
 
@@ -143,6 +144,33 @@ flowchart TD
 | `error_types`            | list[str] | `["rate_limit_error", "overloaded_error", "api_error"]`                            |
 | `error_message_patterns` | list[str] | `["quota", "limit exceeded", "usage cap", "capacity", "internal network failure"]` |
 
+### 5.5 ZhipuConcurrencyConfig — Zhipu 每模型并发参数
+
+仅对 `vendor: zhipu` 生效，基于 `asyncio.Semaphore` 实现 FIFO 公平排队。
+
+| 字段      | 类型           | 默认值 | 说明                                                                             |
+| --------- | -------------- | ------ | -------------------------------------------------------------------------------- |
+| `default` | int            | `3`    | 全局默认并行度（适用于所有未在 `models` 中显式覆盖的模型）；取值范围 `[1, 20]`   |
+| `models`  | map[str → int] | `{}`   | 按映射后模型名（如 `glm-5v-turbo` / `glm-5.1` / `glm-4.5-air`）自定义并行度上限 |
+
+YAML 示例：
+
+```yaml
+- vendor: zhipu
+  concurrency:
+    default: 3
+    models:
+      glm-5v-turbo: 5
+      glm-5.1: 2
+```
+
+行为语义：
+
+- 信号量按**映射后模型名**键控，与上游真实承载模型对齐；流式与非流式请求共用同一槽位。
+- 槽位满时新请求按 FIFO 顺序排队，直到任一在途请求释放槽位才被唤醒。
+- 429 重试期间持续占用槽位（重试视为同一请求的延续）。
+- 顶层 `concurrency` 字段缺省为 `None` → 转发至 `ZhipuConfig` 时回退默认值 `default=3`；如需完全关闭限流，可在 `ZhipuConfig` 构造层显式置 `null`（一般无需操作）。
+
 ---
 
 ## 6. 供应商专属字段
diff --git a/docs/arch/vendors.md b/docs/arch/vendors.md
index 2ec79ad..0e0d862 100644
--- a/docs/arch/vendors.md
+++ b/docs/arch/vendors.md
@@ -1,7 +1,7 @@
 # 供应商模块（vendors/）
 
 > 路径约定：相对于 `src/coding/proxy/`
-> 定位：从 [framework.md](./framework.md) 提取，详述供应商分类体系与各供应商实现。
+> 定位：从 [framework.md](../framework.md) 提取，详述供应商分类体系与各供应商实现。
 
 [TOC]
 
diff --git a/docs/guide/monitoring.md b/docs/guide/monitoring.md
index 7e89341..e11e648 100644
--- a/docs/guide/monitoring.md
+++ b/docs/guide/monitoring.md
@@ -31,7 +31,7 @@
 ```yaml
 logging:
   level: "DEBUG"    # 查看详细的模型映射和路由决策
-  file: "coding-proxy.log"  # 输出到文件
+  file: ".logs/coding-proxy.log"  # 输出到文件
   max_bytes: 5242880        # 单文件 5 MB，触发轮转
   backup_count: 5           # 保留 5 个 gzip 压缩备份
 ```
diff --git a/docs/issue.md b/docs/issue.md
deleted file mode 100644
index c8f9765..0000000
--- a/docs/issue.md
+++ /dev/null
@@ -1,47 +0,0 @@
-# Issue 处理档案
-
-> 维护已处理过的 Issue 摘要（问题描述、表因根因、处理方式、后续防范、同类问题影响与处理注意事项），便于同类问题的跨上下文处理。识别相同 Issue 时应在原条目追加复盘，避免同 Issue 多处维护。
-
----
-
-## streaming usage parse failed: 'NoneType' object has no attribute 'get'
-
-**问题描述**
-
-OpenAI 兼容 SSE 流式响应过程中，单次请求日志反复刷出数十条 WARNING：
-
-```
-WARNING streaming usage parse failed: 'NoneType' object has no attribute 'get'
-```
-
-警告本身被上层 `try/except` 吞掉不影响主链路，但日志噪声严重，且每帧都丢失了 usage 累加。
-
-**表因**
-
-`StreamingUsageAccumulator.feed` 调用 `parse_usage_from_chunk` 解析 SSE chunk 时抛出 `AttributeError`。
-
-**根因**
-
-`src/coding/proxy/routing/usage_parser.py::parse_usage_from_chunk` 中 Anthropic message_start 与 Anthropic message_delta / OpenAI 两条分支都使用了脆弱的判空模式：
-
-```python
-if "usage" in data:        # 仅判断 key 存在
-    u = data["usage"]      # 但值可能是 null
-    u.get("output_tokens", 0)  # AttributeError
-```
-
-部分上游（含某些 OpenAI 兼容供应商）在中间 chunk 显式发送 `"usage": null` 占位帧，`in` 检查通过但取出的是 `None`。
-
-**处理方式**
-
-将两处 guard 统一改为 `u = container.get("usage"); if isinstance(u, dict):`，既排除缺省也排除 null，并顺手移除内部冗余的 `if isinstance(u, dict):` 包装层（已被外层 guard 覆盖）。同时新增三个回归用例覆盖 `data.usage = null` / `message.usage = null` / null 帧后跟有效帧三种场景。
-
-**后续防范**
-
-- 解析外部 SSE / JSON 结构时, 不要单独使用 `if key in data` 作为安全 guard, 应统一采用 `value = data.get(key); if isinstance(value, dict):` 的双重保护, 同时排除缺省与显式 null。
-- 对 try/except 包裹的 WARNING 路径要保持警觉: 异常被吞不代表无害，重复刷屏的同类警告往往暗示防御性 guard 过窄，需要回溯至根因修复，而非依赖 except 兜底。
-
-**同类问题影响与处理注意事项**
-
-- 本仓库内 `parse_usage_from_chunk` 的 Gemini `usageMetadata` 分支 (line ~219) 已经使用 `isinstance(um, dict)` 防御, 不受影响, 可作为参考实现。
-- 检查其他解析器 (如 routing / vendor adapter 层) 是否还有 `if "key" in data: v = data["key"]; v.get(...)` 这种模式, 必要时同步加固。
diff --git a/docs/ci-cd.md b/docs/ops/ci-cd.md
similarity index 98%
rename from docs/ci-cd.md
rename to docs/ops/ci-cd.md
index 6b35b38..65d0464 100644
--- a/docs/ci-cd.md
+++ b/docs/ops/ci-cd.md
@@ -211,7 +211,7 @@ CI 流水线中使用的工具及其版本均与项目实际配置严格对齐
 
 | 工具           | 版本 / 引用                         | 来源 (Action)                            | 与项目配置的对齐关系                                                       |
 | -------------- | ----------------------------------- | ---------------------------------------- | -------------------------------------------------------------------------- |
-| Python         | `["3.12", "3.13", "3.14"]` (matrix) | `actions/setup-python@v5`                | 对齐 [`pyproject.toml`](../pyproject.toml) 中 `requires-python = ">=3.12"` |
+| Python         | `["3.12", "3.13", "3.14"]` (matrix) | `actions/setup-python@v5`                | 对齐 [`pyproject.toml`](../../pyproject.toml) 中 `requires-python = ">=3.12"` |
 | uv             | latest (v4)                         | `astral-sh/setup-uv@v4`                  | 项目强制包管理器（见 AGENTS.md 包管理规范）                                |
 | build          | latest                              | `uv pip install --system build`          | PEP 517 构建前端，后端为 hatchling                                         |
 | twine          | latest                              | `uv pip install --system twine`          | 包元数据校验与上传工具                                                     |
@@ -435,7 +435,7 @@ flowchart TD
 
 ### 4.1 promote.yml 工作流架构
 
-[`promote.yml`](../.github/workflows/promote.yml) 由两个 Job 组成，形成 **Validate → Promote** 的串行管线：
+[`promote.yml`](../../.github/workflows/promote.yml) 由两个 Job 组成，形成 **Validate → Promote** 的串行管线：
 
 #### Job 1：validate（验证门控）
 
@@ -629,7 +629,7 @@ flowchart TD
 | 问题现象                        | 可能原因                                               | 排查步骤                                              | 解决方案                                                  |
 | ------------------------------- | ------------------------------------------------------ | ----------------------------------------------------- | --------------------------------------------------------- |
 | `release.yml` 未触发            | Release 创建时未触发 `published` 事件（如 Draft 状态） | 检查 Actions 页面是否有该 workflow run                | 确保 Release 为非 Draft 状态；或重新发布                  |
-| `build` Job 失败                | `twine check` 报错（包元数据不合规）                   | 查看 build Job 日志中的 twine 输出                    | 修复 [`pyproject.toml`](../pyproject.toml) 中的元数据字段 |
+| `build` Job 失败                | `twine check` 报错（包元数据不合规）                   | 查看 build Job 日志中的 twine 输出                    | 修复 [`pyproject.toml`](../../pyproject.toml) 中的元数据字段 |
 | publish 失败 (HTTP 400)         | 包名或版本号冲突（目标仓库已有同版本）                 | 查看 verbose 日志中的响应体（已启用 `verbose: true`） | 检查 TestPyPI/PyPI 是否已有同版本；使用递增版本号         |
 | publish 失败 (HTTP 403)         | 认证失败（Token 无效或缺失）                           | 检查 Job 日志中的认证错误详情                         | 验证 Secret 配置或 Trusted Publisher 设置（参见 §7.2）    |
 | `promote.yml` validate 失败     | Target release 不是 prerelease（已是 stable）          | 查看 validate Job 错误信息                            | 确认输入的 `tag_name` 对应的是 prerelease release         |
@@ -701,7 +701,7 @@ CI 流水线中的工具版本选择并非随意，每一项都与项目配置
 
 | CI 配置                                          | 项目配置                                                              | 对齐关系                                                                    |
 | ------------------------------------------------ | --------------------------------------------------------------------- | --------------------------------------------------------------------------- |
-| `python-version: "${{ matrix.python-version }}"` | `requires-python = ">=3.12"` in [`pyproject.toml`](../pyproject.toml) | CI 构建环境必须满足项目的最低 Python 版本要求（matrix: 3.12 / 3.13 / 3.14） |
+| `python-version: "${{ matrix.python-version }}"` | `requires-python = ">=3.12"` in [`pyproject.toml`](../../pyproject.toml) | CI 构建环境必须满足项目的最低 Python 版本要求（matrix: 3.12 / 3.13 / 3.14） |
 | `hatchling.build` (build-backend)                | `[build-system] requires = ["hatchling"]`                             | 构建后端声明必须一致                                                        |
 | `uv pip install --system`                        | AGENTS.md 强制使用 `uv`                                               | GitHub Actions Runner 默认无激活的 virtualenv，需 `--system` 标志           |
 | `retention-days: 14`                             | —                                                                     | Artifact 保留两周，覆盖正常的验证窗口期（通常 1-3 天）                      |
@@ -714,7 +714,7 @@ CI 流水线中的工具版本选择并非随意，每一项都与项目配置
 
 ### 8.1 release.yml 结构索引
 
-[`.github/workflows/release.yml`](../.github/workflows/release.yml) 文件结构一览：
+[`.github/workflows/release.yml`](../../.github/workflows/release.yml) 文件结构一览：
 
 | 行范围  | 区块                    | 内容摘要                                                                                             |
 | ------- | ----------------------- | ---------------------------------------------------------------------------------------------------- |
@@ -729,7 +729,7 @@ CI 流水线中的工具版本选择并非随意，每一项都与项目配置
 
 ### 8.2 promote.yml 结构索引
 
-[`.github/workflows/promote.yml`](../.github/workflows/promote.yml) 文件结构一览：
+[`.github/workflows/promote.yml`](../../.github/workflows/promote.yml) 文件结构一览：
 
 | 行范围 | 区块            | 内容摘要                                                       |
 | ------ | --------------- | -------------------------------------------------------------- |
diff --git a/docs/user-guide.md b/docs/user-guide.md
index 81bbba1..f9ecad8 100644
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -202,7 +202,7 @@ database:
 
 logging:
   level: "INFO"          # DEBUG / INFO / WARNING / ERROR
-  # file: "coding-proxy.log"  # 输出到文件
+  # file: ".logs/coding-proxy.log"  # 输出到文件
   # max_bytes: 5242880        # 单文件 5 MB
   # backup_count: 5           # 保留 5 个备份
 ```
diff --git a/docs/zh-CN/README.md b/docs/zh-CN/README.md
index 658e27f..4b32986 100644
--- a/docs/zh-CN/README.md
+++ b/docs/zh-CN/README.md
@@ -30,7 +30,7 @@
 ## 🌟 核心特性 (Core Features)
 
 <div align="center">
-    <img src="../../assets/dashboard-v0.2.4.png">
+    <img src="../../assets/dashboard-v0.4.0.png">
 </div>
 
 - **⛓️ N-tier 链式故障转移 (Failover)**：自主降序序列，支持 Claude 官方 Plans，以及 GitHub Copilot、Google Antigravity、智谱、MiniMax、阿里千问、小米、Kimi、豆包等的 Coding Plan。
diff --git a/pyproject.toml b/pyproject.toml
index 24630e1..14dcba1 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "coding-proxy"
-version = "0.4.0"
+version = "0.5.0"
 description = "A High-Availability, Transparent, and Smart Multi-Vendor Proxy for Claude Code. Support Claude Plans, GitHub Copilot, Google Antigravity, ZAI/GLM, MiniMax, Qwen, Xiaomi, Kimi, Doubao..."
 readme = "README.md"
 requires-python = ">=3.12"
@@ -84,7 +84,10 @@ docstring-code-format = true
 [tool.pytest.ini_options]
 asyncio_mode = "auto"
 testpaths = ["tests"]
-addopts = "-v --tb=short"
+addopts = "-v --tb=short -m 'not e2e'"
+markers = [
+    "e2e: marks tests as end-to-end (deselect with '-m \"not e2e\"')",
+]
 filterwarnings = [
     "ignore::DeprecationWarning",
 ]
diff --git a/src/coding/proxy/cli/__init__.py b/src/coding/proxy/cli/__init__.py
index 3b479fb..b51f089 100644
--- a/src/coding/proxy/cli/__init__.py
+++ b/src/coding/proxy/cli/__init__.py
@@ -109,7 +109,7 @@ def start(
     print_banner(console, host=cfg.server.host, port=cfg.server.port)
 
     # 解析文件日志路径：未显式配置时使用默认值
-    _file_path: str | None = cfg.logging.file or "coding-proxy.log"
+    _file_path: str | None = cfg.logging.file or ".logs/coding-proxy.log"
     uvicorn.run(
         fastapi_app,
         host=cfg.server.host,
diff --git a/src/coding/proxy/config/config.default.yaml b/src/coding/proxy/config/config.default.yaml
index 40808fd..d945125 100644
--- a/src/coding/proxy/config/config.default.yaml
+++ b/src/coding/proxy/config/config.default.yaml
@@ -8,7 +8,7 @@ server:
 
 logging:
   level: "INFO"
-  # file: "coding-proxy.log"          # 文件日志路径；设为 null 或空字符串禁用
+  # file: ".logs/coding-proxy.log"    # 文件日志路径；设为 null 或空字符串禁用
   # max_bytes: 5242880                # 单文件上限（5 MB），触发轮转
   # backup_count: 5                   # 保留 gzip 压缩备份文件数
 
@@ -119,6 +119,14 @@ vendors:
       window_hours: 24.0
       threshold_percent: 95.0
       probe_interval_seconds: 300
+    # 每模型并发限制：默认 3 个并行请求；超出则按 FIFO 排队等待
+    # 可通过 models 字段覆盖单个模型的限制（如 glm-5.1: 5）
+    concurrency:
+      default: 3
+      # models:
+      #   glm-5v-turbo: 3
+      #   glm-5.1: 3
+      #   glm-4.5-air: 3
 
   # Vendor 4: MiniMax（默认禁用，需手动启用并添加到 tiers）
   - vendor: minimax
diff --git a/src/coding/proxy/config/routing.py b/src/coding/proxy/config/routing.py
index 3326a0b..2c29363 100644
--- a/src/coding/proxy/config/routing.py
+++ b/src/coding/proxy/config/routing.py
@@ -9,6 +9,7 @@
 from pydantic import BaseModel, BeforeValidator, Field, PrivateAttr, model_validator
 
 from .resiliency import CircuitBreakerConfig, QuotaGuardConfig, RetryConfig
+from .vendors import ZhipuConcurrencyConfig
 
 # ── 价格字段解析（$ / ¥ 前缀支持） ──────────────────────────
 
@@ -64,13 +65,13 @@ def _detect_currency(v: Any) -> str | None:
         "api_key",
     }
 )
-# 向后兼容别名
-_ZHIPU_FIELDS = _NATIVE_ANTHROPIC_FIELDS
+# Zhipu 独占字段：在通用 api_key 基础上增加每模型并发限制
+_ZHIPU_FIELDS: frozenset[str] = _NATIVE_ANTHROPIC_FIELDS | frozenset({"concurrency"})
 
 _VENDOR_EXCLUSIVE_FIELDS: dict[str, frozenset[str]] = {
     "copilot": _COPILOT_FIELDS,
     "antigravity": _ANTIGRAVITY_FIELDS,
-    "zhipu": _NATIVE_ANTHROPIC_FIELDS,
+    "zhipu": _ZHIPU_FIELDS,
     "minimax": _NATIVE_ANTHROPIC_FIELDS,
     "kimi": _NATIVE_ANTHROPIC_FIELDS,
     "doubao": _NATIVE_ANTHROPIC_FIELDS,
@@ -285,6 +286,12 @@ class VendorConfig(BaseModel):
     quota_guard: QuotaGuardConfig = Field(default_factory=QuotaGuardConfig)
     weekly_quota_guard: QuotaGuardConfig = Field(default_factory=QuotaGuardConfig)
 
+    # ── Zhipu 专属：每模型并发限制 ───────────────────────────
+    concurrency: ZhipuConcurrencyConfig | None = Field(
+        default=None,
+        description="[zhipu] 每模型并发限制；None 表示不限并发",
+    )
+
     @model_validator(mode="after")
     def _warn_irrelevant_fields(self) -> VendorConfig:
         """对非当前 vendor 类型的非空专属字段发出 warning."""
diff --git a/src/coding/proxy/config/schema.py b/src/coding/proxy/config/schema.py
index ee21ee7..40e5428 100644
--- a/src/coding/proxy/config/schema.py
+++ b/src/coding/proxy/config/schema.py
@@ -54,6 +54,7 @@
     KimiConfig,
     MinimaxConfig,
     XiaomiConfig,
+    ZhipuConcurrencyConfig,
     ZhipuConfig,
 )
 
@@ -318,6 +319,7 @@ def compat_state_path(self) -> Path:
     "CopilotConfig",
     "AntigravityConfig",
     "ZhipuConfig",
+    "ZhipuConcurrencyConfig",
     # resiliency
     "CircuitBreakerConfig",
     "RetryConfig",
diff --git a/src/coding/proxy/config/server.py b/src/coding/proxy/config/server.py
index 7d67207..6fa3e8f 100644
--- a/src/coding/proxy/config/server.py
+++ b/src/coding/proxy/config/server.py
@@ -21,7 +21,7 @@ class LoggingConfig(BaseModel):
 
     Attributes:
         level: 控制台日志级别（INFO / WARNING / DEBUG 等）。
-        file: 文件日志路径。为 ``None`` 时使用默认值 ``coding-proxy.log``；
+        file: 文件日志路径。为 ``None`` 时使用默认值 ``.logs/coding-proxy.log``；
              设为空字符串可禁用文件日志。
         max_bytes: 单个日志文件最大字节数（触发轮转）。默认 5 MB。
         backup_count: 保留的已压缩备份文件数。默认 5。
diff --git a/src/coding/proxy/config/vendors.py b/src/coding/proxy/config/vendors.py
index 4f15531..a1c0280 100644
--- a/src/coding/proxy/config/vendors.py
+++ b/src/coding/proxy/config/vendors.py
@@ -2,7 +2,21 @@
 
 from __future__ import annotations
 
-from pydantic import BaseModel
+from pydantic import BaseModel, Field
+
+
+class ZhipuConcurrencyConfig(BaseModel):
+    """Zhipu 每模型并发限制配置."""
+
+    default: int = Field(default=3, ge=1, le=20, description="全局默认并行度")
+    models: dict[str, int] = Field(
+        default_factory=dict,
+        description="按映射后模型名自定义并行度（覆盖 default）",
+    )
+
+    def get_limit(self, model: str) -> int:
+        """获取指定模型的并行度限制."""
+        return self.models.get(model, self.default)
 
 
 class AnthropicConfig(BaseModel):
@@ -48,6 +62,7 @@ class ZhipuConfig(BaseModel):
     base_url: str = "https://open.bigmodel.cn/api/anthropic"
     api_key: str = ""
     timeout_ms: int = 3000000
+    concurrency: ZhipuConcurrencyConfig = Field(default_factory=ZhipuConcurrencyConfig)
 
 
 class MinimaxConfig(BaseModel):
@@ -100,6 +115,7 @@ class AlibabaConfig(BaseModel):
     "CopilotConfig",
     "AntigravityConfig",
     "ZhipuConfig",
+    "ZhipuConcurrencyConfig",
     "MinimaxConfig",
     "KimiConfig",
     "DoubaoConfig",
diff --git a/src/coding/proxy/convert/vendor_channels.py b/src/coding/proxy/convert/vendor_channels.py
index bec46f7..456a9b3 100644
--- a/src/coding/proxy/convert/vendor_channels.py
+++ b/src/coding/proxy/convert/vendor_channels.py
@@ -219,9 +219,114 @@ def enforce_anthropic_tool_pairing(
             ", ".join(synthesized_ids),
         )
 
+    # 纵深防御: sanity 兜底，捕获主循环未覆盖的边角配对漏洞
+    adaptations.extend(_enforce_pairing_sanity_pass(messages_list))
+
     return adaptations
 
 
+def _enforce_pairing_sanity_pass(
+    messages_list: list[dict[str, Any]],
+) -> list[str]:
+    """``enforce_anthropic_tool_pairing`` 主循环之后的纯检测兜底 helper.
+
+    职责正交于主循环（不剥离 tool_result、不插入新 user 消息），仅做两件事:
+
+    1. 遍历每个 ``role == "assistant"`` 且包含 ``tool_use`` 块的消息，
+       检查 ``messages[i+1]`` 是否为 ``user`` 且包含所有 ``tool_use.id`` 对应
+       ``tool_result.tool_use_id``。
+    2. 缺失项在该 user 消息末尾追加 ``is_error=True`` 占位块；如果 next 消息根本
+       不是 user（主循环未触达此分支的退化场景），同样不做插入，仅记录 WARNING
+       供运维定位 —— 该路径正常情况下永不命中（主循环已保证 next user 存在）。
+
+    本 helper 单独抽出的目的有两个:
+
+    - 直接构造"绕过主循环"的输入做单元测试，确保 sanity 分支具备**正向回归保护**
+      （历史教训: ``9061cd0`` 引入两遍扫描+sanity 后被 ``2bac9a7`` 连带回滚，
+      重要原因之一是缺乏对兜底路径的独立单测）。
+    - 在主循环 A-F 步骤未来重构时，sanity 仍能稳定守住 Anthropic 配对约束。
+
+    Args:
+        messages_list: 消息列表（就地修改）。
+
+    Returns:
+        新增的 adaptation 标签列表（命中则为 ``["pairing_sanity_repaired"]``，否则空列表）。
+    """
+    repaired: list[tuple[int, str]] = []
+
+    for i, msg in enumerate(messages_list):
+        if not isinstance(msg, dict) or msg.get("role") != "assistant":
+            continue
+        content = msg.get("content")
+        if not isinstance(content, list):
+            continue
+        tool_use_ids = [
+            b["id"]
+            for b in content
+            if isinstance(b, dict) and b.get("type") == "tool_use" and b.get("id")
+        ]
+        if not tool_use_ids:
+            continue
+
+        next_idx = i + 1
+        if (
+            next_idx >= len(messages_list)
+            or not isinstance(messages_list[next_idx], dict)
+            or messages_list[next_idx].get("role") != "user"
+        ):
+            # 主循环正常情况下已保证 next 为 user；此处仅日志告警，不做隐式插入
+            # 以避免与主循环职责重叠。
+            logger.warning(
+                "Sanity pass: assistant at messages[%d] has tool_use without "
+                "user next message (tool_use_ids=%s). Main enforce loop may have a regression.",
+                i,
+                ", ".join(tool_use_ids),
+            )
+            continue
+
+        user_msg = messages_list[next_idx]
+        user_content = user_msg.get("content")
+        if not isinstance(user_content, list):
+            # 主循环 D 步已将 string content 归一化为 list；这里防御性兜底
+            user_msg["content"] = (
+                [{"type": "text", "text": user_content}]
+                if isinstance(user_content, str)
+                else []
+            )
+            user_content = user_msg["content"]
+
+        existing_result_ids = {
+            b["tool_use_id"]
+            for b in user_content
+            if isinstance(b, dict)
+            and b.get("type") == "tool_result"
+            and b.get("tool_use_id")
+        }
+        for uid in tool_use_ids:
+            if uid in existing_result_ids:
+                continue
+            user_content.append(
+                {
+                    "type": "tool_result",
+                    "tool_use_id": uid,
+                    "content": "",
+                    "is_error": True,
+                }
+            )
+            repaired.append((i, uid))
+
+    if not repaired:
+        return []
+
+    logger.warning(
+        "Sanity pass repaired %d unpaired tool_use(s) missed by main enforce loop. "
+        "Affected: %s",
+        len(repaired),
+        ", ".join(f"messages[{idx}]:{uid}" for idx, uid in repaired),
+    )
+    return ["pairing_sanity_repaired"]
+
+
 def _strip_cache_control(body: dict[str, Any]) -> int:
     """从 system/messages/tools 中移除 cache_control 字段（就地）.
 
@@ -262,6 +367,59 @@ def _strip_cache_control(body: dict[str, Any]) -> int:
     return removed
 
 
+# ── zhipu 共享清洗函数 ──────────────────────────────────────────
+
+# 跨供应商转换时主动剥离的顶层参数。
+# 首选 tier 场景的 thinking.type=adaptive 兼容转换由
+# ZhipuVendor._prepare_request 处理（转换为 enabled + budget，保留功能），
+# 此处仅负责 failover 路径的全量剥离（跨供应商 thinking signature 失效）。
+_ZHIPU_UNSUPPORTED_PARAMS: frozenset[str] = frozenset(
+    {"thinking", "extended_thinking", "reasoning_effort"}
+)
+
+
+def normalize_for_zhipu(body: dict[str, Any]) -> tuple[dict[str, Any], list[str]]:
+    """为 zhipu GLM 的 Anthropic 兼容端点清洗请求体（就地，不 deep copy）.
+
+    为跨供应商转换通道 ``prepare_copilot_to_zhipu`` 提供请求体清洗。
+
+    清洗内容：
+    1. 剥离 cache_control 字段（GLM 静默忽略，主动剥离以减少噪音）
+    2. 移除顶层 thinking/extended_thinking/reasoning_effort 参数（GLM 原生支持
+       thinking、静默忽略 reasoning_effort，但跨供应商场景下这些参数来自原供应商
+       的协议语义，主动剥离以确保请求语义一致性）
+    3. 强制 tool_use/tool_result 配对约束
+
+    不包含 thinking blocks 剥离：跨供应商场景下 history 中的 thinking blocks
+    来自原供应商（签名失效），由调用方在调用本函数之前通过
+    ``strip_thinking_blocks`` 单独处理。
+
+    所有操作均为幂等，安全地在已清洗的请求体上重复调用。
+
+    Returns:
+        (body, adaptations) — body 为就地修改后的同一引用，adaptations 为变换描述列表。
+    """
+    adaptations: list[str] = []
+
+    # Step 1: 剥离 cache_control
+    removed_cc = _strip_cache_control(body)
+    if removed_cc:
+        adaptations.append(f"removed_{removed_cc}_cache_control_fields")
+
+    # Step 2: 移除不支持的顶层参数
+    for param in _ZHIPU_UNSUPPORTED_PARAMS:
+        if param in body:
+            del body[param]
+            adaptations.append(f"removed_{param}_param")
+
+    # Step 3: 强制 tool_use/tool_result 配对
+    pairing_fixes = enforce_anthropic_tool_pairing(body.get("messages", []))
+    if pairing_fixes:
+        adaptations.extend(pairing_fixes)
+
+    return body, adaptations
+
+
 def _remove_vendor_blocks(body: dict[str, Any], block_types: set[str]) -> int:
     """从 messages[].content[] 中就地移除指定 type 的内容块.
 
@@ -294,8 +452,22 @@ def _rewrite_srvtoolu_ids(body: dict[str, Any]) -> tuple[int, dict[str, str]]:
 
     Anthropic API 要求 tool_use 类型与 ``toolu_*`` 格式的 ID。Zhipu 的
     ``server_tool_use`` + ``srvtoolu_*`` 在上游 Anthropic 兼容端点可用，但无法
-    透传至其他供应商；同时还需重写紧随其后 user 消息中 ``tool_result.tool_use_id``
-    引用，保持配对关系。
+    透传至其他供应商；同时还需重写所有 ``tool_result.tool_use_id`` 引用，保持配对关系。
+
+    **两遍扫描（消除块顺序敏感性）**:
+
+    - Pass 1: 仅遍历 ``role == "assistant"`` 的消息，按 assistant 出现顺序为每个
+      待改写的 tool_use 分配 ``toolu_normalized_N`` 新 ID，建立完整 ``id_map``。
+    - Pass 2: 全量遍历消息，对任意 ``tool_result.tool_use_id ∈ id_map`` 的块
+      原地改写为新 ID（不分 user / assistant，覆盖 misplaced 与跨消息边界场景）。
+
+    单遍方案在 GLM-5 偶发将 inline ``tool_result`` 输出在对应 ``server_tool_use``
+    之前的乱序场景下，会因 Case B 时 ``id_map`` 尚未填入而漏改 ``tool_use_id``，
+    导致 ``enforce_anthropic_tool_pairing`` 后 ``extracted_tool_results`` 的 key
+    与 ``tool_use_ids`` 不一致，进而把本应配对的 misplaced tool_result 默默丢弃，
+    最终触发 Anthropic ``messages.x: tool_use ids were found without tool_result
+    blocks immediately after`` 400 错误。两遍扫描以"先建表、后改写"的次序消除该
+    时序耦合。
 
     Returns:
         (rewritten_count, id_map) — 重写次数与 {原 ID: 新 ID} 映射。
@@ -308,45 +480,56 @@ def next_id() -> str:
         counter += 1
         return f"toolu_normalized_{counter}"
 
+    # Pass 1: 扫描 assistant 消息，改写 tool_use / server_tool_use 的 id 与 type，
+    # 按出现顺序填充 id_map（保持与单遍版本相同的序号分配，避免破坏既有断言）。
     for message in body.get("messages", []):
-        if not isinstance(message, dict):
+        if not isinstance(message, dict) or message.get("role") != "assistant":
             continue
         content = message.get("content")
         if not isinstance(content, list):
             continue
-        role = message.get("role")
         for block in content:
             if not isinstance(block, dict):
                 continue
             block_type = block.get("type")
+            if block_type not in {"tool_use", "server_tool_use"}:
+                continue
             block_id = block.get("id")
-
-            # Case A: assistant 消息里的 server_tool_use / srvtoolu_* → 改写
-            if role == "assistant" and block_type in {"tool_use", "server_tool_use"}:
-                if isinstance(block_id, str) and _ANTHROPIC_SERVER_TOOL_USE_ID_RE.match(
-                    block_id
-                ):
-                    new_id = next_id()
-                    id_map[block_id] = new_id
-                    block["id"] = new_id
-                    block["type"] = "tool_use"
-                elif (
-                    isinstance(block_id, str)
-                    and block_id
-                    and not _ANTHROPIC_TOOL_USE_ID_RE.match(block_id)
-                    and block.get("name")
-                ):
-                    # 非标准 ID（非 toolu_ / srvtoolu_），且具备 name 可改写
-                    new_id = next_id()
-                    id_map[block_id] = new_id
-                    block["id"] = new_id
-                    block["type"] = "tool_use"
-                elif block_type == "server_tool_use" and isinstance(block_id, str):
-                    # 兜底: 类型是 server_tool_use 但 ID 已是标准 toolu_ 形式，仅纠正类型
-                    block["type"] = "tool_use"
-
-            # Case B: user 消息里的 tool_result.tool_use_id 同步重写
-            if block_type == "tool_result":
+            if isinstance(block_id, str) and _ANTHROPIC_SERVER_TOOL_USE_ID_RE.match(
+                block_id
+            ):
+                new_id = next_id()
+                id_map[block_id] = new_id
+                block["id"] = new_id
+                block["type"] = "tool_use"
+            elif (
+                isinstance(block_id, str)
+                and block_id
+                and not _ANTHROPIC_TOOL_USE_ID_RE.match(block_id)
+                and block.get("name")
+            ):
+                # 非标准 ID（非 toolu_ / srvtoolu_），且具备 name 可改写
+                new_id = next_id()
+                id_map[block_id] = new_id
+                block["id"] = new_id
+                block["type"] = "tool_use"
+            elif block_type == "server_tool_use" and isinstance(block_id, str):
+                # 兜底: 类型是 server_tool_use 但 ID 已是标准 toolu_ 形式，仅纠正类型
+                block["type"] = "tool_use"
+
+    # Pass 2: 全量扫描，对任意 tool_result.tool_use_id 命中 id_map 的块同步改写。
+    if id_map:
+        for message in body.get("messages", []):
+            if not isinstance(message, dict):
+                continue
+            content = message.get("content")
+            if not isinstance(content, list):
+                continue
+            for block in content:
+                if not isinstance(block, dict):
+                    continue
+                if block.get("type") != "tool_result":
+                    continue
                 tool_use_id = block.get("tool_use_id")
                 if isinstance(tool_use_id, str) and tool_use_id in id_map:
                     block["tool_use_id"] = id_map[tool_use_id]
@@ -414,26 +597,14 @@ def prepare_copilot_to_zhipu(
     prepared = copy.deepcopy(body)
     adaptations: list[str] = []
 
-    # Step 1: 剥离 thinking/redacted_thinking 块
+    # Step 1: 剥离 thinking/redacted_thinking 块（跨供应商签名失效）
     stripped = strip_thinking_blocks(prepared)
     if stripped:
         adaptations.append(f"stripped_{stripped}_thinking_blocks")
 
-    # Step 2: 移除 cache_control 字段
-    removed_cc = _strip_cache_control(prepared)
-    if removed_cc:
-        adaptations.append(f"removed_{removed_cc}_cache_control_fields")
-
-    # Step 3: 移除顶层 thinking/extended_thinking 参数（GLM-5 不支持）
-    for param in ("thinking", "extended_thinking"):
-        if param in prepared:
-            del prepared[param]
-            adaptations.append(f"removed_{param}_param")
-
-    # Step 4: 强制 tool_use/tool_result 配对
-    pairing_fixes = enforce_anthropic_tool_pairing(prepared.get("messages", []))
-    if pairing_fixes:
-        adaptations.extend(pairing_fixes)
+    # Step 2: 共享清洗（cache_control、不支持的顶层参数、tool pairing）
+    _, norm_adaptations = normalize_for_zhipu(prepared)
+    adaptations.extend(norm_adaptations)
 
     return prepared, adaptations
 
diff --git a/src/coding/proxy/logging/db.py b/src/coding/proxy/logging/db.py
index ffe9b2c..8470966 100644
--- a/src/coding/proxy/logging/db.py
+++ b/src/coding/proxy/logging/db.py
@@ -190,6 +190,14 @@ def _local_month_udf(ts_str: str) -> str:
 );
 """
 
+_CREATE_SESSION_META = """
+CREATE TABLE IF NOT EXISTS session_meta (
+    session_key TEXT PRIMARY KEY,
+    title TEXT NOT NULL DEFAULT '',
+    created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ', 'now'))
+);
+"""
+
 _CREATE_INDEXES = """
 CREATE INDEX IF NOT EXISTS idx_usage_ts ON usage_log(ts);
 CREATE INDEX IF NOT EXISTS idx_usage_vendor ON usage_log(vendor);
@@ -245,6 +253,7 @@ async def init(self) -> None:
         self._db.row_factory = aiosqlite.Row
         await self._db.execute("PRAGMA journal_mode=WAL")
         await self._db.executescript(_CREATE_TABLES)
+        await self._db.executescript(_CREATE_SESSION_META)
         # 迁移必须在建索引之前执行，确保 vendor 列已存在
         await self._migrate_rename_backend_to_vendor()
         await self._migrate_add_failover_from()
@@ -316,6 +325,28 @@ async def _migrate_rename_backend_to_vendor(self) -> None:
                     "Migration: renamed 'backend' column to 'vendor' in %s", table
                 )
 
+    async def set_session_title(self, session_key: str, title: str) -> None:
+        """为新 session 设置标题（幂等，仅首次写入）."""
+        if not self._db or not title or not session_key:
+            return
+        await self._db.execute(
+            "INSERT OR IGNORE INTO session_meta (session_key, title) VALUES (?, ?)",
+            (session_key, title),
+        )
+        await self._db.commit()
+
+    async def get_session_titles(self, session_keys: list[str]) -> dict[str, str]:
+        """批量查询 session 标题."""
+        if not self._db or not session_keys:
+            return {}
+        placeholders = ",".join("?" for _ in session_keys)
+        cursor = await self._db.execute(
+            f"SELECT session_key, title FROM session_meta WHERE session_key IN ({placeholders})",
+            session_keys,
+        )
+        rows = await cursor.fetchall()
+        return {row["session_key"]: row["title"] for row in rows}
+
     async def log(
         self,
         vendor: str,
@@ -604,7 +635,8 @@ async def query_recent_sessions(
                       MIN(ts) AS first_seen_ts,
                       MAX(ts) AS last_active_ts,
                       COUNT(*) AS total_requests,
-                      SUM(input_tokens + output_tokens) AS total_tokens,
+                      SUM(input_tokens + output_tokens
+                          + cache_creation_tokens + cache_read_tokens) AS total_tokens,
                       SUM(input_tokens) AS total_input,
                       SUM(output_tokens) AS total_output,
                       GROUP_CONCAT(DISTINCT model_served) AS models,
@@ -620,7 +652,13 @@ async def query_recent_sessions(
             (cutoff_iso, limit),
         )
         rows = await cursor.fetchall()
-        return [dict(row) for row in rows]
+        sessions = [dict(row) for row in rows]
+        if sessions:
+            keys = [s["session_key"] for s in sessions]
+            titles = await self.get_session_titles(keys)
+            for s in sessions:
+                s["title"] = titles.get(s["session_key"], "")
+        return sessions
 
     async def query_session_profile(self, session_key: str) -> dict | None:
         """查询单个会话的完整聚合数据."""
@@ -631,7 +669,8 @@ async def query_session_profile(self, session_key: str) -> dict | None:
                       MIN(ts) AS first_seen_ts,
                       MAX(ts) AS last_active_ts,
                       COUNT(*) AS total_requests,
-                      SUM(input_tokens + output_tokens) AS total_tokens,
+                      SUM(input_tokens + output_tokens
+                          + cache_creation_tokens + cache_read_tokens) AS total_tokens,
                       SUM(input_tokens) AS total_input,
                       SUM(output_tokens) AS total_output,
                       GROUP_CONCAT(DISTINCT model_served) AS models,
diff --git a/src/coding/proxy/native_api/handler.py b/src/coding/proxy/native_api/handler.py
index 790c5f2..ab7b344 100644
--- a/src/coding/proxy/native_api/handler.py
+++ b/src/coding/proxy/native_api/handler.py
@@ -13,11 +13,14 @@
 
 from __future__ import annotations
 
+import asyncio
 import json
 import logging
+import re
 import time
 from collections.abc import AsyncIterator
 from typing import TYPE_CHECKING
+from urllib.parse import unquote
 
 import httpx
 
@@ -172,8 +175,16 @@ async def dispatch(
             )
 
         method = request.method.upper()
-        operation = OperationClassifier.classify(provider, method, rest_path)
-        endpoint = rest_path if rest_path.startswith("/") else f"/{rest_path}"
+        # 防御性 URL 解码：确保 %3A → : 以兼容 Gemini :verb 路径语法。
+        # ASGI 规范要求 scope["path"] 已解码，但部分服务器/反向代理对
+        # 合法路径字符（如冒号）可能保留编码形态。
+        decoded_rest_path = unquote(rest_path)
+        operation = OperationClassifier.classify(provider, method, decoded_rest_path)
+        endpoint = (
+            decoded_rest_path
+            if decoded_rest_path.startswith("/")
+            else f"/{decoded_rest_path}"
+        )
 
         upstream_headers = _filter_request_headers(dict(request.headers))
         # 强制 identity —— 阻止上游压缩（httpx 默认会自动补 gzip,deflate;
@@ -185,6 +196,28 @@ async def dispatch(
         start_ts = time.perf_counter()
         client = self._get_client(provider)
 
+        # ── Gemini embedding Vertex AI 格式转换 ──────────────────
+        # 当上游非官方 Google AI Studio（generativelanguage.googleapis.com）时，
+        # litellm 发送的 Google AI Studio 格式（v1beta/models/{model}:batchEmbedContents）
+        # 需转换为 Vertex AI 格式（v1beta1/publishers/google/models/{model}:embedContent）。
+        vertex_rewrite = (
+            provider == "gemini"
+            and operation in ("embedding", "embedding.batch")
+            and cfg.base_url
+            and "generativelanguage.googleapis.com" not in cfg.base_url
+        )
+        if vertex_rewrite:
+            return await self._dispatch_gemini_vertex_embedding(
+                client=client,
+                operation=operation,
+                endpoint=endpoint,
+                body_bytes=body_bytes,
+                upstream_headers=upstream_headers,
+                query_string=query_string,
+                provider=provider,
+                start_ts=start_ts,
+            )
+
         # 构造上游 URL（保留 query）
         upstream_url = endpoint
         if query_string:
@@ -286,6 +319,313 @@ async def dispatch(
             media_type=content_type or None,
         )
 
+    # ── Gemini embedding → Vertex AI 格式转换 ──────────────────
+
+    # Google AI Studio 路径正则：[v1beta/]models/{model}:{verb}
+    # 版本段允许缺失以兼容 litellm `_check_custom_proxy` 丢失 v1beta 前缀的 bug。
+    _GEMINI_EMBED_PATH_RE = re.compile(
+        r"^/?(?:v1(?:beta1?)?/)?models/(?P<model>[^/:]+)(?::|%3A)(?P<verb>embedContent|batchEmbedContents)/?$"
+    )
+
+    async def _dispatch_gemini_vertex_embedding(
+        self,
+        *,
+        client: httpx.AsyncClient,
+        operation: str,
+        endpoint: str,
+        body_bytes: bytes,
+        upstream_headers: dict[str, str],
+        query_string: str,
+        provider: str,
+        start_ts: float,
+    ) -> StarletteResponse:
+        """将 Google AI Studio 格式的 embedding 请求转换为 Vertex AI 格式.
+
+        Google AI Studio:
+          POST v1beta/models/{model}:batchEmbedContents
+          Body: {"requests": [{"model": "models/{model}", "content": {...}}]}
+
+        Vertex AI:
+          POST v1beta1/publishers/google/models/{model}:embedContent
+          Body: {"content": {...}}
+        """
+        from fastapi.responses import Response as FastAPIResponse
+
+        match = self._GEMINI_EMBED_PATH_RE.match(endpoint)
+        if not match:
+            return FastAPIResponse(
+                content=json.dumps(
+                    {
+                        "error": {
+                            "message": f"unrecognized gemini embedding path: {endpoint}"
+                        }
+                    }
+                ).encode(),
+                status_code=400,
+                media_type="application/json",
+            )
+
+        model_name = match.group("model")
+        verb = match.group("verb")
+
+        # 解析原始请求体
+        try:
+            body = json.loads(body_bytes) if body_bytes else {}
+        except (json.JSONDecodeError, UnicodeDecodeError):
+            return FastAPIResponse(
+                content=json.dumps(
+                    {"error": {"message": "invalid JSON body for embedding request"}}
+                ).encode(),
+                status_code=400,
+                media_type="application/json",
+            )
+
+        if verb == "batchEmbedContents":
+            return await self._vertex_batch_embed(
+                client=client,
+                model_name=model_name,
+                body=body,
+                upstream_headers=upstream_headers,
+                query_string=query_string,
+                provider=provider,
+                operation=operation,
+                endpoint=endpoint,
+                start_ts=start_ts,
+            )
+
+        # 单次 embedContent：直接转换
+        content = body.get("content", body)
+        return await self._vertex_single_embed(
+            client=client,
+            model_name=model_name,
+            content=content,
+            upstream_headers=upstream_headers,
+            query_string=query_string,
+            provider=provider,
+            operation=operation,
+            endpoint=endpoint,
+            start_ts=start_ts,
+        )
+
+    async def _vertex_single_embed(
+        self,
+        *,
+        client: httpx.AsyncClient,
+        model_name: str,
+        content: dict,
+        upstream_headers: dict[str, str],
+        query_string: str,
+        provider: str,
+        operation: str,
+        endpoint: str,
+        start_ts: float,
+    ) -> StarletteResponse:
+        """发送单次 Vertex AI embedContent 请求."""
+        from fastapi.responses import Response as FastAPIResponse
+
+        vertex_path = f"/v1beta1/publishers/google/models/{model_name}:embedContent"
+        vertex_url = vertex_path
+        if query_string:
+            vertex_url = f"{vertex_path}?{query_string}"
+
+        vertex_body = json.dumps({"content": content}).encode()
+
+        req = client.build_request(
+            method="POST",
+            url=vertex_url,
+            content=vertex_body,
+            headers=upstream_headers,
+        )
+
+        try:
+            upstream_resp = await client.send(req, stream=True)
+        except (
+            httpx.TimeoutException,
+            httpx.ConnectError,
+            httpx.ReadError,
+            httpx.RemoteProtocolError,
+        ) as exc:
+            duration_ms = int((time.perf_counter() - start_ts) * 1000)
+            await self._record_failure(
+                provider=provider,
+                operation=operation,
+                endpoint=endpoint,
+                duration_ms=duration_ms,
+                reason=str(exc),
+            )
+            return FastAPIResponse(
+                content=json.dumps(
+                    {
+                        "error": {
+                            "message": f"upstream unreachable: {exc}",
+                            "type": "api_error",
+                        }
+                    }
+                ).encode(),
+                status_code=502,
+                media_type="application/json",
+            )
+
+        try:
+            raw_body = await upstream_resp.aread()
+        finally:
+            await upstream_resp.aclose()
+
+        duration_ms = int((time.perf_counter() - start_ts) * 1000)
+        status = upstream_resp.status_code
+        content_type = upstream_resp.headers.get("content-type", "").lower()
+        resp_headers = _filter_response_headers(dict(upstream_resp.headers))
+
+        # 用量抽取
+        extraction = ExtractionResult()
+        if "application/json" in content_type and raw_body:
+            try:
+                parsed = json.loads(raw_body.decode("utf-8", errors="replace"))
+                if isinstance(parsed, dict):
+                    extraction = extract_usage(
+                        provider, operation, parsed, status, dict(upstream_resp.headers)
+                    )
+            except (json.JSONDecodeError, UnicodeDecodeError):
+                pass
+
+        vendor_label = _VENDOR_LABEL[provider]
+        await self._record_usage(
+            provider=provider,
+            operation=operation,
+            endpoint=endpoint,
+            duration_ms=duration_ms,
+            status=status,
+            extraction=extraction,
+            evidence_records=_build_nonstream_evidence(
+                vendor=vendor_label, extraction=extraction
+            ),
+        )
+
+        return FastAPIResponse(
+            content=raw_body,
+            status_code=status,
+            headers=resp_headers,
+            media_type=content_type or None,
+        )
+
+    async def _vertex_batch_embed(
+        self,
+        *,
+        client: httpx.AsyncClient,
+        model_name: str,
+        body: dict,
+        upstream_headers: dict[str, str],
+        query_string: str,
+        provider: str,
+        operation: str,
+        endpoint: str,
+        start_ts: float,
+    ) -> StarletteResponse:
+        """将 batchEmbedContents 拆分为多次 embedContent 调用并聚合响应."""
+        from fastapi.responses import Response as FastAPIResponse
+
+        requests_list = body.get("requests", [])
+        if not requests_list:
+            return FastAPIResponse(
+                content=json.dumps(
+                    {
+                        "error": {
+                            "message": "batchEmbedContents requires non-empty 'requests' field"
+                        }
+                    }
+                ).encode(),
+                status_code=400,
+                media_type="application/json",
+            )
+
+        vertex_path = f"/v1beta1/publishers/google/models/{model_name}:embedContent"
+        vertex_url = vertex_path
+        if query_string:
+            vertex_url = f"{vertex_path}?{query_string}"
+
+        # 并发发送所有 embedContent 请求
+        async def _single(req_body: dict) -> tuple[dict, int]:
+            content = req_body.get("content", req_body)
+            vertex_body = json.dumps({"content": content}).encode()
+            req = client.build_request(
+                method="POST",
+                url=vertex_url,
+                content=vertex_body,
+                headers=upstream_headers,
+            )
+            try:
+                resp = await client.send(req, stream=False)
+            except (
+                httpx.TimeoutException,
+                httpx.ConnectError,
+                httpx.ReadError,
+                httpx.RemoteProtocolError,
+            ) as exc:
+                return {"error": {"message": f"upstream unreachable: {exc}"}}, 502
+            try:
+                return resp.json(), resp.status_code
+            except Exception:
+                return {"error": {"message": resp.text[:200]}}, resp.status_code
+
+        results = await asyncio.gather(*[_single(r) for r in requests_list])
+
+        # 检查是否有失败的请求
+        embeddings = []
+        for resp_json, resp_status in results:
+            if resp_status != 200:
+                # 返回第一个错误
+                return FastAPIResponse(
+                    content=json.dumps(resp_json).encode(),
+                    status_code=resp_status,
+                    media_type="application/json",
+                )
+            embedding_data = resp_json.get("embedding", {})
+            embeddings.append(embedding_data)
+
+        # 聚合为 batchEmbedContents 响应格式
+        batch_response = {"embeddings": embeddings}
+        duration_ms = int((time.perf_counter() - start_ts) * 1000)
+
+        # 用量抽取
+        extraction = ExtractionResult()
+        for resp_json, _ in results:
+            if isinstance(resp_json, dict):
+                ext = extract_usage(provider, operation, resp_json, 200, {})
+                extraction = ExtractionResult(
+                    input_tokens=extraction.input_tokens + ext.input_tokens,
+                    output_tokens=extraction.output_tokens + ext.output_tokens,
+                    cache_creation_tokens=extraction.cache_creation_tokens
+                    + ext.cache_creation_tokens,
+                    cache_read_tokens=extraction.cache_read_tokens
+                    + ext.cache_read_tokens,
+                    request_id=ext.request_id or extraction.request_id,
+                    model_served=ext.model_served or extraction.model_served,
+                    raw_usage=ext.raw_usage or extraction.raw_usage,
+                    source_field_map=ext.source_field_map
+                    or extraction.source_field_map,
+                    evidence_kind=ext.evidence_kind or extraction.evidence_kind,
+                    extra_usage=ext.extra_usage or extraction.extra_usage,
+                )
+
+        vendor_label = _VENDOR_LABEL[provider]
+        await self._record_usage(
+            provider=provider,
+            operation=operation,
+            endpoint=endpoint,
+            duration_ms=duration_ms,
+            status=200,
+            extraction=extraction,
+            evidence_records=_build_nonstream_evidence(
+                vendor=vendor_label, extraction=extraction
+            ),
+        )
+
+        return FastAPIResponse(
+            content=json.dumps(batch_response).encode(),
+            status_code=200,
+            media_type="application/json",
+        )
+
     # ── SSE 流式转发（同时累加 usage） ─────────────────────────
 
     async def _stream_and_accumulate(
diff --git a/src/coding/proxy/native_api/operation.py b/src/coding/proxy/native_api/operation.py
index 12f3307..2080b6c 100644
--- a/src/coding/proxy/native_api/operation.py
+++ b/src/coding/proxy/native_api/operation.py
@@ -48,30 +48,34 @@ class _Rule:
 )
 
 # ── Gemini ────────────────────────────────────────────────────────
-# Gemini 的方法动词作为路径后缀（``:generateContent``），通过正则提取
+# Gemini 的方法动词作为路径后缀（``:generateContent``），通过正则提取。
+# ``v1(?:beta1?)?/`` 前缀允许缺失，以兼容 litellm `_check_custom_proxy` 在
+# 自定义 ``api_base`` 场景下丢失版本段的 bug（参考 litellm issue #17759）。
 _GEMINI_RULES: tuple[_Rule, ...] = (
     _Rule(
-        re.compile(r"^/?v1(?:beta)?/models/[^/]+:streamGenerateContent/?$"),
+        re.compile(
+            r"^/?(?:v1(?:beta1?)?/)?models/[^/]+(?:%3A|:)streamGenerateContent/?$"
+        ),
         "generate_content",
     ),
     _Rule(
-        re.compile(r"^/?v1(?:beta)?/models/[^/]+:generateContent/?$"),
+        re.compile(r"^/?(?:v1(?:beta1?)?/)?models/[^/]+(?:%3A|:)generateContent/?$"),
         "generate_content",
     ),
     _Rule(
-        re.compile(r"^/?v1(?:beta)?/models/[^/]+:countTokens/?$"),
+        re.compile(r"^/?(?:v1(?:beta1?)?/)?models/[^/]+(?:%3A|:)countTokens/?$"),
         "count_tokens",
     ),
     _Rule(
-        re.compile(r"^/?v1(?:beta)?/models/[^/]+:embedContent/?$"),
+        re.compile(r"^/?(?:v1(?:beta1?)?/)?models/[^/]+(?:%3A|:)embedContent/?$"),
         "embedding",
     ),
     _Rule(
-        re.compile(r"^/?v1(?:beta)?/models/[^/]+:batchEmbedContents/?$"),
+        re.compile(r"^/?(?:v1(?:beta1?)?/)?models/[^/]+(?:%3A|:)batchEmbedContents/?$"),
         "embedding.batch",
     ),
     _Rule(
-        re.compile(r"^/?v1(?:beta)?/models/[^/]+:predict/?$"),
+        re.compile(r"^/?(?:v1(?:beta1?)?/)?models/[^/]+(?:%3A|:)predict/?$"),
         "predict",
     ),
     _Rule(
@@ -159,7 +163,8 @@ def is_stream_path(provider: str, path: str) -> bool:
         normalized = path if path.startswith("/") else f"/{path}"
         return bool(
             re.match(
-                r"^/?v1(?:beta)?/models/[^/]+:streamGenerateContent/?$", normalized
+                r"^/?v1(?:beta)?/models/[^/]+(?:%3A|:)streamGenerateContent/?$",
+                normalized,
             )
         )
 
diff --git a/src/coding/proxy/routing/executor.py b/src/coding/proxy/routing/executor.py
index 9d33ca9..4c37f02 100644
--- a/src/coding/proxy/routing/executor.py
+++ b/src/coding/proxy/routing/executor.py
@@ -6,7 +6,9 @@
 
 from __future__ import annotations
 
+import json
 import logging
+import re
 import time
 from collections.abc import AsyncIterator
 from typing import Any
@@ -43,10 +45,320 @@
 # 向后兼容别名
 BackendResponse = VendorResponse
 NoCompatibleBackendError = NoCompatibleVendorError
-from ..compat.canonical import CompatibilityStatus, build_canonical_request
+from ..compat.canonical import (
+    CanonicalPartType,
+    CompatibilityStatus,
+    build_canonical_request,
+)
+from ..model.compat import CanonicalRequest
 
 logger = logging.getLogger(__name__)
 
+_SESSION_TITLE_MAX_LEN = 30
+
+# Claude Code 注入的"噪声"标签 — 系统级上下文,不应进入 Session 标题。
+# 这些标签由 CC harness 在首个 user 消息 content 中拼接,高度同质,
+# 直接用作标题会导致跨会话标题无差异化,丧失辨识度。
+_NOISE_TAG_PATTERN = re.compile(
+    r"<(?P<tag>system-reminder|user-preferences|"
+    r"local-command-stdout|local-command-stderr|"
+    r"bash-input|bash-stdout|bash-stderr|"
+    r"ide_selection|stdin|system_instruction)\b[^>]*>"
+    r".*?</(?P=tag)>",
+    flags=re.DOTALL | re.IGNORECASE,
+)
+
+# Slash command 子标签:用于识别 /commit、/review 等命令式调用,
+# 合成"命令 + 参数"式标题。
+_CMD_NAME_PATTERN = re.compile(r"<command-name>(.*?)</command-name>", flags=re.DOTALL)
+_CMD_ARGS_PATTERN = re.compile(r"<command-args>(.*?)</command-args>", flags=re.DOTALL)
+# 残留 command-* 包裹标签清除(command-message/command-stdout 等次要标签)。
+_CMD_WRAPPER_PATTERN = re.compile(
+    r"<command-[\w-]+>.*?</command-[\w-]+>", flags=re.DOTALL
+)
+
+
+def _sanitize_user_text(raw: str) -> str:
+    """剔除 Claude Code 注入的系统级 XML 块,还原真实用户输入。
+
+    处理顺序:
+    1. Slash command 优先识别 — 若检测到 <command-name>,合成"命令 + 参数"
+       式标题(因为残留文本通常为空,直接取标签内容更有意义)。
+    2. 通用噪声剥离 — 移除已知白名单内的 system-reminder 等标签。
+    3. 残留 command-* 包裹清除 — 兜底去除 command-message 等次要标签。
+    4. 前后空白归一化 — 折叠连续空白为单空格,便于 30 字截断。
+    """
+    if not raw:
+        return ""
+
+    # 阶段一: slash command 短路
+    cmd = _CMD_NAME_PATTERN.search(raw)
+    if cmd:
+        name = cmd.group(1).strip()
+        args_match = _CMD_ARGS_PATTERN.search(raw)
+        args = args_match.group(1).strip() if args_match else ""
+        composed = f"{name} {args}".strip() if args else name
+        if composed:
+            return composed
+
+    # 阶段二: 通用噪声剥离
+    cleaned = _NOISE_TAG_PATTERN.sub("", raw)
+    cleaned = _CMD_WRAPPER_PATTERN.sub("", cleaned)
+
+    # 阶段三: 空白折叠
+    return re.sub(r"\s+", " ", cleaned).strip()
+
+
+def _extract_session_title(request: CanonicalRequest) -> str:
+    """从规范化请求中提取首个用户消息文本作为 session 标题。
+
+    跳过 Claude Code 注入的系统级 XML 块(system-reminder、user-preferences 等),
+    确保标题反映用户真实输入而非高同质化的系统模板。
+    """
+    for part in request.messages:
+        if part.role != "user" or part.type != CanonicalPartType.TEXT:
+            continue
+        cleaned = _sanitize_user_text(part.text)
+        if cleaned:
+            return cleaned[:_SESSION_TITLE_MAX_LEN]
+    return ""
+
+
+def _build_semantic_rejection_diagnostic(body: dict[str, Any]) -> str:
+    """构建语义拒绝的请求体诊断上下文.
+
+    在 semantic rejection 日志中附加请求体的可疑参数快照，
+    用于定位供应商参数校验失败的具体祸根参数。
+
+    覆盖范围：
+      * 模型 / messages 数（baseline）
+      * thinking 系列顶层参数 + history thinking_blocks 数
+      * system 形态（string / blocks，含 cache_control 计数）
+      * tools 数量 + tool_choice 形态
+      * 采样参数（max_tokens / temperature / top_p / top_k / stop_sequences）
+      * stream / metadata 形态
+      * cache_control 存在性
+      * messages.content 类型分布
+      * 请求体大小估算（json.dumps 字节数）
+    """
+    parts: list[str] = []
+
+    # ── 模型 + 消息数（baseline，始终输出）──
+    parts.append(f"model={body.get('model', 'N/A')}")
+    parts.append(f"messages={len(body.get('messages', []))}")
+
+    # ── 顶层 thinking 系列参数 ──
+    for key in ("thinking", "extended_thinking", "reasoning_effort"):
+        if key in body:
+            val = body[key]
+            parts.append(f"{key}={val!r:.80}")
+
+    # ── system 形态 ──
+    system = body.get("system")
+    if isinstance(system, str):
+        parts.append(f"system_kind=string(len={len(system)})")
+    elif isinstance(system, list):
+        cc_count = sum(
+            1 for item in system if isinstance(item, dict) and "cache_control" in item
+        )
+        if cc_count:
+            parts.append(f"system_blocks={len(system)},cc={cc_count}")
+        else:
+            parts.append(f"system_blocks={len(system)}")
+
+    # ── tools 与 tool_choice ──
+    tools = body.get("tools")
+    if isinstance(tools, list):
+        parts.append(f"tools={len(tools)}")
+    tool_choice = body.get("tool_choice")
+    if tool_choice is not None:
+        parts.append(f"tool_choice={tool_choice!r:.60}")
+
+    # ── 采样参数（仅存在时输出）──
+    for key in ("max_tokens", "temperature", "top_p", "top_k"):
+        if key in body:
+            parts.append(f"{key}={body[key]!r:.40}")
+    stop_sequences = body.get("stop_sequences")
+    if isinstance(stop_sequences, list) and stop_sequences:
+        parts.append(f"stop_sequences={len(stop_sequences)}")
+
+    # ── stream / metadata ──
+    if "stream" in body:
+        parts.append(f"stream={body['stream']}")
+    metadata = body.get("metadata")
+    if isinstance(metadata, dict) and metadata:
+        parts.append(f"metadata_keys={len(metadata)}")
+
+    # ── 会话历史中的 thinking blocks 与 content_types 分布 ──
+    thinking_count = 0
+    content_type_counts: dict[str, int] = {}
+    for msg in body.get("messages", []):
+        content = msg.get("content")
+        if isinstance(content, str):
+            content_type_counts["string"] = content_type_counts.get("string", 0) + 1
+            continue
+        if not isinstance(content, list):
+            continue
+        for block in content:
+            if not isinstance(block, dict):
+                continue
+            btype = block.get("type")
+            if isinstance(btype, str):
+                content_type_counts[btype] = content_type_counts.get(btype, 0) + 1
+            if btype in ("thinking", "redacted_thinking"):
+                thinking_count += 1
+    if thinking_count:
+        parts.append(f"thinking_blocks_in_history={thinking_count}")
+    if content_type_counts:
+        type_repr = ",".join(f"{k}:{v}" for k, v in sorted(content_type_counts.items()))
+        parts.append(f"content_types={{{type_repr}}}")
+
+    # ── cache_control 存在检测（messages / tools，不含 system 因已单独统计）──
+    has_cc = False
+    sections: list[Any] = []
+    for m in body.get("messages", []):
+        if isinstance(m.get("content"), list):
+            sections.append(m["content"])
+    if isinstance(body.get("tools"), list):
+        sections.append(body["tools"])
+    for section in sections:
+        for item in section:
+            if isinstance(item, dict) and "cache_control" in item:
+                has_cc = True
+                break
+        if has_cc:
+            break
+    if has_cc:
+        parts.append("cache_control_fields=present")
+
+    # ── 请求体大小估算 ──
+    try:
+        body_bytes = len(json.dumps(body, ensure_ascii=False).encode("utf-8"))
+        parts.append(f"body_bytes={body_bytes}")
+    except (TypeError, ValueError):
+        # 极少数情况下 body 含非可序列化对象，跳过
+        pass
+
+    return f" [{', '.join(parts)}]" if parts else ""
+
+
+def _build_semantic_rejection_diagnostic(body: dict[str, Any]) -> str:
+    """构建语义拒绝的请求体诊断上下文.
+
+    在 semantic rejection 日志中附加请求体的可疑参数快照，
+    用于定位供应商参数校验失败的具体祸根参数。
+
+    覆盖范围：
+      * 模型 / messages 数（baseline）
+      * thinking 系列顶层参数 + history thinking_blocks 数
+      * system 形态（string / blocks，含 cache_control 计数）
+      * tools 数量 + tool_choice 形态
+      * 采样参数（max_tokens / temperature / top_p / top_k / stop_sequences）
+      * stream / metadata 形态
+      * cache_control 存在性
+      * messages.content 类型分布
+      * 请求体大小估算（json.dumps 字节数）
+    """
+    parts: list[str] = []
+
+    # ── 模型 + 消息数（baseline，始终输出）──
+    parts.append(f"model={body.get('model', 'N/A')}")
+    parts.append(f"messages={len(body.get('messages', []))}")
+
+    # ── 顶层 thinking 系列参数 ──
+    for key in ("thinking", "extended_thinking", "reasoning_effort"):
+        if key in body:
+            val = body[key]
+            parts.append(f"{key}={val!r:.80}")
+
+    # ── system 形态 ──
+    system = body.get("system")
+    if isinstance(system, str):
+        parts.append(f"system_kind=string(len={len(system)})")
+    elif isinstance(system, list):
+        cc_count = sum(
+            1 for item in system if isinstance(item, dict) and "cache_control" in item
+        )
+        if cc_count:
+            parts.append(f"system_blocks={len(system)},cc={cc_count}")
+        else:
+            parts.append(f"system_blocks={len(system)}")
+
+    # ── tools 与 tool_choice ──
+    tools = body.get("tools")
+    if isinstance(tools, list):
+        parts.append(f"tools={len(tools)}")
+    tool_choice = body.get("tool_choice")
+    if tool_choice is not None:
+        parts.append(f"tool_choice={tool_choice!r:.60}")
+
+    # ── 采样参数（仅存在时输出）──
+    for key in ("max_tokens", "temperature", "top_p", "top_k"):
+        if key in body:
+            parts.append(f"{key}={body[key]!r:.40}")
+    stop_sequences = body.get("stop_sequences")
+    if isinstance(stop_sequences, list) and stop_sequences:
+        parts.append(f"stop_sequences={len(stop_sequences)}")
+
+    # ── stream / metadata ──
+    if "stream" in body:
+        parts.append(f"stream={body['stream']}")
+    metadata = body.get("metadata")
+    if isinstance(metadata, dict) and metadata:
+        parts.append(f"metadata_keys={len(metadata)}")
+
+    # ── 会话历史中的 thinking blocks 与 content_types 分布 ──
+    thinking_count = 0
+    content_type_counts: dict[str, int] = {}
+    for msg in body.get("messages", []):
+        content = msg.get("content")
+        if isinstance(content, str):
+            content_type_counts["string"] = content_type_counts.get("string", 0) + 1
+            continue
+        if not isinstance(content, list):
+            continue
+        for block in content:
+            if not isinstance(block, dict):
+                continue
+            btype = block.get("type")
+            if isinstance(btype, str):
+                content_type_counts[btype] = content_type_counts.get(btype, 0) + 1
+            if btype in ("thinking", "redacted_thinking"):
+                thinking_count += 1
+    if thinking_count:
+        parts.append(f"thinking_blocks_in_history={thinking_count}")
+    if content_type_counts:
+        type_repr = ",".join(f"{k}:{v}" for k, v in sorted(content_type_counts.items()))
+        parts.append(f"content_types={{{type_repr}}}")
+
+    # ── cache_control 存在检测（messages / tools，不含 system 因已单独统计）──
+    has_cc = False
+    sections: list[Any] = []
+    for m in body.get("messages", []):
+        if isinstance(m.get("content"), list):
+            sections.append(m["content"])
+    if isinstance(body.get("tools"), list):
+        sections.append(body["tools"])
+    for section in sections:
+        for item in section:
+            if isinstance(item, dict) and "cache_control" in item:
+                has_cc = True
+                break
+        if has_cc:
+            break
+    if has_cc:
+        parts.append("cache_control_fields=present")
+
+    # ── 请求体大小估算 ──
+    try:
+        body_bytes = len(json.dumps(body, ensure_ascii=False).encode("utf-8"))
+        parts.append(f"body_bytes={body_bytes}")
+    except (TypeError, ValueError):
+        # 极少数情况下 body 含非可序列化对象，跳过
+        pass
+
+    return f" [{', '.join(parts)}]" if parts else ""
+
 
 def _log_http_error_detail(
     tier_name: str,
@@ -341,10 +653,16 @@ async def execute_stream(
         failed_tier_name: str | None = None
         request_caps = build_request_capabilities(body)
         canonical_request = build_canonical_request(body, headers)
-        session_record = await self._session_mgr.get_or_create_record(
+        session_record, is_new_session = await self._session_mgr.get_or_create_record(
             canonical_request.session_key,
             canonical_request.trace_id,
         )
+        if is_new_session:
+            title = _extract_session_title(canonical_request)
+            if title:
+                await self._recorder.set_session_title(
+                    canonical_request.session_key, title
+                )
         incompatible_reasons: list[str] = []
         effective_tiers = self._resolve_effective_tiers(canonical_request.session_key)
         last_idx = len(effective_tiers) - 1
@@ -512,10 +830,16 @@ async def execute_message(
         failed_tier_name: str | None = None
         request_caps = build_request_capabilities(body)
         canonical_request = build_canonical_request(body, headers)
-        session_record = await self._session_mgr.get_or_create_record(
+        session_record, is_new_session = await self._session_mgr.get_or_create_record(
             canonical_request.session_key,
             canonical_request.trace_id,
         )
+        if is_new_session:
+            title = _extract_session_title(canonical_request)
+            if title:
+                await self._recorder.set_session_title(
+                    canonical_request.session_key, title
+                )
         incompatible_reasons: list[str] = []
         effective_tiers = self._resolve_effective_tiers(canonical_request.session_key)
         last_idx = len(effective_tiers) - 1
@@ -601,10 +925,17 @@ async def execute_message(
                     )
 
                 if not is_last and is_semantic:
+                    diagnostic = _build_semantic_rejection_diagnostic(body)
+                    # zhipu 等供应商的错误体含字段级诊断（如 [1210] 错误码 + request_id），
+                    # 500 字符足以覆盖完整错误体，避免截断丢失关键细节
+                    err_msg = (resp.error_message or "N/A")[:500]
                     logger.warning(
-                        "Tier %s semantic rejection (%s), trying next tier without recording failure",
+                        "Tier %s semantic rejection (type=%s, msg=%s)%s, "
+                        "trying next tier without recording failure",
                         tier.name,
                         resp.error_type or resp.status_code,
+                        err_msg,
+                        diagnostic,
                     )
                     failed_tier_name = tier.name
                     continue
@@ -836,6 +1167,20 @@ async def _handle_http_error(
                 )
 
             if semantic_rejection and not is_last:
+                if request_body is not None:
+                    diagnostic = _build_semantic_rejection_diagnostic(request_body)
+                    stream_err_msg = (
+                        error.get("message") if isinstance(error, dict) else "N/A"
+                    )
+                    # 扩展至 500 字符以保留完整字段级诊断信息
+                    logger.warning(
+                        "Tier %s stream semantic rejection (type=%s, msg=%s)%s, "
+                        "trying next tier without recording failure",
+                        tier.name,
+                        error.get("type") if isinstance(error, dict) else None,
+                        stream_err_msg[:500],
+                        diagnostic,
+                    )
                 return True, tier.name, exc
 
             rl_info = parse_rate_limit_headers(
diff --git a/src/coding/proxy/routing/session_manager.py b/src/coding/proxy/routing/session_manager.py
index 845ac87..aaef0ba 100644
--- a/src/coding/proxy/routing/session_manager.py
+++ b/src/coding/proxy/routing/session_manager.py
@@ -19,13 +19,18 @@ def __init__(self, compat_session_store: CompatSessionStore | None = None) -> No
 
     async def get_or_create_record(
         self, session_key: str, trace_id: str
-    ) -> CompatSessionRecord | None:
+    ) -> tuple[CompatSessionRecord | None, bool]:
+        """获取或创建兼容性会话记录.
+
+        Returns:
+            (record, is_new) — is_new 为 True 表示本次创建的新会话。
+        """
         if self._store is None:
-            return None
+            return None, False
         record = await self._store.get(session_key)
         if record is not None:
-            return record
-        return CompatSessionRecord(session_key=session_key, trace_id=trace_id)
+            return record, False
+        return CompatSessionRecord(session_key=session_key, trace_id=trace_id), True
 
     def apply_compat_context(
         self,
diff --git a/src/coding/proxy/routing/usage_recorder.py b/src/coding/proxy/routing/usage_recorder.py
index 525a6c1..8887c09 100644
--- a/src/coding/proxy/routing/usage_recorder.py
+++ b/src/coding/proxy/routing/usage_recorder.py
@@ -28,6 +28,11 @@ def __init__(
     def set_pricing_table(self, table: PricingTable) -> None:
         self._pricing_table = table
 
+    async def set_session_title(self, session_key: str, title: str) -> None:
+        """为新 session 设置标题（委托给 TokenLogger）."""
+        if self._token_logger:
+            await self._token_logger.set_session_title(session_key, title)
+
     # ── 用量信息构建 ──────────────────────────────────────
 
     @staticmethod
diff --git a/src/coding/proxy/server/dashboard.py b/src/coding/proxy/server/dashboard.py
index 07bd6a3..75dd812 100644
--- a/src/coding/proxy/server/dashboard.py
+++ b/src/coding/proxy/server/dashboard.py
@@ -411,6 +411,7 @@ def _build_favicon() -> bytes:
     .session-table td.cell-tags { white-space: normal; overflow: visible; text-overflow: clip; line-height: 1.8; vertical-align: middle; }
     .session-table tr:hover td { background: var(--bg-card-hover); }
     .session-table .session-key { font-family: 'JetBrains Mono', monospace; font-size: 12px; color: var(--accent-blue); cursor: default; white-space: nowrap; overflow: hidden; text-overflow: ellipsis; }
+    .session-table .session-title { font-size: 12px; color: var(--text-secondary); white-space: nowrap; overflow: hidden; text-overflow: ellipsis; max-width: 0; }
     .session-id { display: flex; align-items: center; gap: 4px; }
     .session-id-text { overflow: hidden; text-overflow: ellipsis; }
     .copy-btn { background: none; border: none; color: var(--text-tertiary); cursor: pointer; padding: 2px; border-radius: 4px; font-size: 12px; line-height: 1; opacity: .5; flex-shrink: 0; }
@@ -556,6 +557,126 @@ def _build_favicon() -> bytes:
     .tab-btn:focus-visible { outline: 2px solid var(--accent-blue); outline-offset: 2px; }
     .tab-pane { display: none; }
     .tab-pane.active { display: block; }
+
+    /* ── Model Calling 实时状态 ────────────────────────── */
+    .model-calling-card {
+      margin-bottom: 5px;
+    }
+    .mc-empty {
+      text-align: center;
+      color: var(--text-muted);
+      padding: 16px 0;
+      font-size: 13px;
+    }
+    .mc-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fill, minmax(320px, 1fr));
+      gap: 8px;
+    }
+    .mc-model-row {
+      display: flex;
+      align-items: center;
+      gap: 10px;
+      padding: 8px 12px;
+      background: var(--bg-secondary);
+      border-radius: var(--radius-sm);
+      border: 1px solid var(--border-subtle);
+    }
+    .mc-model-name {
+      font-family: 'JetBrains Mono', monospace;
+      font-size: 12px;
+      color: var(--text-primary);
+      min-width: 140px;
+      white-space: nowrap;
+      overflow: hidden;
+      text-overflow: ellipsis;
+    }
+    .mc-bar-wrap {
+      flex: 1;
+      min-width: 60px;
+      height: 6px;
+      background: rgba(255,255,255,.06);
+      border-radius: 3px;
+      overflow: hidden;
+    }
+    .mc-bar-fill {
+      height: 100%;
+      border-radius: 3px;
+      transition: width .3s ease, background .3s ease;
+    }
+    .mc-bar-fill.mc-low { background: var(--accent-green); }
+    .mc-bar-fill.mc-mid { background: var(--accent-yellow); }
+    .mc-bar-fill.mc-high { background: var(--accent-red); }
+    .mc-stats {
+      display: flex;
+      align-items: center;
+      gap: 6px;
+      font-size: 11px;
+      font-family: 'JetBrains Mono', monospace;
+      color: var(--text-muted);
+      white-space: nowrap;
+    }
+    .mc-badge {
+      display: inline-flex;
+      align-items: center;
+      padding: 1px 6px;
+      border-radius: 4px;
+      font-size: 10px;
+      font-weight: 600;
+      font-family: 'JetBrains Mono', monospace;
+    }
+    .mc-badge-pending {
+      background: rgba(251,146,60,.15);
+      color: #fb923c;
+    }
+    .mc-badge-active {
+      background: rgba(74,222,128,.12);
+      color: #4ade80;
+    }
+    .mc-vendor-tag {
+      font-size: 10px;
+      color: var(--text-muted);
+      background: rgba(255,255,255,.06);
+      padding: 1px 6px;
+      border-radius: 3px;
+    }
+    .mc-limit-editable {
+      cursor: pointer;
+      border-bottom: 1px dashed rgba(74,222,128,.4);
+      transition: border-color .2s, color .2s;
+    }
+    .mc-limit-editable:hover {
+      border-bottom-color: #4ade80;
+      color: #4ade80;
+    }
+    .mc-limit-input {
+      width: 36px;
+      background: var(--bg-primary);
+      border: 1px solid var(--accent-blue);
+      border-radius: 3px;
+      color: var(--text-primary);
+      font-size: 10px;
+      font-family: 'JetBrains Mono', monospace;
+      text-align: center;
+      padding: 0 2px;
+      outline: none;
+      -moz-appearance: textfield;
+    }
+    .mc-limit-input::-webkit-outer-spin-button,
+    .mc-limit-input::-webkit-inner-spin-button {
+      -webkit-appearance: none;
+      margin: 0;
+    }
+    .mc-limit-flash-ok { animation: mc-flash-ok .6s ease; }
+    .mc-limit-flash-err { animation: mc-flash-err .6s ease; }
+    @keyframes mc-flash-ok {
+      0%,100% { color: inherit; }
+      40% { color: #4ade80; }
+    }
+    @keyframes mc-flash-err {
+      0%,100% { color: inherit; }
+      40% { color: #f87171; }
+    }
   </style>
 </head>
 <body>
@@ -625,6 +746,14 @@ def _build_favicon() -> bytes:
     </div>
   </div>
 
+  <!-- Model Calling 实时状态 -->
+  <div class="card model-calling-card" id="model-calling-card">
+    <div class="card-title">📡 Model Calling 实时状态</div>
+    <div class="model-calling-wrap" id="model-calling-wrap">
+      <div class="mc-empty">加载中…</div>
+    </div>
+  </div>
+
   <!-- 供应商状态 + 请求量趋势折线图 -->
   <div class="charts-grid">
     <div class="card">
@@ -676,20 +805,22 @@ def _build_favicon() -> bytes:
     <div class="session-table-wrap" id="sessions-table-wrap">
       <table class="session-table">
         <colgroup>
-          <col style="width:12%">
-          <col style="width:7%">
+          <col style="width:10%">
+          <col style="width:15%">
           <col style="width:6%">
+          <col style="width:5%">
+          <col style="width:5%">
+          <col style="width:15%">
+          <col style="width:10%">
           <col style="width:6%">
-          <col style="width:17%">
-          <col style="width:12%">
-          <col style="width:7%">
-          <col style="width:9%">
-          <col style="width:12%">
-          <col style="width:12%">
+          <col style="width:8%">
+          <col style="width:10%">
+          <col style="width:10%">
         </colgroup>
         <thead>
           <tr>
             <th>Session ID</th>
+            <th>Title</th>
             <th>Last Active</th>
             <th>Requests</th>
             <th>Tokens</th>
@@ -702,7 +833,7 @@ def _build_favicon() -> bytes:
           </tr>
         </thead>
         <tbody id="sessions-tbody">
-          <tr><td colspan="10" class="empty">Loading...</td></tr>
+          <tr><td colspan="11" class="empty">Loading...</td></tr>
         </tbody>
       </table>
       <div class="session-pagination" id="session-pagination">
@@ -1131,6 +1262,148 @@ def _build_favicon() -> bytes:
   }).join('');
 }
 
+// ── Model Calling 实时状态 ────────────────────────────────
+function updateModelCalling(status) {
+  var wrap = document.getElementById('model-calling-wrap');
+  if (!wrap) return;
+  var tiers = status.tiers || [];
+
+  // 收集所有带 concurrency 诊断的模型
+  var models = [];
+  for (var i = 0; i < tiers.length; i++) {
+    var tier = tiers[i];
+    var diag = tier.diagnostics || {};
+    var conc = diag.concurrency;
+    if (!conc) continue;
+    var names = Object.keys(conc);
+    for (var j = 0; j < names.length; j++) {
+      var model = names[j];
+      var d = conc[model];
+      models.push({
+        vendor: tier.name,
+        model: model,
+        limit: d.limit || 0,
+        in_use: d.in_use || 0,
+        available: d.available || 0,
+        pending: d.pending || 0,
+      });
+    }
+  }
+
+  if (!models.length) {
+    wrap.innerHTML = '<div class="mc-empty">无活跃模型调用</div>';
+    return;
+  }
+
+  var html = '<div class="mc-grid">';
+  for (var k = 0; k < models.length; k++) {
+    var m = models[k];
+    var pct = m.limit > 0 ? Math.round((m.in_use / m.limit) * 100) : 0;
+    var barClass = pct <= 50 ? 'mc-low' : (pct <= 80 ? 'mc-mid' : 'mc-high');
+
+    html += '<div class="mc-model-row">'
+      + '<span class="mc-model-name">' + escapeHtml(m.vendor + '/' + m.model) + '</span>'
+      + '<div class="mc-bar-wrap"><div class="mc-bar-fill ' + barClass + '" style="width:' + pct + '%"></div></div>'
+      + '<div class="mc-stats">'
+      + '<span class="mc-badge mc-badge-active">' + m.in_use
+      + '/<span class="mc-limit-editable" data-tier="' + escapeHtml(m.vendor) + '" data-model="' + escapeHtml(m.model) + '" data-limit="' + m.limit + '" title="点击修改并行度">' + m.limit + '</span></span>'
+      + (m.pending > 0 ? '<span class="mc-badge mc-badge-pending">⏳ ' + m.pending + '</span>' : '')
+      + '</div>'
+      + '</div>';
+  }
+  html += '</div>';
+  wrap.innerHTML = html;
+}
+
+// Model Calling 独立短间隔轮询
+var _mcTimer = null;
+function startModelCallingPoll() {
+  stopModelCallingPoll();
+  function tick() {
+    fetchJSON('/api/status').then(function(status) {
+      updateModelCalling(status);
+    }).catch(function() {});
+  }
+  tick();
+  _mcTimer = setInterval(tick, 5000);
+}
+function stopModelCallingPoll() {
+  if (_mcTimer) { clearInterval(_mcTimer); _mcTimer = null; }
+}
+
+// ── 并行度运行时编辑 ──────────────────────────────────────
+var _mcEditing = false;
+document.addEventListener('click', function(e) {
+  if (_mcEditing) return;
+  var el = e.target.closest('.mc-limit-editable');
+  if (!el) return;
+  e.preventDefault();
+  _mcEditing = true;
+  var oldVal = el.getAttribute('data-limit');
+  var tier = el.getAttribute('data-tier');
+  var model = el.getAttribute('data-model');
+  var input = document.createElement('input');
+  input.type = 'number';
+  input.className = 'mc-limit-input';
+  input.min = '1';
+  input.max = '20';
+  input.value = oldVal;
+  el.style.display = 'none';
+  el.parentNode.insertBefore(input, el.nextSibling);
+  input.focus();
+  input.select();
+
+  var _cancelled = false;
+
+  function restore() {
+    _mcEditing = false;
+    if (input.parentNode) input.parentNode.removeChild(input);
+    el.style.display = '';
+  }
+
+  function flash(cls) {
+    el.classList.add(cls);
+    setTimeout(function() { el.classList.remove(cls); }, 600);
+  }
+
+  input.addEventListener('keydown', function(ev) {
+    if (ev.key === 'Escape') { _cancelled = true; restore(); return; }
+    if (ev.key !== 'Enter') return;
+    ev.preventDefault();
+    submit();
+  });
+
+  input.addEventListener('blur', function() {
+    setTimeout(function() { if (!_cancelled) submit(); }, 50);
+  });
+
+  function submit() {
+    if (_cancelled) return;
+    var v = parseInt(input.value, 10);
+    if (isNaN(v) || v < 1 || v > 20) { restore(); flash('mc-limit-flash-err'); return; }
+    if (String(v) === oldVal) { restore(); return; }
+    fetch('/api/concurrency', {
+      method: 'PUT',
+      headers: {'Content-Type': 'application/json'},
+      body: JSON.stringify({tier: tier, model: model, limit: v})
+    }).then(function(res) {
+      if (res.ok) {
+        return res.json().then(function() {
+          el.textContent = v;
+          el.setAttribute('data-limit', v);
+          flash('mc-limit-flash-ok');
+        });
+      } else {
+        flash('mc-limit-flash-err');
+      }
+    }).catch(function() {
+      flash('mc-limit-flash-err');
+    }).finally(function() {
+      restore();
+    });
+  }
+});
+
 // ── 按 tiers 顺序排序 vendor 列表 ─────────────────────────
 function sortByTierOrder(vendors, tierOrder) {
   if (!tierOrder || !tierOrder.length) return vendors.sort();
@@ -1573,7 +1846,7 @@ def _build_favicon() -> bytes:
   var tbody = document.getElementById('sessions-tbody');
 
   if (!total) {
-    tbody.innerHTML = '<tr><td colspan="10" class="empty"><div class="empty-icon">📭</div>No session data</td></tr>';
+    tbody.innerHTML = '<tr><td colspan="11" class="empty"><div class="empty-icon">📭</div>No session data</td></tr>';
   } else {
     tbody.innerHTML = page.map(function(s) {
       var parsed = parseSessionKey(s.session_key);
@@ -1582,6 +1855,7 @@ def _build_favicon() -> bytes:
       var modelsFull = (s.models || '').split(',').map(function(c){return c.trim();});
       var vendorsFull = (s.vendors || '').split(',').map(function(v){return formatVendorLabel(v.trim());});
       var sr = s.success_rate != null ? Math.round(s.success_rate) : null;
+      var sessionTitle = s.title || '';
       return '<tr data-row onclick="toggleRow(this)">' +
         '<td class="session-key" onclick="event.stopPropagation()">' +
           '<div class="session-id" data-key="' + escapeHtml(s.session_key) + '" title="' + escapeHtml(s.session_key) + '">' +
@@ -1592,6 +1866,7 @@ def _build_favicon() -> bytes:
             'dev:' + escapeHtml(shortId(parsed.device_id, 8)) + ' · acct:' + escapeHtml(shortId(parsed.account_uuid, 8)) +
           '</div>' +
         '</td>' +
+        '<td class="session-title" title="' + escapeHtml(sessionTitle) + '">' + (sessionTitle ? escapeHtml(sessionTitle) : '–') + '</td>' +
         '<td>' + relativeTime(s.last_active_ts) + '</td>' +
         '<td style="font-family:JetBrains Mono,monospace">' + fmtNum(s.total_requests) + '</td>' +
         '<td style="font-family:JetBrains Mono,monospace">' + fmtTokens(s.total_tokens) + '</td>' +
@@ -1602,9 +1877,10 @@ def _build_favicon() -> bytes:
         '<td onclick="event.stopPropagation()">' + selectHtml + '</td>' +
         '<td>' + formatCategories(s.client_categories) + '</td>' +
         '</tr>' +
-        '<tr class="row-detail"><td colspan="10"><div class="detail-card">' +
+        '<tr class="row-detail"><td colspan="11"><div class="detail-card">' +
           '<div class="detail-identity-row">' +
             '<div class="detail-item"><div class="detail-label">Session ID</div><div class="detail-value" title="' + escapeHtml(s.session_key) + '">' + escapeHtml(parsed.session_id || s.session_key) + '</div></div>' +
+            '<div class="detail-item"><div class="detail-label">Title</div><div class="detail-value">' + (sessionTitle ? escapeHtml(sessionTitle) : '–') + '</div></div>' +
             '<div class="detail-item"><div class="detail-label">Device</div><div class="detail-value" title="' + escapeHtml(parsed.device_id || '') + '">' + (parsed.device_id ? escapeHtml(parsed.device_id) : '–') + '</div></div>' +
             '<div class="detail-item"><div class="detail-label">Account</div><div class="detail-value" title="' + escapeHtml(parsed.account_uuid || '') + '">' + (parsed.account_uuid ? escapeHtml(parsed.account_uuid) : '–') + '</div></div>' +
           '</div>' +
@@ -1707,6 +1983,7 @@ def _build_favicon() -> bytes:
 
   updateKPI(summary);
   updateVendorStatus(status);
+  updateModelCalling(status);
   updateChartTitles(days);
 
   const rows = timeline.rows || [];
@@ -1782,6 +2059,8 @@ def _build_favicon() -> bytes:
   currentTab = name;
   applyTabState(name);
   syncTabUrl(name);
+  // Model Calling 轮询随页签切换启停
+  if (name === 'overview') { startModelCallingPoll(); } else { stopModelCallingPoll(); }
   refresh();
 }
 
@@ -1801,6 +2080,7 @@ def _build_favicon() -> bytes:
   }).catch(function(){});
   refresh();                     // 仅加载初始页签的数据
   setInterval(refresh, 600000);  // 每 10 分钟刷新当前页签
+  if (initial === 'overview') startModelCallingPoll();
 })();
 </script>
 </body>
diff --git a/src/coding/proxy/server/factory.py b/src/coding/proxy/server/factory.py
index a1f64a3..4e7632d 100644
--- a/src/coding/proxy/server/factory.py
+++ b/src/coding/proxy/server/factory.py
@@ -156,13 +156,17 @@ def _create_vendor_from_config(
             cfg = _resolve_antigravity_credentials(cfg, token_store)
             return AntigravityVendor(cfg, failover_cfg, mapper)
         case "zhipu":
-            cfg = ZhipuConfig(
-                enabled=vendor_cfg.enabled,
-                base_url=vendor_cfg.base_url
+            zhipu_kwargs: dict[str, Any] = {
+                "enabled": vendor_cfg.enabled,
+                "base_url": vendor_cfg.base_url
                 or "https://open.bigmodel.cn/api/anthropic",
-                api_key=vendor_cfg.api_key,
-                timeout_ms=vendor_cfg.timeout_ms,
-            )
+                "api_key": vendor_cfg.api_key,
+                "timeout_ms": vendor_cfg.timeout_ms,
+            }
+            # 仅当显式配置了 concurrency 时转发，否则使用 ZhipuConfig 默认值
+            if vendor_cfg.concurrency is not None:
+                zhipu_kwargs["concurrency"] = vendor_cfg.concurrency
+            cfg = ZhipuConfig(**zhipu_kwargs)
             return ZhipuVendor(cfg, mapper, failover_cfg)
         case "minimax":
             cfg = MinimaxConfig(
diff --git a/src/coding/proxy/server/routes.py b/src/coding/proxy/server/routes.py
index 7f157f0..7c13d2f 100644
--- a/src/coding/proxy/server/routes.py
+++ b/src/coding/proxy/server/routes.py
@@ -150,14 +150,15 @@ async def count_tokens(request: Request) -> Response:
 
         source = infer_source_vendor_from_body(body)
         if source:
-            channel_fn = get_transition_channel(source, target_vendor.name)
+            target_name = target_vendor.get_name()
+            channel_fn = get_transition_channel(source, target_name)
             if channel_fn is not None:
                 body, adaptations = channel_fn(body)
                 if adaptations:
                     logger.debug(
                         "count_tokens channel %s → %s: %s",
                         source,
-                        target_vendor.name,
+                        target_name,
                         ", ".join(adaptations),
                     )
 
@@ -224,6 +225,61 @@ async def status() -> dict:
         return result
 
 
+def register_concurrency_route(app: Any, router: Any) -> None:
+    """注册运行时并发限制调整路由."""
+
+    @app.put("/api/concurrency")
+    async def update_concurrency(request: Request) -> Response:
+        try:
+            body = await request.json()
+        except Exception:
+            return json_error_response(
+                400, error_type="invalid_request_error", message="body must be JSON"
+            )
+        tier_name = body.get("tier")
+        model = body.get("model")
+        limit = body.get("limit")
+        if not tier_name or not model or limit is None:
+            return json_error_response(
+                400,
+                error_type="invalid_request_error",
+                message="requires tier, model, limit",
+            )
+        if not isinstance(limit, int) or limit < 1 or limit > 20:
+            return json_error_response(
+                400,
+                error_type="invalid_request_error",
+                message="limit must be an integer between 1 and 20",
+            )
+        for tier in router.tiers:
+            if tier.name == tier_name:
+                vendor = tier.vendor
+                update_fn = getattr(vendor, "update_concurrency", None)
+                if update_fn is None:
+                    return json_error_response(
+                        400,
+                        error_type="invalid_request_error",
+                        message=f"vendor '{tier_name}' does not support concurrency",
+                    )
+                try:
+                    update_fn(model, limit)
+                except (ValueError, AttributeError) as exc:
+                    return json_error_response(
+                        400, error_type="invalid_request_error", message=str(exc)
+                    )
+                return Response(
+                    content=json.dumps(
+                        {"ok": True, "tier": tier_name, "model": model, "limit": limit},
+                        ensure_ascii=False,
+                    ).encode(),
+                    status_code=200,
+                    media_type="application/json",
+                )
+        return json_error_response(
+            404, error_type="not_found", message=f"tier '{tier_name}' not found"
+        )
+
+
 def register_copilot_routes(app: Any, router: Any) -> None:
     """注册 Copilot 诊断与模型探测路由."""
     from .factory import _find_copilot_vendor
@@ -456,6 +512,7 @@ def register_all_routes(
     register_core_routes(app, router)
     register_health_routes(app)
     register_status_route(app, router)
+    register_concurrency_route(app, router)
     register_copilot_routes(app, router)
     register_admin_routes(app, router)
     register_session_vendor_routes(app, router)
diff --git a/src/coding/proxy/vendors/antigravity.py b/src/coding/proxy/vendors/antigravity.py
index b9bbfb5..b4d7199 100644
--- a/src/coding/proxy/vendors/antigravity.py
+++ b/src/coding/proxy/vendors/antigravity.py
@@ -141,7 +141,14 @@ def __init__(
             config.refresh_token,
         )
         TokenBackendMixin.__init__(self, token_manager)
-        BaseVendor.__init__(self, config.base_url, config.timeout_ms, failover_config)
+        # v1internal 模式：base_url 需要去除 /v1internal 路径后缀，
+        # 因为 endpoint 使用完整路径 /v1internal:generateContent（冒号格式）。
+        # httpx 会将 base_url path 与 endpoint path 拼接，
+        # 如果 base_url 含 /v1internal 会导致路径重复。
+        init_base_url = config.base_url
+        if init_base_url.rstrip("/").endswith("/v1internal"):
+            init_base_url = init_base_url.rstrip("/").removesuffix("/v1internal")
+        BaseVendor.__init__(self, init_base_url, config.timeout_ms, failover_config)
         self._model_endpoint = config.model_endpoint
         self._model_mapper = model_mapper
         self._default_model = config.model_endpoint.removeprefix("models/")
@@ -149,6 +156,7 @@ def __init__(
         self._safety_settings = config.safety_settings
         # v1internal 协议字段
         self._project_id: str = config.project_id
+        self._v1internal_enabled: bool = "v1internal" in config.base_url
         self._session_id: str = uuid.uuid4().hex[:16]
         self._message_count: int = 0
         # project_id 自动发现状态
@@ -159,8 +167,11 @@ def get_name(self) -> str:
         return "antigravity"
 
     def _is_v1internal_mode(self) -> bool:
-        """检测是否启用 v1internal 协议模式（与 Antigravity-Manager 对齐）."""
-        return bool(self._effective_project_id) and "v1internal" in self._base_url
+        """检测是否启用 v1internal 协议模式（与 Antigravity-Manager 对齐）.
+
+        v1internal 协议由原始配置的 base_url 路径或 project_id 自动发现触发。
+        """
+        return self._v1internal_enabled
 
     @property
     def _effective_project_id(self) -> str:
@@ -229,7 +240,11 @@ async def _discover_project_id(self, access_token: str) -> str:
                 return ""
 
             # 发现成功：原子性切换到 v1internal 模式
-            self._base_url = _V1INTERNAL_BASE_URL
+            # base_url 只保留域名部分（去除 /v1internal 路径后缀）
+            self._base_url = _V1INTERNAL_BASE_URL.rstrip("/").removesuffix(
+                "/v1internal"
+            )
+            self._v1internal_enabled = True
             self._project_id_discovered = project_id
 
             # 重建 HTTP 客户端（base_url 是初始化参数）
@@ -339,8 +354,13 @@ async def _prepare_request(
         self._last_request_adaptations = converted.adaptations
         token = await self._token_manager.get_token()
 
-        # 懒加载：未配置 project_id 时自动发现并切换 v1internal 模式
-        if not self._project_id and not self._project_discovery_attempted:
+        # 懒加载：未配置 project_id 时尝试自动发现（仅标准 GLA 模式需要）
+        # v1internal 模式不依赖 project_id，跳过发现
+        if (
+            not self._project_id
+            and not self._project_discovery_attempted
+            and not self._v1internal_enabled
+        ):
             discovered = await self._discover_project_id(token)
             if discovered:
                 logger.info(
@@ -450,11 +470,11 @@ async def send_message(
         body, prepared_headers = await self._prepare_request(request_body, headers)
         client = self._get_client()
         resolved_model = self._last_resolved_model
-        endpoint = (
-            ":generateContent"
-            if self._is_v1internal_mode()
-            else f"/models/{resolved_model}:generateContent"
-        )
+        if self._is_v1internal_mode():
+            # v1internal 端点需要完整路径（冒号格式）覆盖 base_url 的 path 部分
+            endpoint = "/v1internal:generateContent"
+        else:
+            endpoint = f"/models/{resolved_model}:generateContent"
 
         logger.debug("send_message: POST %s", endpoint)
         response = await client.post(endpoint, json=body, headers=prepared_headers)
@@ -496,11 +516,10 @@ async def send_message_stream(
         body, prepared_headers = await self._prepare_request(request_body, headers)
         client = self._get_client()
         resolved_model = self._last_resolved_model
-        endpoint = (
-            ":streamGenerateContent?alt=sse"
-            if self._is_v1internal_mode()
-            else f"/models/{resolved_model}:streamGenerateContent?alt=sse"
-        )
+        if self._is_v1internal_mode():
+            endpoint = "/v1internal:streamGenerateContent?alt=sse"
+        else:
+            endpoint = f"/models/{resolved_model}:streamGenerateContent?alt=sse"
 
         logger.debug("send_message_stream: POST %s", endpoint)
 
diff --git a/src/coding/proxy/vendors/concurrency.py b/src/coding/proxy/vendors/concurrency.py
new file mode 100644
index 0000000..7944bdd
--- /dev/null
+++ b/src/coding/proxy/vendors/concurrency.py
@@ -0,0 +1,162 @@
+"""每模型并发限制器 — 支持运行时动态调整的公平排队.
+
+为每个映射后的模型（如 ``glm-5v-turbo``）独立维护一个 ``_ConcurrencySlot`，
+确保同一时间点该模型的并行请求数不超过配置的上限。当所有槽位被占满时，
+新请求按 FIFO 顺序排队等待，直到有槽位释放。
+
+设计要点：
+  - **惰性创建**：仅在首次请求到达时才为该模型创建 Slot，避免冷启动开销
+  - **FIFO 公平**：``asyncio.Event`` + while 循环天然满足 FIFO 排队语义
+  - **动态调整**：支持运行时修改 per-model limit，无需重启进程
+  - **按映射后模型名键控**：与上游真实承载能力对齐，而非按客户端请求名
+"""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+
+from ..config.vendors import ZhipuConcurrencyConfig
+
+logger = logging.getLogger(__name__)
+
+
+class _ConcurrencySlot:
+    """支持动态 limit 的并发槽位.
+
+    使用 ``asyncio.Event`` 作为等待/通知原语，在 ``acquire`` 中 await 等待，
+    在 ``release`` / ``set_limit`` 中唤醒。``set_limit`` 修改上限后立即唤醒
+    所有等待者，由它们重新判断是否可获得槽位。
+    """
+
+    def __init__(self, limit: int) -> None:
+        self._limit = limit
+        self._in_use: int = 0
+        self._pending: int = 0
+        self._wake = asyncio.Event()
+        self._wake.set()
+
+    async def acquire(self) -> _ConcurrencySlot:
+        """获取一个并发槽位，必要时阻塞排队.
+
+        返回 ``self``，调用方在请求完成后调用 ``release()``。
+        """
+        # Fast path
+        if self._in_use < self._limit:
+            self._in_use += 1
+            return self
+        # Slow path — 等待槽位释放
+        self._pending += 1
+        try:
+            while True:
+                self._wake.clear()
+                await self._wake.wait()
+                if self._in_use < self._limit:
+                    self._in_use += 1
+                    return self
+        finally:
+            self._pending -= 1
+
+    def release(self) -> None:
+        """释放一个并发槽位."""
+        self._in_use = max(0, self._in_use - 1)
+        self._wake.set()
+
+    def set_limit(self, new_limit: int) -> None:
+        """动态调整并发上限.
+
+        增大 limit 时立即唤醒等待者；缩小时已持有的槽位不受影响，
+        新 limit 在后续 acquire 中自然生效。
+        """
+        self._limit = new_limit
+        self._wake.set()
+
+    @property
+    def limit(self) -> int:
+        return self._limit
+
+    @property
+    def in_use(self) -> int:
+        return self._in_use
+
+    @property
+    def available(self) -> int:
+        return max(0, self._limit - self._in_use)
+
+    @property
+    def pending(self) -> int:
+        return self._pending
+
+
+class ModelConcurrencyLimiter:
+    """按模型名提供独立并发槽位的限制器.
+
+    用法::
+
+        limiter = ModelConcurrencyLimiter(config)
+        slot = await limiter.acquire("glm-5v-turbo")
+        try:
+            ...  # 执行请求
+        finally:
+            slot.release()
+    """
+
+    def __init__(self, config: ZhipuConcurrencyConfig) -> None:
+        self._config = config
+        self._slots: dict[str, _ConcurrencySlot] = {}
+
+    def _get_or_create_slot(self, model: str) -> _ConcurrencySlot:
+        """获取（或惰性创建）指定模型的并发槽位."""
+        slot = self._slots.get(model)
+        if slot is None:
+            limit = self._config.get_limit(model)
+            slot = _ConcurrencySlot(limit)
+            self._slots[model] = slot
+            logger.debug(
+                "ModelConcurrencyLimiter: created slot model=%s limit=%d",
+                model,
+                limit,
+            )
+        return slot
+
+    async def acquire(self, model: str) -> _ConcurrencySlot:
+        """获取指定模型的并发槽位，必要时阻塞排队.
+
+        返回已获取的 Slot 实例，调用方负责在请求完成后调用 ``release()``。
+        """
+        slot = self._get_or_create_slot(model)
+        await slot.acquire()
+        return slot
+
+    def set_limit(self, model: str, new_limit: int) -> None:
+        """运行时修改指定模型的并发上限.
+
+        同时更新 config.models 以确保后续惰性创建使用新值。
+        """
+        slot = self._slots.get(model)
+        if slot is None:
+            slot = _ConcurrencySlot(new_limit)
+            self._slots[model] = slot
+        else:
+            slot.set_limit(new_limit)
+        self._config.models[model] = new_limit
+        logger.info(
+            "ModelConcurrencyLimiter: updated limit model=%s new_limit=%d",
+            model,
+            new_limit,
+        )
+
+    def get_diagnostics(self) -> dict[str, dict[str, int]]:
+        """返回每个模型的并发状态快照（用于可观测性）."""
+        snapshot: dict[str, dict[str, int]] = {}
+        for model, slot in self._slots.items():
+            snapshot[model] = {
+                "limit": slot.limit,
+                "in_use": slot.in_use,
+                "available": slot.available,
+                "pending": slot.pending,
+            }
+        return snapshot
+
+
+__all__ = ["ModelConcurrencyLimiter"]
diff --git a/src/coding/proxy/vendors/zhipu.py b/src/coding/proxy/vendors/zhipu.py
index 528cabf..64407ba 100644
--- a/src/coding/proxy/vendors/zhipu.py
+++ b/src/coding/proxy/vendors/zhipu.py
@@ -1,23 +1,64 @@
-"""智谱 GLM 供应商 — 原生 Anthropic 兼容端点薄透传代理.
+"""智谱 GLM 供应商 — 原生 Anthropic 兼容端点代理（兼容转换 + 429 重试）.
 
-官方端点 (https://open.bigmodel.cn/api/anthropic) 已完整支持
-Anthropic Messages API 协议，本模块仅做两项最小适配：
+官方端点 (https://open.bigmodel.cn/api/anthropic) 支持大部分
+Anthropic Messages API 协议，本模块做以下适配：
   1. 模型名映射（Claude -> GLM）
   2. 认证头替换（x-api-key）
+  3. 首选 tier 参数兼容转换（_prepare_request）
+
+实测验证 GLM 对 Anthropic 扩展参数的处理方式：
+- thinking.type="enabled"：原生支持（GLM 有自己的 thinking 机制）
+- thinking.type="adaptive"：不支持，触发 [1210] 参数错误 → 转换为 enabled + budget
+- cache_control 字段：静默忽略（GLM 使用隐式自动缓存）
+- reasoning_effort 参数：静默忽略
+- metadata 字段：暂不处理（待进一步诊断确认兼容性）
+
+额外提供 429 Rate Limit 专用重试挽回机制：
+  - max_attempt = 5（1 初始 + 4 重试）
+  - 指数退避 + Full Jitter（1s → 2s → 4s → 8s）
+  - 优先尊重 server retry-after header
 """
 
 from __future__ import annotations
 
+import asyncio
+import json
+import logging
+from collections.abc import AsyncIterator
+from typing import Any
+
+import httpx
+
 from ..config.schema import FailoverConfig, ZhipuConfig
 from ..routing.model_mapper import ModelMapper
+from ..routing.rate_limit import (
+    compute_effective_retry_seconds,
+    parse_rate_limit_headers,
+)
+from ..routing.retry import RetryConfig, calculate_delay
+from .base import VendorResponse
+from .concurrency import ModelConcurrencyLimiter
 from .native_anthropic import NativeAnthropicVendor
 
+logger = logging.getLogger(__name__)
+
+# 429 Rate Limit 重试默认配置
+_RATE_LIMIT_RETRY = RetryConfig(
+    max_retries=4,  # 4 次重试 + 1 次初始 = 5 总尝试
+    initial_delay_ms=1000,
+    max_delay_ms=30000,
+    backoff_multiplier=2.0,
+    jitter=True,
+)
+
 
 class ZhipuVendor(NativeAnthropicVendor):
-    """智谱 GLM 原生 Anthropic 兼容端点供应商（薄透传）.
+    """智谱 GLM 原生 Anthropic 兼容端点供应商（薄透传 + 429 重试挽回）.
 
     通过官方 /api/anthropic 端点转发请求，
     仅替换模型名和认证头，其余原样透传。
+
+    429 Rate Limit 时自动重试（指数退避），降低 failover 频率。
     """
 
     _vendor_name = "zhipu"
@@ -30,7 +71,269 @@ def __init__(
         failover_config: FailoverConfig | None = None,
     ) -> None:
         super().__init__(config, model_mapper, failover_config)
+        self._rl_retry = _RATE_LIMIT_RETRY
+        # 每模型并发限制器（config.concurrency 为 None 时禁用）
+        self._concurrency_limiter: ModelConcurrencyLimiter | None = (
+            ModelConcurrencyLimiter(config.concurrency)
+            if config.concurrency is not None
+            else None
+        )
+
+    # ── 首选 tier 参数兼容转换 ────────────────────────────────
+
+    # adaptive thinking → enabled 的默认预算（Anthropic 推荐的 adaptive 等价值）
+    _ADAPTIVE_THINKING_BUDGET = 16000
+
+    async def _prepare_request(
+        self,
+        request_body: dict[str, Any],
+        headers: dict[str, Any],
+    ) -> tuple[dict[str, Any], dict[str, str]]:
+        """深拷贝 + 模型映射 + 认证头替换 + GLM 兼容转换.
+
+        当 zhipu 作为首选 tier 时（source_vendor=None），请求体来自原始客户端，
+        不经过跨供应商转换通道。此处对已知的 GLM 不兼容参数做兼容转换（而非移除），
+        保留完整的 CC (Claude Code) 功能特性。
+        """
+        body, new_headers = await super()._prepare_request(request_body, headers)
+
+        adaptations: list[str] = []
+
+        # thinking.type="adaptive" 是 Anthropic Claude 4.x 新增的类型，
+        # GLM 不支持此类型值，会触发 [1210] 参数错误。
+        # 转换为 enabled + budget 保留 thinking 能力。
+        thinking = body.get("thinking")
+        if isinstance(thinking, dict) and thinking.get("type") == "adaptive":
+            body["thinking"] = {
+                "type": "enabled",
+                "budget_tokens": self._ADAPTIVE_THINKING_BUDGET,
+            }
+            adaptations.append(
+                f"converted_thinking_adaptive→enabled"
+                f"(budget={self._ADAPTIVE_THINKING_BUDGET})"
+            )
+
+        if adaptations:
+            logger.debug(
+                "ZhipuVendor first-tier compat: %s%s",
+                ", ".join(adaptations),
+                _build_zhipu_request_snapshot(body),
+            )
+
+        return body, new_headers
+
+    # ── 非流式：429 重试 ────────────────────────────────────
+
+    async def send_message(
+        self,
+        request_body: dict[str, Any],
+        headers: dict[str, str],
+    ) -> VendorResponse:
+        """非流式请求，429 时自动重试.
+
+        在 429 重试循环外层套上每模型并发槽位获取，确保同一时间点同一模型的
+        在途请求数不超过配置上限；超过时新请求 FIFO 排队等待。
+        """
+        sem = await self._maybe_acquire_concurrency_slot(request_body)
+        try:
+            return await self._send_message_with_retry(request_body, headers)
+        finally:
+            if sem is not None:
+                sem.release()
+
+    async def _send_message_with_retry(
+        self,
+        request_body: dict[str, Any],
+        headers: dict[str, str],
+    ) -> VendorResponse:
+        """原 send_message 主体逻辑（不含并发控制）."""
+        max_attempts = self._rl_retry.max_attempts
+
+        for attempt in range(max_attempts):
+            resp = await super().send_message(request_body, headers)
+            if resp.status_code != 429:
+                return resp
+
+            if attempt == max_attempts - 1:
+                logger.warning(
+                    "Zhipu 429 rate limit exhausted after %d attempts",
+                    max_attempts,
+                )
+                return resp
+
+            delay = self._compute_retry_delay_from_headers(
+                resp.response_headers, attempt
+            )
+            logger.info(
+                "Zhipu 429 rate limit, retry %d/%d in %.1fms",
+                attempt + 1,
+                max_attempts - 1,
+                delay,
+            )
+            await asyncio.sleep(delay / 1000.0)
+
+        return resp  # pragma: no cover
+
+    # ── 流式：429 重试 ──────────────────────────────────────
+
+    async def send_message_stream(
+        self,
+        request_body: dict[str, Any],
+        headers: dict[str, str],
+    ) -> AsyncIterator[bytes]:
+        """流式请求，429 时自动重试.
+
+        安全性：429 在 BaseVendor.send_message_stream 中于
+        status code 检查阶段即 raise（在任何 chunk yield 之前），
+        因此重试不会导致已发出数据不一致。
+
+        在 429 重试循环外层套上每模型并发槽位获取，确保流式请求与非流式请求
+        共用同一信号量，统一限制同一模型的总在途并发数。
+        """
+        sem = await self._maybe_acquire_concurrency_slot(request_body)
+        max_attempts = self._rl_retry.max_attempts
+
+        try:
+            for attempt in range(max_attempts):
+                try:
+                    # 429 在 status code 检查阶段即 raise（在任何 chunk 之前），
+                    # 因此 __anext__ 安全：要么拿到首个 chunk，要么抛异常。
+                    ait = super().send_message_stream(request_body, headers)
+                    head = await ait.__anext__()
+                except StopAsyncIteration:
+                    return
+                except httpx.HTTPStatusError as exc:
+                    if exc.response is None or exc.response.status_code != 429:
+                        raise
+                    if attempt == max_attempts - 1:
+                        logger.warning(
+                            "Zhipu 429 stream rate limit exhausted after %d attempts",
+                            max_attempts,
+                        )
+                        raise
+
+                    delay = self._compute_retry_delay_from_response(
+                        exc.response, attempt
+                    )
+                    logger.info(
+                        "Zhipu 429 stream rate limit, retry %d/%d in %.1fms",
+                        attempt + 1,
+                        max_attempts - 1,
+                        delay,
+                    )
+                    await asyncio.sleep(delay / 1000.0)
+                    continue
+
+                # yield 在 try/except 之外，避免捕获外部 athrow 的异常
+                yield head
+                async for chunk in ait:
+                    yield chunk
+                return
+        finally:
+            if sem is not None:
+                sem.release()
+
+    # ── 并发控制 ────────────────────────────────────────────
+
+    async def _maybe_acquire_concurrency_slot(
+        self,
+        request_body: dict[str, Any],
+    ) -> asyncio.Semaphore | None:
+        """按映射后模型名获取并发槽位；未配置 concurrency 时返回 None.
+
+        ``map_model()`` 是纯同步字典查找，在 Semaphore 等待前调用是安全的，
+        且能确保排队键与上游真实承载模型对齐。
+        """
+        if self._concurrency_limiter is None:
+            return None
+        raw_model = request_body.get("model", "") if request_body else ""
+        mapped_model = self.map_model(raw_model) if raw_model else ""
+        if not mapped_model:
+            return None
+        return await self._concurrency_limiter.acquire(mapped_model)
+
+    # ── 诊断信息 ─────────────────────────────────────────────
+
+    def get_diagnostics(self) -> dict[str, Any]:
+        """返回供应商运行时诊断信息，包含每模型并发状态."""
+        diagnostics = super().get_diagnostics()
+        if self._concurrency_limiter is not None:
+            diagnostics["concurrency"] = self._concurrency_limiter.get_diagnostics()
+        return diagnostics
+
+    def update_concurrency(self, model: str, limit: int) -> None:
+        """运行时更新指定模型的并发限制."""
+        if self._concurrency_limiter is None:
+            msg = "Concurrency limiter is not enabled for this vendor"
+            raise ValueError(msg)
+        self._concurrency_limiter.set_limit(model, limit)
+
+    # ── 延迟计算 ────────────────────────────────────────────
+
+    def _compute_retry_delay_from_headers(
+        self,
+        headers: dict[str, str] | None,
+        attempt: int,
+    ) -> float:
+        """计算重试延迟（毫秒），优先使用 server retry-after."""
+        rl_info = parse_rate_limit_headers(headers, 429, None)
+        server_delay_s = compute_effective_retry_seconds(rl_info)
+        if server_delay_s is not None:
+            return min(server_delay_s * 1000, self._rl_retry.max_delay_ms)
+        return calculate_delay(attempt, self._rl_retry)
+
+    def _compute_retry_delay_from_response(
+        self,
+        response: httpx.Response,
+        attempt: int,
+    ) -> float:
+        """计算重试延迟（毫秒），从 httpx.Response 提取 header."""
+        rl_info = parse_rate_limit_headers(
+            response.headers,
+            response.status_code,
+            response.text[:500] if response.text else None,
+        )
+        server_delay_s = compute_effective_retry_seconds(rl_info)
+        if server_delay_s is not None:
+            return min(server_delay_s * 1000, self._rl_retry.max_delay_ms)
+        return calculate_delay(attempt, self._rl_retry)
 
 
 # 向后兼容别名
 ZhipuBackend = ZhipuVendor
+
+
+def _build_zhipu_request_snapshot(body: dict[str, Any]) -> str:
+    """构建发往 zhipu 请求的轻量参数快照，用于诊断日志.
+
+    输出格式与 executor._build_semantic_rejection_diagnostic 一致，
+    使成功请求和失败请求的日志可直接 diff 对比，定位差异维度。
+
+    仅在转换发生时输出（DEBUG 级别），避免常态化日志噪声。
+    """
+    parts: list[str] = []
+    parts.append(f"messages={len(body.get('messages', []))}")
+
+    thinking = body.get("thinking")
+    if isinstance(thinking, dict):
+        parts.append(f"thinking_type={thinking.get('type', 'unknown')}")
+
+    metadata = body.get("metadata")
+    if isinstance(metadata, dict) and metadata:
+        parts.append(f"metadata_keys={len(metadata)}")
+
+    tools = body.get("tools")
+    if isinstance(tools, list):
+        parts.append(f"tools={len(tools)}")
+
+    system = body.get("system")
+    if isinstance(system, list):
+        parts.append(f"system_blocks={len(system)}")
+
+    try:
+        body_bytes = len(json.dumps(body, ensure_ascii=False).encode("utf-8"))
+        parts.append(f"body_bytes={body_bytes}")
+    except (TypeError, ValueError):
+        pass
+
+    return f" [{', '.join(parts)}]" if parts else ""
diff --git a/tests/e2e/__init__.py b/tests/e2e/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/e2e/conftest.py b/tests/e2e/conftest.py
new file mode 100644
index 0000000..cf41f45
--- /dev/null
+++ b/tests/e2e/conftest.py
@@ -0,0 +1,199 @@
+"""E2E 集成测试共享 fixtures — Antigravity 真实凭证加载与测试对象构建."""
+
+from __future__ import annotations
+
+import os
+from typing import Any
+
+import pytest
+
+# ── 模块级门控：未设置环境变量时跳过整个 e2e 包 ──
+
+_SKIP_REASON = "Set RUN_ANTIGRAVITY_E2E=1 to enable Antigravity E2E tests"
+
+
+def pytest_configure(config: pytest.Config) -> None:
+    config.addinivalue_line(
+        "markers", "e2e: End-to-end tests requiring real Antigravity credentials"
+    )
+
+
+def _load_real_credentials() -> dict[str, str] | None:
+    """从 ~/.coding-proxy/ 加载真实的 Google OAuth 凭证."""
+    from coding.proxy.auth.providers.google import (
+        _DEFAULT_CLIENT_ID,
+        _DEFAULT_CLIENT_SECRET,
+    )
+    from coding.proxy.auth.store import TokenStoreManager
+    from coding.proxy.config.loader import load_config
+
+    try:
+        token_store = TokenStoreManager()
+        token_store.load()
+        google_tokens = token_store.get("google")
+        if not google_tokens.refresh_token:
+            return None
+
+        config = load_config()
+
+        # 从 vendors 列表查找 antigravity 配置
+        client_id = ""
+        client_secret = ""
+        base_url = ""
+        model_endpoint = "models/claude-sonnet-4-20250514"
+        project_id = ""
+
+        for vc in config.vendors:
+            if vc.vendor == "antigravity":
+                client_id = vc.client_id or _DEFAULT_CLIENT_ID
+                client_secret = vc.client_secret or _DEFAULT_CLIENT_SECRET
+                base_url = (
+                    vc.base_url or "https://generativelanguage.googleapis.com/v1beta"
+                )
+                model_endpoint = vc.model_endpoint or model_endpoint
+                break
+
+        # 优先使用 config.yaml 中的 refresh_token，否则使用 token store
+        refresh_token = ""
+        for vc in config.vendors:
+            if vc.vendor == "antigravity" and vc.refresh_token:
+                refresh_token = vc.refresh_token
+                break
+        if not refresh_token:
+            refresh_token = google_tokens.refresh_token
+
+        return {
+            "client_id": client_id,
+            "client_secret": client_secret,
+            "refresh_token": refresh_token,
+            "base_url": base_url,
+            "model_endpoint": model_endpoint,
+            "project_id": project_id,
+        }
+    except Exception:
+        return None
+
+
+# ── Fixtures ──
+
+
+@pytest.fixture(scope="session")
+def e2e_credentials() -> dict[str, str]:
+    """加载真实 Antigravity OAuth 凭证，失败则跳过."""
+    if os.environ.get("RUN_ANTIGRAVITY_E2E") != "1":
+        pytest.skip(_SKIP_REASON)
+    creds = _load_real_credentials()
+    if creds is None:
+        pytest.skip("No valid Antigravity credentials found in ~/.coding-proxy/")
+    return creds
+
+
+@pytest.fixture(scope="session")
+def antigravity_config(e2e_credentials: dict[str, str]) -> Any:
+    """构建标准 GLA 模式的 AntigravityConfig."""
+    from coding.proxy.config.vendors import AntigravityConfig
+
+    return AntigravityConfig(
+        enabled=True,
+        client_id=e2e_credentials["client_id"],
+        client_secret=e2e_credentials["client_secret"],
+        refresh_token=e2e_credentials["refresh_token"],
+        base_url=e2e_credentials["base_url"],
+        model_endpoint=e2e_credentials["model_endpoint"],
+        timeout_ms=60000,
+    )
+
+
+@pytest.fixture(scope="session")
+def antigravity_config_v1internal(e2e_credentials: dict[str, str]) -> Any:
+    """构建 v1internal 模式的 AntigravityConfig（无 project_id，触发自动发现）."""
+    from coding.proxy.config.vendors import AntigravityConfig
+
+    return AntigravityConfig(
+        enabled=True,
+        client_id=e2e_credentials["client_id"],
+        client_secret=e2e_credentials["client_secret"],
+        refresh_token=e2e_credentials["refresh_token"],
+        base_url="https://cloudcode-pa.googleapis.com/v1internal",
+        model_endpoint=e2e_credentials["model_endpoint"],
+        timeout_ms=60000,
+    )
+
+
+@pytest.fixture
+async def antigravity_vendor(antigravity_config: Any) -> Any:
+    """构建标准 GLA 模式的 AntigravityVendor（function scope，每次测试独立）."""
+    from coding.proxy.config.schema import FailoverConfig
+    from coding.proxy.routing.model_mapper import ModelMapper
+    from coding.proxy.vendors.antigravity import AntigravityVendor
+
+    vendor = AntigravityVendor(antigravity_config, FailoverConfig(), ModelMapper([]))
+    yield vendor
+    await vendor.close()
+
+
+@pytest.fixture
+async def antigravity_vendor_v1internal(antigravity_config_v1internal: Any) -> Any:
+    """构建 v1internal 模式的 AntigravityVendor."""
+    from coding.proxy.config.schema import FailoverConfig
+    from coding.proxy.routing.model_mapper import ModelMapper
+    from coding.proxy.vendors.antigravity import AntigravityVendor
+
+    vendor = AntigravityVendor(
+        antigravity_config_v1internal, FailoverConfig(), ModelMapper([])
+    )
+    yield vendor
+    await vendor.close()
+
+
+@pytest.fixture
+def minimal_request_body() -> dict[str, Any]:
+    """最小 Anthropic 格式请求体（用于最小化 token 消耗）."""
+    return {
+        "model": "claude-sonnet-4-20250514",
+        "messages": [{"role": "user", "content": "Say exactly: pong"}],
+        "max_tokens": 32,
+    }
+
+
+@pytest.fixture(scope="session")
+def e2e_app(e2e_credentials: dict[str, str]) -> Any:
+    """构建仅启用 Antigravity 的 FastAPI 应用（临时 DB）."""
+    import tempfile
+
+    from coding.proxy.config.schema import ProxyConfig
+    from coding.proxy.server.app import create_app
+
+    tmpdir = tempfile.mkdtemp(prefix="e2e-antigravity-")
+    db_path = os.path.join(tmpdir, "usage.db")
+    compat_path = os.path.join(tmpdir, "compat.db")
+
+    config = ProxyConfig(
+        vendors=[
+            {
+                "vendor": "antigravity",
+                "enabled": True,
+                "client_id": e2e_credentials["client_id"],
+                "client_secret": e2e_credentials["client_secret"],
+                "refresh_token": e2e_credentials["refresh_token"],
+                "base_url": "https://cloudcode-pa.googleapis.com/v1internal",
+                "model_endpoint": e2e_credentials["model_endpoint"],
+                "timeout_ms": 60000,
+            },
+        ],
+        tiers=["antigravity"],
+        database={"path": db_path, "compat_state_path": compat_path},
+    )
+    return create_app(config)
+
+
+@pytest.fixture
+async def e2e_client(e2e_app: Any) -> Any:
+    """构建异步 HTTP 客户端（支持 SSE 流式测试）."""
+    import httpx
+
+    transport = httpx.ASGITransport(app=e2e_app)
+    async with httpx.AsyncClient(
+        transport=transport, base_url="http://test", timeout=60.0
+    ) as client:
+        yield client
diff --git a/tests/e2e/test_e2e_http.py b/tests/e2e/test_e2e_http.py
new file mode 100644
index 0000000..fe84db5
--- /dev/null
+++ b/tests/e2e/test_e2e_http.py
@@ -0,0 +1,263 @@
+"""Level 3 E2E: 完整 HTTP 端到端 — 模拟 Claude Code 通过 coding-proxy 使用 Antigravity."""
+
+from __future__ import annotations
+
+import json
+
+import pytest
+
+# Claude Code 发送的典型 headers
+CLAUDE_CODE_HEADERS = {
+    "anthropic-version": "2023-06-01",
+    "content-type": "application/json",
+    "x-api-key": "sk-ant-placeholder",
+}
+
+
+def _is_quota_exhausted(response: object) -> bool:
+    """检查响应是否为配额耗尽 (429)."""
+    if response.status_code != 429:
+        return False
+    try:
+        body = response.json()
+        err = body.get("error", {})
+        msg = err.get("message", "").lower()
+        return "resource" in msg or "quota" in msg or "exhausted" in msg
+    except Exception:
+        return False
+
+
+def _is_scope_error(response: object) -> bool:
+    """检查响应是否为 scope 不足 (403)."""
+    if response.status_code != 403:
+        return False
+    try:
+        body = response.json()
+        err = body.get("error", {})
+        return "scope" in json.dumps(err).lower()
+    except Exception:
+        return False
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_http_non_streaming(
+    e2e_client: object,
+    minimal_request_body: dict,
+) -> None:
+    """POST /v1/messages 非流式 → 验证协议对接正确."""
+    response = await e2e_client.post(
+        "/v1/messages",
+        json=minimal_request_body,
+        headers=CLAUDE_CODE_HEADERS,
+    )
+
+    if _is_scope_error(response):
+        pytest.skip("GLA 端点 scope 不足，需要 v1internal 模式")
+    if _is_quota_exhausted(response):
+        print("\n[E2E] HTTP non-streaming: 协议对接正确，但配额已耗尽 (429)")
+        return
+
+    assert response.status_code == 200, (
+        f"预期 200，实际 {response.status_code}: {response.text[:300]}"
+    )
+
+    body = response.json()
+    assert body["type"] == "message", f"预期 type=message，实际: {body.get('type')}"
+    assert body["role"] == "assistant"
+    assert len(body["content"]) > 0, "content 为空"
+    assert body["content"][0]["type"] == "text"
+    assert body["usage"]["input_tokens"] > 0, "input_tokens 应 > 0"
+
+    print(
+        f"\n[E2E] HTTP non-streaming 成功: model={body.get('model')}, "
+        f"input={body['usage']['input_tokens']}, output={body['usage']['output_tokens']}"
+    )
+    print(f"  content: {body['content'][0].get('text', '')[:100]}")
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_http_streaming(e2e_client: object) -> None:
+    """POST /v1/messages (stream=true) → 验证 SSE 协议."""
+    body = {
+        "model": "claude-sonnet-4-20250514",
+        "messages": [{"role": "user", "content": "Say exactly: pong"}],
+        "max_tokens": 32,
+        "stream": True,
+    }
+
+    events: list[str] = []
+    content_chunks: list[str] = []
+
+    try:
+        async with e2e_client.stream(
+            "POST", "/v1/messages", json=body, headers=CLAUDE_CODE_HEADERS
+        ) as response:
+            if response.status_code == 429:
+                print("\n[E2E] HTTP streaming: 协议对接正确，但配额已耗尽 (429)")
+                return
+
+            assert response.status_code == 200, f"预期 200，实际 {response.status_code}"
+
+            async for line in response.aiter_lines():
+                line = line.strip()
+                if not line:
+                    continue
+                if line.startswith("event:"):
+                    events.append(line[6:].strip())
+                elif line.startswith("data:"):
+                    payload = line[5:].strip()
+                    if payload == "[DONE]":
+                        continue
+                    try:
+                        data = json.loads(payload)
+                        if data.get("type") == "content_block_delta":
+                            delta = data.get("delta", {})
+                            if delta.get("type") == "text_delta":
+                                content_chunks.append(delta.get("text", ""))
+                    except json.JSONDecodeError:
+                        pass
+
+        assert "message_start" in events, f"缺少 message_start，实际: {events[:10]}"
+        assert "content_block_delta" in events, "缺少 content_block_delta"
+        assert "message_stop" in events, "缺少 message_stop"
+
+        full_text = "".join(content_chunks)
+        print(
+            f"\n[E2E] HTTP streaming 成功: events={len(events)}, content='{full_text[:100]}'"
+        )
+    except Exception as exc:
+        error_str = str(exc)
+        if "429" in error_str or "exhausted" in error_str.lower():
+            print("\n[E2E] HTTP streaming: 协议对接正确，但配额已耗尽 (429)")
+            return
+        raise
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_http_with_tools(e2e_client: object) -> None:
+    """POST /v1/messages 带 tools 定义 → 请求正常往返."""
+    body = {
+        "model": "claude-sonnet-4-20250514",
+        "messages": [
+            {"role": "user", "content": "What is 2+2? Reply with just the number."}
+        ],
+        "max_tokens": 128,
+        "tools": [
+            {
+                "name": "calculator",
+                "description": "Performs arithmetic",
+                "input_schema": {
+                    "type": "object",
+                    "properties": {"expression": {"type": "string"}},
+                    "required": ["expression"],
+                },
+            }
+        ],
+    }
+    response = await e2e_client.post(
+        "/v1/messages", json=body, headers=CLAUDE_CODE_HEADERS
+    )
+
+    if _is_scope_error(response):
+        pytest.skip("GLA 端点 scope 不足")
+    if _is_quota_exhausted(response):
+        print("\n[E2E] HTTP with tools: 协议对接正确，配额耗尽")
+        return
+
+    assert response.status_code == 200, (
+        f"预期 200，实际 {response.status_code}: {response.text[:300]}"
+    )
+
+    resp_body = response.json()
+    assert resp_body["type"] == "message"
+    assert len(resp_body["content"]) > 0
+    content_types = [b["type"] for b in resp_body["content"]]
+    print(f"\n[E2E] HTTP with tools 成功: content_types={content_types}")
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_http_health_probe(e2e_client: object) -> None:
+    """HEAD / 和 GET /health → 200（Claude Code 连通性探测）."""
+    head_resp = await e2e_client.head("/")
+    assert head_resp.status_code == 200, (
+        f"HEAD / 预期 200，实际 {head_resp.status_code}"
+    )
+
+    get_resp = await e2e_client.get("/")
+    assert get_resp.status_code == 200, f"GET / 预期 200，实际 {get_resp.status_code}"
+
+    health_resp = await e2e_client.get("/health")
+    assert health_resp.status_code == 200
+    assert health_resp.json() == {"status": "ok"}
+
+    print("\n[E2E] HTTP health probe 成功: HEAD /=200, GET /=200, /health=ok")
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_http_status_diagnostics(e2e_client: object) -> None:
+    """GET /api/status → 包含 antigravity tier 诊断信息."""
+    response = await e2e_client.get("/api/status")
+    assert response.status_code == 200
+
+    data = response.json()
+    assert "tiers" in data
+    antigravity_tiers = [t for t in data["tiers"] if t["name"] == "antigravity"]
+    assert len(antigravity_tiers) == 1, (
+        f"预期 1 个 antigravity tier，实际: {len(antigravity_tiers)}"
+    )
+
+    tier = antigravity_tiers[0]
+    assert "diagnostics" in tier, "缺少 diagnostics"
+
+    diag = tier["diagnostics"]
+    print("\n[E2E] status diagnostics:")
+    for k, v in diag.items():
+        if isinstance(v, dict):
+            print(f"  {k}: {json.dumps(v, ensure_ascii=False)[:200]}")
+        else:
+            print(f"  {k}: {v}")
+
+    # token_manager 诊断可能为空（若未发生错误），仅验证其存在性
+    if "token_manager" in diag:
+        print("  token_manager diagnostics present")
+    else:
+        print("  (token_manager diagnostics empty — no token errors)")
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_http_claude_code_headers(e2e_client: object) -> None:
+    """带完整 Claude Code headers 的请求正常（验证 x-api-key 不干扰 Antigravity）."""
+    headers = {
+        "anthropic-version": "2023-06-01",
+        "content-type": "application/json",
+        "x-api-key": "sk-ant-api03-fake-key-for-testing",
+        "accept": "application/json",
+    }
+    body = {
+        "model": "claude-sonnet-4-20250514",
+        "messages": [{"role": "user", "content": "Say: ok"}],
+        "max_tokens": 16,
+    }
+    response = await e2e_client.post("/v1/messages", json=body, headers=headers)
+
+    if _is_quota_exhausted(response):
+        print("\n[E2E] Claude Code headers: 协议对接正确，配额耗尽")
+        return
+
+    assert response.status_code == 200, (
+        f"预期 200，实际 {response.status_code}: {response.text[:300]}"
+    )
+
+    resp_body = response.json()
+    assert resp_body["type"] == "message"
+    assert len(resp_body["content"]) > 0
+
+    print(
+        f"\n[E2E] Claude Code headers 成功: content='{resp_body['content'][0].get('text', '')[:80]}'"
+    )
diff --git a/tests/e2e/test_e2e_token.py b/tests/e2e/test_e2e_token.py
new file mode 100644
index 0000000..dd3bb7b
--- /dev/null
+++ b/tests/e2e/test_e2e_token.py
@@ -0,0 +1,93 @@
+"""Level 1 E2E: Google OAuth2 Token 刷新 — 验证真实凭证链路."""
+
+from __future__ import annotations
+
+import pytest
+
+from coding.proxy.vendors.antigravity import GoogleOAuthTokenManager
+from coding.proxy.vendors.token_manager import TokenAcquireError, TokenErrorKind
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_real_token_refresh(e2e_credentials: dict[str, str]) -> None:
+    """真实 refresh_token 应返回有效的 access_token（ya29. 前缀）."""
+    tm = GoogleOAuthTokenManager(
+        e2e_credentials["client_id"],
+        e2e_credentials["client_secret"],
+        e2e_credentials["refresh_token"],
+    )
+    try:
+        token = await tm.get_token()
+        assert token, "access_token 为空"
+        assert token.startswith("ya29."), f"access_token 前缀异常: {token[:10]}..."
+        print(f"[E2E DIAG] access_token={token[:10]}... (len={len(token)})")
+    finally:
+        await tm.close()
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_real_token_caching(e2e_credentials: dict[str, str]) -> None:
+    """连续调用 get_token() 应返回缓存的同一 token."""
+    tm = GoogleOAuthTokenManager(
+        e2e_credentials["client_id"],
+        e2e_credentials["client_secret"],
+        e2e_credentials["refresh_token"],
+    )
+    try:
+        token1 = await tm.get_token()
+        token2 = await tm.get_token()
+        assert token1 == token2, "缓存未生效，两次返回不同 token"
+        assert tm._expires_at > 0, "expires_at 未被设置"
+        print(f"[E2E DIAG] caching OK: expires_at={tm._expires_at}")
+    finally:
+        await tm.close()
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_invalid_refresh_token_raises(e2e_credentials: dict[str, str]) -> None:
+    """错误的 refresh_token 应抛出 TokenAcquireError(INVALID_CREDENTIALS)."""
+    tm = GoogleOAuthTokenManager(
+        e2e_credentials["client_id"],
+        e2e_credentials["client_secret"],
+        "1//invalid_token_for_e2e_test_00000000",
+    )
+    try:
+        with pytest.raises(TokenAcquireError) as exc_info:
+            await tm.get_token()
+        assert exc_info.value.kind == TokenErrorKind.INVALID_CREDENTIALS, (
+            f"预期 INVALID_CREDENTIALS，实际: {exc_info.value.kind}"
+        )
+        assert exc_info.value.needs_reauth is True
+        print(f"[E2E DIAG] invalid_grant 正确捕获: {exc_info.value}")
+    finally:
+        await tm.close()
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_token_invalidation_triggers_refresh(
+    e2e_credentials: dict[str, str],
+) -> None:
+    """invalidate() 后重新获取应成功."""
+    tm = GoogleOAuthTokenManager(
+        e2e_credentials["client_id"],
+        e2e_credentials["client_secret"],
+        e2e_credentials["refresh_token"],
+    )
+    try:
+        token1 = await tm.get_token()
+        assert token1, "首次获取失败"
+
+        tm.invalidate()
+        assert tm._expires_at == 0.0, "invalidate 后 expires_at 应为 0"
+
+        token2 = await tm.get_token()
+        assert token2, "invalidate 后重新获取失败"
+        print(
+            f"[E2E DIAG] invalidation OK: token1={token1[:10]}... token2={token2[:10]}..."
+        )
+    finally:
+        await tm.close()
diff --git a/tests/e2e/test_e2e_vendor.py b/tests/e2e/test_e2e_vendor.py
new file mode 100644
index 0000000..1781235
--- /dev/null
+++ b/tests/e2e/test_e2e_vendor.py
@@ -0,0 +1,327 @@
+"""Level 2 E2E: AntigravityVendor 直接调用 — 验证 GLA 和 v1internal 协议端到端."""
+
+from __future__ import annotations
+
+import json
+
+import pytest
+
+
+def _print_diagnostics(vendor: object, label: str) -> None:
+    diag = vendor.get_diagnostics()
+    print(f"\n[E2E DIAG] {label}:")
+    for k, v in diag.items():
+        if isinstance(v, dict):
+            print(f"  {k}: {json.dumps(v, ensure_ascii=False)[:200]}")
+        else:
+            print(f"  {k}: {v}")
+
+
+def _is_quota_exhausted(resp: object) -> bool:
+    """检查响应是否为配额耗尽（429 RESOURCE_EXHAUSTED）.
+
+    429 表示协议对接正确但配额已用完，测试应标记为预期行为。
+    """
+    if resp.status_code != 429:
+        return False
+    error_msg = (resp.error_message or "").lower()
+    return "resource" in error_msg or "quota" in error_msg or "exhausted" in error_msg
+
+
+def _is_scope_error(resp: object) -> bool:
+    """检查响应是否为 scope 不足错误."""
+    if resp.status_code != 403:
+        return False
+    return "scope" in (resp.error_message or "").lower()
+
+
+# ── 标准 GLA 模式 ──
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_gla_non_streaming_text(
+    antigravity_vendor: object,
+    minimal_request_body: dict,
+) -> None:
+    """GLA 模式非流式请求 — 验证协议对接正确."""
+    resp = await antigravity_vendor.send_message(minimal_request_body, {})
+    _print_diagnostics(antigravity_vendor, "GLA non-streaming")
+
+    # 403 scope 不足说明 GLA 端点不适用于当前凭证（正常，需要 v1internal）
+    if _is_scope_error(resp):
+        pytest.skip("GLA 端点 scope 不足，需要 v1internal 模式")
+
+    # 429 配额耗尽 = 协议对接正确，仅配额问题
+    if _is_quota_exhausted(resp):
+        print("\n[E2E] GLA non-streaming: 协议对接正确，但配额已耗尽 (429)")
+        return
+
+    assert resp.status_code == 200, (
+        f"预期 200，实际 {resp.status_code}: {resp.error_message}"
+    )
+
+    body = json.loads(resp.raw_body)
+    assert body["type"] == "message", f"预期 type=message，实际: {body.get('type')}"
+    assert body["role"] == "assistant"
+    assert len(body["content"]) > 0, "content 为空"
+    assert body["content"][0]["type"] == "text"
+    assert body["stop_reason"] in ("end_turn", "max_tokens")
+    assert body["usage"]["input_tokens"] > 0, "input_tokens 应 > 0"
+
+    print(
+        f"\n[E2E] GLA non-streaming 成功: model={body.get('model')}, "
+        f"input={body['usage']['input_tokens']}, output={body['usage']['output_tokens']}, "
+        f"stop_reason={body['stop_reason']}"
+    )
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_gla_streaming_text(
+    antigravity_vendor: object,
+    minimal_request_body: dict,
+) -> None:
+    """GLA 模式流式请求 — 验证 SSE 协议对接."""
+    minimal_request_body["stream"] = True
+
+    events: list[str] = []
+    content_chunks: list[str] = []
+    quota_exhausted = False
+
+    try:
+        async for chunk in antigravity_vendor.send_message_stream(
+            minimal_request_body, {}
+        ):
+            text = chunk.decode("utf-8", errors="replace")
+            for line in text.split("\n"):
+                line = line.strip()
+                if line.startswith("event:"):
+                    events.append(line[6:].strip())
+                elif line.startswith("data:"):
+                    try:
+                        data = json.loads(line[5:].strip())
+                        if data.get("type") == "content_block_delta":
+                            delta = data.get("delta", {})
+                            if delta.get("type") == "text_delta":
+                                content_chunks.append(delta.get("text", ""))
+                    except json.JSONDecodeError:
+                        pass
+    except Exception as exc:
+        error_str = str(exc).lower()
+        if "403" in error_str and "scope" in error_str:
+            pytest.skip("GLA 端点 scope 不足，需要 v1internal 模式")
+        if "429" in error_str or "quota" in error_str or "exhausted" in error_str:
+            quota_exhausted = True
+            print("\n[E2E] GLA streaming: 协议对接正确，但配额已耗尽 (429)")
+        else:
+            raise
+
+    if not quota_exhausted:
+        _print_diagnostics(antigravity_vendor, "GLA streaming")
+        assert "message_start" in events, (
+            f"缺少 message_start 事件，实际事件: {events[:10]}"
+        )
+        assert "content_block_delta" in events, "缺少 content_block_delta 事件"
+        assert "message_stop" in events, "缺少 message_stop 事件"
+
+        full_text = "".join(content_chunks)
+        print(
+            f"\n[E2E] GLA streaming 成功: events={len(events)}, content='{full_text[:100]}'"
+        )
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_gla_with_system_prompt(
+    antigravity_vendor: object,
+    minimal_request_body: dict,
+) -> None:
+    """GLA 模式带 system prompt 的请求正常."""
+    minimal_request_body["system"] = (
+        "You are a test assistant. Always respond with exactly one word."
+    )
+    resp = await antigravity_vendor.send_message(minimal_request_body, {})
+
+    if _is_scope_error(resp):
+        pytest.skip("GLA 端点 scope 不足")
+    if _is_quota_exhausted(resp):
+        print("\n[E2E] GLA with system prompt: 协议对接正确，配额耗尽")
+        return
+
+    assert resp.status_code == 200, (
+        f"预期 200，实际 {resp.status_code}: {resp.error_message}"
+    )
+    body = json.loads(resp.raw_body)
+    assert body["type"] == "message"
+    assert len(body["content"]) > 0
+
+    print(
+        f"\n[E2E] GLA with system prompt 成功: content='{body['content'][0].get('text', '')[:80]}'"
+    )
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_gla_with_tools(
+    antigravity_vendor: object,
+    minimal_request_body: dict,
+) -> None:
+    """GLA 模式带 tools 定义的请求正常往返."""
+    minimal_request_body["tools"] = [
+        {
+            "name": "calculator",
+            "description": "Performs arithmetic",
+            "input_schema": {
+                "type": "object",
+                "properties": {"expression": {"type": "string"}},
+                "required": ["expression"],
+            },
+        }
+    ]
+    minimal_request_body["messages"] = [
+        {"role": "user", "content": "What is 2+2? Reply with just the number."}
+    ]
+    resp = await antigravity_vendor.send_message(minimal_request_body, {})
+
+    if _is_scope_error(resp):
+        pytest.skip("GLA 端点 scope 不足")
+    if _is_quota_exhausted(resp):
+        print("\n[E2E] GLA with tools: 协议对接正确，配额耗尽")
+        return
+
+    assert resp.status_code == 200, (
+        f"预期 200，实际 {resp.status_code}: {resp.error_message}"
+    )
+    body = json.loads(resp.raw_body)
+    assert body["type"] == "message"
+    assert len(body["content"]) > 0
+
+    _print_diagnostics(antigravity_vendor, "GLA with tools")
+    print(
+        f"\n[E2E] GLA with tools 成功: content_types={[b['type'] for b in body['content']]}"
+    )
+
+
+# ── v1internal 模式 ──
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_v1internal_non_streaming(
+    antigravity_vendor_v1internal: object,
+    minimal_request_body: dict,
+) -> None:
+    """v1internal 模式非流式请求 — 验证协议对接."""
+    resp = await antigravity_vendor_v1internal.send_message(minimal_request_body, {})
+
+    _print_diagnostics(antigravity_vendor_v1internal, "v1internal non-streaming")
+
+    # 429 = 协议对接正确，仅配额问题
+    if _is_quota_exhausted(resp):
+        diag = antigravity_vendor_v1internal.get_diagnostics()
+        print(
+            f"\n[E2E] v1internal non-streaming: 协议对接正确 (is_v1internal={diag.get('is_v1internal_mode')})，但配额已耗尽 (429)"
+        )
+        return
+
+    assert resp.status_code == 200, (
+        f"预期 200，实际 {resp.status_code}: {resp.error_message}"
+    )
+    body = json.loads(resp.raw_body)
+    assert body["type"] == "message"
+    assert body["role"] == "assistant"
+    assert len(body["content"]) > 0
+
+    diag = antigravity_vendor_v1internal.get_diagnostics()
+    print(
+        f"\n[E2E] v1internal non-streaming 成功: "
+        f"is_v1internal={diag.get('is_v1internal_mode')}, "
+        f"project_id_source={diag.get('project_id_source')}, "
+        f"input={body['usage']['input_tokens']}, output={body['usage']['output_tokens']}"
+    )
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_v1internal_streaming(
+    antigravity_vendor_v1internal: object,
+    minimal_request_body: dict,
+) -> None:
+    """v1internal 模式流式请求 — 验证 SSE 协议."""
+    minimal_request_body["stream"] = True
+
+    events: list[str] = []
+    content_chunks: list[str] = []
+    quota_exhausted = False
+
+    try:
+        async for chunk in antigravity_vendor_v1internal.send_message_stream(
+            minimal_request_body, {}
+        ):
+            text = chunk.decode("utf-8", errors="replace")
+            for line in text.split("\n"):
+                line = line.strip()
+                if line.startswith("event:"):
+                    events.append(line[6:].strip())
+                elif line.startswith("data:"):
+                    try:
+                        data = json.loads(line[5:].strip())
+                        if data.get("type") == "content_block_delta":
+                            delta = data.get("delta", {})
+                            if delta.get("type") == "text_delta":
+                                content_chunks.append(delta.get("text", ""))
+                    except json.JSONDecodeError:
+                        pass
+    except Exception as exc:
+        error_str = str(exc)
+        if "429" in error_str:
+            quota_exhausted = True
+            print("\n[E2E] v1internal streaming: 协议对接正确，但配额已耗尽 (429)")
+        else:
+            raise
+
+    if not quota_exhausted:
+        _print_diagnostics(antigravity_vendor_v1internal, "v1internal streaming")
+        assert "message_start" in events, "缺少 message_start"
+        assert "content_block_delta" in events, "缺少 content_block_delta"
+        assert "message_stop" in events, "缺少 message_stop"
+
+        full_text = "".join(content_chunks)
+        print(
+            f"\n[E2E] v1internal streaming 成功: events={len(events)}, content='{full_text[:100]}'"
+        )
+
+
+@pytest.mark.e2e
+@pytest.mark.asyncio
+async def test_project_id_auto_discovery(
+    antigravity_vendor_v1internal: object,
+    minimal_request_body: dict,
+) -> None:
+    """首次请求后 v1internal 模式状态和 project_id 发现结果."""
+    resp = await antigravity_vendor_v1internal.send_message(minimal_request_body, {})
+
+    diag = antigravity_vendor_v1internal.get_diagnostics()
+    source = diag.get("project_id_source", "unknown")
+    is_v1 = diag.get("is_v1internal_mode", False)
+
+    print(f"\n[E2E] project_id discovery: source={source}, is_v1internal={is_v1}")
+
+    # v1internal 模式应已启用（由 base_url 配置驱动）
+    assert is_v1 is True, "v1internal 模式应已启用"
+    assert source in ("discovered", "none", "configured"), (
+        f"未知的 project_id_source: {source}"
+    )
+
+    # 请求应到达了 API 端点（429 配额耗尽或 200 成功都说明协议对接正确）
+    assert resp.status_code in (200, 429), (
+        f"预期 200/429，实际 {resp.status_code}: {resp.error_message[:200]}"
+    )
+
+    if resp.status_code == 429:
+        print("  配额已耗尽 (429)，但协议对接验证正确")
+    elif source == "discovered":
+        print(f"  discovered_project_id={diag.get('discovered_project_id')}")
+    elif source == "none":
+        print("  未发现 project_id，v1internal 无需 project_id")
diff --git a/tests/test_antigravity.py b/tests/test_antigravity.py
index 6256bfb..cc93127 100644
--- a/tests/test_antigravity.py
+++ b/tests/test_antigravity.py
@@ -384,12 +384,12 @@ def test_is_v1internal_mode_with_project_id_and_v1internal_url():
 
 
 def test_is_v1internal_mode_without_project_id():
-    """未配置 project_id 时即使 URL 含 v1internal 也不启用."""
+    """v1internal 模式由 base_url 驱动，无需 project_id（与参考项目对齐）."""
     config = AntigravityConfig(
         base_url="https://cloudcode-pa.googleapis.com/v1internal",
     )
     vendor = AntigravityVendor(config, FailoverConfig(), ModelMapper([]))
-    assert vendor._is_v1internal_mode() is False
+    assert vendor._is_v1internal_mode() is True
 
 
 def test_is_v1internal_mode_standard_gla_url():
@@ -527,7 +527,7 @@ async def test_discover_project_id_single_active_project():
 
     assert result == "my-gcp-123"
     assert vendor._project_id_discovered == "my-gcp-123"
-    assert vendor._base_url == "https://cloudcode-pa.googleapis.com/v1internal"
+    assert vendor._base_url == "https://cloudcode-pa.googleapis.com"
     assert vendor._is_v1internal_mode() is True
 
 
@@ -743,20 +743,19 @@ async def mock_discover(token):
 
 
 def test_is_v1internal_mode_uses_effective_project_id():
-    """_is_v1internal_mode 应基于 _effective_project_id 判断."""
+    """_is_v1internal_mode 应基于 base_url 判断（不再依赖 project_id）."""
     config = AntigravityConfig(base_url=_V1INTERNAL_BASE_URL)
     vendor = AntigravityVendor(config, FailoverConfig(), ModelMapper([]))
 
-    # 未配置、未发现 → False
-    assert vendor._is_v1internal_mode() is False
+    # base_url 含 v1internal → True（即使无 project_id）
+    assert vendor._is_v1internal_mode() is True
 
-    # 发现后 → True
+    # 发现 project_id 不影响 v1internal 模式判断
     vendor._project_id_discovered = "found-it"
     assert vendor._is_v1internal_mode() is True
 
-    # 配置值覆盖发现值
+    # 清除发现值也不影响
     vendor._project_id_discovered = ""
-    vendor._project_id = "manual"
     assert vendor._is_v1internal_mode() is True
 
 
diff --git a/tests/test_app_routes.py b/tests/test_app_routes.py
index 8df0277..4c460e3 100644
--- a/tests/test_app_routes.py
+++ b/tests/test_app_routes.py
@@ -286,6 +286,76 @@ def test_count_tokens_falls_back_to_tiers0_on_cold_start():
             assert resp.json()["input_tokens"] == 88
 
 
+def test_count_tokens_triggers_zhipu_to_target_channel(caplog):
+    """count_tokens 请求体含 zhipu 私有产物时，应触发跨供应商通道并返回 200.
+
+    回归测试：routes.py 历史上错误访问 target_vendor.name（BaseVendor 仅暴露 get_name()
+    方法，并无 name 属性），当 infer_source_vendor_from_body() 推断出非空 source 时
+    会抛 AttributeError 返回 500。本用例通过注入 zhipu 私有产物（srvtoolu_* id 与
+    server_tool_use 块）触发该路径，断言 200 且 adaptations 日志被打印。
+    """
+    config = ProxyConfig(
+        tiers=[
+            {"vendor": "anthropic", "enabled": True, "api_key": "sk-ant-test"},
+        ],
+        database={"path": "/tmp/test-count-tokens-zhipu-channel.db"},
+    )
+    app = create_app(config)
+
+    mock_response = MagicMock()
+    mock_response.content = b'{"input_tokens": 99}'
+    mock_response.status_code = 200
+
+    body_with_zhipu_artifact = {
+        "model": "claude-sonnet-4-20250514",
+        "messages": [
+            {"role": "user", "content": "Hello"},
+            {
+                "role": "assistant",
+                "content": [
+                    {
+                        "type": "server_tool_use",
+                        "id": "srvtoolu_abc123",
+                        "name": "web_search",
+                        "input": {"query": "test"},
+                    },
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "tool_result",
+                        "tool_use_id": "srvtoolu_abc123",
+                        "content": "result",
+                    },
+                ],
+            },
+        ],
+    }
+
+    with TestClient(app) as client:
+        with patch.object(
+            httpx.AsyncClient,
+            "post",
+            new_callable=AsyncMock,
+            return_value=mock_response,
+        ):
+            with caplog.at_level(logging.DEBUG, logger="coding.proxy.server.routes"):
+                resp = client.post(
+                    "/v1/messages/count_tokens?beta=true",
+                    json=body_with_zhipu_artifact,
+                    headers={"authorization": "Bearer sk-test"},
+                )
+            assert resp.status_code == 200
+            assert resp.json()["input_tokens"] == 99
+            # 通道被实际触发的证据：debug 日志含 "count_tokens channel zhipu → anthropic"
+            assert any(
+                "count_tokens channel zhipu" in record.message
+                for record in caplog.records
+            ), "expected zhipu→anthropic channel adaptation log"
+
+
 def test_status_exposes_vendor_diagnostics():
     """状态接口暴露供应商诊断信息，便于排查凭证交换异常."""
     config = ProxyConfig(
diff --git a/tests/test_native_api_handler.py b/tests/test_native_api_handler.py
index d8db031..14be66c 100644
--- a/tests/test_native_api_handler.py
+++ b/tests/test_native_api_handler.py
@@ -14,6 +14,7 @@
 
 from __future__ import annotations
 
+import json
 from collections.abc import Iterator
 
 import httpx
@@ -372,3 +373,191 @@ def factory(make_transport):
         r = client.request(method, "/api/openai/v1/files/abc")
         assert r.status_code == 200
         assert captured[0].method == method
+
+
+# ── Gemini batchEmbedContents 端到端 ─────────────────────────────
+
+
+def test_gemini_batch_embed_forwards_correctly() -> None:
+    """Gemini batchEmbedContents 端点（字面冒号）正确转发."""
+
+    def route(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(
+            200,
+            json={"embeddings": [{"values": [0.1, 0.2]}]},
+        )
+
+    def factory(make_transport):
+        cfg = NativeApiConfig(
+            gemini=NativeProviderConfig(
+                enabled=True, base_url="https://generativelanguage.googleapis.com"
+            ),
+        )
+        transport = make_transport(route)
+        return NativeProxyHandler(cfg, transport=transport), transport
+
+    for client, captured in _make_app(factory):
+        r = client.post(
+            "/api/gemini/v1beta/models/gemini-embedding-001:batchEmbedContents?key=secret123",
+            json={
+                "requests": [
+                    {
+                        "model": "models/gemini-embedding-001",
+                        "content": {"parts": [{"text": "hello"}]},
+                    }
+                ]
+            },
+        )
+        assert r.status_code == 200
+        assert r.json()["embeddings"][0]["values"] == [0.1, 0.2]
+        upstream = captured[0]
+        # 上游 URL 必须含字面冒号，不含 %3A
+        upstream_str = str(upstream.url)
+        assert ":batchEmbedContents" in upstream_str
+        assert "%3A" not in upstream_str
+        assert upstream.url.params.get("key") == "secret123"
+
+
+def test_gemini_url_encoded_colon_decoded_for_upstream() -> None:
+    """当 %3A 到达代理时，上游必须收到字面冒号."""
+
+    def route(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(200, json={"ok": True})
+
+    def factory(make_transport):
+        cfg = NativeApiConfig(
+            gemini=NativeProviderConfig(
+                enabled=True, base_url="https://generativelanguage.googleapis.com"
+            ),
+        )
+        transport = make_transport(route)
+        return NativeProxyHandler(cfg, transport=transport), transport
+
+    for client, captured in _make_app(factory):
+        r = client.post(
+            "/api/gemini/v1beta/models/gemini-embedding-001%3AbatchEmbedContents?key=k",
+            json={"requests": []},
+        )
+        assert r.status_code == 200
+        upstream = captured[0]
+        upstream_str = str(upstream.url)
+        # 上游 URL 必须含字面冒号，不含 %3A
+        assert "%3A" not in upstream_str
+        assert ":batchEmbedContents" in upstream_str
+
+
+# ── Gemini embedding Vertex AI 格式转换 ─────────────────────────
+
+
+def test_gemini_vertex_embed_content_single() -> None:
+    """非官方上游时，embedContent 转为 Vertex AI 格式."""
+
+    def route(request: httpx.Request) -> httpx.Response:
+        body = json.loads(request.content)
+        assert "content" in body
+        assert "model" not in body
+        assert "requests" not in body
+        assert ":embedContent" in str(request.url)
+        assert "v1beta1/publishers/google/models" in str(request.url)
+        return httpx.Response(200, json={"embedding": {"values": [0.1, 0.2]}})
+
+    def factory(make_transport):
+        cfg = NativeApiConfig(
+            gemini=NativeProviderConfig(enabled=True, base_url="http://llms.as-in.io"),
+        )
+        transport = make_transport(route)
+        return NativeProxyHandler(cfg, transport=transport), transport
+
+    for client, captured in _make_app(factory):
+        r = client.post(
+            "/api/gemini/v1beta/models/gemini-embedding-2-preview:embedContent",
+            json={
+                "model": "models/gemini-embedding-2-preview",
+                "content": {"parts": [{"text": "hello"}]},
+            },
+        )
+        assert r.status_code == 200
+        assert "embedding" in r.json()
+
+
+def test_gemini_vertex_batch_embed_contents() -> None:
+    """非官方上游时，batchEmbedContents 拆分为多次 embedContent 并聚合."""
+
+    call_count = 0
+
+    def route(request: httpx.Request) -> httpx.Response:
+        nonlocal call_count
+        call_count += 1
+        body = json.loads(request.content)
+        assert "content" in body
+        assert ":embedContent" in str(request.url)
+        assert "v1beta1/publishers/google/models" in str(request.url)
+        return httpx.Response(
+            200,
+            json={"embedding": {"values": [float(call_count), 0.5]}},
+        )
+
+    def factory(make_transport):
+        cfg = NativeApiConfig(
+            gemini=NativeProviderConfig(enabled=True, base_url="http://llms.as-in.io"),
+        )
+        transport = make_transport(route)
+        return NativeProxyHandler(cfg, transport=transport), transport
+
+    for client, captured in _make_app(factory):
+        r = client.post(
+            "/api/gemini/v1beta/models/gemini-embedding-2-preview:batchEmbedContents",
+            json={
+                "requests": [
+                    {
+                        "model": "models/gemini-embedding-2-preview",
+                        "content": {"parts": [{"text": "hello"}]},
+                    },
+                    {
+                        "model": "models/gemini-embedding-2-preview",
+                        "content": {"parts": [{"text": "world"}]},
+                    },
+                ]
+            },
+        )
+        assert r.status_code == 200
+        data = r.json()
+        assert "embeddings" in data
+        assert len(data["embeddings"]) == 2
+        assert data["embeddings"][0]["values"] == [1.0, 0.5]
+        assert data["embeddings"][1]["values"] == [2.0, 0.5]
+        assert call_count == 2
+
+
+def test_gemini_vertex_embed_official_upstream_unchanged() -> None:
+    """官方上游时，batchEmbedContents 走原始透传路径，不做格式转换."""
+
+    def route(request: httpx.Request) -> httpx.Response:
+        return httpx.Response(200, json={"embeddings": [{"values": [0.1, 0.2]}]})
+
+    def factory(make_transport):
+        cfg = NativeApiConfig(
+            gemini=NativeProviderConfig(
+                enabled=True, base_url="https://generativelanguage.googleapis.com"
+            ),
+        )
+        transport = make_transport(route)
+        return NativeProxyHandler(cfg, transport=transport), transport
+
+    for client, captured in _make_app(factory):
+        r = client.post(
+            "/api/gemini/v1beta/models/gemini-embedding-001:batchEmbedContents?key=k",
+            json={
+                "requests": [
+                    {
+                        "model": "models/gemini-embedding-001",
+                        "content": {"parts": [{"text": "hello"}]},
+                    }
+                ]
+            },
+        )
+        assert r.status_code == 200
+        # 官方上游走原始路径，URL 保持 v1beta/models/ 格式
+        upstream = captured[0]
+        assert "v1beta/models" in str(upstream.url)
+        assert "v1beta1/publishers" not in str(upstream.url)
diff --git a/tests/test_native_api_operation.py b/tests/test_native_api_operation.py
index 64cd160..fc237bc 100644
--- a/tests/test_native_api_operation.py
+++ b/tests/test_native_api_operation.py
@@ -55,6 +55,12 @@ def test_classify_openai(path: str, expected: str) -> None:
         ("/v1beta/models/text-embedding-004:embedContent", "embedding"),
         ("/v1beta/models/text-embedding-004:batchEmbedContents", "embedding.batch"),
         ("/v1beta/models/imagegeneration@006:predict", "predict"),
+        # %3A (URL 编码冒号) 兼容性
+        ("/v1beta/models/gemini-embedding-001%3AbatchEmbedContents", "embedding.batch"),
+        ("/v1beta/models/text-embedding-004%3AembedContent", "embedding"),
+        ("/v1beta/models/gemini-2.0-flash%3AgenerateContent", "generate_content"),
+        ("/v1beta/models/gemini-2.0-flash%3AstreamGenerateContent", "generate_content"),
+        ("/v1beta/models/gemini-1.5-pro%3AcountTokens", "count_tokens"),
         ("/v1beta/cachedContents", "cache"),
         ("/v1beta/cachedContents/cachedContents-xyz", "cache"),
         ("/v1beta/files", "file"),
@@ -128,3 +134,14 @@ def test_is_stream_path() -> None:
     # OpenAI / Anthropic 不走路径判定（以响应 content-type 为准）
     assert not OperationClassifier.is_stream_path("openai", "/v1/chat/completions")
     assert not OperationClassifier.is_stream_path("anthropic", "/v1/messages")
+
+
+def test_is_stream_path_with_encoded_colon() -> None:
+    """%3A (URL 编码冒号) 也应被 is_stream_path 识别."""
+    assert OperationClassifier.is_stream_path(
+        "gemini", "/v1beta/models/gemini-2.0-flash%3AstreamGenerateContent"
+    )
+    # %3A + 非流式路径仍应返回 False
+    assert not OperationClassifier.is_stream_path(
+        "gemini", "/v1beta/models/gemini-2.0-flash%3AgenerateContent"
+    )
diff --git a/tests/test_router_executor.py b/tests/test_router_executor.py
index 1e40ea6..9506e67 100644
--- a/tests/test_router_executor.py
+++ b/tests/test_router_executor.py
@@ -20,11 +20,15 @@
     build_canonical_request,
 )
 from coding.proxy.routing.executor import (
+    _SESSION_TITLE_MAX_LEN,
     _VENDOR_PROTOCOL_LABEL_MAP,
+    _build_semantic_rejection_diagnostic,
+    _extract_session_title,
     _has_tool_results,
     _is_likely_request_format_error,
     _log_vendor_response_error,
     _RouteExecutor,
+    _sanitize_user_text,
 )
 from coding.proxy.routing.session_manager import RouteSessionManager
 from coding.proxy.routing.tier import VendorTier
@@ -222,7 +226,7 @@ async def test_eligible_when_all_checks_pass(self):
         headers = {}
         caps = RequestCapabilities()
         req = build_canonical_request(body, headers)
-        session_record = await exec_inst._session_mgr.get_or_create_record(
+        session_record, _is_new = await exec_inst._session_mgr.get_or_create_record(
             req.session_key, req.trace_id
         )
         reasons: list[str] = []
@@ -246,7 +250,7 @@ async def test_skip_when_capability_unsupported(self):
         body = {"model": "test"}
         headers = {}
         req = build_canonical_request(body, headers)
-        session_record = await exec_inst._session_mgr.get_or_create_record(
+        session_record, _is_new = await exec_inst._session_mgr.get_or_create_record(
             req.session_key, req.trace_id
         )
         reasons: list[str] = []
@@ -275,7 +279,7 @@ async def test_skip_when_unsafe_compatibility(self):
         body = {"model": "test", "thinking": {"type": "enabled"}}
         headers = {}
         req = build_canonical_request(body, headers)
-        session_record = await exec_inst._session_mgr.get_or_create_record(
+        session_record, _is_new = await exec_inst._session_mgr.get_or_create_record(
             req.session_key, req.trace_id
         )
         reasons: list[str] = []
@@ -651,9 +655,10 @@ class TestRouteSessionManagerIntegration:
     @pytest.mark.asyncio
     async def test_get_or_create_without_store(self):
         mgr = RouteSessionManager(compat_session_store=None)
-        record = await mgr.get_or_create_record("sk_test", "trace_1")
-        # 无 store 时返回 None（由 executor 层面处理空 record 场景）
+        record, is_new = await mgr.get_or_create_record("sk_test", "trace_1")
+        # 无 store 时返回 (None, False)
         assert record is None
+        assert is_new is False
 
     @pytest.mark.asyncio
     async def test_persist_session_without_store_is_noop(self):
@@ -1948,3 +1953,374 @@ def test_returns_body_for_unknown_tier(self):
         result = exec_inst._prepare_body_for_tier(body, tier, source_vendor="zhipu")
 
         assert result is body
+
+
+class TestBuildSemanticRejectionDiagnostic:
+    """覆盖 _build_semantic_rejection_diagnostic 函数 — 用于诊断 [1210] 等供应商语义拒绝.
+
+    重点验证：
+    - baseline 字段（model / messages）始终输出
+    - 仅当参数存在时才输出相关项（避免日志噪声）
+    - 各字段输出格式稳定
+    """
+
+    def test_baseline_minimal_body(self):
+        """最小请求体：仅输出 model + messages."""
+        body = {"model": "glm-5-turbo", "messages": [{"role": "user", "content": "hi"}]}
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "model=glm-5-turbo" in result
+        assert "messages=1" in result
+        # 不应输出未使用的字段
+        assert "thinking" not in result
+        assert "tools" not in result
+        assert "cache_control" not in result
+
+    def test_includes_thinking_param(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [],
+            "thinking": {"type": "enabled", "budget_tokens": 1024},
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "thinking=" in result
+        assert "budget_tokens" in result
+
+    def test_includes_system_string(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [],
+            "system": "You are helpful." * 5,
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "system_kind=string(len=" in result
+
+    def test_includes_system_blocks_with_cache_control(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [],
+            "system": [
+                {
+                    "type": "text",
+                    "text": "rule1",
+                    "cache_control": {"type": "ephemeral"},
+                },
+                {"type": "text", "text": "rule2"},
+            ],
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "system_blocks=2,cc=1" in result
+
+    def test_includes_tools_and_tool_choice(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [],
+            "tools": [{"name": "a"}, {"name": "b"}, {"name": "c"}],
+            "tool_choice": {"type": "auto"},
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "tools=3" in result
+        assert "tool_choice=" in result
+
+    def test_includes_sampling_params(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [],
+            "max_tokens": 8192,
+            "temperature": 0.7,
+            "top_p": 0.9,
+            "top_k": 40,
+            "stop_sequences": ["\n\n", "END"],
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "max_tokens=8192" in result
+        assert "temperature=0.7" in result
+        assert "top_p=0.9" in result
+        assert "top_k=40" in result
+        assert "stop_sequences=2" in result
+
+    def test_includes_stream_and_metadata(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [],
+            "stream": True,
+            "metadata": {"user_id": "x", "session_id": "y"},
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "stream=True" in result
+        assert "metadata_keys=2" in result
+
+    def test_content_type_distribution(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": "hi"},
+                        {"type": "text", "text": "bye"},
+                        {"type": "image", "source": {}},
+                    ],
+                },
+                {
+                    "role": "assistant",
+                    "content": [
+                        {"type": "tool_use", "id": "t1", "name": "x", "input": {}},
+                    ],
+                },
+            ],
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        # 排序为字母序
+        assert "content_types={image:1,text:2,tool_use:1}" in result
+
+    def test_content_type_string_messages(self):
+        """messages.content 为 string 时计入 string:N."""
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [
+                {"role": "user", "content": "hello"},
+                {"role": "assistant", "content": "hi"},
+            ],
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "content_types={string:2}" in result
+
+    def test_thinking_blocks_in_history(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [
+                {
+                    "role": "assistant",
+                    "content": [
+                        {"type": "thinking", "thinking": "..."},
+                        {"type": "redacted_thinking", "data": "..."},
+                        {"type": "text", "text": "result"},
+                    ],
+                }
+            ],
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "thinking_blocks_in_history=2" in result
+
+    def test_cache_control_in_messages_or_tools(self):
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "text",
+                            "text": "x",
+                            "cache_control": {"type": "ephemeral"},
+                        },
+                    ],
+                }
+            ],
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "cache_control_fields=present" in result
+
+    def test_body_bytes_estimated(self):
+        body = {"model": "glm-5-turbo", "messages": [{"role": "user", "content": "ok"}]}
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "body_bytes=" in result
+
+    def test_body_bytes_skipped_when_unserializable(self):
+        """请求体含非可序列化对象时不抛异常."""
+
+        class NonSerializable:
+            pass
+
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [],
+            "metadata": {"obj": NonSerializable()},
+        }
+        # 不应抛异常
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "model=glm-5-turbo" in result
+
+    def test_combined_real_world_failure_case(self):
+        """模拟真实失败请求形态（messages=1，无 thinking/cache_control，含 system + tools）."""
+        body = {
+            "model": "glm-5-turbo",
+            "messages": [{"role": "user", "content": "需要修复一个 bug"}],
+            "system": [{"type": "text", "text": "You are Claude Code."}],
+            "tools": [{"name": "Read"}, {"name": "Edit"}],
+            "max_tokens": 8192,
+            "temperature": 1.0,
+            "metadata": {"user_id": "x"},
+            "stream": True,
+        }
+        result = _build_semantic_rejection_diagnostic(body)
+        assert "model=glm-5-turbo" in result
+        assert "messages=1" in result
+        assert "system_blocks=1" in result
+        assert "tools=2" in result
+        assert "max_tokens=8192" in result
+        assert "temperature=1.0" in result
+        assert "metadata_keys=1" in result
+        assert "stream=True" in result
+        # 不应包含未出现的项
+        assert "thinking_blocks_in_history" not in result
+        assert "cache_control_fields" not in result
+
+
+# ── Session 标题清洗与抽取测试 ─────────────────────────────────
+
+
+class TestSanitizeUserText:
+    """``_sanitize_user_text`` — 剥离 CC 注入的系统级 XML 块.
+
+    覆盖典型 system-reminder/user-preferences 噪声、slash command
+    短路、空白折叠与边界场景。
+    """
+
+    def test_strips_system_reminder(self):
+        raw = "<system-reminder>MCP 指令</system-reminder>这是用户真实输入"
+        assert _sanitize_user_text(raw) == "这是用户真实输入"
+
+    def test_strips_user_preferences(self):
+        raw = "用户问题<user-preferences>遵循 AGENTS.md</user-preferences>"
+        assert _sanitize_user_text(raw) == "用户问题"
+
+    def test_strips_multiple_noise_blocks(self):
+        raw = (
+            "<system-reminder>A</system-reminder>"
+            "<system-reminder>B</system-reminder>"
+            "<system-reminder>C</system-reminder>"
+            "<system-reminder>D</system-reminder>"
+            "真实输入文本"
+            "<user-preferences>P</user-preferences>"
+        )
+        assert _sanitize_user_text(raw) == "真实输入文本"
+
+    def test_strips_multiline_system_reminder(self):
+        """多行 system-reminder 块需被 DOTALL 完整匹配剥离."""
+        raw = (
+            "<system-reminder>\n"
+            "# MCP Server Instructions\n"
+            "Use this server to fetch ...\n"
+            "</system-reminder>\n"
+            "TITLE 中的 Session 标题应当取自用户输入"
+        )
+        assert _sanitize_user_text(raw) == "TITLE 中的 Session 标题应当取自用户输入"
+
+    def test_strips_tag_with_attributes(self):
+        """容忍标签携带属性(如 <system-reminder type="x">)."""
+        raw = '<system-reminder type="x">noise</system-reminder>真实'
+        assert _sanitize_user_text(raw) == "真实"
+
+    def test_slash_command_with_args(self):
+        raw = (
+            "<command-message>commit (user)</command-message>"
+            "<command-name>/commit</command-name>"
+            "<command-args>修复标题</command-args>"
+        )
+        assert _sanitize_user_text(raw) == "/commit 修复标题"
+
+    def test_slash_command_no_args(self):
+        raw = "<command-name>/review</command-name>"
+        assert _sanitize_user_text(raw) == "/review"
+
+    def test_collapses_whitespace(self):
+        raw = "<system-reminder>X</system-reminder>\n\n   多余  空白\t\t折叠   "
+        assert _sanitize_user_text(raw) == "多余 空白 折叠"
+
+    def test_empty_after_strip(self):
+        raw = "<system-reminder>仅噪声</system-reminder>"
+        assert _sanitize_user_text(raw) == ""
+
+    def test_empty_input(self):
+        assert _sanitize_user_text("") == ""
+
+    def test_preserves_user_xml_like_content(self):
+        """用户输入中合法的 XML/HTML 片段(非白名单标签)需完整保留."""
+        raw = "请帮我审查这段代码:<div>hello</div> 是否符合规范?"
+        assert _sanitize_user_text(raw) == raw
+
+    def test_strips_local_command_output(self):
+        raw = "<local-command-stdout>build ok</local-command-stdout>构建后的下一步问题"
+        assert _sanitize_user_text(raw) == "构建后的下一步问题"
+
+
+class TestExtractSessionTitle:
+    """``_extract_session_title`` — 端到端从 CanonicalRequest 抽取标题."""
+
+    @staticmethod
+    def _build_request(messages: list[dict]):
+        return build_canonical_request({"model": "test", "messages": messages}, {})
+
+    def test_truncates_to_max_len(self):
+        long_text = "用户输入文本" * 20
+        req = self._build_request([{"role": "user", "content": long_text}])
+        title = _extract_session_title(req)
+        assert len(title) == _SESSION_TITLE_MAX_LEN
+        assert title == long_text[:_SESSION_TITLE_MAX_LEN]
+
+    def test_strips_noise_from_first_user_message(self):
+        raw = (
+            "<system-reminder>MCP 指令</system-reminder>"
+            "<user-preferences>偏好</user-preferences>"
+            "测试标题 ABC"
+        )
+        req = self._build_request([{"role": "user", "content": raw}])
+        assert _extract_session_title(req) == "测试标题 ABC"
+
+    def test_handles_real_cc_first_message_shape(self):
+        """模拟 CC 真实首条消息(多个连续 system-reminder + 用户文本)."""
+        raw = (
+            "<system-reminder>\n# MCP Server Instructions\n...</system-reminder>"
+            "<system-reminder>\nThe following skills...\n</system-reminder>"
+            "<system-reminder>\nPlan mode is active...\n</system-reminder>"
+            "\n\nTITLE 中的 Session 标题应当取自用户输入的信息前 30 个字\n\n"
+            "<user-preferences>始终遵循 AGENTS.md</user-preferences>"
+        )
+        req = self._build_request([{"role": "user", "content": raw}])
+        title = _extract_session_title(req)
+        assert title.startswith("TITLE 中的 Session")
+        assert len(title) <= _SESSION_TITLE_MAX_LEN
+
+    def test_extracts_slash_command(self):
+        raw = (
+            "<command-name>/commit</command-name>"
+            "<command-args>feat: 新增标题清洗</command-args>"
+        )
+        req = self._build_request([{"role": "user", "content": raw}])
+        assert _extract_session_title(req) == "/commit feat: 新增标题清洗"
+
+    def test_returns_empty_when_only_noise(self):
+        raw = "<system-reminder>纯噪声</system-reminder>"
+        req = self._build_request([{"role": "user", "content": raw}])
+        assert _extract_session_title(req) == ""
+
+    def test_returns_empty_for_no_user_messages(self):
+        req = self._build_request([{"role": "assistant", "content": "你好"}])
+        assert _extract_session_title(req) == ""
+
+    def test_skips_noise_only_part_to_find_real_input(self):
+        """首个 user text part 全噪声时,fallback 到下一个非空 user part."""
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": "<system-reminder>noise</system-reminder>",
+                    },
+                    {"type": "text", "text": "真实问题"},
+                ],
+            }
+        ]
+        req = self._build_request(messages)
+        assert _extract_session_title(req) == "真实问题"
+
+    def test_skips_assistant_role(self):
+        """assistant 角色的文本不应被作为标题候选."""
+        messages = [
+            {"role": "assistant", "content": "上一轮回答"},
+            {"role": "user", "content": "新的用户问题"},
+        ]
+        req = self._build_request(messages)
+        assert _extract_session_title(req) == "新的用户问题"
diff --git a/tests/test_schema.py b/tests/test_schema.py
index ae7120e..30d691c 100644
--- a/tests/test_schema.py
+++ b/tests/test_schema.py
@@ -31,7 +31,8 @@ def test_antigravity_fields_set():
 
 def test_zhipu_fields_set():
     assert "api_key" in _ZHIPU_FIELDS
-    assert len(_ZHIPU_FIELDS) == 1
+    assert "concurrency" in _ZHIPU_FIELDS
+    assert len(_ZHIPU_FIELDS) == 2
 
 
 def test_vendor_exclusive_fields_mapping_complete():
diff --git a/tests/test_session_aware.py b/tests/test_session_aware.py
index 0c08449..29518e5 100644
--- a/tests/test_session_aware.py
+++ b/tests/test_session_aware.py
@@ -160,6 +160,8 @@ async def test_query_recent_sessions_basic(logger):
             model_served="claude-sonnet",
             input_tokens=100 * (i + 1),
             output_tokens=50 * (i + 1),
+            cache_creation_tokens=10 * (i + 1),
+            cache_read_tokens=1000 * (i + 1),
             session_key="session-alpha",
             duration_ms=100 + i * 50,
         )
@@ -186,9 +188,15 @@ async def test_query_recent_sessions_basic(logger):
 
     alpha = next(s for s in sessions if s["session_key"] == "session-alpha")
     assert alpha["total_requests"] == 3
-    assert alpha["total_tokens"] == (100 + 200 + 300) + (50 + 100 + 150)
-    assert alpha["total_input"] == 100 + 200 + 300
-    assert alpha["total_output"] == 50 + 100 + 150
+    expected_input = 100 + 200 + 300
+    expected_output = 50 + 100 + 150
+    expected_cache_creation = 10 + 20 + 30
+    expected_cache_read = 1000 + 2000 + 3000
+    assert alpha["total_tokens"] == (
+        expected_input + expected_output + expected_cache_creation + expected_cache_read
+    )
+    assert alpha["total_input"] == expected_input
+    assert alpha["total_output"] == expected_output
     assert "claude-sonnet" in alpha["models"]
     assert "anthropic" in alpha["vendors"]
     assert alpha["success_rate"] == 100.0
@@ -269,12 +277,15 @@ async def test_query_session_profile_found(logger):
         model_served="m",
         input_tokens=100,
         output_tokens=50,
+        cache_creation_tokens=20,
+        cache_read_tokens=400,
         session_key="profile-test",
     )
     profile = await logger.query_session_profile("profile-test")
     assert profile is not None
     assert profile["session_key"] == "profile-test"
     assert profile["total_requests"] == 1
+    assert profile["total_tokens"] == 100 + 50 + 20 + 400
 
 
 @pytest.mark.asyncio
diff --git a/tests/test_vendor_channels.py b/tests/test_vendor_channels.py
index 774b85a..f9c9bb5 100644
--- a/tests/test_vendor_channels.py
+++ b/tests/test_vendor_channels.py
@@ -15,12 +15,14 @@
 
 from coding.proxy.convert.vendor_channels import (
     VENDOR_TRANSITIONS,
+    _enforce_pairing_sanity_pass,
     _remove_vendor_blocks,
     _rewrite_srvtoolu_ids,
     _strip_cache_control,
     enforce_anthropic_tool_pairing,
     get_transition_channel,
     infer_source_vendor_from_body,
+    normalize_for_zhipu,
     prepare_copilot_to_zhipu,
     prepare_zhipu_to_anthropic,
     prepare_zhipu_to_copilot,
@@ -1008,6 +1010,91 @@ def test_skips_non_matching_user_tool_result(self):
         assert count == 0
         assert body["messages"][0]["content"][0]["tool_use_id"] == "toolu_other"
 
+    def test_two_pass_handles_inline_tool_result_before_server_tool_use(self):
+        """乱序回归: 同一 assistant content 内 tool_result 出现在 server_tool_use 之前.
+
+        Zhipu GLM-5 流式响应中已观察到的真实形态。若使用单遍扫描，
+        Case B 在 tool_result 块上执行时 ``id_map`` 尚未被 Case A 填入，
+        会漏改 ``tool_result.tool_use_id``，留下旧的 ``srvtoolu_*`` 引用，
+        最终触发 Anthropic API 的 ``messages.x: tool_use ids were found
+        without tool_result blocks immediately after`` 400 错误。
+
+        修复后的两遍扫描必须保证 ``id_map`` 在 Pass 1 完整建立、
+        Pass 2 再统一改写 tool_result.tool_use_id, 与块出现顺序无关。
+        """
+        body = {
+            "messages": [
+                {"role": "user", "content": "ask"},
+                {
+                    "role": "assistant",
+                    "content": [
+                        {
+                            "type": "tool_result",
+                            "tool_use_id": "srvtoolu_oof",
+                            "content": "out",
+                        },
+                        {
+                            "type": "server_tool_use",
+                            "id": "srvtoolu_oof",
+                            "name": "bash",
+                            "input": {},
+                        },
+                    ],
+                },
+            ],
+        }
+        count, id_map = _rewrite_srvtoolu_ids(body)
+        assert count == 1
+        new_id = id_map["srvtoolu_oof"]
+        assert new_id.startswith("toolu_normalized_")
+
+        blocks = body["messages"][1]["content"]
+        tool_result_block = next(b for b in blocks if b.get("type") == "tool_result")
+        tool_use_block = next(b for b in blocks if b.get("type") == "tool_use")
+        assert tool_result_block["tool_use_id"] == new_id
+        assert tool_use_block["id"] == new_id
+        assert tool_use_block["type"] == "tool_use"
+
+    def test_two_pass_handles_tool_result_in_earlier_user_message(self):
+        """跨消息边界乱序: tool_result 在更早的 user 消息中先出现.
+
+        旧单遍扫描遍历到 msg[1] 的 user tool_result 时 ``id_map`` 还未含
+        ``srvtoolu_late``（对应 tool_use 在 msg[2]），导致漏改;
+        两遍扫描必须保证此场景下 tool_result.tool_use_id 仍能正确改写.
+        """
+        body = {
+            "messages": [
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "tool_result",
+                            "tool_use_id": "srvtoolu_late",
+                            "content": "prefetched",
+                        },
+                    ],
+                },
+                {
+                    "role": "assistant",
+                    "content": [
+                        {
+                            "type": "server_tool_use",
+                            "id": "srvtoolu_late",
+                            "name": "bash",
+                            "input": {},
+                        },
+                    ],
+                },
+            ],
+        }
+        count, id_map = _rewrite_srvtoolu_ids(body)
+        assert count == 1
+        new_id = id_map["srvtoolu_late"]
+        assert body["messages"][0]["content"][0]["tool_use_id"] == new_id, (
+            "Pass 2 必须改写出现位置早于 tool_use 的 tool_result.tool_use_id"
+        )
+        assert body["messages"][1]["content"][0]["id"] == new_id
+
 
 # ── infer_source_vendor_from_body 单元测试 ─────────────────────────
 
@@ -1582,6 +1669,209 @@ def test_next_message_is_assistant_inserts_user(self):
         assert messages[2]["role"] == "assistant"
 
 
+# ── _enforce_pairing_sanity_pass 单元测试（纵深防御兜底层） ─────────────
+
+
+class TestEnforcePairingSanityPass:
+    """``_enforce_pairing_sanity_pass`` 单元测试.
+
+    这层是 enforce 主循环结束后的纵深防御。直接以 helper 为被测单元，
+    确保即使主循环未来重构出现遗漏，sanity 仍能稳定守住 Anthropic 配对约束。
+    """
+
+    def test_noop_when_all_paired(self):
+        """所有 tool_use 都已正确配对时返回空列表，不修改输入."""
+        messages = [
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "tool_use", "id": "toolu_x", "name": "bash", "input": {}}
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "tool_result", "tool_use_id": "toolu_x", "content": "ok"}
+                ],
+            },
+        ]
+        snapshot = copy.deepcopy(messages)
+        result = _enforce_pairing_sanity_pass(messages)
+        assert result == []
+        assert messages == snapshot
+
+    def test_appends_is_error_placeholder_when_user_lacks_tool_result(self):
+        """assistant tool_use 但 user 缺 tool_result 时追加 is_error 占位."""
+        messages = [
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "tool_use", "id": "toolu_x", "name": "bash", "input": {}}
+                ],
+            },
+            {"role": "user", "content": [{"type": "text", "text": "ok"}]},
+        ]
+        result = _enforce_pairing_sanity_pass(messages)
+        assert result == ["pairing_sanity_repaired"]
+        user_content = messages[1]["content"]
+        appended = next(b for b in user_content if b.get("type") == "tool_result")
+        assert appended == {
+            "type": "tool_result",
+            "tool_use_id": "toolu_x",
+            "content": "",
+            "is_error": True,
+        }
+
+    def test_repairs_only_missing_ids_when_partially_paired(self):
+        """3 tool_use 但 user 只配 2 个 tool_result 时仅补缺失项."""
+        messages = [
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "tool_use", "id": "toolu_a", "name": "bash", "input": {}},
+                    {"type": "tool_use", "id": "toolu_b", "name": "read", "input": {}},
+                    {"type": "tool_use", "id": "toolu_c", "name": "write", "input": {}},
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "tool_result", "tool_use_id": "toolu_a", "content": "a"},
+                    {"type": "tool_result", "tool_use_id": "toolu_c", "content": "c"},
+                ],
+            },
+        ]
+        result = _enforce_pairing_sanity_pass(messages)
+        assert result == ["pairing_sanity_repaired"]
+        result_ids = {
+            b["tool_use_id"]
+            for b in messages[1]["content"]
+            if b.get("type") == "tool_result"
+        }
+        assert result_ids == {"toolu_a", "toolu_b", "toolu_c"}
+        # 仅 toolu_b 是兜底合成的 is_error 占位
+        b_block = next(
+            b for b in messages[1]["content"] if b.get("tool_use_id") == "toolu_b"
+        )
+        assert b_block.get("is_error") is True
+        a_block = next(
+            b for b in messages[1]["content"] if b.get("tool_use_id") == "toolu_a"
+        )
+        assert a_block.get("is_error") is not True
+
+    def test_warns_when_next_message_not_user(self, caplog):
+        """next 非 user 时只发 WARNING、不修改、不返回 adaptation.
+
+        主循环正常情况下已保证 next 为 user；这是退化场景的可观测性兜底。
+        """
+        messages = [
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "tool_use", "id": "toolu_x", "name": "bash", "input": {}}
+                ],
+            },
+            {
+                "role": "assistant",
+                "content": [{"type": "text", "text": "weird"}],
+            },
+        ]
+        snapshot = copy.deepcopy(messages)
+        import logging
+
+        with caplog.at_level(
+            logging.WARNING, logger="coding.proxy.convert.vendor_channels"
+        ):
+            result = _enforce_pairing_sanity_pass(messages)
+        assert result == []
+        assert messages == snapshot
+        assert any("Sanity pass" in rec.message for rec in caplog.records)
+
+    def test_normalizes_user_string_content_before_repair(self):
+        """user content 为 string 时归一化为 list 再补占位."""
+        messages = [
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "tool_use", "id": "toolu_x", "name": "bash", "input": {}}
+                ],
+            },
+            {"role": "user", "content": "ack"},
+        ]
+        result = _enforce_pairing_sanity_pass(messages)
+        assert result == ["pairing_sanity_repaired"]
+        user_content = messages[1]["content"]
+        assert isinstance(user_content, list)
+        assert user_content[0] == {"type": "text", "text": "ack"}
+        assert user_content[1]["tool_use_id"] == "toolu_x"
+        assert user_content[1]["is_error"] is True
+
+    def test_skips_non_assistant_messages(self):
+        """user / system / 异常消息一律跳过."""
+        messages = [
+            {"role": "user", "content": "hi"},
+            {"role": "system", "content": "ctx"},
+            "not a dict",  # type: ignore[list-item]
+        ]
+        snapshot = copy.deepcopy(messages)
+        result = _enforce_pairing_sanity_pass(messages)
+        assert result == []
+        assert messages == snapshot
+
+    def test_skips_assistant_without_tool_use(self):
+        """assistant 纯文本（无 tool_use）短路，不影响下一条 user."""
+        messages = [
+            {
+                "role": "assistant",
+                "content": [{"type": "text", "text": "just chatting"}],
+            },
+            {"role": "user", "content": "ok"},
+        ]
+        snapshot = copy.deepcopy(messages)
+        result = _enforce_pairing_sanity_pass(messages)
+        assert result == []
+        assert messages == snapshot
+
+    def test_enforce_main_loop_chains_sanity_helper(self):
+        """主 enforce 流程末尾应当调用 sanity helper，标签会出现在 adaptations."""
+        # 构造主循环无法剥离/合成的退化场景：直接放一个未配对 tool_use，
+        # 且 user 端事先放无关 tool_result，绕过主循环的 existing check
+        messages = [
+            {
+                "role": "assistant",
+                "content": [
+                    {
+                        "type": "tool_use",
+                        "id": "toolu_main",
+                        "name": "bash",
+                        "input": {},
+                    }
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "tool_result",
+                        "tool_use_id": "toolu_unrelated",
+                        "content": "x",
+                    }
+                ],
+            },
+        ]
+        fixes = enforce_anthropic_tool_pairing(messages)
+        # 主循环 F 步会先合成 orphaned_tool_use_repaired, sanity 不再触发
+        assert "orphaned_tool_use_repaired" in fixes
+        assert "pairing_sanity_repaired" not in fixes
+        # 但 toolu_main 必须最终有对应 tool_result
+        result_ids = {
+            b["tool_use_id"]
+            for b in messages[1]["content"]
+            if b.get("type") == "tool_result"
+        }
+        assert "toolu_main" in result_ids
+
+
 # ── 通道层端到端集成（zhipu 产物全量清洗） ───────────────────────────
 
 
@@ -1687,6 +1977,135 @@ def test_full_zhipu_artifacts_combined(self):
         assert relocated[0]["tool_use_id"] == new_id
         assert any("misplaced_tool_result_relocated" in a for a in adaptations)
 
+    def test_handles_out_of_order_inline_tool_result_end_to_end(self):
+        """端到端复现日志故障场景: assistant content 内 tool_result 排在 server_tool_use 之前.
+
+        生产日志 `messages.3: tool_use ids were found without tool_result blocks
+        immediately after: toolu_normalized_2` 错误的等价最小复现.
+
+        旧单遍 ``_rewrite_srvtoolu_ids`` 会漏改这种 misplaced tool_result 的
+        ``tool_use_id``，使 enforce 在 extracted_tool_results 字典中以旧 ID 作 key，
+        而 tool_use_ids 已是新 ID，造成 pairing 错位; 修复后两遍扫描确保
+        每个 assistant.tool_use_id 与下一条 user.tool_result.tool_use_id
+        一一匹配，且消息体内不再残留任何 ``srvtoolu_*`` / ``server_tool_use``。
+        """
+        body = {
+            "messages": [
+                {"role": "user", "content": "begin"},
+                # 第一轮: 普通配对，建立 toolu_normalized_1
+                {
+                    "role": "assistant",
+                    "content": [
+                        {
+                            "type": "thinking",
+                            "thinking": "...",
+                            "signature": "zhipu_sig_1",
+                        },
+                        {
+                            "type": "server_tool_use",
+                            "id": "srvtoolu_first",
+                            "name": "bash",
+                            "input": {},
+                        },
+                    ],
+                },
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "tool_result",
+                            "tool_use_id": "srvtoolu_first",
+                            "content": "first ok",
+                        }
+                    ],
+                },
+                # 第二轮: 故障形态，tool_result 内联在 server_tool_use 之前
+                {
+                    "role": "assistant",
+                    "content": [
+                        {
+                            "type": "thinking",
+                            "thinking": "...",
+                            "signature": "zhipu_sig_2",
+                        },
+                        {
+                            "type": "tool_result",
+                            "tool_use_id": "srvtoolu_second",
+                            "content": "inline glm5",
+                        },
+                        {
+                            "type": "server_tool_use",
+                            "id": "srvtoolu_second",
+                            "name": "bash",
+                            "input": {},
+                        },
+                    ],
+                },
+                {"role": "user", "content": "continue"},
+            ],
+        }
+        prepared, adaptations = prepare_zhipu_to_anthropic(body)
+        messages = prepared["messages"]
+
+        # 所有 assistant 消息不得残留 server_tool_use / srvtoolu_* / tool_result
+        for msg in messages:
+            if msg.get("role") != "assistant":
+                continue
+            for b in msg.get("content", []):
+                assert isinstance(b, dict)
+                assert b.get("type") != "server_tool_use"
+                assert b.get("type") != "tool_result"
+                bid = b.get("id")
+                if isinstance(bid, str):
+                    assert not bid.startswith("srvtoolu_"), (
+                        f"assistant content 残留 srvtoolu_* ID: {bid}"
+                    )
+
+        # 任意 tool_result.tool_use_id 不得保留为 srvtoolu_* 形式
+        for msg in messages:
+            for b in msg.get("content") or []:
+                if isinstance(b, dict) and b.get("type") == "tool_result":
+                    tid = b.get("tool_use_id")
+                    assert isinstance(tid, str)
+                    assert not tid.startswith("srvtoolu_"), (
+                        f"tool_result 残留旧 srvtoolu_* 引用: {tid}"
+                    )
+
+        # 每个 assistant 的 tool_use.id 都能在下一条 user 的 tool_result 中找到匹配
+        for i, msg in enumerate(messages):
+            if msg.get("role") != "assistant":
+                continue
+            tool_use_ids = [
+                b["id"]
+                for b in (msg.get("content") or [])
+                if isinstance(b, dict) and b.get("type") == "tool_use" and b.get("id")
+            ]
+            if not tool_use_ids:
+                continue
+            next_msg = messages[i + 1]
+            assert next_msg.get("role") == "user"
+            next_tool_result_ids = {
+                b["tool_use_id"]
+                for b in (next_msg.get("content") or [])
+                if isinstance(b, dict)
+                and b.get("type") == "tool_result"
+                and b.get("tool_use_id")
+            }
+            for uid in tool_use_ids:
+                assert uid in next_tool_result_ids, (
+                    f"messages[{i}].tool_use_id={uid} 在 messages[{i + 1}] 中"
+                    f"找不到对应 tool_result（next ids = {next_tool_result_ids}）"
+                )
+
+        # adaptations 覆盖关键变换
+        assert any("srvtoolu_ids" in a for a in adaptations)
+        assert any("misplaced_tool_result_relocated" in a for a in adaptations)
+        assert any("thinking_blocks" in a for a in adaptations)
+        # sanity 不应触发: 两遍扫描 + 主 enforce 已经把所有配对补齐
+        assert "pairing_sanity_repaired" not in adaptations
+        # 主 enforce 应当能正确把内联 tool_result 重定位、配对完整
+        assert "orphaned_tool_use_repaired" not in adaptations
+
 
 class TestZhipuToCopilotChannelFullCleanup:
     """验证 prepare_zhipu_to_copilot 对 zhipu 产物的完整清洗."""
@@ -1729,3 +2148,80 @@ def test_rewrites_srvtoolu_and_strips_vendor_delta(self):
         assert prepared["messages"][1]["content"][0]["tool_use_id"] == new_id
         assert any("zhipu_vendor_blocks" in a for a in adaptations)
         assert any("srvtoolu_ids" in a for a in adaptations)
+
+
+# ── normalize_for_zhipu 共享清洗函数 ────────────────────────
+
+
+class TestNormalizeForZhipu:
+    """normalize_for_zhipu 共享清洗函数测试."""
+
+    def test_strips_cache_control_and_params(self):
+        body = {
+            "model": "claude-sonnet-4-20250514",
+            "messages": [],
+            "thinking": {"type": "enabled", "budget_tokens": 5000},
+            "extended_thinking": {"type": "enabled"},
+            "reasoning_effort": "high",
+            "system": [
+                {
+                    "type": "text",
+                    "text": "sys",
+                    "cache_control": {"type": "ephemeral"},
+                },
+            ],
+            "tools": [
+                {
+                    "name": "Bash",
+                    "input_schema": {"type": "object"},
+                    "cache_control": {"type": "ephemeral"},
+                },
+            ],
+        }
+        result, adaptations = normalize_for_zhipu(body)
+
+        assert "thinking" not in result
+        assert "extended_thinking" not in result
+        assert "reasoning_effort" not in result
+        assert "cache_control" not in result["system"][0]
+        assert "cache_control" not in result["tools"][0]
+        assert any("cache_control" in a for a in adaptations)
+        assert any("thinking" in a for a in adaptations)
+        assert any("reasoning_effort" in a for a in adaptations)
+
+    def test_operates_in_place(self):
+        body = {"model": "x", "messages": []}
+        result, _ = normalize_for_zhipu(body)
+        assert result is body
+
+    def test_idempotent(self):
+        body = {
+            "model": "x",
+            "messages": [],
+            "thinking": {"type": "enabled"},
+        }
+        normalize_for_zhipu(body)
+        _, adaptations = normalize_for_zhipu(body)
+        assert adaptations == []
+
+    def test_no_deep_copy(self):
+        messages = [{"role": "user", "content": "hi"}]
+        body = {"model": "x", "messages": messages}
+        result, _ = normalize_for_zhipu(body)
+        assert result["messages"] is messages
+
+    def test_preserves_supported_params(self):
+        body = {
+            "model": "x",
+            "messages": [{"role": "user", "content": "hello"}],
+            "max_tokens": 1024,
+            "temperature": 0.7,
+            "stream": True,
+            "metadata": {"user_id": "test"},
+        }
+        result, adaptations = normalize_for_zhipu(body)
+        assert result["max_tokens"] == 1024
+        assert result["temperature"] == 0.7
+        assert result["stream"] is True
+        assert result["metadata"] == {"user_id": "test"}
+        assert adaptations == []
diff --git a/tests/test_vendors.py b/tests/test_vendors.py
index f771e9b..bc72602 100644
--- a/tests/test_vendors.py
+++ b/tests/test_vendors.py
@@ -396,7 +396,7 @@ async def test_zhipu_prepare_request_preserves_metadata():
 
 @pytest.mark.asyncio
 async def test_zhipu_prepare_request_preserves_thinking():
-    """ZhipuVendor._prepare_request 应原样保留 thinking 字段（原生端点支持）."""
+    """ZhipuVendor._prepare_request 应原样保留 thinking.type=enabled（GLM 原生支持）."""
     mapper = ModelMapper([])
     zhipu_vendor = ZhipuVendor(ZhipuConfig(api_key="sk-test"), mapper)
     body = {
@@ -405,12 +405,35 @@ async def test_zhipu_prepare_request_preserves_thinking():
         "thinking": {"type": "enabled", "budget_tokens": 10000},
     }
     prepared_body, _ = await zhipu_vendor._prepare_request(body, {})
-    # thinking 原样透传，不再剥离任何字段
+    # thinking.type=enabled 原样透传（GLM 原生支持）
     assert prepared_body["thinking"] == {"type": "enabled", "budget_tokens": 10000}
     # 原始 body 不应被修改
     assert body["thinking"]["budget_tokens"] == 10000
 
 
+@pytest.mark.asyncio
+async def test_zhipu_prepare_request_converts_thinking_adaptive():
+    """ZhipuVendor._prepare_request 应将 thinking.type=adaptive 转换为 enabled+budget.
+
+    GLM 不支持 adaptive 类型，转换为已确认安全的 enabled + budget_tokens 格式，
+    保留 thinking 能力不被阉割。
+    """
+    mapper = ModelMapper([])
+    zhipu_vendor = ZhipuVendor(ZhipuConfig(api_key="sk-test"), mapper)
+    body = {
+        "model": "claude-opus-4-7",
+        "messages": [],
+        "thinking": {"type": "adaptive"},
+    }
+    prepared_body, _ = await zhipu_vendor._prepare_request(body, {})
+
+    # adaptive 应被转换为 enabled + budget
+    assert prepared_body["thinking"]["type"] == "enabled"
+    assert prepared_body["thinking"]["budget_tokens"] == 16000
+    # 原始 body 不应被修改
+    assert body["thinking"] == {"type": "adaptive"}
+
+
 @pytest.mark.asyncio
 async def test_zhipu_prepare_request_preserves_anthropic_beta_header():
     zhipu_vendor = ZhipuVendor(ZhipuConfig(api_key="sk-test"), ModelMapper([]))
diff --git a/tests/test_zhipu.py b/tests/test_zhipu.py
index 2eceb41..aa05b21 100644
--- a/tests/test_zhipu.py
+++ b/tests/test_zhipu.py
@@ -5,20 +5,23 @@
   - 其余请求体/响应原样透传
   - 401 错误归一化
   - 能力声明全部为 NATIVE
+  - 429 Rate Limit 重试挽回
 """
 
 import json
+from unittest.mock import AsyncMock, patch
 
+import httpx
 import pytest
 
 from coding.proxy.compat.canonical import CompatibilityStatus
 from coding.proxy.config.schema import ModelMappingRule, ZhipuConfig
 from coding.proxy.routing.model_mapper import ModelMapper
+from coding.proxy.vendors.native_anthropic import NativeAnthropicVendor
 from coding.proxy.vendors.zhipu import ZhipuVendor
 
 
-@pytest.fixture
-def zhipu_vendor():
+def _make_zhipu_vendor(api_key: str = "test-zhipu-key") -> ZhipuVendor:
     """创建使用默认配置的 ZhipuVendor 实例."""
     mapper = ModelMapper(
         [
@@ -42,7 +45,13 @@ def zhipu_vendor():
             ),
         ]
     )
-    return ZhipuVendor(ZhipuConfig(api_key="test-zhipu-key"), mapper)
+    return ZhipuVendor(ZhipuConfig(api_key=api_key), mapper)
+
+
+@pytest.fixture
+def zhipu_vendor():
+    """创建使用默认配置的 ZhipuVendor 实例."""
+    return _make_zhipu_vendor()
 
 
 # ── 模型映射 ──────────────────────────────────────────────
@@ -69,7 +78,7 @@ def test_unknown_model_falls_back_to_default(self, zhipu_vendor):
 
 
 class TestRequestPassthrough:
-    """验证 _prepare_request 仅修改 model 和 headers."""
+    """验证 _prepare_request 的模型映射、headers 替换和兼容转换."""
 
     @pytest.mark.asyncio
     async def test_body_passthrough_except_model(self, zhipu_vendor):
@@ -94,24 +103,60 @@ async def test_body_passthrough_except_model(self, zhipu_vendor):
 
         # 仅 model 被映射
         assert prepared_body["model"] == "glm-5.1"
+        # thinking.type=enabled 原样保留（GLM 原生支持）
+        assert prepared_body["thinking"] == {"type": "enabled", "budget_tokens": 5000}
         # 其余字段原样保留
         assert prepared_body["max_tokens"] == 1024
         assert prepared_body["temperature"] == 0.7
         assert prepared_body["top_p"] == 0.9
         assert prepared_body["stream"] is True
-        # thinking 不再被剥离
-        assert prepared_body["thinking"] == {"type": "enabled", "budget_tokens": 5000}
-        # metadata 不再被剥离
         assert prepared_body["metadata"] == {"user_id": "test-user"}
-        # system 不被删除
         assert prepared_body["system"] == "You are a helpful assistant."
-        # tools 不被截断或过滤
         assert len(prepared_body["tools"]) == 3
-        # tool_choice 不被修改
         assert prepared_body["tool_choice"] == {"type": "auto"}
         # 原始 body 未被修改（deep copy）
         assert body["model"] == "claude-sonnet-4-20250514"
 
+    @pytest.mark.asyncio
+    async def test_thinking_adaptive_converted_to_enabled(self, zhipu_vendor):
+        """thinking.type=adaptive 应被转换为 enabled+budget（GLM 不支持 adaptive）."""
+        body = {
+            "model": "claude-opus-4-7",
+            "messages": [],
+            "thinking": {"type": "adaptive"},
+        }
+        prepared_body, _ = await zhipu_vendor._prepare_request(body, {})
+
+        assert prepared_body["thinking"]["type"] == "enabled"
+        assert prepared_body["thinking"]["budget_tokens"] == 16000
+        # 原始 body 未被修改
+        assert body["thinking"] == {"type": "adaptive"}
+
+    @pytest.mark.asyncio
+    async def test_thinking_enabled_preserved_unchanged(self, zhipu_vendor):
+        """thinking.type=enabled 应原样保留（GLM 原生支持）."""
+        body = {
+            "model": "claude-sonnet-4-20250514",
+            "messages": [],
+            "thinking": {"type": "enabled", "budget_tokens": 8000},
+        }
+        prepared_body, _ = await zhipu_vendor._prepare_request(body, {})
+
+        assert prepared_body["thinking"] == {"type": "enabled", "budget_tokens": 8000}
+        assert body["thinking"]["budget_tokens"] == 8000
+
+    @pytest.mark.asyncio
+    async def test_no_thinking_param_unchanged(self, zhipu_vendor):
+        """无 thinking 参数时不触发任何转换."""
+        body = {
+            "model": "claude-sonnet-4-20250514",
+            "messages": [{"role": "user", "content": "hi"}],
+        }
+        prepared_body, _ = await zhipu_vendor._prepare_request(body, {})
+
+        assert "thinking" not in prepared_body
+        assert prepared_body["model"] == "glm-5.1"
+
     @pytest.mark.asyncio
     async def test_headers_replaces_auth(self, zhipu_vendor):
         """验证 x-api-key 被正确设置，authorization 被剥离."""
@@ -292,3 +337,332 @@ def test_never_triggers_failover(self, zhipu_vendor):
     async def test_health_check_always_true(self, zhipu_vendor):
         result = await zhipu_vendor.check_health()
         assert result is True
+
+
+# ── 429 Rate Limit 重试挽回 ─────────────────────────────────
+
+
+def _make_429_response(
+    headers: dict[str, str] | None = None,
+) -> httpx.Response:
+    """构造 429 HTTP 响应."""
+    return httpx.Response(
+        status_code=429,
+        content=b'{"error":{"type":"rate_limit_error","message":"Too many requests"}}',
+        headers=headers or {},
+        request=httpx.Request(
+            "POST", "https://open.bigmodel.cn/api/anthropic/v1/messages"
+        ),
+    )
+
+
+def _make_200_response() -> httpx.Response:
+    """构造 200 HTTP 响应."""
+    body = json.dumps(
+        {
+            "id": "msg_test",
+            "type": "message",
+            "role": "assistant",
+            "content": [{"type": "text", "text": "hello"}],
+            "model": "glm-5.1",
+            "usage": {"input_tokens": 10, "output_tokens": 5},
+        }
+    ).encode()
+    return httpx.Response(
+        status_code=200,
+        content=body,
+        headers={"content-type": "application/json"},
+        request=httpx.Request(
+            "POST", "https://open.bigmodel.cn/api/anthropic/v1/messages"
+        ),
+    )
+
+
+class TestRateLimitRetry:
+    """429 Rate Limit 重试挽回机制."""
+
+    # ── 非流式 ─────────────────────────────────────────────
+
+    @pytest.mark.asyncio
+    async def test_nonstream_429_retries_and_succeeds(self):
+        """429 两次后 200，重试成功."""
+        vendor = _make_zhipu_vendor()
+        call_count = 0
+
+        async def mock_post(*args, **kwargs):
+            nonlocal call_count
+            call_count += 1
+            if call_count <= 2:
+                return _make_429_response()
+            return _make_200_response()
+
+        with patch.object(vendor, "_get_client") as mock_client:
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            resp = await vendor.send_message(
+                {"model": "claude-sonnet-4-20250514", "messages": []},
+                {},
+            )
+
+        assert resp.status_code == 200
+        assert call_count == 3
+
+    @pytest.mark.asyncio
+    async def test_nonstream_429_exhausted_retries(self):
+        """连续 5 次 429，耗尽重试后返回 429."""
+        vendor = _make_zhipu_vendor()
+        call_count = 0
+
+        async def mock_post(*args, **kwargs):
+            nonlocal call_count
+            call_count += 1
+            return _make_429_response()
+
+        with patch.object(vendor, "_get_client") as mock_client:
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            with patch("asyncio.sleep", new_callable=AsyncMock):
+                resp = await vendor.send_message(
+                    {"model": "claude-sonnet-4-20250514", "messages": []},
+                    {},
+                )
+
+        assert resp.status_code == 429
+        assert call_count == 5
+
+    @pytest.mark.asyncio
+    async def test_nonstream_non_429_no_retry(self):
+        """500 不触发重试."""
+        vendor = _make_zhipu_vendor()
+        call_count = 0
+
+        async def mock_post(*args, **kwargs):
+            nonlocal call_count
+            call_count += 1
+            return httpx.Response(
+                status_code=500,
+                content=b'{"error":{"type":"api_error","message":"Internal error"}}',
+                request=httpx.Request("POST", "https://example.com"),
+            )
+
+        with patch.object(vendor, "_get_client") as mock_client:
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            resp = await vendor.send_message(
+                {"model": "claude-sonnet-4-20250514", "messages": []},
+                {},
+            )
+
+        assert resp.status_code == 500
+        assert call_count == 1
+
+    # ── 流式 ───────────────────────────────────────────────
+
+    @pytest.mark.asyncio
+    async def test_stream_429_retries_and_succeeds(self):
+        """流式 429 两次后成功."""
+        call_count = 0
+
+        async def fake_stream(self, body, headers):
+            nonlocal call_count
+            call_count += 1
+            if call_count <= 2:
+                resp = _make_429_response()
+                raise httpx.HTTPStatusError(
+                    "429",
+                    request=resp.request,
+                    response=resp,
+                )
+            yield b'data: {"type":"content_block_start"}\n\n'
+            yield b'data: {"type":"content_block_delta"}\n\n'
+
+        vendor = _make_zhipu_vendor()
+        chunks = []
+        with (
+            patch.object(NativeAnthropicVendor, "send_message_stream", fake_stream),
+            patch("asyncio.sleep", new_callable=AsyncMock),
+        ):
+            async for chunk in vendor.send_message_stream(
+                {"model": "claude-sonnet-4-20250514", "messages": []},
+                {},
+            ):
+                chunks.append(chunk)
+
+        assert len(chunks) == 2
+        assert call_count == 3
+
+    @pytest.mark.asyncio
+    async def test_stream_429_exhausted_retries_raises(self):
+        """流式连续 429，耗尽重试后 raise."""
+        call_count = 0
+
+        async def fake_stream(self, body, headers):
+            nonlocal call_count
+            call_count += 1
+            resp = _make_429_response()
+            raise httpx.HTTPStatusError(
+                "429",
+                request=resp.request,
+                response=resp,
+            )
+            yield  # 使函数成为 async generator（不可达，仅影响类型）
+
+        vendor = _make_zhipu_vendor()
+        with (
+            patch.object(NativeAnthropicVendor, "send_message_stream", fake_stream),
+            patch("asyncio.sleep", new_callable=AsyncMock),
+            pytest.raises(httpx.HTTPStatusError) as exc_info,
+        ):
+            async for _ in vendor.send_message_stream(
+                {"model": "claude-sonnet-4-20250514", "messages": []},
+                {},
+            ):
+                pass
+
+        assert exc_info.value.response.status_code == 429
+        assert call_count == 5
+
+    @pytest.mark.asyncio
+    async def test_stream_500_no_retry_raises(self):
+        """流式 500 不触发重试，直接 raise."""
+        call_count = 0
+
+        async def fake_stream(self, body, headers):
+            nonlocal call_count
+            call_count += 1
+            resp = httpx.Response(
+                status_code=500,
+                content=b'{"error":{"type":"api_error"}}',
+                request=httpx.Request("POST", "https://example.com"),
+            )
+            raise httpx.HTTPStatusError(
+                "500",
+                request=resp.request,
+                response=resp,
+            )
+            yield  # 使函数成为 async generator
+
+        vendor = _make_zhipu_vendor()
+        with (
+            patch.object(NativeAnthropicVendor, "send_message_stream", fake_stream),
+            pytest.raises(httpx.HTTPStatusError) as exc_info,
+        ):
+            async for _ in vendor.send_message_stream(
+                {"model": "claude-sonnet-4-20250514", "messages": []},
+                {},
+            ):
+                pass
+
+        assert exc_info.value.response.status_code == 500
+        assert call_count == 1
+
+    # ── retry-after header ─────────────────────────────────
+
+    @pytest.mark.asyncio
+    async def test_respects_retry_after_header(self):
+        """响应含 retry-after 时使用 server 建议延迟."""
+        vendor = _make_zhipu_vendor()
+        call_count = 0
+        sleep_delays = []
+
+        async def mock_post(*args, **kwargs):
+            nonlocal call_count
+            call_count += 1
+            if call_count == 1:
+                return _make_429_response(headers={"retry-after": "2"})
+            return _make_200_response()
+
+        async def mock_sleep(delay):
+            sleep_delays.append(delay)
+
+        with (
+            patch.object(vendor, "_get_client") as mock_client,
+            patch("asyncio.sleep", side_effect=mock_sleep),
+        ):
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            resp = await vendor.send_message(
+                {"model": "claude-sonnet-4-20250514", "messages": []},
+                {},
+            )
+
+        assert resp.status_code == 200
+        assert len(sleep_delays) == 1
+        # retry-after=2 → 2 * 1.1 = 2.2s → 2200ms → sleep(2.2)
+        assert 2.0 <= sleep_delays[0] <= 2.2
+
+    # ── 退避延迟增长 ───────────────────────────────────────
+
+    @pytest.mark.asyncio
+    async def test_backoff_delays_increase(self):
+        """无 retry-after 时延迟按指数增长."""
+        vendor = _make_zhipu_vendor()
+        sleep_delays = []
+
+        async def mock_sleep(delay):
+            sleep_delays.append(delay)
+
+        # 禁用 jitter 以精确验证延迟
+        import dataclasses
+
+        original_jitter = vendor._rl_retry.jitter
+        vendor._rl_retry = dataclasses.replace(vendor._rl_retry, jitter=False)
+
+        call_count = 0
+
+        async def mock_post(*args, **kwargs):
+            nonlocal call_count
+            call_count += 1
+            if call_count <= 4:
+                return _make_429_response()
+            return _make_200_response()
+
+        try:
+            with (
+                patch.object(vendor, "_get_client") as mock_client,
+                patch("asyncio.sleep", side_effect=mock_sleep),
+            ):
+                client = AsyncMock()
+                client.post = mock_post
+                mock_client.return_value = client
+
+                resp = await vendor.send_message(
+                    {"model": "claude-sonnet-4-20250514", "messages": []},
+                    {},
+                )
+
+            assert resp.status_code == 200
+            assert len(sleep_delays) == 4
+            # initial=1000ms, multiplier=2.0
+            # attempt 0: 1000 * 2^0 = 1000ms → sleep(1.0)
+            # attempt 1: 1000 * 2^1 = 2000ms → sleep(2.0)
+            # attempt 2: 1000 * 2^2 = 4000ms → sleep(4.0)
+            # attempt 3: 1000 * 2^3 = 8000ms → sleep(8.0)
+            assert sleep_delays[0] == pytest.approx(1.0)
+            assert sleep_delays[1] == pytest.approx(2.0)
+            assert sleep_delays[2] == pytest.approx(4.0)
+            assert sleep_delays[3] == pytest.approx(8.0)
+        finally:
+            vendor._rl_retry = dataclasses.replace(
+                vendor._rl_retry, jitter=original_jitter
+            )
+
+    # ── API key 缺失 ──────────────────────────────────────
+
+    @pytest.mark.asyncio
+    async def test_missing_api_key_skips_retry(self):
+        """API key 缺失时 401 快速失败，不触发 429 重试."""
+        vendor = _make_zhipu_vendor(api_key="")
+        resp = await vendor.send_message(
+            {"model": "claude-sonnet-4-20250514", "messages": []},
+            {},
+        )
+        assert resp.status_code == 401
diff --git a/tests/test_zhipu_concurrency.py b/tests/test_zhipu_concurrency.py
new file mode 100644
index 0000000..7566b24
--- /dev/null
+++ b/tests/test_zhipu_concurrency.py
@@ -0,0 +1,557 @@
+"""Zhipu 每模型并发限制专项测试.
+
+验证 ``ModelConcurrencyLimiter`` 与 ``ZhipuVendor`` 集成后的并发控制行为：
+  - 默认 ``concurrency.default=3`` 时同一模型最多 3 个并发
+  - 超出上限时按 FIFO 排队，槽位释放后才唤醒
+  - 不同模型彼此独立，互不阻塞
+  - 异常路径下 Semaphore 仍能释放，避免泄漏
+  - 流式请求与非流式请求共享同一信号量
+  - 与 429 重试机制兼容（重试期间持续占用槽位）
+  - ``concurrency=None`` 时禁用限制（向后兼容）
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+from unittest.mock import AsyncMock, patch
+
+import httpx
+import pytest
+
+from coding.proxy.config.schema import (
+    ModelMappingRule,
+    ZhipuConcurrencyConfig,
+    ZhipuConfig,
+)
+from coding.proxy.routing.model_mapper import ModelMapper
+from coding.proxy.vendors.concurrency import ModelConcurrencyLimiter
+from coding.proxy.vendors.native_anthropic import NativeAnthropicVendor
+from coding.proxy.vendors.zhipu import ZhipuVendor
+
+# ─── 测试工具 ───────────────────────────────────────────────
+
+
+def _make_mapper() -> ModelMapper:
+    """构造标准三模型映射的 ModelMapper."""
+    return ModelMapper(
+        [
+            ModelMappingRule(
+                pattern="claude-sonnet-.*",
+                target="glm-5v-turbo",
+                is_regex=True,
+                vendors=["zhipu"],
+            ),
+            ModelMappingRule(
+                pattern="claude-opus-.*",
+                target="glm-5.1",
+                is_regex=True,
+                vendors=["zhipu"],
+            ),
+            ModelMappingRule(
+                pattern="claude-haiku-.*",
+                target="glm-4.5-air",
+                is_regex=True,
+                vendors=["zhipu"],
+            ),
+        ]
+    )
+
+
+def _make_vendor(
+    concurrency: ZhipuConcurrencyConfig | None = None,
+    api_key: str = "test-zhipu-key",
+) -> ZhipuVendor:
+    """构造一个 ZhipuVendor，默认启用并发限制（default=3）."""
+    cfg_kwargs: dict = {"api_key": api_key}
+    if concurrency is not None:
+        cfg_kwargs["concurrency"] = concurrency
+    return ZhipuVendor(ZhipuConfig(**cfg_kwargs), _make_mapper())
+
+
+def _make_200_response() -> httpx.Response:
+    body = json.dumps(
+        {
+            "id": "msg_test",
+            "type": "message",
+            "role": "assistant",
+            "content": [{"type": "text", "text": "ok"}],
+            "model": "glm-5.1",
+            "usage": {"input_tokens": 1, "output_tokens": 1},
+        }
+    ).encode()
+    return httpx.Response(
+        status_code=200,
+        content=body,
+        headers={"content-type": "application/json"},
+        request=httpx.Request(
+            "POST", "https://open.bigmodel.cn/api/anthropic/v1/messages"
+        ),
+    )
+
+
+def _make_429_response() -> httpx.Response:
+    return httpx.Response(
+        status_code=429,
+        content=b'{"error":{"type":"rate_limit_error","message":"slow down"}}',
+        headers={},
+        request=httpx.Request(
+            "POST", "https://open.bigmodel.cn/api/anthropic/v1/messages"
+        ),
+    )
+
+
+# ─── 配置层测试 ─────────────────────────────────────────────
+
+
+class TestZhipuConcurrencyConfig:
+    """ZhipuConcurrencyConfig 配置模型行为."""
+
+    def test_defaults(self) -> None:
+        cfg = ZhipuConcurrencyConfig()
+        assert cfg.default == 3
+        assert cfg.models == {}
+
+    def test_get_limit_falls_back_to_default(self) -> None:
+        cfg = ZhipuConcurrencyConfig(default=5)
+        assert cfg.get_limit("glm-5.1") == 5
+        assert cfg.get_limit("any-unknown-model") == 5
+
+    def test_get_limit_uses_per_model_override(self) -> None:
+        cfg = ZhipuConcurrencyConfig(default=3, models={"glm-5v-turbo": 1})
+        assert cfg.get_limit("glm-5v-turbo") == 1
+        assert cfg.get_limit("glm-5.1") == 3  # 未覆盖时回退 default
+
+    def test_default_must_be_positive(self) -> None:
+        with pytest.raises(ValueError):
+            ZhipuConcurrencyConfig(default=0)
+
+    def test_zhipu_config_default_concurrency(self) -> None:
+        cfg = ZhipuConfig()
+        assert cfg.concurrency is not None
+        assert cfg.concurrency.default == 3
+
+
+# ─── ModelConcurrencyLimiter 单元测试 ──────────────────────
+
+
+class TestModelConcurrencyLimiter:
+    """ModelConcurrencyLimiter 基础行为."""
+
+    @pytest.mark.asyncio
+    async def test_lazy_semaphore_creation(self) -> None:
+        limiter = ModelConcurrencyLimiter(ZhipuConcurrencyConfig(default=2))
+        slot_a = limiter._get_or_create_slot("model-a")
+        slot_b = limiter._get_or_create_slot("model-b")
+        # 不同模型独立 slot
+        assert slot_a is not slot_b
+        # 相同模型复用 slot
+        assert limiter._get_or_create_slot("model-a") is slot_a
+
+    @pytest.mark.asyncio
+    async def test_acquire_blocks_when_full(self) -> None:
+        limiter = ModelConcurrencyLimiter(ZhipuConcurrencyConfig(default=2))
+
+        # 占满 2 个槽位
+        sem1 = await limiter.acquire("glm-5.1")
+        sem2 = await limiter.acquire("glm-5.1")
+        assert sem1 is sem2  # 同一 semaphore
+
+        # 第 3 次 acquire 必须阻塞
+        task = asyncio.create_task(limiter.acquire("glm-5.1"))
+        await asyncio.sleep(0.05)
+        assert not task.done(), "第三个请求应在排队等待"
+
+        # 释放一个槽位后，等待者被唤醒
+        sem1.release()
+        await asyncio.sleep(0.05)
+        assert task.done()
+        (await task).release()
+        sem2.release()
+
+    @pytest.mark.asyncio
+    async def test_per_model_independent(self) -> None:
+        limiter = ModelConcurrencyLimiter(
+            ZhipuConcurrencyConfig(default=1, models={"glm-5.1": 1})
+        )
+        # 占满 glm-5.1
+        sem_51 = await limiter.acquire("glm-5.1")
+        # glm-5v-turbo 仍可立即获取
+        sem_5v = await asyncio.wait_for(limiter.acquire("glm-5v-turbo"), timeout=0.5)
+        assert sem_51 is not sem_5v
+        sem_51.release()
+        sem_5v.release()
+
+    def test_diagnostics_snapshot(self) -> None:
+        limiter = ModelConcurrencyLimiter(ZhipuConcurrencyConfig(default=3))
+        # 触发 slot 创建
+        limiter._get_or_create_slot("glm-5.1")
+        snap = limiter.get_diagnostics()
+        assert "glm-5.1" in snap
+        assert snap["glm-5.1"]["limit"] == 3
+        assert snap["glm-5.1"]["available"] == 3
+        assert snap["glm-5.1"]["in_use"] == 0
+
+
+# ─── ZhipuVendor 集成测试：非流式 ────────────────────────────
+
+
+class TestZhipuVendorNonStreamConcurrency:
+    """非流式 send_message 的并发限制行为."""
+
+    @pytest.mark.asyncio
+    async def test_limits_parallel_requests(self) -> None:
+        """concurrency.default=2 时，3 个并发请求中只有 2 个同时执行."""
+        vendor = _make_vendor(ZhipuConcurrencyConfig(default=2))
+        active = 0
+        peak = 0
+        gate = asyncio.Event()
+
+        async def mock_post(*_, **__) -> httpx.Response:
+            nonlocal active, peak
+            active += 1
+            peak = max(peak, active)
+            # 等待外部释放，保证并发观测窗口
+            await gate.wait()
+            active -= 1
+            return _make_200_response()
+
+        with patch.object(vendor, "_get_client") as mock_client:
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            tasks = [
+                asyncio.create_task(
+                    vendor.send_message(
+                        {"model": "claude-opus-4-6", "messages": []},
+                        {},
+                    )
+                )
+                for _ in range(3)
+            ]
+            # 等待两个请求进入 active 状态
+            for _ in range(40):
+                if active >= 2:
+                    break
+                await asyncio.sleep(0.01)
+
+            assert active == 2, "应有恰好 2 个请求在执行（第 3 个排队）"
+            gate.set()
+            results = await asyncio.gather(*tasks)
+            assert all(r.status_code == 200 for r in results)
+            assert peak == 2, "并发峰值不应超过 2"
+
+    @pytest.mark.asyncio
+    async def test_per_model_independent(self) -> None:
+        """不同模型的槽位互不影响."""
+        cfg = ZhipuConcurrencyConfig(
+            default=3,
+            models={"glm-5v-turbo": 1, "glm-5.1": 1},
+        )
+        vendor = _make_vendor(cfg)
+        gate = asyncio.Event()
+        seen_models: list[str] = []
+
+        async def mock_post(*_args, **kwargs) -> httpx.Response:
+            body = kwargs.get("json", {})
+            seen_models.append(body.get("model", ""))
+            await gate.wait()
+            return _make_200_response()
+
+        with patch.object(vendor, "_get_client") as mock_client:
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            # claude-opus → glm-5.1, claude-sonnet → glm-5v-turbo，
+            # 分属两个独立信号量，应同时执行
+            task_opus = asyncio.create_task(
+                vendor.send_message(
+                    {"model": "claude-opus-4-6", "messages": []},
+                    {},
+                )
+            )
+            task_sonnet = asyncio.create_task(
+                vendor.send_message(
+                    {"model": "claude-sonnet-4-6", "messages": []},
+                    {},
+                )
+            )
+            for _ in range(40):
+                if len(seen_models) >= 2:
+                    break
+                await asyncio.sleep(0.01)
+
+            assert len(seen_models) == 2, "两个不同模型应并发执行"
+            assert set(seen_models) == {"glm-5.1", "glm-5v-turbo"}
+            gate.set()
+            await asyncio.gather(task_opus, task_sonnet)
+
+    @pytest.mark.asyncio
+    async def test_semaphore_released_on_exception(self) -> None:
+        """上游抛异常时 Semaphore 仍应释放，后续请求不阻塞."""
+        vendor = _make_vendor(ZhipuConcurrencyConfig(default=1))
+        call_count = 0
+
+        async def mock_post(*_, **__) -> httpx.Response:
+            nonlocal call_count
+            call_count += 1
+            if call_count == 1:
+                raise RuntimeError("upstream boom")
+            return _make_200_response()
+
+        with patch.object(vendor, "_get_client") as mock_client:
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            with pytest.raises(RuntimeError):
+                await vendor.send_message(
+                    {"model": "claude-opus-4-6", "messages": []},
+                    {},
+                )
+
+            # 槽位应已释放，第二次请求可正常完成
+            resp = await asyncio.wait_for(
+                vendor.send_message(
+                    {"model": "claude-opus-4-6", "messages": []},
+                    {},
+                ),
+                timeout=1.0,
+            )
+            assert resp.status_code == 200
+
+    @pytest.mark.asyncio
+    async def test_429_retry_holds_slot(self) -> None:
+        """429 重试期间持续占用槽位，重试结束后释放."""
+        vendor = _make_vendor(ZhipuConcurrencyConfig(default=1))
+        call_count = 0
+
+        async def mock_post(*_, **__) -> httpx.Response:
+            nonlocal call_count
+            call_count += 1
+            if call_count <= 2:
+                return _make_429_response()
+            return _make_200_response()
+
+        with (
+            patch.object(vendor, "_get_client") as mock_client,
+            patch("asyncio.sleep", new_callable=AsyncMock),
+        ):
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            resp = await vendor.send_message(
+                {"model": "claude-opus-4-6", "messages": []},
+                {},
+            )
+            assert resp.status_code == 200
+            assert call_count == 3  # 两次 429 + 一次成功，且共用同一槽位
+
+    @pytest.mark.asyncio
+    async def test_no_concurrency_when_config_is_none(self) -> None:
+        """concurrency=None 时禁用并发限制，行为与旧版完全一致."""
+        # 强制构造一个 concurrency=None 的 ZhipuConfig（绕过默认工厂）
+        cfg = ZhipuConfig(api_key="key")
+        cfg = cfg.model_copy(update={"concurrency": None})
+        vendor = ZhipuVendor(cfg, _make_mapper())
+        assert vendor._concurrency_limiter is None
+
+        gate = asyncio.Event()
+        active = 0
+        peak = 0
+
+        async def mock_post(*_, **__) -> httpx.Response:
+            nonlocal active, peak
+            active += 1
+            peak = max(peak, active)
+            await gate.wait()
+            active -= 1
+            return _make_200_response()
+
+        with patch.object(vendor, "_get_client") as mock_client:
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            tasks = [
+                asyncio.create_task(
+                    vendor.send_message(
+                        {"model": "claude-opus-4-6", "messages": []},
+                        {},
+                    )
+                )
+                for _ in range(5)
+            ]
+            for _ in range(40):
+                if active >= 5:
+                    break
+                await asyncio.sleep(0.01)
+
+            assert peak == 5, "无并发限制时应全部并行"
+            gate.set()
+            await asyncio.gather(*tasks)
+
+
+# ─── ZhipuVendor 集成测试：流式 ──────────────────────────────
+
+
+class TestZhipuVendorStreamConcurrency:
+    """流式 send_message_stream 的并发限制行为."""
+
+    @pytest.mark.asyncio
+    async def test_stream_limits_parallel_requests(self) -> None:
+        """流式请求遵循并发限制，超出排队等待."""
+        vendor = _make_vendor(ZhipuConcurrencyConfig(default=1))
+        active = 0
+        peak = 0
+        gate = asyncio.Event()
+
+        async def fake_stream(self, _body, _headers):  # noqa: ARG001
+            nonlocal active, peak
+            active += 1
+            peak = max(peak, active)
+            try:
+                await gate.wait()
+                yield b'data: {"type":"message_start"}\n\n'
+            finally:
+                active -= 1
+
+        async def consume(model: str) -> int:
+            chunks: list[bytes] = []
+            async for chunk in vendor.send_message_stream(
+                {"model": model, "messages": []}, {}
+            ):
+                chunks.append(chunk)
+            return len(chunks)
+
+        with patch.object(NativeAnthropicVendor, "send_message_stream", fake_stream):
+            tasks = [asyncio.create_task(consume("claude-opus-4-6")) for _ in range(3)]
+            for _ in range(40):
+                if active >= 1:
+                    break
+                await asyncio.sleep(0.01)
+
+            assert active == 1, "concurrency=1 时只允许 1 个流式请求并发"
+            gate.set()
+            results = await asyncio.gather(*tasks)
+            assert all(c >= 1 for c in results)
+            assert peak == 1
+
+    @pytest.mark.asyncio
+    async def test_stream_releases_slot_on_completion(self) -> None:
+        """流式生成器正常耗尽后槽位释放."""
+        vendor = _make_vendor(ZhipuConcurrencyConfig(default=1))
+
+        async def fake_stream(self, _body, _headers):  # noqa: ARG001
+            yield b'data: {"type":"message_start"}\n\n'
+            yield b'data: {"type":"message_stop"}\n\n'
+
+        with patch.object(NativeAnthropicVendor, "send_message_stream", fake_stream):
+            # 连续两次流式请求都能完成（说明槽位被释放）
+            for _ in range(2):
+                chunks = []
+                async for chunk in vendor.send_message_stream(
+                    {"model": "claude-opus-4-6", "messages": []}, {}
+                ):
+                    chunks.append(chunk)
+                assert len(chunks) == 2
+
+        # 确认 slot 当前完全可用
+        assert vendor._concurrency_limiter is not None
+        slot = vendor._concurrency_limiter._get_or_create_slot("glm-5.1")
+        assert slot.available == 1
+
+    @pytest.mark.asyncio
+    async def test_stream_releases_slot_on_error(self) -> None:
+        """流式请求异常退出时槽位仍释放，后续请求不被阻塞."""
+        vendor = _make_vendor(ZhipuConcurrencyConfig(default=1))
+        call_count = 0
+
+        async def fake_stream(self, _body, _headers):  # noqa: ARG001
+            nonlocal call_count
+            call_count += 1
+            if call_count == 1:
+                resp = httpx.Response(
+                    status_code=500,
+                    content=b'{"error":{"type":"api_error"}}',
+                    request=httpx.Request("POST", "https://example.com"),
+                )
+                raise httpx.HTTPStatusError("500", request=resp.request, response=resp)
+                yield b""  # 让函数成为 async generator（不可达）
+            yield b'data: {"type":"message_start"}\n\n'
+
+        with patch.object(NativeAnthropicVendor, "send_message_stream", fake_stream):
+            with pytest.raises(httpx.HTTPStatusError):
+                async for _ in vendor.send_message_stream(
+                    {"model": "claude-opus-4-6", "messages": []}, {}
+                ):
+                    pass
+
+            # 槽位应已释放，第二次请求可正常推进
+            chunks = []
+            async for chunk in vendor.send_message_stream(
+                {"model": "claude-opus-4-6", "messages": []}, {}
+            ):
+                chunks.append(chunk)
+            assert chunks == [b'data: {"type":"message_start"}\n\n']
+
+    @pytest.mark.asyncio
+    async def test_stream_and_nonstream_share_semaphore(self) -> None:
+        """流式与非流式请求共用同一信号量（按映射后模型分组）."""
+        vendor = _make_vendor(ZhipuConcurrencyConfig(default=1))
+        gate = asyncio.Event()
+        active = 0
+
+        async def fake_stream(self, _body, _headers):  # noqa: ARG001
+            nonlocal active
+            active += 1
+            try:
+                await gate.wait()
+                yield b'data: {"type":"message_start"}\n\n'
+            finally:
+                active -= 1
+
+        async def mock_post(*_, **__) -> httpx.Response:
+            nonlocal active
+            active += 1
+            active -= 1
+            return _make_200_response()
+
+        with (
+            patch.object(NativeAnthropicVendor, "send_message_stream", fake_stream),
+            patch.object(vendor, "_get_client") as mock_client,
+        ):
+            client = AsyncMock()
+            client.post = mock_post
+            mock_client.return_value = client
+
+            # 启动流式请求并等待它占用槽位
+            async def consume_stream() -> None:
+                async for _ in vendor.send_message_stream(
+                    {"model": "claude-opus-4-6", "messages": []}, {}
+                ):
+                    pass
+
+            stream_task = asyncio.create_task(consume_stream())
+            for _ in range(40):
+                if active >= 1:
+                    break
+                await asyncio.sleep(0.01)
+            assert active == 1
+
+            # 非流式请求应被同一信号量阻塞
+            nonstream_task = asyncio.create_task(
+                vendor.send_message(
+                    {"model": "claude-opus-4-6", "messages": []},
+                    {},
+                )
+            )
+            await asyncio.sleep(0.05)
+            assert not nonstream_task.done(), "非流式请求应等待流式释放槽位"
+
+            # 释放后两者都能完成
+            gate.set()
+            await asyncio.gather(stream_task, nonstream_task)
diff --git a/uv.lock b/uv.lock
index 79995a3..d04ad46 100644
--- a/uv.lock
+++ b/uv.lock
@@ -74,7 +74,7 @@ wheels = [
 
 [[package]]
 name = "coding-proxy"
-version = "0.4.0"
+version = "0.5.0"
 source = { editable = "." }
 dependencies = [
     { name = "aiosqlite" },