Skip to content

fix(zhipu): 将 429/529 兜底退避抖动从 Full Jitter 改为 Equal Jitter;#263

Merged
ThreeFish-AI merged 1 commit into
feature/1.x.xfrom
ThreeFish-AI/zhipu-529-exponential-backoff
Jun 30, 2026
Merged

fix(zhipu): 将 429/529 兜底退避抖动从 Full Jitter 改为 Equal Jitter;#263
ThreeFish-AI merged 1 commit into
feature/1.x.xfrom
ThreeFish-AI/zhipu-529-exponential-backoff

Conversation

@ThreeFish-AI

Copy link
Copy Markdown
Owner

背景

cc 调用 zhipu 返回 529 过载触发重试时,延迟序列呈非单调形态(实测 418.8ms → 1857.7ms → 961.6ms → 3769.7ms,第 3 次反而短于第 2 次),不像 429 那样呈现干净的指数退避。期望 529 与 429 使用相同的指数退避规则。

根因

经核查,429 与 529 在代码层面早已共用同一退避路径 ZhipuVendor._compute_retry_delay_from_headers,并非"两套规则"。感知差异来自两点叠加:

  1. 服务端响应头不对称:429(限流)响应通常携带 Retry-After 头 → 走确定性 server-guided 路径(retry_after * 1.1),看起来"干净递增";529(过载)响应通常不携带该头 → 落入抖动兜底分支 calculate_delay
  2. 兜底抖动策略为 Full Jitterrandom.uniform(0, ceiling)),其本质就是非单调的——每次延迟是 [0, ceiling] 区间均匀随机值。实测值逐项精确匹配 attempt 0/1/2/3random(0,1000)/(0,2000)/(0,4000)/(0,8000)

修复

calculate_delay 的抖动从 Full Jitter 改为 Equal Jitter(AWS M. Brooker, "Exponential Backoff And Jitter," 2015):

temp  = min(initial * backoff^attempt, max)
delay = temp/2 + random.uniform(0, temp/2)      # 落在 [temp/2, temp]

Zhipu 配置下各 attempt 延迟区间由 [0,1000]/[0,2000]/[0,4000]/[0,8000] 收窄为 [500,1000]/[1000,2000]/[2000,4000]/[4000,8000],相邻区间仅边界相切,延迟几乎必然单调非递减。429/529 共用路径同步受益;保留抖动以防惊群;retry-after 优先级链不动。

变更内容

  • src/coding/proxy/routing/retry.pycalculate_delay 核心改动(1 行)+ 模块/函数 docstring(含单调性契约边界声明)
  • src/coding/proxy/vendors/zhipu.py:2 处 docstring 同步(Full Jitter → Equal Jitter)
  • tests/test_retry.py新建,6 个独立单元测试(填补 calculate_delay 此前零覆盖缺口:无抖动精确指数、Equal Jitter 区间边界、max 封顶、单调非递减、极小 initial、可复现性)
  • tests/test_zhipu.py:新增 test_529_equal_jitter_delay_in_expected_band(529 无 retry-after 时首跳落在 [0.5, 1.0]s
  • CHANGELOG.md / docs/.agents/issue.md:变更记录与根因沉淀

验证

  • 全量 uv run pytest1608 passed,零回归
  • uv run ruff check / format + pre-commit hooks:全通过

设计取舍

  • 为何直改 calculate_delay(方案 A)而非新增可配置 jitter 字段(方案 B)calculate_delay 经全仓库 grep 确证仅被 Zhipu 调用;方案 B 既属 YAGNI 又会恶化仓库既有的"双 RetryConfig 死代码"债(routing/retry.py 活跃 vs config/resiliency.py 死代码),已记入 issue.md 待后续清理。
  • 诚实权衡:Brooker 2015 分布式吞吐基准中 Full Jitter 占优,但本场景是单代理低并发过载恢复、诉求是可解释性(日志递增可预测),Equal 更合适。
  • 契约边界:单调非递减依赖 backoff_multiplier ≥ 2.0 且未触及 max_delay_ms 封顶(当前 max_retries=4 触及不到),已在 docstring 写明。

🤖 Generated with Claude Code

修复 529 过载重试延迟非单调问题(实测 418.8→1857.7→961.6→3769.7ms,
非递增)。根因为 calculate_delay 的 Full Jitter(random.uniform(0, ceiling))
本质非单调,且 529 通常无 Retry-After 头而落入该兜底分支。改为 Equal Jitter
(temp/2 + random(0, temp/2))后区间为 [500,1000]→[1000,2000]→[2000,4000]
→[4000,8000],单调非递减;429/529 共用退避路径同步受益,retry-after 优先级不变。

- 新增 tests/test_retry.py 独立单元测试(calculate_delay 此前零覆盖)
- 新增 test_zhipu.py::test_529_equal_jitter_delay_in_expected_band
- 同步更新 retry.py / zhipu.py docstring、CHANGELOG、issue.md

🤖 Generated with [Claude Code](https://github.com/claude), [CodeX](https://openai.com), [Gemini](https://github.com/apps/gemini-code-assist)
Co-Authored-By: Aurelius Huang<threefish.ai@gmail.com>
@ThreeFish-AI ThreeFish-AI merged commit 0fe07be into feature/1.x.x Jun 30, 2026
6 checks passed
@ThreeFish-AI ThreeFish-AI deleted the ThreeFish-AI/zhipu-529-exponential-backoff branch June 30, 2026 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant