Skip to content

Commit fd6f133

Browse files
authored
feat: unified install integrity — .mcpp_ok marker + auto-cleanup (#73)
* feat: unified install integrity — .mcpp_ok marker + auto-cleanup Unified mechanism for detecting and recovering from interrupted installs (Ctrl+C, network failure, kill -9). Applies to all package types: toolchains, bootstrap tools, and modular libraries. src/fallback/install_integrity.cppm: - is_install_complete(): check .mcpp_ok marker or backward-compat heuristic - mark_install_complete(): write .mcpp_ok after verified install - clean_incomplete_install(): remove directory if incomplete - clean_all_incomplete(): scan xpkgs/ and clean all residue resolve_xpkg_path() now: 1. Check complete (marker or heuristic) → use 2. Clean incomplete residue → install → mark complete 3. Install failed → clean residue → copy fallback → mark complete 4. All failed → clear error with hint mcpp self init now scans and cleans incomplete xpkgs. Bootstrap tools (patchelf/ninja) get .mcpp_ok after ensure. * fix: check_base_init warns instead of blocking, fix Windows build * fix: address 4 review issues on install integrity 1. is_install_complete: handle mcpplibs nested layout (single subdir with src/ or mcpp.toml), prevents false deletion of old packages 2. copy_from_global: check return value and verify completeness before marking .mcpp_ok; clean partial copies on failure 3. Restore !inst error propagation in resolve_xpkg_path — don't mask xlings launch/protocol errors behind generic "payload missing" 4. Per-command bootstrap gating: check_base_init is deferred to get_cfg(requireBootstrap=true) in build/run/toolchain-install, not run globally in load_or_init. Light commands (self env, toolchain list) skip the check. * fix: address round-2 review — 4 issues 1. toolchain install now calls check_base_init() before proceeding, failing early if patchelf/ninja bootstrap is incomplete 2. Preserve original xlings install error (exit code + message) in final error instead of masking with generic "payload missing" 3. One-time legacy migration: migrate_legacy_installs() scans xpkgs and writes .mcpp_ok markers for old complete packages on first run. is_install_complete() still has legacy fallback for un-migrated packages but migration ensures it's rarely needed. 4. Clean trailing whitespace in .agents/docs/*.md * fix: .mcpp_ok only written after verified binary exists Round-3 review fixes: 1. Bootstrap marker: mark_install_complete() only called after verifying the actual binary (bin/patchelf, bin/ninja) exists, not just after ensure_*() returns (which may have failed). 2. Remove automatic migrate_legacy_installs() from load_or_init(). Heuristic-based marker writing could stamp half-extracted packages as complete. Legacy heuristic remains in is_install_complete() as read-only fallback (won't delete old packages), but .mcpp_ok is only written on explicit success paths or via `mcpp self init`. * fix: strict marker semantics for cleanup, remove dead migration code Round-4 review fixes: 1. clean_incomplete_install() now uses marker-only check: - Has .mcpp_ok → keep (verified install) - No marker but looks_complete_legacy → keep (pre-upgrade package) - No marker, no legacy content → clean (genuinely incomplete) This prevents half-extracted packages that happen to have bin/lib from escaping cleanup. 2. Remove migrate_legacy_installs() — it was dead code (declared but never called). The legacy fallback in is_install_complete() handles old packages read-only without writing markers. * fix: strict marker semantics for cleanup, remove dead migration code Round-4 review fix: clean_incomplete_install() now uses STRICT marker-only semantics. Used on the resolve/install path for the CURRENT target — absence of .mcpp_ok unambiguously means the install attempt was incomplete. A half-extracted dir with bin/ would otherwise escape cleanup and corrupt subsequent installs. clean_all_incomplete() (global scan via `mcpp self init`) keeps the legacy-aware behavior: packages without marker but with legacy content dirs are preserved for backward compatibility with pre-upgrade installs. is_install_complete() retains the legacy fallback for read-only compat in resolve_xpkg_path() — old packages are recognized as usable, but this doesn't shield them from explicit cleanup on the install path. * fix: strict marker-only on resolve path, no legacy adoption Round-5 review fix: is_install_complete() is now strict marker-only. No more legacy heuristic fallback on the resolve/install path. Rationale: from directory layout alone we cannot distinguish a legacy-complete install (bin/ exists, full) from a half-extracted residue (bin/ exists, partial). Adopting the latter silently corrupts the user's toolchain. Strict semantics close this gap. Cost: upgrade users do a one-time reinstall per toolchain. In practice this hits the fast copy_xpkg_from_global() fallback that reuses ~/.xlings/, so it's rarely a real download. clean_all_incomplete() (mcpp self init) still preserves legacy packages (no marker + legacy layout) as user-visible assets — that's a separate concern from the resolve path's strictness. looks_complete_legacy() is now exported for explicit legacy-aware call sites (currently only clean_all_incomplete uses it). * fix: copy fallback uses legacy heuristic to validate copied content Round-6 review fix: After Round-5 made is_install_complete() strict marker-only, the copy fallback path broke: bool copyOk = copy_xpkg_from_global(verdir); if (copyOk && is_install_complete(verdir)) { // ← always false mark_install_complete(...); // never reached return make_payload(); } clean_incomplete_install(verdir); // ← wipes the copy copy_xpkg_from_global() doesn't (and can't) write .mcpp_ok in the copied directory, so the marker-only check would always fail, and the just-copied package would be immediately wiped, returning "xpkg payload missing". Fix: validate the copied content via looks_complete_legacy() (the structural heuristic) before writing the marker. This is safe in this context because: 1. Step 2 of the resolve chain already cleaned any pre-existing residue using strict marker-only semantics — so anything at verdir now MUST be the result of our just-completed copy. 2. copy_xpkg_from_global() only returns true on a clean recursive copy (no partial copies reach this branch). 3. The heuristic validates that the source actually had content (rules out copying from an empty/broken global xlings dir). This restores the documented "copy_xpkg_from_global is the typical fast fallback" behavior that Round-5 unintentionally broke. * fix: restrict copy fallback to XLINGS_HOME propagation scenario only Round-7 review fix: Previously the copy fallback ran after ANY xlings install failure (exitCode != 0), copying whatever was in ~/.xlings/ and validating with looks_complete_legacy(). That heuristic only checks for top-level bin/lib/include/share — a half-extracted residue in the GLOBAL xlings directory (which we cannot clean) would pass this check and get marked as complete, permanently masking the broken install. Fix: split into three branches based on what xlings reported. - exitCode == 0 && verdir exists → normal success, mark complete - exitCode == 0 && verdir missing → XLINGS_HOME propagation bug; this is the ONLY scenario where we trust the global location enough to fall through to copy - exitCode != 0 → genuine install failure; propagate the original error without trying global copy (global may also be residue from the same failure, and looks_complete_legacy can't tell them apart) Also clarifies the autoInstall=false branch: still allow copy from global if the user previously installed via system xlings (no install attempt to confuse the state). * fix: remove autoInstall=false copy fallback (round-8 cleanup) The autoInstall=false branch was performing copy_xpkg_from_global recovery without a "this session's xlings install reported success" witness, falling outside the safety boundary established in round-7. Currently no caller passes autoInstall=false, so this is a no-op cleanup that removes a future foot-gun: anyone adding such a caller would inadvertently re-introduce the "half-extracted residue marked as complete" window. Semantic: when the caller explicitly disables auto-install, do not perform any implicit recovery — return "payload missing" so the caller (and the user) sees the truth instead of a silently-recovered possibly-broken package.
1 parent 37e7176 commit fd6f133

13 files changed

Lines changed: 2812 additions & 36 deletions
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# resolve_xpkg_path() 的 copy 优先级问题分析
2+
3+
**Date**: 2026-05-22
4+
5+
## 一、当前流程
6+
7+
`resolve_xpkg_path()` (`src/pm/package_fetcher.cppm:580-718`) 的执行顺序:
8+
9+
```
10+
resolve_xpkg_path(target, autoInstall)
11+
12+
├─ resolve() ← 第一次调用
13+
│ ├─ sandbox 里有?→ 直接返回 ✅
14+
│ ├─ sandbox 里没有?→ 检查 ~/.xlings/
15+
│ │ ├─ ~/.xlings/ 里有?→ copy 到 sandbox → 返回 ✅
16+
│ │ └─ ~/.xlings/ 里没有?→ 返回 error
17+
│ └─ 返回 error
18+
19+
├─ resolve() 成功?→ return(不会触发 install)
20+
21+
├─ autoInstall=false?→ return error
22+
23+
├─ install() ← 只有 resolve() 失败且 autoInstall=true 才走到这里
24+
│ └─ xlings interface install_packages
25+
26+
└─ resolve() ← 第二次调用(install 后再 resolve)
27+
└─ 同上逻辑(sandbox → copy → error)
28+
```
29+
30+
## 二、问题:copy 短路了 install
31+
32+
**核心问题**:只要 `~/.xlings/` 里有这个包,`resolve()` 就会直接 copy 并返回成功,
33+
**永远不会走到 `install()` 路径**
34+
35+
### 场景 1:用户之前用系统 xlings 装过 LLVM
36+
37+
```
38+
~/.xlings/data/xpkgs/xim-x-llvm/20.1.7/ ← 存在(旧版本)
39+
~/.mcpp/registry/data/xpkgs/xim-x-llvm/20.1.7/ ← 不存在
40+
41+
resolve():
42+
sandbox 没有 → 检查 ~/.xlings/ → 有 → copy → 返回成功
43+
↑ 完全跳过 install,即使 ~/.xlings/ 里的版本可能有问题
44+
```
45+
46+
**后果**
47+
- mcpp 拿到的是 xlings 全局环境的旧包,可能跟 mcpp sandbox 不兼容
48+
- ELF RUNPATH 指向 `~/.xlings/...`(这就是 libatomic bug 的根源)
49+
- mcpp 无法确保拿到的包是用 `XLINGS_HOME=~/.mcpp/registry` 安装的
50+
51+
### 场景 2:全局也没有,需要全新安装
52+
53+
```
54+
~/.xlings/data/xpkgs/xim-x-llvm/20.1.7/ ← 不存在
55+
~/.mcpp/registry/data/xpkgs/xim-x-llvm/20.1.7/ ← 不存在
56+
57+
resolve():
58+
sandbox 没有 → 检查 ~/.xlings/ → 也没有 → 返回 error
59+
60+
install():
61+
xlings interface install_packages → exitCode=0
62+
但 LLVM 实际没装到 sandbox(xlings bug)
63+
也没装到 ~/.xlings/(安装可能不完整)
64+
65+
resolve()(第二次):
66+
sandbox 没有 → ~/.xlings/ 也没有 → 返回 "xpkg payload missing"
67+
```
68+
69+
**后果**:全新安装完全失败(就是你遇到的情况)
70+
71+
### 场景 3:sandbox 里已有
72+
73+
```
74+
~/.mcpp/registry/data/xpkgs/xim-x-llvm/20.1.7/ ← 存在
75+
76+
resolve():
77+
sandbox 有 → 直接返回成功
78+
↑ 不检查版本、完整性、RUNPATH 正确性
79+
```
80+
81+
**后果**:如果之前拷贝的包有问题(比如 RUNPATH 错误),不会自动修复
82+
83+
## 三、问题分层
84+
85+
|| 问题 | 严重度 |
86+
|----|------|--------|
87+
| **优先级反转** | copy 优先于 install,导致 install 路径几乎不被执行 ||
88+
| **来源不可信** |`~/.xlings/` 拷贝的包不是为 mcpp sandbox 构建的 ||
89+
| **无完整性检查** | copy 后不验证包是否完整、路径是否正确 ||
90+
| **install 路径不可靠** | xlings NDJSON interface 安装大包时返回成功但未实际安装 ||
91+
| **无版本/时间戳校验** | 不检查 `~/.xlings/` 的包是否比 sandbox 的更新 ||
92+
93+
## 四、理想的执行流程
94+
95+
```
96+
resolve_xpkg_path(target, autoInstall)
97+
98+
├─ 1. sandbox 里有且完整?→ 直接返回 ✅
99+
100+
├─ 2. autoInstall?
101+
│ ├─ 是 → install()(用 XLINGS_HOME=sandbox 安装到 sandbox)
102+
│ │ ├─ 成功且 sandbox 里有?→ 返回 ✅
103+
│ │ └─ 失败 → 走 fallback
104+
│ └─ 否 → 走 fallback
105+
106+
├─ 3. fallback: ~/.xlings/ 里有?
107+
│ ├─ 是 → copy + post-copy fixup → 返回 ✅
108+
│ └─ 否 → 返回 error
109+
110+
└─ 4. 返回结果
111+
```
112+
113+
关键变化:**install 优先于 copy**。copy 只是 fallback,不是首选路径。
114+
115+
## 五、修复方案
116+
117+
### 方案 A:调换 install 和 copy 的优先级
118+
119+
`resolve()` 中的 copy workaround 移到 `install()` 之后:
120+
121+
```
122+
resolve_xpkg_path(target, autoInstall):
123+
1. check sandbox → return if exists
124+
2. if autoInstall → install via xlings
125+
3. check sandbox again → return if exists
126+
4. FALLBACK: copy from ~/.xlings/ (workaround)
127+
5. check sandbox again → return if exists
128+
6. error: payload missing
129+
```
130+
131+
**优点**:install 路径得到优先执行,copy 只是最后兜底
132+
**缺点**:如果 install 慢或失败,用户体验变差(之前可以秒拷贝)
133+
134+
### 方案 B:install 优先 + copy fallback + 超时
135+
136+
```
137+
resolve_xpkg_path(target, autoInstall):
138+
1. check sandbox → return if exists
139+
2. if autoInstall → try install (with timeout)
140+
3. check sandbox → return if exists
141+
4. copy from ~/.xlings/ if available
142+
5. post-copy fixup (patchelf RUNPATH)
143+
6. return or error
144+
```
145+
146+
**优点**:兼顾速度(install 失败时快速 fallback)和正确性
147+
**缺点**:增加超时逻辑的复杂度
148+
149+
### 方案 C:install 优先 + install 直接调用(非 NDJSON)
150+
151+
之前排查发现 NDJSON interface 路径安装大包不可靠。`install_with_progress()`
152+
已有"直接调用" fallback(`std::system("xlings install ... -y")`)。
153+
154+
将工具链安装改为使用 `install_with_progress()`(直接调用模式)而非
155+
`install()`(NDJSON interface 模式):
156+
157+
```
158+
resolve_xpkg_path(target, autoInstall):
159+
1. check sandbox → return if exists
160+
2. if autoInstall → install_with_progress (direct mode)
161+
3. check sandbox → return if exists
162+
4. copy from ~/.xlings/ as fallback
163+
5. return or error
164+
```
165+
166+
**优点**
167+
- 修复了 NDJSON interface 安装大包不可靠的问题
168+
- install 正确执行时,包直接装到 sandbox,无需 copy
169+
- copy 只在 install 真正失败时兜底
170+
171+
**缺点**:需要在 package_fetcher 层引入 install_with_progress
172+
173+
### 方案 D:保持 copy 优先但增加 post-copy fixup(当前状态)
174+
175+
当前 PR #67 的做法:保持 copy 优先,但在工具链 post-install 时修正 RUNPATH。
176+
177+
**优点**:改动最小,已实施
178+
**缺点**
179+
- copy 仍然优先于 install,install 路径几乎不被测试
180+
- 依赖 `~/.xlings/` 有正确的包(全新机器无 `~/.xlings/` 则完全失败)
181+
- 每个工具链都需要写对应的 fixup
182+
183+
## 六、建议
184+
185+
**短期(已完成)**:方案 D — post-copy fixup 兜底
186+
187+
**中期(推荐)**:方案 C — install 优先 + 直接调用模式
188+
- 修改 `resolve_xpkg_path()` 的流程顺序
189+
- 工具链安装使用 `install_with_progress()`(直接调用)
190+
- copy 降级为 fallback
191+
- 这是最务实的方案,解决了优先级反转和 NDJSON 不可靠两个问题
192+
193+
**长期**:方案 C + 在 copy fallback 后统一做 RUNPATH fixup
194+
- 将 patchelf fixup 从各工具链的 post-install 提取到 copy 出口统一处理
195+
- 未来加新工具链不会再遗漏

0 commit comments

Comments
 (0)