[DO NOT MERGE] Temporary CAS updates and fixes. #2301
Conversation
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
The previous 2-second PING ceiling was tight enough that a routine
Redis BGSAVE fork would push every RedisStore indicator over the line
simultaneously: on an 11 GB production master under load we observe
fork-induced pauses around 3 seconds, and with three RedisStore
indicators (AC, CAS, scheduler) all PING-ing through the same
connection pool, all three return Failed in lockstep — surfacing as a
503 on /status and a kubelet probe-failure event even though the
Redis service is otherwise healthy.
Verified by capturing /status response bodies during a flap window:
[{"namespace": ".../SCHEDULER_STORE/RedisStore",
"status": {"Failed": {"message":
"RedisStore::check_health: PING exceeded 2 s timeout"}}},
{"namespace": ".../CAS_REDIS_STORE/RedisStore",
"status": {"Failed": {"message":
"RedisStore::check_health: PING exceeded 2 s timeout"}}},
{"namespace": ".../AC_REDIS_STORE/RedisStore",
"status": {"Failed": {"message":
"RedisStore::check_health: PING exceeded 2 s timeout"}}}]
The HealthServer's per-indicator wrapper budget is 5 s
(DEFAULT_HEALTH_CHECK_TIMEOUT_SECONDS), so 4 s leaves a small
safety margin while comfortably absorbing the BGSAVE worst case we
have observed in practice.
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
Production worker pods were OOMKilled (exit 137) after the ConnectionManager's `available_connections: usize` counter underflowed to ~u64::MAX while `waiting_connections` climbed unbounded. The manual decrement-on-issue / increment-on-Dropped accounting balances on paper, but a leak path was occasionally missing a `Dropped` delivery during tonic transport errors and task aborts. Switch to `Arc<Semaphore>` with `OwnedSemaphorePermit` on the Connection. RAII makes leakage structurally impossible: every Drop path (panic, abort, dropped oneshot receiver, transport error) releases the permit exactly once. Adds 3 integration tests covering the request/acquire/release cycle, an aborted-caller-future cleanup scenario, and the MAX_CONCURRENT ceiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`RedisSubscription::Drop` previously dropped the `watch::Receiver` *before* taking the `subscribed_keys` write lock, then decided whether to remove the publisher entry based on `receiver_count() == 0`. Two concurrent drops on subscriptions sharing a publisher (e.g. multiple `WaitExecution` clients on the same operation_id) could both decrement their counts before either took the lock, then race for it: the loser saw the entry already removed and emitted a spurious "Key … was not found in subscribed keys" error. Worse, if a fresh `subscribe(same_key)` interleaved between the two drops, the second drop could remove the freshly-inserted publisher and silently strand its subscribers. Acquire the write lock *first*, evaluate "count == 1 with my receiver still alive", remove the entry under the lock if so, then drop the receiver. The lock now serialises both the count read and the map mutation, closing both race windows. Demote the absence log from `error!` to `warn!`: with the fix, that path now indicates a genuine unexpected mutation outside the lock, not the race noise. Adds 4 regression tests covering single-drop silence, drop-one-of-two preserving the publisher, 200-iteration concurrent-drop race, and resubscribe-after-drop creating a fresh publisher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in upstream changes since the last sync, including: - feb6a15 Bound CAS leader-wait + per-blob batch deadline (TraceMachina#2298) - 43ab01d Add expiry to completed redis actions (TraceMachina#2315) - 6cdcf8e fix RBE CI for hermetic LLVM (TraceMachina#2314) - f5846df Migrate to hermetic llvm (TraceMachina#2312) Conflicts resolved (all kept the local superset where the local change extended an upstream one): - nativelink-store/src/fast_slow_store.rs: kept HEAD's `huge_blob_dedup_bypasses` / `fast_store_stale_map_falls_through` metrics and the `DEFAULT_BYPASS_DEDUP_THRESHOLD_BYTES` const alongside upstream's `LEADER_WAIT_TIMEOUT` / `leader_wait_timeouts` - nativelink-store/src/filesystem_store.rs: kept HEAD's path_type=Temp bookkeeping inside the ENOENT branch on top of upstream's debug-demote of the rename failure - nativelink-store/src/redis_store.rs: kept HEAD's 4s PING_TIMEOUT and richer doc comment - nativelink-store/tests/{fast_slow_store_test,redis_store_test}.rs: concatenated both branches' independent test additions; merged `use` lists; updated `test_search_by_index_skips_int_from_cursor_read` to expect the local FT.AGGREGATE TIMEOUT clause; added `bypass_dedup_threshold_bytes: 0` to upstream's new FastSlowSpec literal so it satisfies the field added locally. All 30 test binaries across nativelink-store and nativelink-util pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
…tion `add_action` wrote the `cid_<client_operation_id>` → `operation_id` pointer with `expiry=None`. PR TraceMachina#2315 ("Add expiry to completed redis actions") then started attaching `retain_completed_for_s` TTL onto the matching `aa_*` key on completion. The TTL mismatch produced permanent orphans: aa_* expired after retain_completed_for_s, cid_* lingered forever. A subsequent WaitExecution resolving the stale cid_* hit the orphan path, returned NotFound, and the client (Bazel) restarted Execute — which created *another* unbounded cid_*. In production we saw the cid_* count reach 3.8M with ~4.5% already orphaned and intermittent "lost-action" symptoms in long builds. The action's full lifetime is unknown at insert time (queue + execute + retain), so the exact-correct TTL can't be computed at this site without preserving the original ClientOperationId on AwaitedAction (invasive schema change) or adding a reverse-lookup index. A generous 24h fixed ceiling comfortably exceeds the longest production `client_action_timeout_s + max_action_executing_timeout_s + retain_completed_for_s` combinations observed (~1500s in customer configs) and bounds orphan accumulation to a single day's worth of builds. Test infra: `dynamic_fake_redis::FakeRedisBackend` previously panicked on EXPIRE. Add an EXPIRE handler + `expiries` HashMap so tests can assert TTL attachment without standing up real Redis. The new `add_action_attaches_ttl_to_cid_mapping` regression test fails on the pre-fix code path (no EXPIRE recorded) and passes after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
…ob_retries `UpdateOperationType::UpdateWithDisconnect` was handled as an unconditional `ActionStage::Queued` with no `attempts` increment, so `max_job_retries` never tripped. When a worker pod OOMKilled (kernel SIGKILL → gRPC stream drops → liveness check evicts the worker with `is_disconnect=true`), every in-flight action got re-queued fresh. The scheduler then re-dispatched the same action set to the next worker, which OOMed the same way, and the loop continued indefinitely. Bazel's client-side `--test_timeout` was the only thing eventually breaking the cycle, with the symptom surfacing to the user as TIMEOUT (looks like a slow test) or NO STATUS (looks like a stuck dependency) — neither pointing at the cluster. BuildBarn surfaces this as a clean failure; NL loops. Mirror the `UpdateWithError` semantics for disconnects: increment `attempts`, and once `attempts > max_job_retries` transition to `Completed` with an `Aborted` error that explicitly names the worker-disconnect path (operators can grep for "Worker disconnected" instead of chasing ambiguous Bazel-side TIMEOUT/NO STATUS). Tradeoff: a transient network blip now counts. With max_job_retries=5 (customer-helm setting) a single blip still has 4 attempts of headroom, so legitimate blips remain harmless; sustained crash loops bail after 5 disconnects. This does not address the orthogonal "no backend-side action deadline" gap — an alive-but-stuck worker still relies on `max_action_executing_timeout_s`. Adds `worker_retries_on_disconnect_and_fails_test` mirroring the existing internal-error retry test; verifies that the second disconnect (with max=1) surfaces a Completed error whose message names the disconnect failure mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/build-image nativelink-worker-init |
|
/build-image |
|
Image built and pushed! |
|
Image built and pushed! |
Description
This is a combination of PRs that I have added here, such that I can get a docker image out of this and maintain a track of PRs to be merged.
Fixes # (issue)
Type of change
Please delete options that aren't relevant.
not work as expected)
How Has This Been Tested?
Please also list any relevant details for your test configuration
Checklist
bazel test //...passes locallygit amendsee some docsThis change is