[DO NOT MERGE] Temporary CAS updates and fixes. by amankrx · Pull Request #2301 · TraceMachina/nativelink

amankrx · 2026-05-05T06:11:30Z

Description

This is a combination of PRs that I have added here, such that I can get a docker image out of this and maintain a track of PRs to be merged.
Fixes # (issue)

Type of change

Please delete options that aren't relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to
not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please also list any relevant details for your test configuration

Checklist

Updated documentation if needed
Tests added/amended
bazel test //... passes locally
PR is contained in a single commit, using git amend see some docs

This change is

…nFailure

amankrx · 2026-05-05T06:12:01Z

/build-image nativelink-worker-init

amankrx · 2026-05-05T06:12:08Z

/build-image

github-actions · 2026-05-05T06:21:54Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:749e1b4

github-actions · 2026-05-05T06:22:37Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:749e1b4

… worker error

amankrx · 2026-05-06T00:17:49Z

/build-image nativelink-worker-init

amankrx · 2026-05-06T00:17:56Z

/build-image

github-actions · 2026-05-06T00:27:58Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:8266b7d

github-actions · 2026-05-06T00:28:01Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:8266b7d

amankrx · 2026-05-06T00:33:46Z

/build-image nativelink-worker-init

amankrx · 2026-05-06T00:33:55Z

/build-image

github-actions · 2026-05-06T00:43:24Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:4f8e090

github-actions · 2026-05-06T00:50:37Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:4f8e090

…ause

amankrx · 2026-05-06T00:53:05Z

/build-image nativelink-worker-init

amankrx · 2026-05-06T00:53:12Z

/build-image

github-actions · 2026-05-06T01:02:43Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:c461440

github-actions · 2026-05-06T01:03:22Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:c461440

amankrx · 2026-05-06T12:02:51Z

/build-image nativelink-worker-init

amankrx · 2026-05-06T12:03:07Z

/build-image

github-actions · 2026-05-06T12:12:51Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:065daaa

The previous 2-second PING ceiling was tight enough that a routine Redis BGSAVE fork would push every RedisStore indicator over the line simultaneously: on an 11 GB production master under load we observe fork-induced pauses around 3 seconds, and with three RedisStore indicators (AC, CAS, scheduler) all PING-ing through the same connection pool, all three return Failed in lockstep — surfacing as a 503 on /status and a kubelet probe-failure event even though the Redis service is otherwise healthy. Verified by capturing /status response bodies during a flap window: [{"namespace": ".../SCHEDULER_STORE/RedisStore", "status": {"Failed": {"message": "RedisStore::check_health: PING exceeded 2 s timeout"}}}, {"namespace": ".../CAS_REDIS_STORE/RedisStore", "status": {"Failed": {"message": "RedisStore::check_health: PING exceeded 2 s timeout"}}}, {"namespace": ".../AC_REDIS_STORE/RedisStore", "status": {"Failed": {"message": "RedisStore::check_health: PING exceeded 2 s timeout"}}}] The HealthServer's per-indicator wrapper budget is 5 s (DEFAULT_HEALTH_CHECK_TIMEOUT_SECONDS), so 4 s leaves a small safety margin while comfortably absorbing the BGSAVE worst case we have observed in practice.

amankrx · 2026-05-07T14:27:05Z

/build-image nativelink-worker-init

amankrx · 2026-05-07T14:27:10Z

/build-image

github-actions · 2026-05-07T14:38:19Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:7a89d6b

github-actions · 2026-05-07T14:38:25Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:7a89d6b

Production worker pods were OOMKilled (exit 137) after the ConnectionManager's `available_connections: usize` counter underflowed to ~u64::MAX while `waiting_connections` climbed unbounded. The manual decrement-on-issue / increment-on-Dropped accounting balances on paper, but a leak path was occasionally missing a `Dropped` delivery during tonic transport errors and task aborts. Switch to `Arc<Semaphore>` with `OwnedSemaphorePermit` on the Connection. RAII makes leakage structurally impossible: every Drop path (panic, abort, dropped oneshot receiver, transport error) releases the permit exactly once. Adds 3 integration tests covering the request/acquire/release cycle, an aborted-caller-future cleanup scenario, and the MAX_CONCURRENT ceiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`RedisSubscription::Drop` previously dropped the `watch::Receiver` *before* taking the `subscribed_keys` write lock, then decided whether to remove the publisher entry based on `receiver_count() == 0`. Two concurrent drops on subscriptions sharing a publisher (e.g. multiple `WaitExecution` clients on the same operation_id) could both decrement their counts before either took the lock, then race for it: the loser saw the entry already removed and emitted a spurious "Key … was not found in subscribed keys" error. Worse, if a fresh `subscribe(same_key)` interleaved between the two drops, the second drop could remove the freshly-inserted publisher and silently strand its subscribers. Acquire the write lock *first*, evaluate "count == 1 with my receiver still alive", remove the entry under the lock if so, then drop the receiver. The lock now serialises both the count read and the map mutation, closing both race windows. Demote the absence log from `error!` to `warn!`: with the fix, that path now indicates a genuine unexpected mutation outside the lock, not the race noise. Adds 4 regression tests covering single-drop silence, drop-one-of-two preserving the publisher, 200-iteration concurrent-drop race, and resubscribe-after-drop creating a fresh publisher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings in upstream changes since the last sync, including: - feb6a15 Bound CAS leader-wait + per-blob batch deadline (TraceMachina#2298) - 43ab01d Add expiry to completed redis actions (TraceMachina#2315) - 6cdcf8e fix RBE CI for hermetic LLVM (TraceMachina#2314) - f5846df Migrate to hermetic llvm (TraceMachina#2312) Conflicts resolved (all kept the local superset where the local change extended an upstream one): - nativelink-store/src/fast_slow_store.rs: kept HEAD's `huge_blob_dedup_bypasses` / `fast_store_stale_map_falls_through` metrics and the `DEFAULT_BYPASS_DEDUP_THRESHOLD_BYTES` const alongside upstream's `LEADER_WAIT_TIMEOUT` / `leader_wait_timeouts` - nativelink-store/src/filesystem_store.rs: kept HEAD's path_type=Temp bookkeeping inside the ENOENT branch on top of upstream's debug-demote of the rename failure - nativelink-store/src/redis_store.rs: kept HEAD's 4s PING_TIMEOUT and richer doc comment - nativelink-store/tests/{fast_slow_store_test,redis_store_test}.rs: concatenated both branches' independent test additions; merged `use` lists; updated `test_search_by_index_skips_int_from_cursor_read` to expect the local FT.AGGREGATE TIMEOUT clause; added `bypass_dedup_threshold_bytes: 0` to upstream's new FastSlowSpec literal so it satisfies the field added locally. All 30 test binaries across nativelink-store and nativelink-util pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

amankrx · 2026-05-09T01:25:26Z

/build-image nativelink-worker-init

amankrx · 2026-05-09T01:25:35Z

/build-image

github-actions · 2026-05-09T01:35:25Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:b3d5473

github-actions · 2026-05-09T01:35:33Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:b3d5473

…tion `add_action` wrote the `cid_<client_operation_id>` → `operation_id` pointer with `expiry=None`. PR TraceMachina#2315 ("Add expiry to completed redis actions") then started attaching `retain_completed_for_s` TTL onto the matching `aa_*` key on completion. The TTL mismatch produced permanent orphans: aa_* expired after retain_completed_for_s, cid_* lingered forever. A subsequent WaitExecution resolving the stale cid_* hit the orphan path, returned NotFound, and the client (Bazel) restarted Execute — which created *another* unbounded cid_*. In production we saw the cid_* count reach 3.8M with ~4.5% already orphaned and intermittent "lost-action" symptoms in long builds. The action's full lifetime is unknown at insert time (queue + execute + retain), so the exact-correct TTL can't be computed at this site without preserving the original ClientOperationId on AwaitedAction (invasive schema change) or adding a reverse-lookup index. A generous 24h fixed ceiling comfortably exceeds the longest production `client_action_timeout_s + max_action_executing_timeout_s + retain_completed_for_s` combinations observed (~1500s in customer configs) and bounds orphan accumulation to a single day's worth of builds. Test infra: `dynamic_fake_redis::FakeRedisBackend` previously panicked on EXPIRE. Add an EXPIRE handler + `expiries` HashMap so tests can assert TTL attachment without standing up real Redis. The new `add_action_attaches_ttl_to_cid_mapping` regression test fails on the pre-fix code path (no EXPIRE recorded) and passes after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

amankrx · 2026-05-11T16:37:25Z

/build-image nativelink-worker-init

amankrx · 2026-05-11T16:37:32Z

/build-image

github-actions · 2026-05-11T16:48:15Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:c742d9f

github-actions · 2026-05-11T16:48:26Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:c742d9f

…ob_retries `UpdateOperationType::UpdateWithDisconnect` was handled as an unconditional `ActionStage::Queued` with no `attempts` increment, so `max_job_retries` never tripped. When a worker pod OOMKilled (kernel SIGKILL → gRPC stream drops → liveness check evicts the worker with `is_disconnect=true`), every in-flight action got re-queued fresh. The scheduler then re-dispatched the same action set to the next worker, which OOMed the same way, and the loop continued indefinitely. Bazel's client-side `--test_timeout` was the only thing eventually breaking the cycle, with the symptom surfacing to the user as TIMEOUT (looks like a slow test) or NO STATUS (looks like a stuck dependency) — neither pointing at the cluster. BuildBarn surfaces this as a clean failure; NL loops. Mirror the `UpdateWithError` semantics for disconnects: increment `attempts`, and once `attempts > max_job_retries` transition to `Completed` with an `Aborted` error that explicitly names the worker-disconnect path (operators can grep for "Worker disconnected" instead of chasing ambiguous Bazel-side TIMEOUT/NO STATUS). Tradeoff: a transient network blip now counts. With max_job_retries=5 (customer-helm setting) a single blip still has 4 attempts of headroom, so legitimate blips remain harmless; sustained crash loops bail after 5 disconnects. This does not address the orthogonal "no backend-side action deadline" gap — an alive-but-stuck worker still relies on `max_action_executing_timeout_s`. Adds `worker_retries_on_disconnect_and_fails_test` mirroring the existing internal-error retry test; verifies that the second disconnect (with max=1) surfaces a Completed error whose message names the disconnect failure mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

amankrx · 2026-05-11T17:47:31Z

/build-image nativelink-worker-init

amankrx · 2026-05-11T17:47:39Z

/build-image

github-actions · 2026-05-11T17:58:20Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink:e06e3e7

github-actions · 2026-05-11T17:58:20Z

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:e06e3e7

amankrx added 6 commits May 2, 2026 10:24

fast_slow_store: only bound followers' wait, never the leader's populate

a644f64

fast_slow_store: never pass caller's writer into follower closures

35cf8f1

execution_server: pre-validate CAS blobs and return PreconditionFailure

1a236a0

execution_server: detect missing Action proto and surface Preconditio…

20b1de9

…nFailure

ft_aggregate: pass explicit TIMEOUT to absorb RediSearch slow scans

55ac2f8

health_utils: run indicator checks in parallel, not serially

749e1b4

health_utils: run indicator checks in parallel, not serially

a424b38

action_messages: surface PreconditionFailure for any missing-CAS-blob…

8266b7d

… worker error

fast_slow_store: fall through to slow on stale fast-tier map entries

4f8e090

dynamic_fake_redis: tolerate the new explicit FT.AGGREGATE TIMEOUT cl…

c461440

…ause

store_awaited_action_db: retry try_subscribe once on miss to close dedup

065daaa

amankrx temporarily deployed to production May 7, 2026 14:26 — with GitHub Actions Inactive

amankrx and others added 3 commits May 9, 2026 01:49

amankrx temporarily deployed to production May 9, 2026 01:25 — with GitHub Actions Inactive

amankrx temporarily deployed to production May 11, 2026 16:19 — with GitHub Actions Inactive

amankrx temporarily deployed to production May 11, 2026 17:47 — with GitHub Actions Inactive

Conversation

amankrx commented May 5, 2026 • edited by MarcusSorealheis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Checklist

Uh oh!

amankrx commented May 5, 2026

Uh oh!

amankrx commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

amankrx commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

amankrx commented May 7, 2026

Uh oh!

amankrx commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

amankrx commented May 9, 2026

Uh oh!

amankrx commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

amankrx commented May 11, 2026

Uh oh!

amankrx commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

amankrx commented May 11, 2026

Uh oh!

amankrx commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

amankrx commented May 5, 2026 •

edited by MarcusSorealheis

Loading