Skip to content

[DO NOT MERGE] Temporary CAS updates and fixes. #2301

Open
amankrx wants to merge 22 commits into
TraceMachina:mainfrom
amankrx:temp-cas-bulk-changes
Open

[DO NOT MERGE] Temporary CAS updates and fixes. #2301
amankrx wants to merge 22 commits into
TraceMachina:mainfrom
amankrx:temp-cas-bulk-changes

Conversation

@amankrx
Copy link
Copy Markdown
Collaborator

@amankrx amankrx commented May 5, 2026

Description

This is a combination of PRs that I have added here, such that I can get a docker image out of this and maintain a track of PRs to be merged.
Fixes # (issue)

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please also list any relevant details for your test configuration

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 5, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 5, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:749e1b4

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:749e1b4

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:8266b7d

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:8266b7d

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:4f8e090

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:4f8e090

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:c461440

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:c461440

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 6, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:065daaa

The previous 2-second PING ceiling was tight enough that a routine
Redis BGSAVE fork would push every RedisStore indicator over the line
simultaneously: on an 11 GB production master under load we observe
fork-induced pauses around 3 seconds, and with three RedisStore
indicators (AC, CAS, scheduler) all PING-ing through the same
connection pool, all three return Failed in lockstep — surfacing as a
503 on /status and a kubelet probe-failure event even though the
Redis service is otherwise healthy.

Verified by capturing /status response bodies during a flap window:

  [{"namespace": ".../SCHEDULER_STORE/RedisStore",
    "status": {"Failed": {"message":
      "RedisStore::check_health: PING exceeded 2 s timeout"}}},
   {"namespace": ".../CAS_REDIS_STORE/RedisStore",
    "status": {"Failed": {"message":
      "RedisStore::check_health: PING exceeded 2 s timeout"}}},
   {"namespace": ".../AC_REDIS_STORE/RedisStore",
    "status": {"Failed": {"message":
      "RedisStore::check_health: PING exceeded 2 s timeout"}}}]

The HealthServer's per-indicator wrapper budget is 5 s
(DEFAULT_HEALTH_CHECK_TIMEOUT_SECONDS), so 4 s leaves a small
safety margin while comfortably absorbing the BGSAVE worst case we
have observed in practice.
@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 7, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 7, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:7a89d6b

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:7a89d6b

amankrx and others added 3 commits May 9, 2026 01:49
Production worker pods were OOMKilled (exit 137) after the
ConnectionManager's `available_connections: usize` counter underflowed
to ~u64::MAX while `waiting_connections` climbed unbounded. The manual
decrement-on-issue / increment-on-Dropped accounting balances on paper,
but a leak path was occasionally missing a `Dropped` delivery during
tonic transport errors and task aborts.

Switch to `Arc<Semaphore>` with `OwnedSemaphorePermit` on the Connection.
RAII makes leakage structurally impossible: every Drop path (panic,
abort, dropped oneshot receiver, transport error) releases the permit
exactly once.

Adds 3 integration tests covering the request/acquire/release cycle, an
aborted-caller-future cleanup scenario, and the MAX_CONCURRENT ceiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`RedisSubscription::Drop` previously dropped the `watch::Receiver`
*before* taking the `subscribed_keys` write lock, then decided whether
to remove the publisher entry based on `receiver_count() == 0`. Two
concurrent drops on subscriptions sharing a publisher (e.g. multiple
`WaitExecution` clients on the same operation_id) could both decrement
their counts before either took the lock, then race for it: the loser
saw the entry already removed and emitted a spurious "Key … was not
found in subscribed keys" error. Worse, if a fresh `subscribe(same_key)`
interleaved between the two drops, the second drop could remove the
freshly-inserted publisher and silently strand its subscribers.

Acquire the write lock *first*, evaluate "count == 1 with my receiver
still alive", remove the entry under the lock if so, then drop the
receiver. The lock now serialises both the count read and the map
mutation, closing both race windows. Demote the absence log from
`error!` to `warn!`: with the fix, that path now indicates a genuine
unexpected mutation outside the lock, not the race noise.

Adds 4 regression tests covering single-drop silence, drop-one-of-two
preserving the publisher, 200-iteration concurrent-drop race, and
resubscribe-after-drop creating a fresh publisher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in upstream changes since the last sync, including:
  - feb6a15 Bound CAS leader-wait + per-blob batch deadline (TraceMachina#2298)
  - 43ab01d Add expiry to completed redis actions (TraceMachina#2315)
  - 6cdcf8e fix RBE CI for hermetic LLVM (TraceMachina#2314)
  - f5846df Migrate to hermetic llvm (TraceMachina#2312)

Conflicts resolved (all kept the local superset where the local change
extended an upstream one):
  - nativelink-store/src/fast_slow_store.rs: kept HEAD's
    `huge_blob_dedup_bypasses` / `fast_store_stale_map_falls_through`
    metrics and the `DEFAULT_BYPASS_DEDUP_THRESHOLD_BYTES` const
    alongside upstream's `LEADER_WAIT_TIMEOUT` / `leader_wait_timeouts`
  - nativelink-store/src/filesystem_store.rs: kept HEAD's path_type=Temp
    bookkeeping inside the ENOENT branch on top of upstream's
    debug-demote of the rename failure
  - nativelink-store/src/redis_store.rs: kept HEAD's 4s PING_TIMEOUT
    and richer doc comment
  - nativelink-store/tests/{fast_slow_store_test,redis_store_test}.rs:
    concatenated both branches' independent test additions; merged
    `use` lists; updated `test_search_by_index_skips_int_from_cursor_read`
    to expect the local FT.AGGREGATE TIMEOUT clause; added
    `bypass_dedup_threshold_bytes: 0` to upstream's new FastSlowSpec
    literal so it satisfies the field added locally.

All 30 test binaries across nativelink-store and nativelink-util pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 9, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 9, 2026

/build-image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink:b3d5473

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:b3d5473

…tion

`add_action` wrote the `cid_<client_operation_id>` → `operation_id`
pointer with `expiry=None`. PR TraceMachina#2315 ("Add expiry to completed redis
actions") then started attaching `retain_completed_for_s` TTL onto the
matching `aa_*` key on completion. The TTL mismatch produced permanent
orphans: aa_* expired after retain_completed_for_s, cid_* lingered
forever. A subsequent WaitExecution resolving the stale cid_* hit the
orphan path, returned NotFound, and the client (Bazel) restarted
Execute — which created *another* unbounded cid_*. In production we
saw the cid_* count reach 3.8M with ~4.5% already orphaned and
intermittent "lost-action" symptoms in long builds.

The action's full lifetime is unknown at insert time (queue + execute
+ retain), so the exact-correct TTL can't be computed at this site
without preserving the original ClientOperationId on AwaitedAction
(invasive schema change) or adding a reverse-lookup index. A generous
24h fixed ceiling comfortably exceeds the longest production
`client_action_timeout_s + max_action_executing_timeout_s +
retain_completed_for_s` combinations observed (~1500s in customer
configs) and bounds orphan accumulation to a single day's worth of
builds.

Test infra: `dynamic_fake_redis::FakeRedisBackend` previously panicked
on EXPIRE. Add an EXPIRE handler + `expiries` HashMap so tests can
assert TTL attachment without standing up real Redis. The new
`add_action_attaches_ttl_to_cid_mapping` regression test fails on the
pre-fix code path (no EXPIRE recorded) and passes after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 11, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 11, 2026

/build-image

@github-actions
Copy link
Copy Markdown

Image built and pushed!

ghcr.io/TraceMachina/nativelink:c742d9f

@github-actions
Copy link
Copy Markdown

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:c742d9f

…ob_retries

`UpdateOperationType::UpdateWithDisconnect` was handled as an
unconditional `ActionStage::Queued` with no `attempts` increment, so
`max_job_retries` never tripped. When a worker pod OOMKilled (kernel
SIGKILL → gRPC stream drops → liveness check evicts the worker with
`is_disconnect=true`), every in-flight action got re-queued fresh.
The scheduler then re-dispatched the same action set to the next
worker, which OOMed the same way, and the loop continued
indefinitely. Bazel's client-side `--test_timeout` was the only thing
eventually breaking the cycle, with the symptom surfacing to the user
as TIMEOUT (looks like a slow test) or NO STATUS (looks like a stuck
dependency) — neither pointing at the cluster. BuildBarn surfaces
this as a clean failure; NL loops.

Mirror the `UpdateWithError` semantics for disconnects: increment
`attempts`, and once `attempts > max_job_retries` transition to
`Completed` with an `Aborted` error that explicitly names the
worker-disconnect path (operators can grep for "Worker disconnected"
instead of chasing ambiguous Bazel-side TIMEOUT/NO STATUS).

Tradeoff: a transient network blip now counts. With max_job_retries=5
(customer-helm setting) a single blip still has 4 attempts of
headroom, so legitimate blips remain harmless; sustained crash loops
bail after 5 disconnects. This does not address the orthogonal "no
backend-side action deadline" gap — an alive-but-stuck worker still
relies on `max_action_executing_timeout_s`.

Adds `worker_retries_on_disconnect_and_fails_test` mirroring the
existing internal-error retry test; verifies that the second
disconnect (with max=1) surfaces a Completed error whose message
names the disconnect failure mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 11, 2026

/build-image nativelink-worker-init

@amankrx
Copy link
Copy Markdown
Collaborator Author

amankrx commented May 11, 2026

/build-image

@github-actions
Copy link
Copy Markdown

Image built and pushed!

ghcr.io/TraceMachina/nativelink:e06e3e7

@github-actions
Copy link
Copy Markdown

Image built and pushed!

ghcr.io/TraceMachina/nativelink-worker-init:e06e3e7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant