Dov/antithesis poc upstream#36512
Draft
DAlperin wants to merge 62 commits into
Draft
Conversation
Contributor
|
Thank you for your submission! We really appreciate it. Like many source-available projects, we require that you sign our Contributor License Agreement (CLA) before we can accept your contribution. I have read the Contributor License Agreement (CLA) and I hereby sign the CLA. 3 out of 4 committers have signed the CLA. |
754deec to
d4373eb
Compare
…older .env mzbuild's _build_locked runs `git clean -ffdX <image_path>` before each build, which wipes any gitignored file in the build context — including the .env we generate. Two fixes: 1. publish:false on antithesis-config so the standard ci.test.build flow skips it entirely on regular nightly builds (where .env never exists). Only build-antithesis.sh / push-antithesis.py builds this image, and they write .env first. 2. Commit a placeholder .env so the file is tracked (survives git clean) and participates in mzbuild's fingerprint computation. build-antithesis.sh overwrites it with real registry refs before the build runs; fingerprint reflects the overwritten content per build.
Add 16 Antithesis properties for Kafka source ingestion (NONE + UPSERT
envelopes) to the scratchbook, plus the workload-side implementation of
upsert-key-reflects-latest-value.
Scratchbook additions:
- sut-analysis Appendix A: kafka source pipeline detail
- existing-assertions: enumerated SUT-side panic/assert sites that are
candidates for Antithesis SDK instrumentation
- property-catalog Category 7: 16 new Kafka/UPSERT properties
- property-relationships clusters 7-10 plus cross-cluster connections
- 16 per-property evidence files
- evaluation/synthesis.md: four-lens review
Workload:
- parallel_driver_upsert_latest_value.py: produces upserts+tombstones
with deterministic randomness, requests a quiet period, polls
mz_source_statistics for catchup, and asserts per-key value match
(two always() assertions + one sometimes() liveness anchor).
- helper_pg / helper_kafka / helper_quiet / helper_random /
helper_source_stats / helper_upsert_source: shared utilities for
subsequent Kafka source properties.
…o-data-duplication
… state-rehydrates-correctly
…config} + transitive deps
…atalog properties
…zable-reads workload driver
… catalog-recovery-consistency workload driver
…imeouts; remove dead upsert.rs (classic) antithesis asserts
Wraps the SELECT in a session that has SET real_time_recency = TRUE. Under strict-serializable, this pushes the chosen-ts lower bound to the source's real-time upstream frontier, so the SELECT waits for ingestion to reach the broker/upstream high-water mark before responding. Existing 'wait_for_catchup' on mz_source_statistics.offset_committed is insufficient as a queryability gate: offset_committed tracks the data-shard upper, which can advance past oracle_read_ts via the source's reclock while the corresponding rows live at an mz_ts further forward (assigned by the next-probe binding). The strict-serializable SELECT then picks a chosen-ts between the two and returns count=0. Used by drivers that produce-then-assert against kafka/mysql sources. MV-over- table drivers don't need this; tables have no upstream to probe and the table writer's commit already advances the timestamp oracle.
Apply real_time_recency=True to the SELECTs that follow wait_for_catchup or
the equivalent in:
- parallel_driver_upsert_latest_value.py
- singleton_driver_upsert_state_rehydration.py
- parallel_driver_kafka_none_envelope.py
- parallel_driver_mysql_cdc.py
These drivers all produce upstream (kafka/mysql), wait for a catchup signal,
then SELECT and assert. The current catchup signal (offset_committed in
mz_source_statistics, or a COUNT-based poll) clears before the just-ingested
rows are visible at the strict-serializable read timestamp the SELECT picks:
* offset_committed reflects the data-shard upper reclocked to upstream
offsets. It can advance past oracle_read_ts via the source's reclock
binding while the corresponding rows live at an mz_ts further forward
(assigned by the next-probe binding).
* COUNT-based polling only requires a single chosen-ts where the count
matches; the immediately-following per-row SELECT picks oracle_read_ts
afresh and can race.
real_time_recency forces the SELECT's chosen-ts lower bound to the source's
real-time upstream frontier, so the SELECT waits for ingestion to reach the
broker's/replica's current high-water mark before responding. See the
docstring on helper_pg.query_retry for the full reasoning.
Not applied to parallel_driver_mv_reflects_table_updates: tables have no
upstream to probe (RTR no-ops), and the existing count-based poll on the MV
is already queryability-based.
Not applied to parallel_driver_strict_serializable_reads: it already opens
fresh connections with explicit SET REAL_TIME_RECENCY TO TRUE.
…tes concurrent races Multiple parallel-driver invocations race the deterministic object-name pool in _create_database_for_antithesis (role0..roleN, cluster-0..cluster-N, etc). Setup statements run without IF NOT EXISTS / IF EXISTS guards in many places, and there is no IF EXISTS form for DROP CLUSTER or DROP ROLE — so the loser of any given race sees: * cluster 'cluster-0' already exists * unknown role 'role0' * unknown cluster 'cluster-0' * role "role0" cannot be dropped because some objects depend on it These are the same concurrent-DDL outcomes the parallel_workload framework already tolerates inside the worker loop via Action.errors_to_ignore at DDL complexity. The setup phase had no equivalent tolerance, so any of these escaped as a setup_failure and the always-zero-exit assertion fired: always(unexpected is None, "parallel workload: no unexpected SQL errors …") Add _tolerate_setup_race that catches QueryError or Exception with any of the expected race substrings and proceeds. Wrap every setup statement, including db.drop/db.create, the cluster/role enumerate-and-drop loops, the DROP/CREATE CONNECTION + SECRET statements, and the per-relation create loop. The pattern list mirrors action.Action.errors_to_ignore for the DDL tier.
…nt-state view The Sometimes assertions: * fault recovery: observed antithesis_cluster replica non-online at least once * kafka source resumes: observed antithesis_cluster replica non-online both rely on `_replica_non_online()` returning True at least once across all invocations in a run. The previous implementation queried `mz_cluster_replica_statuses` (DISTINCT ON (replica_id, process_id) over the underlying history shard), which shows only the latest tick per process. With a 0.5s probe cadence and a 30s invocation budget, an Antithesis fault that takes a clusterd offline-then-back-online within a sub-second window slips between two consecutive polls — the SDK never sees a non-online status, the Sometimes assertion never fires, and we get a 0-pass / N-fail finding even though the fault recipe is correctly hitting the cluster. Switch to `mz_internal.mz_cluster_replica_status_history` and filter on `h.status = 'offline'`. This is the underlying audit log; any past offline event remains visible from any later poll within the retention window, so we record the fault even if the transition fully completed before the next probe. Same change in both drivers (the helper was duplicated).
…al clusterds
Three coupled additions to make parallel_workload safe to run as multiple
concurrent invocations sharing one Materialize instance, with each
invocation's cluster routed to a dedicated external clusterd container.
1. Database(seed_scoped_names: bool = False). When True, forwards a
'name_scope' string to every Role and Cluster the framework creates,
producing 'cluster-<seed>-<id>' and 'role-<seed>-<id>' rather than
'cluster-<id>' / 'role<id>'. Schemas / tables / views / sources etc.
don't need this — their fully-qualified names already flow through
DB.name() which already embeds the seed. Default False so non-
Antithesis consumers keep their existing name shapes.
Role.__str__ now passes through pg8000.native.identifier() so the
quoted-dashed names round-trip correctly; no-op for ASCII names.
2. Database(pool_members: list[ClusterdPoolMember] | None = None) and
a new ClusterdPoolMember dataclass (host + storagectl/computectl/
compute/storage ports + workers). When set, the framework provisions
unmanaged Cluster replicas with explicit STORAGECTL/STORAGE/
COMPUTECTL/COMPUTE ADDRESSES pointed at the supplied member(s)
instead of emitting managed SIZE/REPLICATION FACTOR. The is_pool_backed
property on Cluster gates the rendering.
3. CreateClusterAction / CreateClusterReplicaAction / DropClusterReplicaAction
skip pool-backed clusters: there is no in-band allocator for grabbing
additional pool members from a worker thread, and replication factor
manipulation has no analogue in unmanaged-replica mode. The framework
therefore only ever touches the pool members the caller pre-allocated.
These three pieces only make sense together: seed-scoping by itself
doesn't isolate the clusterd workload; the pool backend by itself collides
on global names; skipping the dynamic DDL by itself would just leave
clusters un-grown in a managed-cluster topology where that's the workload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reserve a pool of pre-existing clusterd containers
(`clusterd-pool-{0..N-1}`) so each parallel-workload cluster can land on
its own container and Antithesis can fault-inject it in isolation.
Without the pool, parallel-workload's clusters all live as child
processes of environmentd under the materialized container's process
orchestrator, and Antithesis can only kill / pause / partition the
container as a unit.
Pool size is read from `ANTITHESIS_CLUSTERD_POOL_SIZE` (env), default 8.
Each member is identical to clusterd1 / clusterd2: 4 timely workers,
no scratch, restart=no. `workflow_default` brings them up before
materialized so the controller can reach them when CREATE CLUSTER
references their addresses.
`Materialized` already has `unsafe_enable_unorchestrated_cluster_replicas`
set so CREATE CLUSTER ... STORAGECTL ADDRESSES is accepted.
config/docker-compose.yaml regenerated via
bin/pyactivate test/antithesis/export-compose.py
to match — the YAML is generated, not hand-edited.
… clusterd
Wires the antithesis parallel-workload driver to:
* Claim one clusterd-pool-{i} container per invocation via a real
allocator (fcntl.flock on /tmp/clusterd-pool-slots/{i}.lock; lock
held for the lifetime of the invocation, released on normal return
or exception). Slots are tried in randomized order so the slot a
driver lands on doesn't correlate with the invocation seed. If
every slot is held the driver tags a sometimes() and exits cleanly
rather than running unisolated.
* Construct ClusterdPoolMember(host='clusterd-pool-<slot>', workers=4)
and pass to Database(pool_members=[member], seed_scoped_names=True).
The initial cluster lands on its own clusterd container — Antithesis
fault injection targets that container in isolation, which is the
point of the whole change.
* Scope setup-phase catalog sweeps to objects this invocation owns:
'cluster-<seed>-%' and 'role-<seed>-%'. The previous 'c%' / 'r%'
patterns would have torn down every concurrent invocation's
still-running state. The shared connections (kafka_conn, csr_conn,
aws_conn, minio) live outside any seed-scoped database; we never
drop them (CREATE ... IF NOT EXISTS is idempotent and dropping
would CASCADE through another invocation's in-flight sources).
* Drop seed-scoped clusters / databases / roles in main()'s finally
so each invocation leaves the catalog clean and frees its pool-
slot's clusterd. The DROP CLUSTER on an unmanaged cluster
re-arms the clusterd to accept a fresh controller connection via
the same reconcile() path that handles environmentd restarts
(storage_state::reconcile drops stale objects, transport::serve
cancels the prior connection on the next connect).
Pool size is read from CLUSTERD_POOL_SIZE (env), matching the
ANTITHESIS_CLUSTERD_POOL_SIZE knob in test/antithesis/mzcompose.py.
Default 8.
v1 scope (documented for the next round of work):
* MAX_INITIAL_CLUSTERS = 1 per invocation, REPLICATION FACTOR = 1.
Multi-replica coverage stays in antithesis_cluster.
* CreateClusterAction / CreateClusterReplicaAction /
DropClusterReplicaAction are skipped in pool mode; no in-band
allocator inside the framework yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-workload Documents the pool-backed parallel-workload topology: pool of clusterd containers, file-lock slot allocator, seed-scoped naming, drop-on-exit, and the reconcile-path correctness argument for clusterd reuse across DROP/CREATE CLUSTER cycles. Lists current failure modes (all-slots-held, crash-before-cleanup, sizing) and v1 limitations the next round of work will close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…not N single-replica clusters The driver now claims best-effort up to PW_DESIRED_REPLICAS (default 2) clusterd-pool slots per invocation; the framework consumes the whole pool_members list into a single unmanaged cluster with one replica per member instead of one cluster per member. Gives multi-replica fault coverage to parallel-workload (previously only antithesis_cluster ran multi-replica) when pool capacity allows, and degrades gracefully to a single-replica cluster under contention. The driver helper renamed from _claim_pool_slot (yields one slot or None) to _claim_pool_slots (yields list of 0..desired slots) so the contention fallback is just a shorter list rather than a special case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…el-workload setup and worker phases
Three fault-injection shapes were escaping the parallel-workload driver as
'unexpected' SQL errors and tripping the always() assertion:
1. SETUP: "Failed to resolve hostname" during CREATE CONNECTION FOR
KAFKA — materialized's broker-validation can't reach `kafka`
because Antithesis paused the kafka container or partitioned DNS.
2. SETUP: "connection timeout expired" / "server closed the
connection unexpectedly" from the driver's psycopg.connect to
`materialized:6875` — Antithesis paused the materialized container
during setup.
3. WORKER: "thread pw-worker-N exited early" with no captured cause
— `Worker.run`'s initial psycopg.connect / websocket / SET
statements run outside any try/except, so a fault that lands during
worker startup kills the thread without populating
`occurred_exception`. The driver's worker_failed payload then
reports just the thread-exit-early string with no underlying cause.
None of these are SUT correctness issues — they're the expected cost of
fault injection landing at inconvenient moments.
The fix adds a second tolerance category, _SETUP_FAULT_PATTERNS, alongside
the existing _SETUP_RACE_PATTERNS, and:
* Inside _tolerate_setup_race: swallow per-statement exceptions whose
message matches either pattern set.
* Around the whole setup phase: if setup_failure matches, demote out of
the always() assertion and into a sometimes() for visibility.
* Around worker thread death: if occurred_exception is None
(fault-killed startup) or its message matches a fault pattern,
demote out of the always() assertion and into a sometimes(). A
worker that captures a non-fault exception still fails the run as
before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous design (CREATE+DROP CLUSTER per parallel-workload
invocation, targeting a pool clusterd) hit a hard clusterd halt on the
second invocation against the same slot:
WARN: halting process: new instance configuration not compatible with
existing instance configuration: ... index_logs:
IntrospectionSourceIndex(...876897) vs Some(IntrospectionSourceIndex(...876641))
InstanceConfig::compatible_with compares LoggingConfig, which includes
the per-cluster IntrospectionSourceIndex GlobalIds — those are freshly
allocated by every CREATE CLUSTER. Pointing a different cluster
identity at a clusterd that already saw a prior cluster's introspection
indexes trips this check. reconcile() handles environmentd restarts
against the same cluster identity, but not cluster-identity changes.
Pivot to permanent pool clusters: one long-lived pool_cluster_<i>
per clusterd-pool-<i>, bootstrapped by the workload-entrypoint once
at compose-up and never dropped. Each parallel-workload invocation
picks a slot at random and runs against the pool cluster for that
slot. The cluster identity stays stable per slot, so the only
reconnect events the pool clusterds see are environmentd restarts and
Antithesis-injected pauses of the pool clusterd itself — the path
reconcile is designed for.
Concurrent invocations may share a pool cluster: every workload object
is in a seed-scoped database (db-pw-<seed>-*) with seed-scoped roles,
so DDL/DML never collides across invocations. No coordination
required — no fcntl.flock, no slot-claim contextmanager, no "no slot
available, exit cleanly" fallback. Antithesis still faults containers,
not invocations, so the per-container fault domain is preserved; two
invocations witnessing the same fault simply give us more independent
reproductions per failure.
Framework changes:
* Cluster.pre_existing_name: when set, create()/drop() are no-ops,
name() returns the literal, is_pool_backed flips True for action
skips.
* Database.existing_cluster_name: replaces pool_members on the
Antithesis path; wraps one pre-existing cluster.
* ClusterReplica loses its pool_member field (no longer used).
* action.CreateClusterAction checks existing_cluster_name (was
pool_members).
Driver changes:
* Slot pick is rng.randrange(CLUSTERD_POOL_SIZE). No coordination.
* _drop_seed_scoped_objects stops dropping clusters. The seed-scoped
database drop cascades through every workload-created object on
the cluster, returning the bound clusterd to an idle baseline.
* Setup-phase pre-create sweep no longer touches mz_clusters.
mzcompose / entrypoint changes:
* Workload service env now exports ANTITHESIS_CLUSTERD_POOL_SIZE +
CLUSTERD_POOL_SIZE so the bootstrap script and the driver agree
on the slot count.
* workload-entrypoint.sh loops POOL_SIZE times and CREATEs each
pool_cluster_<i> on its clusterd-pool-<i> (idempotent across
compose-up).
Multi-replica parallel-workload coverage is gone in this iteration —
each pool cluster has one replica. Multi-replica coverage stays in
antithesis_cluster; a future revision could pair pool clusterds into
2-replica pool clusters at the cost of doubling the pool footprint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…perty Sibling property upsert-key-reflects-latest-value tests only freshly- written keys within one invocation. This new property tests the complementary case: a key that has been resident in the upsert source for a long time (many invocations, many faults, possibly many clusterd restarts) must still accept a fresh write and have that write reflected in the source. The bug class this catches is a long-resident-state rehydration regression where the upsert operator's state-store remembers a key's value with enough fidelity to serve reads but enough wrongness that fresh writes are silently dropped — the user's pipeline appears stuck with no error. Implementation: parallel_driver_upsert_ancient_key_writable.py owns a dedicated key ring (ancient-k<0..31>) so it never collides with the sibling driver's per-invocation keys. Each invocation picks 5 ring slots at random, snapshots their current values, produces fresh 'cross-<prefix>-<nonce>' values, waits for catchup, and asserts that each key's reflected value changed (or, for first-touch ring slots, that a row now exists). The 'always' assertion is race-tolerant against concurrent invocations of this driver writing to the same ring slot — the only forbidden outcome is 'row still has the exact old value we tried to overwrite, with no peer interference,' which means our write was silently lost. A separate 'sometimes' clause records when our specific new value reached the source as the win-the-race liveness signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers per clusterd: 4 -> 16. Single-process clusterds at workers=16
exercise the same intra-process concurrency surface as a 4-process
scale=4,workers=4 production deployment, giving us realistic per-shard
parallelism, scheduler contention, and Antithesis-thread-pause-fault
depth.
Pool size: 8 -> 2. The no-lock allocator already tolerates
oversubscription (concurrent invocations may share a pool cluster
because every workload object lives in a seed-scoped database), so a
smaller pool isn't a correctness concern. A pool of 2 keeps the
topology closer to production replica counts and makes each pool
cluster behave more like a busy production cluster.
Workers count is now plumbed through end-to-end:
* mzcompose.py declares CLUSTERD_WORKERS=16 and uses it for every
Clusterd(...) service AND exports it in the Workload service env.
* workload-entrypoint.sh reads CLUSTERD_WORKERS and templates it into
every CREATE CLUSTER REPLICAS' WORKERS clause (antithesis_cluster
plus each pool_cluster_<i>). The controller reads WORKERS from this
clause, not from clusterd's runtime config, so the two must stay
in lockstep.
Total worker thread count goes from 4*8 + 4*2 = 40 (old: 8 pool + 2
antithesis) to 16*2 + 16*2 = 64 (new: 2 pool + 2 antithesis). Modest
memory increase, big throughput / parallelism gain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the non-transactional DML axis of the MySQL CDC source.
Materialize's MySQL source code path is engine-agnostic (binlog ROW
events look the same for MyISAM and InnoDB), so this exists to assert
the engine-agnostic contract holds in practice when the upstream is
MyISAM:
* BEGIN/COMMIT around MyISAM statements is silently ignored — each
statement commits immediately with its own GTID.
* No rollback semantics: a statement killed mid-write leaves whatever
rows it managed to insert committed.
* Table-level locking instead of row-level.
A MyISAM-specific regression would surface as the new driver's
'mysql myisam: CDC source row has correct value after catchup' firing
false while the existing InnoDB-backed driver looks healthy.
What landed:
* first_mysql_replica_setup.py creates antithesis.cdc_test_myisam
(ENGINE=MyISAM) alongside the existing cdc_test (ENGINE=InnoDB) on
the primary, and waits for both to replicate to the replica.
* helper_mysql_source.py exposes MYSQL_TABLE_MYISAM /
TABLE_NAME_MYISAM constants and an ensure_mysql_cdc_myisam_table()
helper. ensure_mysql_cdc_source() now creates both subsources off
the single mysql_cdc_source SOURCE.
* parallel_driver_mysql_myisam.py mirrors the InnoDB sibling's shape
against the MyISAM subsource with the 'myi-p<u64hex>' batch prefix
so the two drivers don't interfere.
* Property doc scratchbook/properties/mysql-myisam-cdc-no-data-loss.md
captures the bug class, MyISAM-specific binlog semantics, and the
assertion list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dge network Two changes the Antithesis platform needs from the docker-compose YAML, applied together at export time so they stay in lockstep: 1. Set `container_name` and `hostname` on every service (matching the service key). Per Antithesis docker best practices, triage reports attribute log lines and assertions by hostname; without an explicit hostname the platform infers one (possibly the container id) that's harder to recognize. The workload container is the highest-value case but the rule is uniform. 2. Define a named bridge network (`antithesis-net`) at the top level and put every service on it. Relying on docker-compose's auto- generated `default` network was leaving DNS resolution up to whatever the surrounding Antithesis orchestration decided; an earlier run on this stack failed with kafka unable to resolve `zookeeper` (UnknownHostException) during setup. Antithesis support pointed at the network shape as the likely cause and suggested declaring it explicitly. Not setting `internal: true` per Antithesis docker best practices — that would cut us off from the Antithesis- side instrumentation network. Both transforms live in export-compose.py so they apply uniformly to every present and future service. Sanity-check that no service key contains an underscore (RFC-1123); all current keys already use hyphens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
95d6209 to
8060eb2
Compare
…available
Data-driven export-compose transform: for every depends_on entry that
uses `condition: service_started` against a dependency that declares a
`healthcheck`, upgrade the condition to `service_healthy`. Dependencies
without a healthcheck (currently only clusterd) are left as
`service_started` since there's nothing to wait on.
Under the Antithesis platform, `service_started` proved unreliable as a
readiness gate during initial container startup. Docker fires it as
soon as the dependency's container process starts, before the
dependency's DNS entry is reliably resolvable. The previous run on the
fault-isolated topology saw kafka hit
`java.net.UnknownHostException: zookeeper: Name or service not known`
148+ times in a row before its retry loop landed on a successful
lookup, with the same cascade downstream (schema-registry ↔ kafka).
Both containers exited with code 1 from those retries, tripping the
"No unexpected container exits" property.
Upgraded edges:
kafka -> zookeeper (zookeeper:2181 nc healthcheck)
schema-registry -> kafka (kafka:9092 nc healthcheck)
materialized -> minio (minio /minio/health/live curl)
workload -> schema-registry (schema-registry curl healthcheck)
Left alone:
workload -> clusterd{1,2} (no clusterd healthcheck)
Gating on the healthcheck (which probes the actual listen port)
eliminates the DNS-race shape because docker won't fire
`service_healthy` until the dependency is answering on its port —
and DNS is reliably resolvable by then.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Antithesis feedback noted that parallel_driver_parallel_workload pulls one u64 from the SDK, seeds a stdlib `random.Random`, and then makes every downstream decision deterministically off that seed — locking the fuzzer out of all branches in the framework's action/expression subtree. Add `AntithesisRandom`, a `random.Random` subclass that overrides `getrandbits()` and `random()` to draw from the Antithesis SDK on every call. Plug it into `parallel_driver_parallel_workload` so action selection, DDL choices, expression shape, sample sizes, and every other in-framework `self.rng.*` call route through the SDK per draw. Each worker thread gets its own instance. Also add `random_float(low, high)` in helper_random — needed by the follow-up commit that swarms `TOMBSTONE_PROB`/`DROP_PROBABILITY` across invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Antithesis feedback called out hardcoded probability constants
(TOMBSTONE_PROB, DROP_PROBABILITY) as missed swarm-testing opportunities
— every timeline ran the exact same workload mix instead of letting the
fuzzer drive the parameter.
Replace the three hardcoded constants with per-invocation draws from
helper_random.random_float() over sensible ranges:
parallel_driver_upsert_latest_value:
TOMBSTONE_PROB 0.15 -> random_float(0.05, 0.50)
singleton_driver_upsert_state_rehydration:
TOMBSTONE_PROB 0.20 -> random_float(0.05, 0.50) (fixed per run
so cross-cycle stability of `expected` still tests rehydration)
singleton_driver_catalog_recovery_consistency:
DROP_PROBABILITY 0.20 -> random_float(0.10, 0.50)
The draw happens once at the top of main() and is logged for triage.
Each timeline ends up with a different mix; the fuzzer is free to push
toward whichever extreme reveals a bug.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ator
Antithesis feedback: with every parallel driver requesting its own
`ANTITHESIS_STOP_FAULTS` window, the union of overlapping per-driver
quiet periods leaves the SUT mostly un-faulted. Faults should arrive
on a single coordinated cadence driven from a dedicated container, and
workloads should stay robust to whatever quiet/faulting transitions
the orchestrator picks — the catchup-then-assert pattern already there
fits that model.
Topology change: add a `fault-orchestrator` service backed by `bash:5`
running `test/antithesis/fault-orchestrator/pause_faults.sh`, adapted
from the Antithesis hands-on tutorial. It alternates
faults-OFF/faults-ON windows at randomised intervals
(START_DELAY=30, MIN_ON/MAX_ON/MIN_OFF/MAX_OFF=20-40), centralising the
cadence. Outside Antithesis (`ANTITHESIS_STOP_FAULTS` unset) the script
no-ops so snouty local validate still works.
The script is loaded via `Path(...).read_text()` inside
`FaultOrchestrator(Service)`; every `$` is doubled to `$$` before
embedding into the compose YAML so docker-compose's parse-time
variable interpolation doesn't eat shell references like `${RANDOM}`
or `${ANTITHESIS_STOP_FAULTS}`. The on-disk .sh file stays plain bash
so shellcheck and direct execution still work.
Driver-side: delete `helper_quiet.py` and every `request_quiet_period`
call site (9 drivers). Each driver's `wait_for_catchup` timeout (or
the equivalent FINAL_READ_TIMEOUT_S in the strict-serializable driver)
is bumped to span at least one MAX_OFF window plus catchup overhead
— concretely 90s for the short Kafka/MV/upsert drivers, 120s for
MySQL CDC paths, and 180s for the singleton rehydration driver which
must survive a clusterd kill landing inside a quiet window. Liveness
`sometimes(...)` anchor messages were renamed
"after quiet period" → "within catchup budget" to match the new
semantics; scratchbook docs that quoted the exact strings are updated
to match.
Regenerate test/antithesis/config/docker-compose.yaml via
export-compose.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or windows
The global fault-orchestrator alternates faults-ON/OFF windows of up
to MAX_ON / MAX_OFF seconds each (defaults 40s, set in the
FaultOrchestrator service). With the previous timeouts a single
connect attempt or producer flush could expire entirely inside one
40s faults-ON window — fast-failing before the orchestrator opened
the next quiet window and burning retry budget on TCP timeouts.
helper_pg:
CONNECT_TIMEOUT_S 15 -> 30 (renamed from _CONNECT_TIMEOUT_S so
the parallel-workload driver can
reuse it instead of hardcoding 15)
_RETRY_BUDGET_S 120 -> 180 (spans one full ON+OFF cycle + margin)
helper_mysql: same logic — same values. Replication adds primary→replica
hops so the budgets match helper_pg's. `wait_for_host`'s 5s probe stays
short: it runs at bootstrap before fault injection begins.
helper_kafka: new explicit librdkafka producer config —
`request.timeout.ms=60000`, `delivery.timeout.ms=180000`. New module-
level `ADMIN_TIMEOUT_S=90` for `admin.list_topics` and `create_topics`
result waits; new `FLUSH_TIMEOUT_S=90` exported for drivers so
`producer.flush(timeout=...)` waits past a single MAX_ON window before
declaring pending messages "skipping assertions" material.
Per-driver direct psycopg.connect in parallel_driver_parallel_workload
(3 sites) now use `CONNECT_TIMEOUT_S` instead of literal 15. The four
Kafka-source drivers' `producer.flush(timeout=30)` calls now use
`FLUSH_TIMEOUT_S` from helper_kafka.
Probe timeouts are intentionally kept short — they exist to *measure*
unavailability, not wait through it:
anytime_fault_recovery_exercised.PROBE_CONNECT_TIMEOUT_S = 2.0
singleton_driver_catalog_recovery_consistency.PROBE_CONNECT_TIMEOUT_S = 2.0
parallel_driver_strict_serializable_reads.PROBE_CONNECT_TIMEOUT_S = 5
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ually runs The FaultOrchestrator service was wired up with `entrypoint: ["bash", "-s"]` and the script body passed as `command`. But `bash -s` reads commands from stdin — and there's no stdin in a detached docker container, so bash exited immediately with no output and the script string was silently used as `$0`. Net effect: the orchestrator container started, exited cleanly, and ANTITHESIS_STOP_FAULTS was never called. Antithesis fault injection ran unconstrained for the entire run, with no quiet windows ever opening. Every driver that needed more than one connection (the four Kafka drivers do CREATE CONNECTION + admin metadata fetch + CREATE SOURCE; the two MySQL drivers do primary writes + MZ reads; the parallel-workload driver does multi-step setup; the strict- serializable driver opens many fresh psycopg connects) effectively starved. Only `parallel_driver_mv_reflects_table_updates.py` ever reached its "driver done" log line: it does one batched INSERT and then polls materialize on a single retried connection, so a brief calm in the faults occasionally let it through. Fix: use `bash -c <script>`, which actually runs the script string. Verified locally that the round-tripped YAML script body executes under bash -c, including the no-op exit when ANTITHESIS_STOP_FAULTS is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Debugging the recent fault-orchestrator no-op required staring at logs
without a way to tell which of N concurrent driver invocations a given
line came from, and the helpers logged so little that "stuck at
ensure_kafka_connection" looked the same as "stuck at the CREATE SOURCE
broker validation" or "stuck on the kafka admin metadata fetch."
Two changes:
1. helper_logging.py mints a short hex `INVOCATION_ID` once per process
and stamps it into every log record via the root formatter
(`[inv=<id>]`). Every driver swaps its bare `logging.basicConfig`
block for `helper_logging.setup_logging(name)`, so grepping
`inv=a3f9b1c2` isolates one invocation's records across helpers,
threads, and subprocesses.
2. Helpers grow start/done lifecycle logs with elapsed timings at every
meaningful checkpoint:
helper_pg.connect start, established/giving up + elapsed
helper_pg.execute_retry per-statement, with truncated SQL summary
helper_pg.query_retry same shape, plus row count on success
helper_pg.execute_internal same shape
helper_mysql._open per-host start/established/giving up
helper_kafka.make_producer config snapshot at construction
helper_kafka.ensure_topic probing / list_topics done / creating / created
helper_upsert_source per-phase markers around ensure_kafka_connection
and ensure_upsert_text_source
helper_none_source same
helper_table_mv per-phase markers
helper_source_stats wait_for_catchup start, progress, done/timeout
Retry attempts get attempt count + per-attempt elapsed + budget used,
so "still trying" vs. "burning the budget" is visible in one line.
`bin/pyactivate -m ruff check --fix` cleaned up the unused `import
logging` statements that fell out of the basicConfig removal.
LOG_LEVEL env var (default INFO) lets a triage user re-run with
`LOG_LEVEL=DEBUG` to surface the per-statement SQL summaries the helpers
emit at DEBUG without flooding INFO.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er validation
A driver invocation failed after one attempt with:
WARNING pg execute: giving up after 1 attempts (17.88s total) on
CREATE SOURCE IF NOT EXISTS upsert_text_src IN CLUSTER antithesis_cluster
FROM KAFKA CONNECTION antithesis_kafka_conn ...:
Meta data fetch error: BrokerTransportFailure (Local: Broker transport failure)
Materialize validates `CREATE SOURCE FOR KAFKA` by doing a broker
metadata fetch as part of the planning path. Under the global fault-
orchestrator's faults-ON windows, that fetch fails — and materialize
surfaces the failure as a plain `psycopg.errors.InternalError`, *not*
`OperationalError`. Our `_retryable` predicate only recognised
OperationalError + InterfaceError, so the driver burned through one
attempt in 17s and aborted instead of waiting for the next quiet window.
Widen `_retryable` to also accept `InternalError` whose message matches
one of a small set of patterns identifying transient external
dependencies:
* librdkafka surfaces: BrokerTransportFailure, Meta data fetch error,
"Local: All broker connections are down",
"Local: Timed out"
* schema-registry HTTP: "schema registry", Connection refused/reset
* DNS partitioned: Failed to resolve hostname, no route to host,
Temporary failure in name resolution
* postgres/mysql source upstream:
could not translate host name,
could not connect to server
Patterns are intentionally narrow so a real catalog/schema error
("relation does not exist", "syntax error at or near...") still
propagates after one attempt rather than spinning silently.
Applies to execute_retry / query_retry / execute_internal_retry via the
shared predicate; `create_source_idempotent` keeps its separate
already-exists-tolerance path on top.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Almost no workloads have been finishing on recent builds. Dov noted that Antithesis's deterministic hypervisor runs the whole fleet on a single core, which makes the workers-per-clusterd bump from 4 to 16 (commit 86d1fbb, "bump clusterd workers to 16 and shrink pool to 2") look suspicious: * Total Timely workers across the four clusterd containers went from 40 to 64. * Per-process worker count went from 4 to 16 — sixteen work-stealing threads sharing one core would burn most wakeups on context-switch overhead and starve dependent steps. Revert just `CLUSTERD_WORKERS = 16 -> 4` (keeping the pool size at 2; pool size by itself shouldn't cause this). Regenerate the compose yaml; every CREATE CLUSTER REPLICAS WORKERS clause now matches the clusterd `workers=` argument again. If a build at this commit shows workloads finishing again, the workers bump is the regression and we go back to 4. If they still don't finish, the regression is elsewhere (rpqlkrmq's container_name + hostname + named-bridge-network is the next candidate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7fe53c5 to
714e252
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remove these sections if your commit already has a good description!
Motivation
Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.
If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.
Description
What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.
Verification
How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.