Dov/antithesis poc upstream by DAlperin · Pull Request #36512 · MaterializeInc/materialize

DAlperin · 2026-05-11T17:26:38Z

Remove these sections if your commit already has a good description!

Motivation

Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.

If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.

Description

What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.

Verification

How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.

github-actions · 2026-05-11T17:27:08Z

Thank you for your submission! We really appreciate it. Like many source-available projects, we require that you sign our Contributor License Agreement (CLA) before we can accept your contribution.

You can sign the CLA by posting a comment with the message below.

I have read the Contributor License Agreement (CLA) and I hereby sign the CLA.

3 out of 4 committers have signed the CLA.
✅ (DAlperin)[https://github.com/DAlperin]
✅ (patrickwwbutler)[https://github.com/patrickwwbutler]
✅ (def-)[https://github.com/def-]
❌ @mitchwagner-antithesis
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

…compose YAML)

…y/retag hacks)

…erprints)

…try push

…older .env mzbuild's _build_locked runs `git clean -ffdX <image_path>` before each build, which wipes any gitignored file in the build context — including the .env we generate. Two fixes: 1. publish:false on antithesis-config so the standard ci.test.build flow skips it entirely on regular nightly builds (where .env never exists). Only build-antithesis.sh / push-antithesis.py builds this image, and they write .env first. 2. Commit a placeholder .env so the file is tracked (survives git clean) and participates in mzbuild's fingerprint computation. build-antithesis.sh overwrites it with real registry refs before the build runs; fingerprint reflects the overwritten content per build.

Add 16 Antithesis properties for Kafka source ingestion (NONE + UPSERT envelopes) to the scratchbook, plus the workload-side implementation of upsert-key-reflects-latest-value. Scratchbook additions: - sut-analysis Appendix A: kafka source pipeline detail - existing-assertions: enumerated SUT-side panic/assert sites that are candidates for Antithesis SDK instrumentation - property-catalog Category 7: 16 new Kafka/UPSERT properties - property-relationships clusters 7-10 plus cross-cluster connections - 16 per-property evidence files - evaluation/synthesis.md: four-lens review Workload: - parallel_driver_upsert_latest_value.py: produces upserts+tombstones with deterministic randomness, requests a quiet period, polls mz_source_statistics for catchup, and asserts per-key value match (two always() assertions + one sometimes() liveness anchor). - helper_pg / helper_kafka / helper_quiet / helper_random / helper_source_stats / helper_upsert_source: shared utilities for subsequent Kafka source properties.

…dk assertions

…o-data-duplication

… state-rehydrates-correctly

…config} + transitive deps

…first_ selector

…atalog properties

…zable-reads workload driver

… catalog-recovery-consistency workload driver

…imeouts; remove dead upsert.rs (classic) antithesis asserts

Wraps the SELECT in a session that has SET real_time_recency = TRUE. Under strict-serializable, this pushes the chosen-ts lower bound to the source's real-time upstream frontier, so the SELECT waits for ingestion to reach the broker/upstream high-water mark before responding. Existing 'wait_for_catchup' on mz_source_statistics.offset_committed is insufficient as a queryability gate: offset_committed tracks the data-shard upper, which can advance past oracle_read_ts via the source's reclock while the corresponding rows live at an mz_ts further forward (assigned by the next-probe binding). The strict-serializable SELECT then picks a chosen-ts between the two and returns count=0. Used by drivers that produce-then-assert against kafka/mysql sources. MV-over- table drivers don't need this; tables have no upstream to probe and the table writer's commit already advances the timestamp oracle.

Apply real_time_recency=True to the SELECTs that follow wait_for_catchup or the equivalent in: - parallel_driver_upsert_latest_value.py - singleton_driver_upsert_state_rehydration.py - parallel_driver_kafka_none_envelope.py - parallel_driver_mysql_cdc.py These drivers all produce upstream (kafka/mysql), wait for a catchup signal, then SELECT and assert. The current catchup signal (offset_committed in mz_source_statistics, or a COUNT-based poll) clears before the just-ingested rows are visible at the strict-serializable read timestamp the SELECT picks: * offset_committed reflects the data-shard upper reclocked to upstream offsets. It can advance past oracle_read_ts via the source's reclock binding while the corresponding rows live at an mz_ts further forward (assigned by the next-probe binding). * COUNT-based polling only requires a single chosen-ts where the count matches; the immediately-following per-row SELECT picks oracle_read_ts afresh and can race. real_time_recency forces the SELECT's chosen-ts lower bound to the source's real-time upstream frontier, so the SELECT waits for ingestion to reach the broker's/replica's current high-water mark before responding. See the docstring on helper_pg.query_retry for the full reasoning. Not applied to parallel_driver_mv_reflects_table_updates: tables have no upstream to probe (RTR no-ops), and the existing count-based poll on the MV is already queryability-based. Not applied to parallel_driver_strict_serializable_reads: it already opens fresh connections with explicit SET REAL_TIME_RECENCY TO TRUE.

…tes concurrent races Multiple parallel-driver invocations race the deterministic object-name pool in _create_database_for_antithesis (role0..roleN, cluster-0..cluster-N, etc). Setup statements run without IF NOT EXISTS / IF EXISTS guards in many places, and there is no IF EXISTS form for DROP CLUSTER or DROP ROLE — so the loser of any given race sees: * cluster 'cluster-0' already exists * unknown role 'role0' * unknown cluster 'cluster-0' * role "role0" cannot be dropped because some objects depend on it These are the same concurrent-DDL outcomes the parallel_workload framework already tolerates inside the worker loop via Action.errors_to_ignore at DDL complexity. The setup phase had no equivalent tolerance, so any of these escaped as a setup_failure and the always-zero-exit assertion fired: always(unexpected is None, "parallel workload: no unexpected SQL errors …") Add _tolerate_setup_race that catches QueryError or Exception with any of the expected race substrings and proceeds. Wrap every setup statement, including db.drop/db.create, the cluster/role enumerate-and-drop loops, the DROP/CREATE CONNECTION + SECRET statements, and the per-relation create loop. The pattern list mirrors action.Action.errors_to_ignore for the DDL tier.

…nt-state view The Sometimes assertions: * fault recovery: observed antithesis_cluster replica non-online at least once * kafka source resumes: observed antithesis_cluster replica non-online both rely on `_replica_non_online()` returning True at least once across all invocations in a run. The previous implementation queried `mz_cluster_replica_statuses` (DISTINCT ON (replica_id, process_id) over the underlying history shard), which shows only the latest tick per process. With a 0.5s probe cadence and a 30s invocation budget, an Antithesis fault that takes a clusterd offline-then-back-online within a sub-second window slips between two consecutive polls — the SDK never sees a non-online status, the Sometimes assertion never fires, and we get a 0-pass / N-fail finding even though the fault recipe is correctly hitting the cluster. Switch to `mz_internal.mz_cluster_replica_status_history` and filter on `h.status = 'offline'`. This is the underlying audit log; any past offline event remains visible from any later poll within the retention window, so we record the fault even if the transition fully completed before the next probe. Same change in both drivers (the helper was duplicated).

…al clusterds Three coupled additions to make parallel_workload safe to run as multiple concurrent invocations sharing one Materialize instance, with each invocation's cluster routed to a dedicated external clusterd container. 1. Database(seed_scoped_names: bool = False). When True, forwards a 'name_scope' string to every Role and Cluster the framework creates, producing 'cluster-<seed>-<id>' and 'role-<seed>-<id>' rather than 'cluster-<id>' / 'role<id>'. Schemas / tables / views / sources etc. don't need this — their fully-qualified names already flow through DB.name() which already embeds the seed. Default False so non- Antithesis consumers keep their existing name shapes. Role.__str__ now passes through pg8000.native.identifier() so the quoted-dashed names round-trip correctly; no-op for ASCII names. 2. Database(pool_members: list[ClusterdPoolMember] | None = None) and a new ClusterdPoolMember dataclass (host + storagectl/computectl/ compute/storage ports + workers). When set, the framework provisions unmanaged Cluster replicas with explicit STORAGECTL/STORAGE/ COMPUTECTL/COMPUTE ADDRESSES pointed at the supplied member(s) instead of emitting managed SIZE/REPLICATION FACTOR. The is_pool_backed property on Cluster gates the rendering. 3. CreateClusterAction / CreateClusterReplicaAction / DropClusterReplicaAction skip pool-backed clusters: there is no in-band allocator for grabbing additional pool members from a worker thread, and replication factor manipulation has no analogue in unmanaged-replica mode. The framework therefore only ever touches the pool members the caller pre-allocated. These three pieces only make sense together: seed-scoping by itself doesn't isolate the clusterd workload; the pool backend by itself collides on global names; skipping the dynamic DDL by itself would just leave clusters un-grown in a managed-cluster topology where that's the workload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reserve a pool of pre-existing clusterd containers (`clusterd-pool-{0..N-1}`) so each parallel-workload cluster can land on its own container and Antithesis can fault-inject it in isolation. Without the pool, parallel-workload's clusters all live as child processes of environmentd under the materialized container's process orchestrator, and Antithesis can only kill / pause / partition the container as a unit. Pool size is read from `ANTITHESIS_CLUSTERD_POOL_SIZE` (env), default 8. Each member is identical to clusterd1 / clusterd2: 4 timely workers, no scratch, restart=no. `workflow_default` brings them up before materialized so the controller can reach them when CREATE CLUSTER references their addresses. `Materialized` already has `unsafe_enable_unorchestrated_cluster_replicas` set so CREATE CLUSTER ... STORAGECTL ADDRESSES is accepted. config/docker-compose.yaml regenerated via bin/pyactivate test/antithesis/export-compose.py to match — the YAML is generated, not hand-edited.

… clusterd Wires the antithesis parallel-workload driver to: * Claim one clusterd-pool-{i} container per invocation via a real allocator (fcntl.flock on /tmp/clusterd-pool-slots/{i}.lock; lock held for the lifetime of the invocation, released on normal return or exception). Slots are tried in randomized order so the slot a driver lands on doesn't correlate with the invocation seed. If every slot is held the driver tags a sometimes() and exits cleanly rather than running unisolated. * Construct ClusterdPoolMember(host='clusterd-pool-<slot>', workers=4) and pass to Database(pool_members=[member], seed_scoped_names=True). The initial cluster lands on its own clusterd container — Antithesis fault injection targets that container in isolation, which is the point of the whole change. * Scope setup-phase catalog sweeps to objects this invocation owns: 'cluster-<seed>-%' and 'role-<seed>-%'. The previous 'c%' / 'r%' patterns would have torn down every concurrent invocation's still-running state. The shared connections (kafka_conn, csr_conn, aws_conn, minio) live outside any seed-scoped database; we never drop them (CREATE ... IF NOT EXISTS is idempotent and dropping would CASCADE through another invocation's in-flight sources). * Drop seed-scoped clusters / databases / roles in main()'s finally so each invocation leaves the catalog clean and frees its pool- slot's clusterd. The DROP CLUSTER on an unmanaged cluster re-arms the clusterd to accept a fresh controller connection via the same reconcile() path that handles environmentd restarts (storage_state::reconcile drops stale objects, transport::serve cancels the prior connection on the next connect). Pool size is read from CLUSTERD_POOL_SIZE (env), matching the ANTITHESIS_CLUSTERD_POOL_SIZE knob in test/antithesis/mzcompose.py. Default 8. v1 scope (documented for the next round of work): * MAX_INITIAL_CLUSTERS = 1 per invocation, REPLICATION FACTOR = 1. Multi-replica coverage stays in antithesis_cluster. * CreateClusterAction / CreateClusterReplicaAction / DropClusterReplicaAction are skipped in pool mode; no in-band allocator inside the framework yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-workload Documents the pool-backed parallel-workload topology: pool of clusterd containers, file-lock slot allocator, seed-scoped naming, drop-on-exit, and the reconcile-path correctness argument for clusterd reuse across DROP/CREATE CLUSTER cycles. Lists current failure modes (all-slots-held, crash-before-cleanup, sizing) and v1 limitations the next round of work will close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…not N single-replica clusters The driver now claims best-effort up to PW_DESIRED_REPLICAS (default 2) clusterd-pool slots per invocation; the framework consumes the whole pool_members list into a single unmanaged cluster with one replica per member instead of one cluster per member. Gives multi-replica fault coverage to parallel-workload (previously only antithesis_cluster ran multi-replica) when pool capacity allows, and degrades gracefully to a single-replica cluster under contention. The driver helper renamed from _claim_pool_slot (yields one slot or None) to _claim_pool_slots (yields list of 0..desired slots) so the contention fallback is just a shorter list rather than a special case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…el-workload setup and worker phases Three fault-injection shapes were escaping the parallel-workload driver as 'unexpected' SQL errors and tripping the always() assertion: 1. SETUP: "Failed to resolve hostname" during CREATE CONNECTION FOR KAFKA — materialized's broker-validation can't reach `kafka` because Antithesis paused the kafka container or partitioned DNS. 2. SETUP: "connection timeout expired" / "server closed the connection unexpectedly" from the driver's psycopg.connect to `materialized:6875` — Antithesis paused the materialized container during setup. 3. WORKER: "thread pw-worker-N exited early" with no captured cause — `Worker.run`'s initial psycopg.connect / websocket / SET statements run outside any try/except, so a fault that lands during worker startup kills the thread without populating `occurred_exception`. The driver's worker_failed payload then reports just the thread-exit-early string with no underlying cause. None of these are SUT correctness issues — they're the expected cost of fault injection landing at inconvenient moments. The fix adds a second tolerance category, _SETUP_FAULT_PATTERNS, alongside the existing _SETUP_RACE_PATTERNS, and: * Inside _tolerate_setup_race: swallow per-statement exceptions whose message matches either pattern set. * Around the whole setup phase: if setup_failure matches, demote out of the always() assertion and into a sometimes() for visibility. * Around worker thread death: if occurred_exception is None (fault-killed startup) or its message matches a fault pattern, demote out of the always() assertion and into a sometimes(). A worker that captures a non-fault exception still fails the run as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous design (CREATE+DROP CLUSTER per parallel-workload invocation, targeting a pool clusterd) hit a hard clusterd halt on the second invocation against the same slot: WARN: halting process: new instance configuration not compatible with existing instance configuration: ... index_logs: IntrospectionSourceIndex(...876897) vs Some(IntrospectionSourceIndex(...876641)) InstanceConfig::compatible_with compares LoggingConfig, which includes the per-cluster IntrospectionSourceIndex GlobalIds — those are freshly allocated by every CREATE CLUSTER. Pointing a different cluster identity at a clusterd that already saw a prior cluster's introspection indexes trips this check. reconcile() handles environmentd restarts against the same cluster identity, but not cluster-identity changes. Pivot to permanent pool clusters: one long-lived pool_cluster_ per clusterd-pool-, bootstrapped by the workload-entrypoint once at compose-up and never dropped. Each parallel-workload invocation picks a slot at random and runs against the pool cluster for that slot. The cluster identity stays stable per slot, so the only reconnect events the pool clusterds see are environmentd restarts and Antithesis-injected pauses of the pool clusterd itself — the path reconcile is designed for. Concurrent invocations may share a pool cluster: every workload object is in a seed-scoped database (db-pw-<seed>-*) with seed-scoped roles, so DDL/DML never collides across invocations. No coordination required — no fcntl.flock, no slot-claim contextmanager, no "no slot available, exit cleanly" fallback. Antithesis still faults containers, not invocations, so the per-container fault domain is preserved; two invocations witnessing the same fault simply give us more independent reproductions per failure. Framework changes: * Cluster.pre_existing_name: when set, create()/drop() are no-ops, name() returns the literal, is_pool_backed flips True for action skips. * Database.existing_cluster_name: replaces pool_members on the Antithesis path; wraps one pre-existing cluster. * ClusterReplica loses its pool_member field (no longer used). * action.CreateClusterAction checks existing_cluster_name (was pool_members). Driver changes: * Slot pick is rng.randrange(CLUSTERD_POOL_SIZE). No coordination. * _drop_seed_scoped_objects stops dropping clusters. The seed-scoped database drop cascades through every workload-created object on the cluster, returning the bound clusterd to an idle baseline. * Setup-phase pre-create sweep no longer touches mz_clusters. mzcompose / entrypoint changes: * Workload service env now exports ANTITHESIS_CLUSTERD_POOL_SIZE + CLUSTERD_POOL_SIZE so the bootstrap script and the driver agree on the slot count. * workload-entrypoint.sh loops POOL_SIZE times and CREATEs each pool_cluster_ on its clusterd-pool- (idempotent across compose-up). Multi-replica parallel-workload coverage is gone in this iteration — each pool cluster has one replica. Multi-replica coverage stays in antithesis_cluster; a future revision could pair pool clusterds into 2-replica pool clusters at the cost of doubling the pool footprint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…perty Sibling property upsert-key-reflects-latest-value tests only freshly- written keys within one invocation. This new property tests the complementary case: a key that has been resident in the upsert source for a long time (many invocations, many faults, possibly many clusterd restarts) must still accept a fresh write and have that write reflected in the source. The bug class this catches is a long-resident-state rehydration regression where the upsert operator's state-store remembers a key's value with enough fidelity to serve reads but enough wrongness that fresh writes are silently dropped — the user's pipeline appears stuck with no error. Implementation: parallel_driver_upsert_ancient_key_writable.py owns a dedicated key ring (ancient-k<0..31>) so it never collides with the sibling driver's per-invocation keys. Each invocation picks 5 ring slots at random, snapshots their current values, produces fresh 'cross-<prefix>-<nonce>' values, waits for catchup, and asserts that each key's reflected value changed (or, for first-touch ring slots, that a row now exists). The 'always' assertion is race-tolerant against concurrent invocations of this driver writing to the same ring slot — the only forbidden outcome is 'row still has the exact old value we tried to overwrite, with no peer interference,' which means our write was silently lost. A separate 'sometimes' clause records when our specific new value reached the source as the win-the-race liveness signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Workers per clusterd: 4 -> 16. Single-process clusterds at workers=16 exercise the same intra-process concurrency surface as a 4-process scale=4,workers=4 production deployment, giving us realistic per-shard parallelism, scheduler contention, and Antithesis-thread-pause-fault depth. Pool size: 8 -> 2. The no-lock allocator already tolerates oversubscription (concurrent invocations may share a pool cluster because every workload object lives in a seed-scoped database), so a smaller pool isn't a correctness concern. A pool of 2 keeps the topology closer to production replica counts and makes each pool cluster behave more like a busy production cluster. Workers count is now plumbed through end-to-end: * mzcompose.py declares CLUSTERD_WORKERS=16 and uses it for every Clusterd(...) service AND exports it in the Workload service env. * workload-entrypoint.sh reads CLUSTERD_WORKERS and templates it into every CREATE CLUSTER REPLICAS' WORKERS clause (antithesis_cluster plus each pool_cluster_). The controller reads WORKERS from this clause, not from clusterd's runtime config, so the two must stay in lockstep. Total worker thread count goes from 4*8 + 4*2 = 40 (old: 8 pool + 2 antithesis) to 16*2 + 16*2 = 64 (new: 2 pool + 2 antithesis). Modest memory increase, big throughput / parallelism gain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Covers the non-transactional DML axis of the MySQL CDC source. Materialize's MySQL source code path is engine-agnostic (binlog ROW events look the same for MyISAM and InnoDB), so this exists to assert the engine-agnostic contract holds in practice when the upstream is MyISAM: * BEGIN/COMMIT around MyISAM statements is silently ignored — each statement commits immediately with its own GTID. * No rollback semantics: a statement killed mid-write leaves whatever rows it managed to insert committed. * Table-level locking instead of row-level. A MyISAM-specific regression would surface as the new driver's 'mysql myisam: CDC source row has correct value after catchup' firing false while the existing InnoDB-backed driver looks healthy. What landed: * first_mysql_replica_setup.py creates antithesis.cdc_test_myisam (ENGINE=MyISAM) alongside the existing cdc_test (ENGINE=InnoDB) on the primary, and waits for both to replicate to the replica. * helper_mysql_source.py exposes MYSQL_TABLE_MYISAM / TABLE_NAME_MYISAM constants and an ensure_mysql_cdc_myisam_table() helper. ensure_mysql_cdc_source() now creates both subsources off the single mysql_cdc_source SOURCE. * parallel_driver_mysql_myisam.py mirrors the InnoDB sibling's shape against the MyISAM subsource with the 'myi-p<u64hex>' batch prefix so the two drivers don't interfere. * Property doc scratchbook/properties/mysql-myisam-cdc-no-data-loss.md captures the bug class, MyISAM-specific binlog semantics, and the assertion list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dge network Two changes the Antithesis platform needs from the docker-compose YAML, applied together at export time so they stay in lockstep: 1. Set `container_name` and `hostname` on every service (matching the service key). Per Antithesis docker best practices, triage reports attribute log lines and assertions by hostname; without an explicit hostname the platform infers one (possibly the container id) that's harder to recognize. The workload container is the highest-value case but the rule is uniform. 2. Define a named bridge network (`antithesis-net`) at the top level and put every service on it. Relying on docker-compose's auto- generated `default` network was leaving DNS resolution up to whatever the surrounding Antithesis orchestration decided; an earlier run on this stack failed with kafka unable to resolve `zookeeper` (UnknownHostException) during setup. Antithesis support pointed at the network shape as the likely cause and suggested declaring it explicitly. Not setting `internal: true` per Antithesis docker best practices — that would cut us off from the Antithesis- side instrumentation network. Both transforms live in export-compose.py so they apply uniformly to every present and future service. Sanity-check that no service key contains an underscore (RFC-1123); all current keys already use hyphens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…available Data-driven export-compose transform: for every depends_on entry that uses `condition: service_started` against a dependency that declares a `healthcheck`, upgrade the condition to `service_healthy`. Dependencies without a healthcheck (currently only clusterd) are left as `service_started` since there's nothing to wait on. Under the Antithesis platform, `service_started` proved unreliable as a readiness gate during initial container startup. Docker fires it as soon as the dependency's container process starts, before the dependency's DNS entry is reliably resolvable. The previous run on the fault-isolated topology saw kafka hit `java.net.UnknownHostException: zookeeper: Name or service not known` 148+ times in a row before its retry loop landed on a successful lookup, with the same cascade downstream (schema-registry ↔ kafka). Both containers exited with code 1 from those retries, tripping the "No unexpected container exits" property. Upgraded edges: kafka -> zookeeper (zookeeper:2181 nc healthcheck) schema-registry -> kafka (kafka:9092 nc healthcheck) materialized -> minio (minio /minio/health/live curl) workload -> schema-registry (schema-registry curl healthcheck) Left alone: workload -> clusterd{1,2} (no clusterd healthcheck) Gating on the healthcheck (which probes the actual listen port) eliminates the DNS-race shape because docker won't fire `service_healthy` until the dependency is answering on its port — and DNS is reliably resolvable by then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Antithesis feedback noted that parallel_driver_parallel_workload pulls one u64 from the SDK, seeds a stdlib `random.Random`, and then makes every downstream decision deterministically off that seed — locking the fuzzer out of all branches in the framework's action/expression subtree. Add `AntithesisRandom`, a `random.Random` subclass that overrides `getrandbits()` and `random()` to draw from the Antithesis SDK on every call. Plug it into `parallel_driver_parallel_workload` so action selection, DDL choices, expression shape, sample sizes, and every other in-framework `self.rng.*` call route through the SDK per draw. Each worker thread gets its own instance. Also add `random_float(low, high)` in helper_random — needed by the follow-up commit that swarms `TOMBSTONE_PROB`/`DROP_PROBABILITY` across invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Antithesis feedback called out hardcoded probability constants (TOMBSTONE_PROB, DROP_PROBABILITY) as missed swarm-testing opportunities — every timeline ran the exact same workload mix instead of letting the fuzzer drive the parameter. Replace the three hardcoded constants with per-invocation draws from helper_random.random_float() over sensible ranges: parallel_driver_upsert_latest_value: TOMBSTONE_PROB 0.15 -> random_float(0.05, 0.50) singleton_driver_upsert_state_rehydration: TOMBSTONE_PROB 0.20 -> random_float(0.05, 0.50) (fixed per run so cross-cycle stability of `expected` still tests rehydration) singleton_driver_catalog_recovery_consistency: DROP_PROBABILITY 0.20 -> random_float(0.10, 0.50) The draw happens once at the top of main() and is logged for triage. Each timeline ends up with a different mix; the fuzzer is free to push toward whichever extreme reveals a bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ator Antithesis feedback: with every parallel driver requesting its own `ANTITHESIS_STOP_FAULTS` window, the union of overlapping per-driver quiet periods leaves the SUT mostly un-faulted. Faults should arrive on a single coordinated cadence driven from a dedicated container, and workloads should stay robust to whatever quiet/faulting transitions the orchestrator picks — the catchup-then-assert pattern already there fits that model. Topology change: add a `fault-orchestrator` service backed by `bash:5` running `test/antithesis/fault-orchestrator/pause_faults.sh`, adapted from the Antithesis hands-on tutorial. It alternates faults-OFF/faults-ON windows at randomised intervals (START_DELAY=30, MIN_ON/MAX_ON/MIN_OFF/MAX_OFF=20-40), centralising the cadence. Outside Antithesis (`ANTITHESIS_STOP_FAULTS` unset) the script no-ops so snouty local validate still works. The script is loaded via `Path(...).read_text()` inside `FaultOrchestrator(Service)`; every `$` is doubled to `$$` before embedding into the compose YAML so docker-compose's parse-time variable interpolation doesn't eat shell references like `${RANDOM}` or `${ANTITHESIS_STOP_FAULTS}`. The on-disk .sh file stays plain bash so shellcheck and direct execution still work. Driver-side: delete `helper_quiet.py` and every `request_quiet_period` call site (9 drivers). Each driver's `wait_for_catchup` timeout (or the equivalent FINAL_READ_TIMEOUT_S in the strict-serializable driver) is bumped to span at least one MAX_OFF window plus catchup overhead — concretely 90s for the short Kafka/MV/upsert drivers, 120s for MySQL CDC paths, and 180s for the singleton rehydration driver which must survive a clusterd kill landing inside a quiet window. Liveness `sometimes(...)` anchor messages were renamed "after quiet period" → "within catchup budget" to match the new semantics; scratchbook docs that quoted the exact strings are updated to match. Regenerate test/antithesis/config/docker-compose.yaml via export-compose.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or windows The global fault-orchestrator alternates faults-ON/OFF windows of up to MAX_ON / MAX_OFF seconds each (defaults 40s, set in the FaultOrchestrator service). With the previous timeouts a single connect attempt or producer flush could expire entirely inside one 40s faults-ON window — fast-failing before the orchestrator opened the next quiet window and burning retry budget on TCP timeouts. helper_pg: CONNECT_TIMEOUT_S 15 -> 30 (renamed from _CONNECT_TIMEOUT_S so the parallel-workload driver can reuse it instead of hardcoding 15) _RETRY_BUDGET_S 120 -> 180 (spans one full ON+OFF cycle + margin) helper_mysql: same logic — same values. Replication adds primary→replica hops so the budgets match helper_pg's. `wait_for_host`'s 5s probe stays short: it runs at bootstrap before fault injection begins. helper_kafka: new explicit librdkafka producer config — `request.timeout.ms=60000`, `delivery.timeout.ms=180000`. New module- level `ADMIN_TIMEOUT_S=90` for `admin.list_topics` and `create_topics` result waits; new `FLUSH_TIMEOUT_S=90` exported for drivers so `producer.flush(timeout=...)` waits past a single MAX_ON window before declaring pending messages "skipping assertions" material. Per-driver direct psycopg.connect in parallel_driver_parallel_workload (3 sites) now use `CONNECT_TIMEOUT_S` instead of literal 15. The four Kafka-source drivers' `producer.flush(timeout=30)` calls now use `FLUSH_TIMEOUT_S` from helper_kafka. Probe timeouts are intentionally kept short — they exist to *measure* unavailability, not wait through it: anytime_fault_recovery_exercised.PROBE_CONNECT_TIMEOUT_S = 2.0 singleton_driver_catalog_recovery_consistency.PROBE_CONNECT_TIMEOUT_S = 2.0 parallel_driver_strict_serializable_reads.PROBE_CONNECT_TIMEOUT_S = 5 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ually runs The FaultOrchestrator service was wired up with `entrypoint: ["bash", "-s"]` and the script body passed as `command`. But `bash -s` reads commands from stdin — and there's no stdin in a detached docker container, so bash exited immediately with no output and the script string was silently used as `$0`. Net effect: the orchestrator container started, exited cleanly, and ANTITHESIS_STOP_FAULTS was never called. Antithesis fault injection ran unconstrained for the entire run, with no quiet windows ever opening. Every driver that needed more than one connection (the four Kafka drivers do CREATE CONNECTION + admin metadata fetch + CREATE SOURCE; the two MySQL drivers do primary writes + MZ reads; the parallel-workload driver does multi-step setup; the strict- serializable driver opens many fresh psycopg connects) effectively starved. Only `parallel_driver_mv_reflects_table_updates.py` ever reached its "driver done" log line: it does one batched INSERT and then polls materialize on a single retried connection, so a brief calm in the faults occasionally let it through. Fix: use `bash -c <script>`, which actually runs the script string. Verified locally that the round-tripped YAML script body executes under bash -c, including the no-op exit when ANTITHESIS_STOP_FAULTS is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Debugging the recent fault-orchestrator no-op required staring at logs without a way to tell which of N concurrent driver invocations a given line came from, and the helpers logged so little that "stuck at ensure_kafka_connection" looked the same as "stuck at the CREATE SOURCE broker validation" or "stuck on the kafka admin metadata fetch." Two changes: 1. helper_logging.py mints a short hex `INVOCATION_ID` once per process and stamps it into every log record via the root formatter (`[inv=<id>]`). Every driver swaps its bare `logging.basicConfig` block for `helper_logging.setup_logging(name)`, so grepping `inv=a3f9b1c2` isolates one invocation's records across helpers, threads, and subprocesses. 2. Helpers grow start/done lifecycle logs with elapsed timings at every meaningful checkpoint: helper_pg.connect start, established/giving up + elapsed helper_pg.execute_retry per-statement, with truncated SQL summary helper_pg.query_retry same shape, plus row count on success helper_pg.execute_internal same shape helper_mysql._open per-host start/established/giving up helper_kafka.make_producer config snapshot at construction helper_kafka.ensure_topic probing / list_topics done / creating / created helper_upsert_source per-phase markers around ensure_kafka_connection and ensure_upsert_text_source helper_none_source same helper_table_mv per-phase markers helper_source_stats wait_for_catchup start, progress, done/timeout Retry attempts get attempt count + per-attempt elapsed + budget used, so "still trying" vs. "burning the budget" is visible in one line. `bin/pyactivate -m ruff check --fix` cleaned up the unused `import logging` statements that fell out of the basicConfig removal. LOG_LEVEL env var (default INFO) lets a triage user re-run with `LOG_LEVEL=DEBUG` to surface the per-statement SQL summaries the helpers emit at DEBUG without flooding INFO. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er validation A driver invocation failed after one attempt with: WARNING pg execute: giving up after 1 attempts (17.88s total) on CREATE SOURCE IF NOT EXISTS upsert_text_src IN CLUSTER antithesis_cluster FROM KAFKA CONNECTION antithesis_kafka_conn ...: Meta data fetch error: BrokerTransportFailure (Local: Broker transport failure) Materialize validates `CREATE SOURCE FOR KAFKA` by doing a broker metadata fetch as part of the planning path. Under the global fault- orchestrator's faults-ON windows, that fetch fails — and materialize surfaces the failure as a plain `psycopg.errors.InternalError`, *not* `OperationalError`. Our `_retryable` predicate only recognised OperationalError + InterfaceError, so the driver burned through one attempt in 17s and aborted instead of waiting for the next quiet window. Widen `_retryable` to also accept `InternalError` whose message matches one of a small set of patterns identifying transient external dependencies: * librdkafka surfaces: BrokerTransportFailure, Meta data fetch error, "Local: All broker connections are down", "Local: Timed out" * schema-registry HTTP: "schema registry", Connection refused/reset * DNS partitioned: Failed to resolve hostname, no route to host, Temporary failure in name resolution * postgres/mysql source upstream: could not translate host name, could not connect to server Patterns are intentionally narrow so a real catalog/schema error ("relation does not exist", "syntax error at or near...") still propagates after one attempt rather than spinning silently. Applies to execute_retry / query_retry / execute_internal_retry via the shared predicate; `create_source_idempotent` keeps its separate already-exists-tolerance path on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Almost no workloads have been finishing on recent builds. Dov noted that Antithesis's deterministic hypervisor runs the whole fleet on a single core, which makes the workers-per-clusterd bump from 4 to 16 (commit 86d1fbb, "bump clusterd workers to 16 and shrink pool to 2") look suspicious: * Total Timely workers across the four clusterd containers went from 40 to 64. * Per-process worker count went from 4 to 16 — sixteen work-stealing threads sharing one core would burn most wakeups on context-switch overhead and starve dependent steps. Revert just `CLUSTERD_WORKERS = 16 -> 4` (keeping the pool size at 2; pool size by itself shouldn't cause this). Regenerate the compose yaml; every CREATE CLUSTER REPLICAS WORKERS clause now matches the clusterd `workers=` argument again. If a build at this commit shows workloads finishing again, the workers bump is the regression and we go back to 4. If they still don't finish, the regression is elsewhere (rpqlkrmq's container_name + hostname + named-bridge-network is the next candidate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mitchwagner-antithesis added 3 commits May 11, 2026 13:25

feat: working local antithesis build

4170e43

feat: tweaks for basic_test

127f67e

feat: working instrumentation

8323b8a

DAlperin added 12 commits May 11, 2026 13:49

test/antithesis: consolidate antithesis/ into test/antithesis/

2d9ae67

test/antithesis: add antithesis-config mzbuild image (FROM scratch + …

40100e8

…compose YAML)

test/antithesis: add copyright headers

92569ac

test/antithesis: rewrite export-compose.py to use mzbuild specs

0a7d801

test/antithesis: strip Makefile to mzbuild-driven build (drop registr…

d7cc3c4

…y/retag hacks)

ci: nightly antithesis builds via CI_ANTITHESIS env passthrough

106c9a9

ci: lint check that test/antithesis compose YAML matches mzcompose.py

e021460

ci: drop branches:main on build-x86_64-antithesis (validating)

ff5c6d7

test/antithesis: switch to Kafka stack + external clusterd

0f59e7d

ci: regenerate antithesis compose YAML before build (avoid stale fing…

359402a

…erprints)

test/antithesis: parameterize compose via .env (no more baked-in fing…

2cfa6a3

…erprints)

ci: distinct ANTITHESIS_GCP_SERVICE_ACCOUNT_JSON for Antithesis regis…

d4373eb

…try push

DAlperin force-pushed the dov/antithesis-poc-upstream branch from 754deec to d4373eb Compare May 11, 2026 19:21

DAlperin added 13 commits May 11, 2026 16:10

test/antithesis: pass Arch enum to Repository, not string

007c7af

src/storage: wrap kafka source + upsert panic sites with antithesis-s…

7033cce

…dk assertions

test/antithesis: implement kafka-source-no-data-loss + kafka-source-n…

12f2c79

…o-data-duplication

test/antithesis: implement frontier-monotonic, tombstone-removes-key,…

fd6722e

… state-rehydrates-correctly

ci: scope CI_ANTITHESIS build to materialized + antithesis-{workload,…

bb02873

…config} + transitive deps

test/antithesis: pre-create kafka topics before CREATE SOURCE

0a1fa97

test/antithesis: tolerate orphan _progress collision + add upsert-v2 …

624149c

…first_ selector

test/antithesis: add four workload drivers + reclock SUT anchor for c…

520f908

…atalog properties

test/antithesis: persist-cas-monotonicity SUT anchor + strict-seriali…

7c026ca

…zable-reads workload driver

test/antithesis: catalog cluster — partial epoch-fencing SUT anchor +…

06d90fb

… catalog-recovery-consistency workload driver

test/antithesis: drop unfireable rehydration anchor; bump pg client t…

3b9bac5

…imeouts; remove dead upsert.rs (classic) antithesis asserts

def- and others added 19 commits May 13, 2026 15:59

Not what we want, but what we deserve?

492c30a

approach #2

9727324

try to fix logging

d72fc00

add assertion for gtid monotonicity violation in mysql

891668b

DAlperin force-pushed the dov/antithesis-poc-upstream branch from 95d6209 to 8060eb2 Compare May 14, 2026 18:30

DAlperin and others added 9 commits May 14, 2026 16:41

DAlperin force-pushed the dov/antithesis-poc-upstream branch from 7fe53c5 to 714e252 Compare May 15, 2026 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dov/antithesis poc upstream#36512

Dov/antithesis poc upstream#36512
DAlperin wants to merge 62 commits into
MaterializeInc:mainfrom
DAlperin:dov/antithesis-poc-upstream

DAlperin commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DAlperin commented May 11, 2026

Motivation

Description

Verification

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented May 11, 2026 •

edited

Loading