Skip to content

Dov/antithesis poc upstream#36512

Draft
DAlperin wants to merge 62 commits into
MaterializeInc:mainfrom
DAlperin:dov/antithesis-poc-upstream
Draft

Dov/antithesis poc upstream#36512
DAlperin wants to merge 62 commits into
MaterializeInc:mainfrom
DAlperin:dov/antithesis-poc-upstream

Conversation

@DAlperin
Copy link
Copy Markdown
Member

Remove these sections if your commit already has a good description!

Motivation

Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.

If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.

Description

What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.

Verification

How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Thank you for your submission! We really appreciate it. Like many source-available projects, we require that you sign our Contributor License Agreement (CLA) before we can accept your contribution.

You can sign the CLA by posting a comment with the message below.


I have read the Contributor License Agreement (CLA) and I hereby sign the CLA.


3 out of 4 committers have signed the CLA.
✅ (DAlperin)[https://github.com/DAlperin]
✅ (patrickwwbutler)[https://github.com/patrickwwbutler]
✅ (def-)[https://github.com/def-]
@mitchwagner-antithesis
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@DAlperin DAlperin force-pushed the dov/antithesis-poc-upstream branch from 754deec to d4373eb Compare May 11, 2026 19:21
DAlperin added 13 commits May 11, 2026 16:10
…older .env

mzbuild's _build_locked runs `git clean -ffdX <image_path>` before each
build, which wipes any gitignored file in the build context — including
the .env we generate. Two fixes:

1. publish:false on antithesis-config so the standard ci.test.build flow
   skips it entirely on regular nightly builds (where .env never exists).
   Only build-antithesis.sh / push-antithesis.py builds this image, and
   they write .env first.

2. Commit a placeholder .env so the file is tracked (survives git clean)
   and participates in mzbuild's fingerprint computation. build-antithesis.sh
   overwrites it with real registry refs before the build runs;
   fingerprint reflects the overwritten content per build.
Add 16 Antithesis properties for Kafka source ingestion (NONE + UPSERT
envelopes) to the scratchbook, plus the workload-side implementation of
upsert-key-reflects-latest-value.

Scratchbook additions:
  - sut-analysis Appendix A: kafka source pipeline detail
  - existing-assertions: enumerated SUT-side panic/assert sites that are
    candidates for Antithesis SDK instrumentation
  - property-catalog Category 7: 16 new Kafka/UPSERT properties
  - property-relationships clusters 7-10 plus cross-cluster connections
  - 16 per-property evidence files
  - evaluation/synthesis.md: four-lens review

Workload:
  - parallel_driver_upsert_latest_value.py: produces upserts+tombstones
    with deterministic randomness, requests a quiet period, polls
    mz_source_statistics for catchup, and asserts per-key value match
    (two always() assertions + one sometimes() liveness anchor).
  - helper_pg / helper_kafka / helper_quiet / helper_random /
    helper_source_stats / helper_upsert_source: shared utilities for
    subsequent Kafka source properties.
… catalog-recovery-consistency workload driver
…imeouts; remove dead upsert.rs (classic) antithesis asserts
def- and others added 19 commits May 13, 2026 15:59
Wraps the SELECT in a session that has SET real_time_recency = TRUE. Under
strict-serializable, this pushes the chosen-ts lower bound to the source's
real-time upstream frontier, so the SELECT waits for ingestion to reach the
broker/upstream high-water mark before responding.

Existing 'wait_for_catchup' on mz_source_statistics.offset_committed is
insufficient as a queryability gate: offset_committed tracks the data-shard
upper, which can advance past oracle_read_ts via the source's reclock while
the corresponding rows live at an mz_ts further forward (assigned by the
next-probe binding). The strict-serializable SELECT then picks a chosen-ts
between the two and returns count=0.

Used by drivers that produce-then-assert against kafka/mysql sources. MV-over-
table drivers don't need this; tables have no upstream to probe and the table
writer's commit already advances the timestamp oracle.
Apply real_time_recency=True to the SELECTs that follow wait_for_catchup or
the equivalent in:

  - parallel_driver_upsert_latest_value.py
  - singleton_driver_upsert_state_rehydration.py
  - parallel_driver_kafka_none_envelope.py
  - parallel_driver_mysql_cdc.py

These drivers all produce upstream (kafka/mysql), wait for a catchup signal,
then SELECT and assert. The current catchup signal (offset_committed in
mz_source_statistics, or a COUNT-based poll) clears before the just-ingested
rows are visible at the strict-serializable read timestamp the SELECT picks:

  * offset_committed reflects the data-shard upper reclocked to upstream
    offsets. It can advance past oracle_read_ts via the source's reclock
    binding while the corresponding rows live at an mz_ts further forward
    (assigned by the next-probe binding).
  * COUNT-based polling only requires a single chosen-ts where the count
    matches; the immediately-following per-row SELECT picks oracle_read_ts
    afresh and can race.

real_time_recency forces the SELECT's chosen-ts lower bound to the source's
real-time upstream frontier, so the SELECT waits for ingestion to reach the
broker's/replica's current high-water mark before responding. See the
docstring on helper_pg.query_retry for the full reasoning.

Not applied to parallel_driver_mv_reflects_table_updates: tables have no
upstream to probe (RTR no-ops), and the existing count-based poll on the MV
is already queryability-based.

Not applied to parallel_driver_strict_serializable_reads: it already opens
fresh connections with explicit SET REAL_TIME_RECENCY TO TRUE.
…tes concurrent races

Multiple parallel-driver invocations race the deterministic object-name pool
in _create_database_for_antithesis (role0..roleN, cluster-0..cluster-N, etc).
Setup statements run without IF NOT EXISTS / IF EXISTS guards in many places,
and there is no IF EXISTS form for DROP CLUSTER or DROP ROLE — so the loser
of any given race sees:

  * cluster 'cluster-0' already exists
  * unknown role 'role0'
  * unknown cluster 'cluster-0'
  * role "role0" cannot be dropped because some objects depend on it

These are the same concurrent-DDL outcomes the parallel_workload framework
already tolerates inside the worker loop via Action.errors_to_ignore at
DDL complexity. The setup phase had no equivalent tolerance, so any of these
escaped as a setup_failure and the always-zero-exit assertion fired:
  always(unexpected is None, "parallel workload: no unexpected SQL errors …")

Add _tolerate_setup_race that catches QueryError or Exception with any of the
expected race substrings and proceeds. Wrap every setup statement, including
db.drop/db.create, the cluster/role enumerate-and-drop loops, the
DROP/CREATE CONNECTION + SECRET statements, and the per-relation create loop.

The pattern list mirrors action.Action.errors_to_ignore for the DDL tier.
…nt-state view

The Sometimes assertions:
  * fault recovery: observed antithesis_cluster replica non-online at least once
  * kafka source resumes: observed antithesis_cluster replica non-online

both rely on `_replica_non_online()` returning True at least once across all
invocations in a run. The previous implementation queried
`mz_cluster_replica_statuses` (DISTINCT ON (replica_id, process_id) over the
underlying history shard), which shows only the latest tick per process. With
a 0.5s probe cadence and a 30s invocation budget, an Antithesis fault that
takes a clusterd offline-then-back-online within a sub-second window slips
between two consecutive polls — the SDK never sees a non-online status, the
Sometimes assertion never fires, and we get a 0-pass / N-fail finding even
though the fault recipe is correctly hitting the cluster.

Switch to `mz_internal.mz_cluster_replica_status_history` and filter on
`h.status = 'offline'`. This is the underlying audit log; any past offline
event remains visible from any later poll within the retention window, so we
record the fault even if the transition fully completed before the next probe.

Same change in both drivers (the helper was duplicated).
…al clusterds

Three coupled additions to make parallel_workload safe to run as multiple
concurrent invocations sharing one Materialize instance, with each
invocation's cluster routed to a dedicated external clusterd container.

  1. Database(seed_scoped_names: bool = False). When True, forwards a
     'name_scope' string to every Role and Cluster the framework creates,
     producing 'cluster-<seed>-<id>' and 'role-<seed>-<id>' rather than
     'cluster-<id>' / 'role<id>'. Schemas / tables / views / sources etc.
     don't need this — their fully-qualified names already flow through
     DB.name() which already embeds the seed. Default False so non-
     Antithesis consumers keep their existing name shapes.

     Role.__str__ now passes through pg8000.native.identifier() so the
     quoted-dashed names round-trip correctly; no-op for ASCII names.

  2. Database(pool_members: list[ClusterdPoolMember] | None = None) and
     a new ClusterdPoolMember dataclass (host + storagectl/computectl/
     compute/storage ports + workers). When set, the framework provisions
     unmanaged Cluster replicas with explicit STORAGECTL/STORAGE/
     COMPUTECTL/COMPUTE ADDRESSES pointed at the supplied member(s)
     instead of emitting managed SIZE/REPLICATION FACTOR. The is_pool_backed
     property on Cluster gates the rendering.

  3. CreateClusterAction / CreateClusterReplicaAction / DropClusterReplicaAction
     skip pool-backed clusters: there is no in-band allocator for grabbing
     additional pool members from a worker thread, and replication factor
     manipulation has no analogue in unmanaged-replica mode. The framework
     therefore only ever touches the pool members the caller pre-allocated.

These three pieces only make sense together: seed-scoping by itself
doesn't isolate the clusterd workload; the pool backend by itself collides
on global names; skipping the dynamic DDL by itself would just leave
clusters un-grown in a managed-cluster topology where that's the workload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reserve a pool of pre-existing clusterd containers
(`clusterd-pool-{0..N-1}`) so each parallel-workload cluster can land on
its own container and Antithesis can fault-inject it in isolation.
Without the pool, parallel-workload's clusters all live as child
processes of environmentd under the materialized container's process
orchestrator, and Antithesis can only kill / pause / partition the
container as a unit.

Pool size is read from `ANTITHESIS_CLUSTERD_POOL_SIZE` (env), default 8.
Each member is identical to clusterd1 / clusterd2: 4 timely workers,
no scratch, restart=no. `workflow_default` brings them up before
materialized so the controller can reach them when CREATE CLUSTER
references their addresses.

`Materialized` already has `unsafe_enable_unorchestrated_cluster_replicas`
set so CREATE CLUSTER ... STORAGECTL ADDRESSES is accepted.

config/docker-compose.yaml regenerated via
  bin/pyactivate test/antithesis/export-compose.py
to match — the YAML is generated, not hand-edited.
… clusterd

Wires the antithesis parallel-workload driver to:

  * Claim one clusterd-pool-{i} container per invocation via a real
    allocator (fcntl.flock on /tmp/clusterd-pool-slots/{i}.lock; lock
    held for the lifetime of the invocation, released on normal return
    or exception). Slots are tried in randomized order so the slot a
    driver lands on doesn't correlate with the invocation seed. If
    every slot is held the driver tags a sometimes() and exits cleanly
    rather than running unisolated.

  * Construct ClusterdPoolMember(host='clusterd-pool-<slot>', workers=4)
    and pass to Database(pool_members=[member], seed_scoped_names=True).
    The initial cluster lands on its own clusterd container — Antithesis
    fault injection targets that container in isolation, which is the
    point of the whole change.

  * Scope setup-phase catalog sweeps to objects this invocation owns:
    'cluster-<seed>-%' and 'role-<seed>-%'. The previous 'c%' / 'r%'
    patterns would have torn down every concurrent invocation's
    still-running state. The shared connections (kafka_conn, csr_conn,
    aws_conn, minio) live outside any seed-scoped database; we never
    drop them (CREATE ... IF NOT EXISTS is idempotent and dropping
    would CASCADE through another invocation's in-flight sources).

  * Drop seed-scoped clusters / databases / roles in main()'s finally
    so each invocation leaves the catalog clean and frees its pool-
    slot's clusterd. The DROP CLUSTER on an unmanaged cluster
    re-arms the clusterd to accept a fresh controller connection via
    the same reconcile() path that handles environmentd restarts
    (storage_state::reconcile drops stale objects, transport::serve
    cancels the prior connection on the next connect).

Pool size is read from CLUSTERD_POOL_SIZE (env), matching the
ANTITHESIS_CLUSTERD_POOL_SIZE knob in test/antithesis/mzcompose.py.
Default 8.

v1 scope (documented for the next round of work):
  * MAX_INITIAL_CLUSTERS = 1 per invocation, REPLICATION FACTOR = 1.
    Multi-replica coverage stays in antithesis_cluster.
  * CreateClusterAction / CreateClusterReplicaAction /
    DropClusterReplicaAction are skipped in pool mode; no in-band
    allocator inside the framework yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-workload

Documents the pool-backed parallel-workload topology: pool of clusterd
containers, file-lock slot allocator, seed-scoped naming, drop-on-exit,
and the reconcile-path correctness argument for clusterd reuse across
DROP/CREATE CLUSTER cycles. Lists current failure modes (all-slots-held,
crash-before-cleanup, sizing) and v1 limitations the next round of work
will close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…not N single-replica clusters

The driver now claims best-effort up to PW_DESIRED_REPLICAS (default 2)
clusterd-pool slots per invocation; the framework consumes the whole
pool_members list into a single unmanaged cluster with one replica per
member instead of one cluster per member. Gives multi-replica fault
coverage to parallel-workload (previously only antithesis_cluster ran
multi-replica) when pool capacity allows, and degrades gracefully to a
single-replica cluster under contention.

The driver helper renamed from _claim_pool_slot (yields one slot or None)
to _claim_pool_slots (yields list of 0..desired slots) so the contention
fallback is just a shorter list rather than a special case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…el-workload setup and worker phases

Three fault-injection shapes were escaping the parallel-workload driver as
'unexpected' SQL errors and tripping the always() assertion:

  1. SETUP: "Failed to resolve hostname" during CREATE CONNECTION FOR
     KAFKA — materialized's broker-validation can't reach `kafka`
     because Antithesis paused the kafka container or partitioned DNS.
  2. SETUP: "connection timeout expired" / "server closed the
     connection unexpectedly" from the driver's psycopg.connect to
     `materialized:6875` — Antithesis paused the materialized container
     during setup.
  3. WORKER: "thread pw-worker-N exited early" with no captured cause
     — `Worker.run`'s initial psycopg.connect / websocket / SET
     statements run outside any try/except, so a fault that lands during
     worker startup kills the thread without populating
     `occurred_exception`. The driver's worker_failed payload then
     reports just the thread-exit-early string with no underlying cause.

None of these are SUT correctness issues — they're the expected cost of
fault injection landing at inconvenient moments.

The fix adds a second tolerance category, _SETUP_FAULT_PATTERNS, alongside
the existing _SETUP_RACE_PATTERNS, and:

  * Inside _tolerate_setup_race: swallow per-statement exceptions whose
    message matches either pattern set.
  * Around the whole setup phase: if setup_failure matches, demote out of
    the always() assertion and into a sometimes() for visibility.
  * Around worker thread death: if occurred_exception is None
    (fault-killed startup) or its message matches a fault pattern,
    demote out of the always() assertion and into a sometimes(). A
    worker that captures a non-fault exception still fails the run as
    before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous design (CREATE+DROP CLUSTER per parallel-workload
invocation, targeting a pool clusterd) hit a hard clusterd halt on the
second invocation against the same slot:

  WARN: halting process: new instance configuration not compatible with
  existing instance configuration: ... index_logs:
  IntrospectionSourceIndex(...876897) vs Some(IntrospectionSourceIndex(...876641))

InstanceConfig::compatible_with compares LoggingConfig, which includes
the per-cluster IntrospectionSourceIndex GlobalIds — those are freshly
allocated by every CREATE CLUSTER. Pointing a different cluster
identity at a clusterd that already saw a prior cluster's introspection
indexes trips this check. reconcile() handles environmentd restarts
against the same cluster identity, but not cluster-identity changes.

Pivot to permanent pool clusters: one long-lived pool_cluster_<i>
per clusterd-pool-<i>, bootstrapped by the workload-entrypoint once
at compose-up and never dropped. Each parallel-workload invocation
picks a slot at random and runs against the pool cluster for that
slot. The cluster identity stays stable per slot, so the only
reconnect events the pool clusterds see are environmentd restarts and
Antithesis-injected pauses of the pool clusterd itself — the path
reconcile is designed for.

Concurrent invocations may share a pool cluster: every workload object
is in a seed-scoped database (db-pw-<seed>-*) with seed-scoped roles,
so DDL/DML never collides across invocations. No coordination
required — no fcntl.flock, no slot-claim contextmanager, no "no slot
available, exit cleanly" fallback. Antithesis still faults containers,
not invocations, so the per-container fault domain is preserved; two
invocations witnessing the same fault simply give us more independent
reproductions per failure.

Framework changes:
  * Cluster.pre_existing_name: when set, create()/drop() are no-ops,
    name() returns the literal, is_pool_backed flips True for action
    skips.
  * Database.existing_cluster_name: replaces pool_members on the
    Antithesis path; wraps one pre-existing cluster.
  * ClusterReplica loses its pool_member field (no longer used).
  * action.CreateClusterAction checks existing_cluster_name (was
    pool_members).

Driver changes:
  * Slot pick is rng.randrange(CLUSTERD_POOL_SIZE). No coordination.
  * _drop_seed_scoped_objects stops dropping clusters. The seed-scoped
    database drop cascades through every workload-created object on
    the cluster, returning the bound clusterd to an idle baseline.
  * Setup-phase pre-create sweep no longer touches mz_clusters.

mzcompose / entrypoint changes:
  * Workload service env now exports ANTITHESIS_CLUSTERD_POOL_SIZE +
    CLUSTERD_POOL_SIZE so the bootstrap script and the driver agree
    on the slot count.
  * workload-entrypoint.sh loops POOL_SIZE times and CREATEs each
    pool_cluster_<i> on its clusterd-pool-<i> (idempotent across
    compose-up).

Multi-replica parallel-workload coverage is gone in this iteration —
each pool cluster has one replica. Multi-replica coverage stays in
antithesis_cluster; a future revision could pair pool clusterds into
2-replica pool clusters at the cost of doubling the pool footprint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…perty

Sibling property upsert-key-reflects-latest-value tests only freshly-
written keys within one invocation. This new property tests the
complementary case: a key that has been resident in the upsert source
for a long time (many invocations, many faults, possibly many clusterd
restarts) must still accept a fresh write and have that write reflected
in the source.

The bug class this catches is a long-resident-state rehydration
regression where the upsert operator's state-store remembers a key's
value with enough fidelity to serve reads but enough wrongness that
fresh writes are silently dropped — the user's pipeline appears stuck
with no error.

Implementation: parallel_driver_upsert_ancient_key_writable.py owns a
dedicated key ring (ancient-k<0..31>) so it never collides with the
sibling driver's per-invocation keys. Each invocation picks 5 ring
slots at random, snapshots their current values, produces fresh
'cross-<prefix>-<nonce>' values, waits for catchup, and asserts that
each key's reflected value changed (or, for first-touch ring slots,
that a row now exists).

The 'always' assertion is race-tolerant against concurrent invocations
of this driver writing to the same ring slot — the only forbidden
outcome is 'row still has the exact old value we tried to overwrite,
with no peer interference,' which means our write was silently lost.
A separate 'sometimes' clause records when our specific new value
reached the source as the win-the-race liveness signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers per clusterd: 4 -> 16. Single-process clusterds at workers=16
exercise the same intra-process concurrency surface as a 4-process
scale=4,workers=4 production deployment, giving us realistic per-shard
parallelism, scheduler contention, and Antithesis-thread-pause-fault
depth.

Pool size: 8 -> 2. The no-lock allocator already tolerates
oversubscription (concurrent invocations may share a pool cluster
because every workload object lives in a seed-scoped database), so a
smaller pool isn't a correctness concern. A pool of 2 keeps the
topology closer to production replica counts and makes each pool
cluster behave more like a busy production cluster.

Workers count is now plumbed through end-to-end:
  * mzcompose.py declares CLUSTERD_WORKERS=16 and uses it for every
    Clusterd(...) service AND exports it in the Workload service env.
  * workload-entrypoint.sh reads CLUSTERD_WORKERS and templates it into
    every CREATE CLUSTER REPLICAS' WORKERS clause (antithesis_cluster
    plus each pool_cluster_<i>). The controller reads WORKERS from this
    clause, not from clusterd's runtime config, so the two must stay
    in lockstep.

Total worker thread count goes from 4*8 + 4*2 = 40 (old: 8 pool + 2
antithesis) to 16*2 + 16*2 = 64 (new: 2 pool + 2 antithesis). Modest
memory increase, big throughput / parallelism gain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the non-transactional DML axis of the MySQL CDC source.
Materialize's MySQL source code path is engine-agnostic (binlog ROW
events look the same for MyISAM and InnoDB), so this exists to assert
the engine-agnostic contract holds in practice when the upstream is
MyISAM:

  * BEGIN/COMMIT around MyISAM statements is silently ignored — each
    statement commits immediately with its own GTID.
  * No rollback semantics: a statement killed mid-write leaves whatever
    rows it managed to insert committed.
  * Table-level locking instead of row-level.

A MyISAM-specific regression would surface as the new driver's
'mysql myisam: CDC source row has correct value after catchup' firing
false while the existing InnoDB-backed driver looks healthy.

What landed:
  * first_mysql_replica_setup.py creates antithesis.cdc_test_myisam
    (ENGINE=MyISAM) alongside the existing cdc_test (ENGINE=InnoDB) on
    the primary, and waits for both to replicate to the replica.
  * helper_mysql_source.py exposes MYSQL_TABLE_MYISAM /
    TABLE_NAME_MYISAM constants and an ensure_mysql_cdc_myisam_table()
    helper. ensure_mysql_cdc_source() now creates both subsources off
    the single mysql_cdc_source SOURCE.
  * parallel_driver_mysql_myisam.py mirrors the InnoDB sibling's shape
    against the MyISAM subsource with the 'myi-p<u64hex>' batch prefix
    so the two drivers don't interfere.
  * Property doc scratchbook/properties/mysql-myisam-cdc-no-data-loss.md
    captures the bug class, MyISAM-specific binlog semantics, and the
    assertion list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dge network

Two changes the Antithesis platform needs from the docker-compose YAML,
applied together at export time so they stay in lockstep:

1. Set `container_name` and `hostname` on every service (matching the
   service key). Per Antithesis docker best practices, triage reports
   attribute log lines and assertions by hostname; without an explicit
   hostname the platform infers one (possibly the container id) that's
   harder to recognize. The workload container is the highest-value
   case but the rule is uniform.

2. Define a named bridge network (`antithesis-net`) at the top level
   and put every service on it. Relying on docker-compose's auto-
   generated `default` network was leaving DNS resolution up to
   whatever the surrounding Antithesis orchestration decided; an
   earlier run on this stack failed with kafka unable to resolve
   `zookeeper` (UnknownHostException) during setup. Antithesis support
   pointed at the network shape as the likely cause and suggested
   declaring it explicitly. Not setting `internal: true` per Antithesis
   docker best practices — that would cut us off from the Antithesis-
   side instrumentation network.

Both transforms live in export-compose.py so they apply uniformly to
every present and future service. Sanity-check that no service key
contains an underscore (RFC-1123); all current keys already use
hyphens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@DAlperin DAlperin force-pushed the dov/antithesis-poc-upstream branch from 95d6209 to 8060eb2 Compare May 14, 2026 18:30
DAlperin and others added 9 commits May 14, 2026 16:41
…available

Data-driven export-compose transform: for every depends_on entry that
uses `condition: service_started` against a dependency that declares a
`healthcheck`, upgrade the condition to `service_healthy`. Dependencies
without a healthcheck (currently only clusterd) are left as
`service_started` since there's nothing to wait on.

Under the Antithesis platform, `service_started` proved unreliable as a
readiness gate during initial container startup. Docker fires it as
soon as the dependency's container process starts, before the
dependency's DNS entry is reliably resolvable. The previous run on the
fault-isolated topology saw kafka hit
`java.net.UnknownHostException: zookeeper: Name or service not known`
148+ times in a row before its retry loop landed on a successful
lookup, with the same cascade downstream (schema-registry ↔ kafka).
Both containers exited with code 1 from those retries, tripping the
"No unexpected container exits" property.

Upgraded edges:
  kafka            -> zookeeper       (zookeeper:2181 nc healthcheck)
  schema-registry  -> kafka           (kafka:9092 nc healthcheck)
  materialized     -> minio           (minio /minio/health/live curl)
  workload         -> schema-registry (schema-registry curl healthcheck)

Left alone:
  workload -> clusterd{1,2}           (no clusterd healthcheck)

Gating on the healthcheck (which probes the actual listen port)
eliminates the DNS-race shape because docker won't fire
`service_healthy` until the dependency is answering on its port —
and DNS is reliably resolvable by then.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Antithesis feedback noted that parallel_driver_parallel_workload pulls
one u64 from the SDK, seeds a stdlib `random.Random`, and then makes
every downstream decision deterministically off that seed — locking the
fuzzer out of all branches in the framework's action/expression
subtree.

Add `AntithesisRandom`, a `random.Random` subclass that overrides
`getrandbits()` and `random()` to draw from the Antithesis SDK on every
call. Plug it into `parallel_driver_parallel_workload` so action
selection, DDL choices, expression shape, sample sizes, and every other
in-framework `self.rng.*` call route through the SDK per draw. Each
worker thread gets its own instance.

Also add `random_float(low, high)` in helper_random — needed by the
follow-up commit that swarms `TOMBSTONE_PROB`/`DROP_PROBABILITY` across
invocations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Antithesis feedback called out hardcoded probability constants
(TOMBSTONE_PROB, DROP_PROBABILITY) as missed swarm-testing opportunities
— every timeline ran the exact same workload mix instead of letting the
fuzzer drive the parameter.

Replace the three hardcoded constants with per-invocation draws from
helper_random.random_float() over sensible ranges:

  parallel_driver_upsert_latest_value:
      TOMBSTONE_PROB 0.15  ->  random_float(0.05, 0.50)
  singleton_driver_upsert_state_rehydration:
      TOMBSTONE_PROB 0.20  ->  random_float(0.05, 0.50) (fixed per run
      so cross-cycle stability of `expected` still tests rehydration)
  singleton_driver_catalog_recovery_consistency:
      DROP_PROBABILITY 0.20 ->  random_float(0.10, 0.50)

The draw happens once at the top of main() and is logged for triage.
Each timeline ends up with a different mix; the fuzzer is free to push
toward whichever extreme reveals a bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ator

Antithesis feedback: with every parallel driver requesting its own
`ANTITHESIS_STOP_FAULTS` window, the union of overlapping per-driver
quiet periods leaves the SUT mostly un-faulted. Faults should arrive
on a single coordinated cadence driven from a dedicated container, and
workloads should stay robust to whatever quiet/faulting transitions
the orchestrator picks — the catchup-then-assert pattern already there
fits that model.

Topology change: add a `fault-orchestrator` service backed by `bash:5`
running `test/antithesis/fault-orchestrator/pause_faults.sh`, adapted
from the Antithesis hands-on tutorial. It alternates
faults-OFF/faults-ON windows at randomised intervals
(START_DELAY=30, MIN_ON/MAX_ON/MIN_OFF/MAX_OFF=20-40), centralising the
cadence. Outside Antithesis (`ANTITHESIS_STOP_FAULTS` unset) the script
no-ops so snouty local validate still works.

The script is loaded via `Path(...).read_text()` inside
`FaultOrchestrator(Service)`; every `$` is doubled to `$$` before
embedding into the compose YAML so docker-compose's parse-time
variable interpolation doesn't eat shell references like `${RANDOM}`
or `${ANTITHESIS_STOP_FAULTS}`. The on-disk .sh file stays plain bash
so shellcheck and direct execution still work.

Driver-side: delete `helper_quiet.py` and every `request_quiet_period`
call site (9 drivers). Each driver's `wait_for_catchup` timeout (or
the equivalent FINAL_READ_TIMEOUT_S in the strict-serializable driver)
is bumped to span at least one MAX_OFF window plus catchup overhead
— concretely 90s for the short Kafka/MV/upsert drivers, 120s for
MySQL CDC paths, and 180s for the singleton rehydration driver which
must survive a clusterd kill landing inside a quiet window. Liveness
`sometimes(...)` anchor messages were renamed
"after quiet period" → "within catchup budget" to match the new
semantics; scratchbook docs that quoted the exact strings are updated
to match.

Regenerate test/antithesis/config/docker-compose.yaml via
export-compose.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or windows

The global fault-orchestrator alternates faults-ON/OFF windows of up
to MAX_ON / MAX_OFF seconds each (defaults 40s, set in the
FaultOrchestrator service). With the previous timeouts a single
connect attempt or producer flush could expire entirely inside one
40s faults-ON window — fast-failing before the orchestrator opened
the next quiet window and burning retry budget on TCP timeouts.

helper_pg:
  CONNECT_TIMEOUT_S  15 -> 30   (renamed from _CONNECT_TIMEOUT_S so
                                 the parallel-workload driver can
                                 reuse it instead of hardcoding 15)
  _RETRY_BUDGET_S    120 -> 180  (spans one full ON+OFF cycle + margin)

helper_mysql: same logic — same values. Replication adds primary→replica
hops so the budgets match helper_pg's. `wait_for_host`'s 5s probe stays
short: it runs at bootstrap before fault injection begins.

helper_kafka: new explicit librdkafka producer config —
`request.timeout.ms=60000`, `delivery.timeout.ms=180000`. New module-
level `ADMIN_TIMEOUT_S=90` for `admin.list_topics` and `create_topics`
result waits; new `FLUSH_TIMEOUT_S=90` exported for drivers so
`producer.flush(timeout=...)` waits past a single MAX_ON window before
declaring pending messages "skipping assertions" material.

Per-driver direct psycopg.connect in parallel_driver_parallel_workload
(3 sites) now use `CONNECT_TIMEOUT_S` instead of literal 15. The four
Kafka-source drivers' `producer.flush(timeout=30)` calls now use
`FLUSH_TIMEOUT_S` from helper_kafka.

Probe timeouts are intentionally kept short — they exist to *measure*
unavailability, not wait through it:
  anytime_fault_recovery_exercised.PROBE_CONNECT_TIMEOUT_S = 2.0
  singleton_driver_catalog_recovery_consistency.PROBE_CONNECT_TIMEOUT_S = 2.0
  parallel_driver_strict_serializable_reads.PROBE_CONNECT_TIMEOUT_S = 5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ually runs

The FaultOrchestrator service was wired up with
`entrypoint: ["bash", "-s"]` and the script body passed as `command`.
But `bash -s` reads commands from stdin — and there's no stdin in a
detached docker container, so bash exited immediately with no output
and the script string was silently used as `$0`.

Net effect: the orchestrator container started, exited cleanly, and
ANTITHESIS_STOP_FAULTS was never called. Antithesis fault injection
ran unconstrained for the entire run, with no quiet windows ever
opening. Every driver that needed more than one connection (the four
Kafka drivers do CREATE CONNECTION + admin metadata fetch + CREATE
SOURCE; the two MySQL drivers do primary writes + MZ reads; the
parallel-workload driver does multi-step setup; the strict-
serializable driver opens many fresh psycopg connects) effectively
starved.

Only `parallel_driver_mv_reflects_table_updates.py` ever reached its
"driver done" log line: it does one batched INSERT and then polls
materialize on a single retried connection, so a brief calm in the
faults occasionally let it through.

Fix: use `bash -c <script>`, which actually runs the script string.
Verified locally that the round-tripped YAML script body executes
under bash -c, including the no-op exit when ANTITHESIS_STOP_FAULTS
is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Debugging the recent fault-orchestrator no-op required staring at logs
without a way to tell which of N concurrent driver invocations a given
line came from, and the helpers logged so little that "stuck at
ensure_kafka_connection" looked the same as "stuck at the CREATE SOURCE
broker validation" or "stuck on the kafka admin metadata fetch."

Two changes:

1. helper_logging.py mints a short hex `INVOCATION_ID` once per process
   and stamps it into every log record via the root formatter
   (`[inv=<id>]`). Every driver swaps its bare `logging.basicConfig`
   block for `helper_logging.setup_logging(name)`, so grepping
   `inv=a3f9b1c2` isolates one invocation's records across helpers,
   threads, and subprocesses.

2. Helpers grow start/done lifecycle logs with elapsed timings at every
   meaningful checkpoint:

   helper_pg.connect             start, established/giving up + elapsed
   helper_pg.execute_retry       per-statement, with truncated SQL summary
   helper_pg.query_retry         same shape, plus row count on success
   helper_pg.execute_internal    same shape
   helper_mysql._open            per-host start/established/giving up
   helper_kafka.make_producer    config snapshot at construction
   helper_kafka.ensure_topic     probing / list_topics done / creating / created
   helper_upsert_source          per-phase markers around ensure_kafka_connection
                                 and ensure_upsert_text_source
   helper_none_source            same
   helper_table_mv               per-phase markers
   helper_source_stats           wait_for_catchup start, progress, done/timeout

   Retry attempts get attempt count + per-attempt elapsed + budget used,
   so "still trying" vs. "burning the budget" is visible in one line.

`bin/pyactivate -m ruff check --fix` cleaned up the unused `import
logging` statements that fell out of the basicConfig removal.

LOG_LEVEL env var (default INFO) lets a triage user re-run with
`LOG_LEVEL=DEBUG` to surface the per-statement SQL summaries the helpers
emit at DEBUG without flooding INFO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er validation

A driver invocation failed after one attempt with:

  WARNING pg execute: giving up after 1 attempts (17.88s total) on
  CREATE SOURCE IF NOT EXISTS upsert_text_src IN CLUSTER antithesis_cluster
  FROM KAFKA CONNECTION antithesis_kafka_conn ...:
  Meta data fetch error: BrokerTransportFailure (Local: Broker transport failure)

Materialize validates `CREATE SOURCE FOR KAFKA` by doing a broker
metadata fetch as part of the planning path. Under the global fault-
orchestrator's faults-ON windows, that fetch fails — and materialize
surfaces the failure as a plain `psycopg.errors.InternalError`, *not*
`OperationalError`. Our `_retryable` predicate only recognised
OperationalError + InterfaceError, so the driver burned through one
attempt in 17s and aborted instead of waiting for the next quiet window.

Widen `_retryable` to also accept `InternalError` whose message matches
one of a small set of patterns identifying transient external
dependencies:

  * librdkafka surfaces:  BrokerTransportFailure, Meta data fetch error,
                          "Local: All broker connections are down",
                          "Local: Timed out"
  * schema-registry HTTP: "schema registry", Connection refused/reset
  * DNS partitioned:      Failed to resolve hostname, no route to host,
                          Temporary failure in name resolution
  * postgres/mysql source upstream:
                          could not translate host name,
                          could not connect to server

Patterns are intentionally narrow so a real catalog/schema error
("relation does not exist", "syntax error at or near...") still
propagates after one attempt rather than spinning silently.

Applies to execute_retry / query_retry / execute_internal_retry via the
shared predicate; `create_source_idempotent` keeps its separate
already-exists-tolerance path on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Almost no workloads have been finishing on recent builds. Dov noted
that Antithesis's deterministic hypervisor runs the whole fleet on a
single core, which makes the workers-per-clusterd bump from 4 to 16
(commit 86d1fbb, "bump clusterd workers to 16 and shrink pool
to 2") look suspicious:

  * Total Timely workers across the four clusterd containers went from
    40 to 64.
  * Per-process worker count went from 4 to 16 — sixteen work-stealing
    threads sharing one core would burn most wakeups on context-switch
    overhead and starve dependent steps.

Revert just `CLUSTERD_WORKERS = 16 -> 4` (keeping the pool size at 2;
pool size by itself shouldn't cause this). Regenerate the compose
yaml; every CREATE CLUSTER REPLICAS WORKERS clause now matches the
clusterd `workers=` argument again.

If a build at this commit shows workloads finishing again, the workers
bump is the regression and we go back to 4. If they still don't
finish, the regression is elsewhere (rpqlkrmq's container_name +
hostname + named-bridge-network is the next candidate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@DAlperin DAlperin force-pushed the dov/antithesis-poc-upstream branch from 7fe53c5 to 714e252 Compare May 15, 2026 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants