Added a TLA+/TLC trace-validation baseline with trace export, runner integration, and experiment tooling. by baochunli · Pull Request #90 · iQua/days

baochunli · 2026-02-01T17:18:18Z

This PR added a reproducible TLA+/TLC trace-validation baseline for Days and integrated it into the LeanGuard runner + test-generation workflows so we can do an apples-to-apples comparison on the same canonicalized event traces (same ordering assumptions, same post-state “witness” fields, and comparable diagnostics).

Motivation

Established a baseline that used the standard, widely-available TLC workflow (released tla2tools.jar) rather than relying on research-only NDJSON ingestion modules, so the comparison stayed reproducible.
Matched LeanGuard’s “equal-information” checking style by validating against the same per-event post-state snapshots that the Lean checkers consume (instead of giving TLC extra inference power by omitting fields).

Changes

Introduced a trace export pipeline that canonicalized *_events.csv rows by (time_ns, event_id), rejected duplicate keys, and produced:
- lossless NDJSON (for diagnostics and precise “first failing row” reporting), and
- a generated TraceData TLA module (rescales large numeric fields to fit TLC’s 32-bit integer limits).
Added a standalone trace-export CLI that exported a single trace to NDJSON or to a generated TraceData module, enabling manual TLC debugging without the full runner.
Extended leanguard-run with:
- simulate-and-check vs check-only modes,
- deterministic-run guardrails for multi-threaded simulation runs (with an explicit opt-in when nondeterminism was acceptable),
- manifest-based trace discovery with a scan fallback,
- structured JSON summaries that captured per-checker argv/stdout/stderr/exit-code/runtime,
- optional semantic coverage collection (--coverage) and union/per-checker aggregation in the JSON summary.
Added an optional TLC execution path (--tlc-check) that:
- generated an isolated per-spec TLC workspace under the run’s log directory (spec + model config + generated TraceData),
- ran TLC via either java -cp <tla2tools.jar> tlc2.TLC ... or an explicit wrapper executable,
- applied trace-validation-recommended TLC settings by default (single worker + DFS queue optimization),
- parsed TLC output to compute the matched prefix length (via diameter/depth lines, with an l = ... fallback),
- reported the first failing event with (index, time_ns, event_id, kind) by cross-referencing the NDJSON export,
- exposed both command runtime and end-to-end runtime (including export + workspace preparation) to make benchmarks comparable.
Added optional peak RSS sampling (--measure-rss) for both LeanGuard checkers and TLC runs (via periodic RSS polling) so memory comparisons could be reported alongside runtimes.
Implemented protocol-specific TLA trace validators and model configurations for:
- DCQCN control-loop state updates,
- AQM witness validation (drop/ECN decisions),
- PFC pause/resume pairing and hysteresis rules,
- WFQ and DRR schedule/depart replay in scaled integer time,
- TCP CUBIC ACK growth and loss/timeout reductions (via a small fixed-point approximation with a bounded tolerance).
Added a coverage-driven test-generation CLI that:
- indexed seed configs into a local corpus (with basic protocol/feature tagging),
- fuzzed configs via controlled mutations, recorded the applied mutations, and bucketed cases as accepted/rejected,
- supported protocol-targeted campaigns (including DCQCN-specific “knob” mutations and optional calibration loops to hit requested coverpoints),
- replayed and minimized failing cases while preserving reproduction metadata,
- optionally required TLC acceptance in addition to LeanGuard acceptance for “accepted” cases.
Added experiment automation scripts that:
- benchmarked LeanGuard vs TLC across multiple scenarios and protocols,
- ran a small fault-injection agreement suite that validated accept/reject agreement and compared first-failure localization,
- exported the latest experiment JSON outputs to Markdown/CSV for inclusion in the writeup.
Added/updated documentation describing the baseline, the rescaling rules, TLC runner behavior, the coverage experiment plan, and end-to-end reproduction steps.
Updated ignore rules to keep local TLC jars, TLC-generated state directories, and Python bytecode out of version control.

How to run (examples)

Run LeanGuard checkers only

cargo run --bin leanguard-run -- \
  --config <scenario.toml> \
  --mode check-only \
  --checker-dir <path/to/lean/.lake/build/bin>

Run LeanGuard + TLC baseline

cargo run --bin leanguard-run -- \
  --config <scenario.toml> \
  --mode check-only \
  --checker-dir <path/to/lean/.lake/build/bin> \
  --tlc-check \
  --tlc-jar <path/to/tla2tools.jar>

Export a single trace for manual TLC runs

cargo run --bin days-trace-export -- --format tla --input <log_path>/<trace>_events.csv

Run test generation (seed index + fuzz)

cargo run --bin leanguard-testgen -- seed-index <seeds_dir>

cargo run --bin leanguard-testgen -- fuzz --budget 100

Notes / caveats

Stored TraceData values were rescaled to fit TLC’s integer range:
- nanoseconds → microseconds (rounded),
- bps → 10 Mbps units (rounded),
- ppb → permille units (rounded).
Added small tolerances in the specs where rescaling or float→int rounding made exact equality brittle (most notably for WFQ virtual time/finish time and for CUBIC congestion-avoidance updates).

Tests

Ran cargo test --features test -- --show-output.

• Those tables are LeanGuard (Lean checkers) vs. a TLA+/TLC trace-validation baseline on the same Days *_events.csv traces: checker_ms is the Lean checker runtime, tlc_cmd_ms is just the TLC process, and tlc_total_ms includes trace export + staging + TLC; the ratio is tlc_total/checker. What’s done - TLC baseline implemented for all 6 protocols you listed: dcqcn, aqm, pfc, wfq, drr, cubic (see task.md#L18). - CUBIC TLC spec fixed/extended so it accepts configs/cubic_simple.toml (fixed-point CA update + small cwnd_bytes tolerance; tla/CubicTrace.tla#L26). - Full benchmarks run (5 reps) and recorded in task.md#L108 (raw JSON: logs/bench_leanguard_vs_tlc_2026-01-29.json). Where the results are - Combined per-protocol table: task.md#L108. - TLC baseline overview: tla/README.md#L1. Re-run - Bench: python3 utils/bench_leanguard_vs_tlc.py --reps 5 - Tests (all passing on my run): cargo test --features test -- --show-output

Bench results (all protocols) - Updated results are in task.md:110 (raw data: logs/bench_leanguard_vs_tlc_2026-01-30.json). - We’re benchmarking 6 protocols: dcqcn, aqm, pfc, wfq, drr, cubic (AQM often appears alongside other configs because the switch emits aqm_events.csv in many runs). TLC “reject” is now principled (not diameter-only) - Added ProgressOk == IF l <= LenTrace THEN ENABLED Next ELSE TRUE to each TLC spec (e.g. tla/DcqcnTrace.tla:309) and enabled it via INVARIANT ProgressOk in each .cfg (e.g. tla/DcqcnTrace.cfg:2). - Updated leanguard-run to classify TLC invariant violations as reject and to recover the matched prefix even when TLC doesn’t print depth/ diameter (src/bin/leanguard-run.rs:790 and helpers around src/bin/leanguard-run.rs:922). Correctness / test status - Added runner-parsing coverage for invariant-based rejects and tool-wrapped output (tests/leanguard_run.rs:288, tests/leanguard_run.rs:401, tests/leanguard_run.rs:485). - cargo test --features test -- --show-output passes after these changes. What we still need (for “paper-grade” confidence) - Fault-injection agreement studies across protocols (Lean reject vs TLC reject + first-failure alignment) and memory reporting are still listed as in-progress in task.md:45.

…plumbing in preparation for coverage experiments.

…riments.

The Lean checkers now accepts: - --emit-coverpoint-catalog This prints the checker's coverpoint catalog as stable, sorted string names, one per line. Example: ```bash /Users/bli/Playground/days/lean/.lake/build/bin/dcqcn_check --emit-coverpoint-catalog /Users/bli/Playground/days/lean/.lake/build/bin/aqm_check --emit-coverpoint-catalog /Users/bli/Playground/days/lean/.lake/build/bin/pfc_check --emit-coverpoint-catalog ```

…N goal campaign. extended leanguard-testgen to support AQM goal calibration and targeted mutations in testgen.rs (line 809), testgen.rs (line 1802), testgen.rs (line 2192). New AQM result (seed dcqcn_simple.toml, goal ecn_threshold_drop_overflow): hit rate improves from 8% (iters=0) to 62% (iters=5) (testgen_dcqcn_calibration_summary.csv (line 6)), and the first time the campaign hits that coverpoint drops from 191 accepted tests (random) to 2 accepted tests (goal-calibrated) (experiments.tex (line 206)).

XindanZhang and others added 12 commits January 29, 2026 23:55

Propagate TLC baseline to testgen

b855cd2

Added fault-injection agreement and RSS sampling.

4ab186a

Ignore Python bytecode

d7a3ce6

Export experiment results

398e686

Document experiment reproduction

2c89e21

Added documentation for TLA+/TLC experiments; added coverage-related …

2ee29e7

…plumbing in preparation for coverage experiments.

Added experimental setup and procedures for the coverage-related expe…

78b6537

…riments.

Added local timings for leanguard-testgen.

1eb3ee1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a TLA+/TLC trace-validation baseline with trace export, runner integration, and experiment tooling.#90

Added a TLA+/TLC trace-validation baseline with trace export, runner integration, and experiment tooling.#90
baochunli wants to merge 12 commits into
mainfrom
merge

baochunli commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

baochunli commented Feb 1, 2026

Motivation

Changes

How to run (examples)

Run LeanGuard checkers only

Run LeanGuard + TLC baseline

Export a single trace for manual TLC runs

Run test generation (seed index + fuzz)

Notes / caveats

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants