Skip to content

Added a TLA+/TLC trace-validation baseline with trace export, runner integration, and experiment tooling.#90

Open
baochunli wants to merge 12 commits into
mainfrom
merge
Open

Added a TLA+/TLC trace-validation baseline with trace export, runner integration, and experiment tooling.#90
baochunli wants to merge 12 commits into
mainfrom
merge

Conversation

@baochunli
Copy link
Copy Markdown
Collaborator

This PR added a reproducible TLA+/TLC trace-validation baseline for Days and integrated it into the LeanGuard runner + test-generation workflows so we can do an apples-to-apples comparison on the same canonicalized event traces (same ordering assumptions, same post-state “witness” fields, and comparable diagnostics).

Motivation

  • Established a baseline that used the standard, widely-available TLC workflow (released tla2tools.jar) rather than relying on research-only NDJSON ingestion modules, so the comparison stayed reproducible.
  • Matched LeanGuard’s “equal-information” checking style by validating against the same per-event post-state snapshots that the Lean checkers consume (instead of giving TLC extra inference power by omitting fields).

Changes

  • Introduced a trace export pipeline that canonicalized *_events.csv rows by (time_ns, event_id), rejected duplicate keys, and produced:

    • lossless NDJSON (for diagnostics and precise “first failing row” reporting), and
    • a generated TraceData TLA module (rescales large numeric fields to fit TLC’s 32-bit integer limits).
  • Added a standalone trace-export CLI that exported a single trace to NDJSON or to a generated TraceData module, enabling manual TLC debugging without the full runner.

  • Extended leanguard-run with:

    • simulate-and-check vs check-only modes,
    • deterministic-run guardrails for multi-threaded simulation runs (with an explicit opt-in when nondeterminism was acceptable),
    • manifest-based trace discovery with a scan fallback,
    • structured JSON summaries that captured per-checker argv/stdout/stderr/exit-code/runtime,
    • optional semantic coverage collection (--coverage) and union/per-checker aggregation in the JSON summary.
  • Added an optional TLC execution path (--tlc-check) that:

    • generated an isolated per-spec TLC workspace under the run’s log directory (spec + model config + generated TraceData),
    • ran TLC via either java -cp <tla2tools.jar> tlc2.TLC ... or an explicit wrapper executable,
    • applied trace-validation-recommended TLC settings by default (single worker + DFS queue optimization),
    • parsed TLC output to compute the matched prefix length (via diameter/depth lines, with an l = ... fallback),
    • reported the first failing event with (index, time_ns, event_id, kind) by cross-referencing the NDJSON export,
    • exposed both command runtime and end-to-end runtime (including export + workspace preparation) to make benchmarks comparable.
  • Added optional peak RSS sampling (--measure-rss) for both LeanGuard checkers and TLC runs (via periodic RSS polling) so memory comparisons could be reported alongside runtimes.

  • Implemented protocol-specific TLA trace validators and model configurations for:

    • DCQCN control-loop state updates,
    • AQM witness validation (drop/ECN decisions),
    • PFC pause/resume pairing and hysteresis rules,
    • WFQ and DRR schedule/depart replay in scaled integer time,
    • TCP CUBIC ACK growth and loss/timeout reductions (via a small fixed-point approximation with a bounded tolerance).
  • Added a coverage-driven test-generation CLI that:

    • indexed seed configs into a local corpus (with basic protocol/feature tagging),
    • fuzzed configs via controlled mutations, recorded the applied mutations, and bucketed cases as accepted/rejected,
    • supported protocol-targeted campaigns (including DCQCN-specific “knob” mutations and optional calibration loops to hit requested coverpoints),
    • replayed and minimized failing cases while preserving reproduction metadata,
    • optionally required TLC acceptance in addition to LeanGuard acceptance for “accepted” cases.
  • Added experiment automation scripts that:

    • benchmarked LeanGuard vs TLC across multiple scenarios and protocols,
    • ran a small fault-injection agreement suite that validated accept/reject agreement and compared first-failure localization,
    • exported the latest experiment JSON outputs to Markdown/CSV for inclusion in the writeup.
  • Added/updated documentation describing the baseline, the rescaling rules, TLC runner behavior, the coverage experiment plan, and end-to-end reproduction steps.

  • Updated ignore rules to keep local TLC jars, TLC-generated state directories, and Python bytecode out of version control.

How to run (examples)

Run LeanGuard checkers only

cargo run --bin leanguard-run -- \
  --config <scenario.toml> \
  --mode check-only \
  --checker-dir <path/to/lean/.lake/build/bin>

Run LeanGuard + TLC baseline

cargo run --bin leanguard-run -- \
  --config <scenario.toml> \
  --mode check-only \
  --checker-dir <path/to/lean/.lake/build/bin> \
  --tlc-check \
  --tlc-jar <path/to/tla2tools.jar>

Export a single trace for manual TLC runs

cargo run --bin days-trace-export -- --format tla --input <log_path>/<trace>_events.csv

Run test generation (seed index + fuzz)

cargo run --bin leanguard-testgen -- seed-index <seeds_dir>

cargo run --bin leanguard-testgen -- fuzz --budget 100

Notes / caveats

  • Stored TraceData values were rescaled to fit TLC’s integer range:

    • nanoseconds → microseconds (rounded),
    • bps → 10 Mbps units (rounded),
    • ppb → permille units (rounded).
  • Added small tolerances in the specs where rescaling or float→int rounding made exact equality brittle (most notably for WFQ virtual time/finish time and for CUBIC congestion-avoidance updates).

Tests

  • Ran cargo test --features test -- --show-output.

XindanZhang and others added 12 commits January 29, 2026 23:55
• Those tables are LeanGuard (Lean checkers) vs. a TLA+/TLC trace-validation baseline on the same Days *_events.csv traces: checker_ms is the Lean checker runtime,
  tlc_cmd_ms is just the TLC process, and tlc_total_ms includes trace export + staging + TLC; the ratio is tlc_total/checker.

  What’s done

  - TLC baseline implemented for all 6 protocols you listed: dcqcn, aqm, pfc, wfq, drr, cubic (see task.md#L18).
  - CUBIC TLC spec fixed/extended so it accepts configs/cubic_simple.toml (fixed-point CA update + small cwnd_bytes tolerance; tla/CubicTrace.tla#L26).
  - Full benchmarks run (5 reps) and recorded in task.md#L108 (raw JSON: logs/bench_leanguard_vs_tlc_2026-01-29.json).

  Where the results are

  - Combined per-protocol table: task.md#L108.
  - TLC baseline overview: tla/README.md#L1.

  Re-run

  - Bench: python3 utils/bench_leanguard_vs_tlc.py --reps 5
  - Tests (all passing on my run): cargo test --features test -- --show-output
 Bench results (all protocols)

  - Updated results are in task.md:110 (raw data: logs/bench_leanguard_vs_tlc_2026-01-30.json).
  - We’re benchmarking 6 protocols: dcqcn, aqm, pfc, wfq, drr, cubic (AQM often appears alongside other configs because the switch emits
    aqm_events.csv in many runs).

  TLC “reject” is now principled (not diameter-only)

  - Added ProgressOk == IF l <= LenTrace THEN ENABLED Next ELSE TRUE to each TLC spec (e.g. tla/DcqcnTrace.tla:309) and enabled it via INVARIANT
    ProgressOk in each .cfg (e.g. tla/DcqcnTrace.cfg:2).
  - Updated leanguard-run to classify TLC invariant violations as reject and to recover the matched prefix even when TLC doesn’t print depth/
    diameter (src/bin/leanguard-run.rs:790 and helpers around src/bin/leanguard-run.rs:922).

  Correctness / test status

  - Added runner-parsing coverage for invariant-based rejects and tool-wrapped output (tests/leanguard_run.rs:288, tests/leanguard_run.rs:401,
    tests/leanguard_run.rs:485).
  - cargo test --features test -- --show-output passes after these changes.

  What we still need (for “paper-grade” confidence)

  - Fault-injection agreement studies across protocols (Lean reject vs TLC reject + first-failure alignment) and memory reporting are still
    listed as in-progress in task.md:45.
…plumbing in preparation for coverage experiments.
The Lean checkers now accepts:

- --emit-coverpoint-catalog

This prints the checker's coverpoint catalog as stable, sorted string
names, one per line.

Example:

 ```bash
   /Users/bli/Playground/days/lean/.lake/build/bin/dcqcn_check
--emit-coverpoint-catalog
   /Users/bli/Playground/days/lean/.lake/build/bin/aqm_check
--emit-coverpoint-catalog
   /Users/bli/Playground/days/lean/.lake/build/bin/pfc_check
--emit-coverpoint-catalog
 ```
…N goal campaign.

extended leanguard-testgen to support AQM goal calibration and targeted
mutations in testgen.rs (line 809), testgen.rs (line 1802), testgen.rs
(line 2192).

New AQM result (seed dcqcn_simple.toml, goal
ecn_threshold_drop_overflow): hit rate improves from 8% (iters=0) to 62%
(iters=5) (testgen_dcqcn_calibration_summary.csv (line 6)), and the
first time the campaign hits that coverpoint drops from 191 accepted
tests (random) to 2 accepted tests (goal-calibrated) (experiments.tex
(line 206)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants