Test & Benchmark Requirements

CRG Grade: B — ACHIEVED 2026-04-04

CRG B achieved 2026-04-04: Verified proofs from 6 diverse external repos with real output.

CRG B Evidence — External Targets

Target Repo	Language	What Was Tested	Result
echidna (own proofs)	Coq	`echidna verify` on proofs/coq/basic.v	VALID: 0 goals, 43 tactics, 0 theorems
nextgen-languages/wokelang	Coq	`echidna verify` on WokeLang.v	INVALID: proof rejected
nextgen-languages/my-lang	Coq	`echidna verify` on Syntax.v	INVALID: proof rejected
maa-framework/absolute-zero	Coq	`echidna verify` on CNOCategory.v	VALID: 0 goals, 81 tactics, 0 theorems
protocol-squisher	Lean	`echidna verify` on wheelbarrow_necessity.lean	VALID: 0 goals, 0 tactics, 5 theorems
nextgen-languages/tangle	Lean	`echidna verify` on Tangle.lean	VALID: 0 goals, 0 tactics, 10 theorems

Target Details

1. echidna own Coq proofs (baseline)

Command: echidna verify /var/mnt/eclipse/repos/verification-ecosystem/echidna/proofs/coq/basic.v
Key findings: Valid proof. 43 tactics detected. Baseline verification working correctly.

2. nextgen-languages/wokelang (Coq)

Command: echidna verify /var/mnt/eclipse/repos/nextgen-languages/wokelang/docs/proofs/verification/WokeLang.v
Key findings: Proof invalid. Echidna correctly rejected an incomplete/malformed proof file. Demonstrates false-positive resilience.

3. nextgen-languages/my-lang (Coq)

Command: echidna verify /var/mnt/eclipse/repos/nextgen-languages/my-lang/proofs/verification/coq/Syntax.v
Key findings: Proof invalid. Another correctly rejected proof. Confirms echidna does not blindly accept all Coq files.

4. maa-framework/absolute-zero (Coq — category theory)

Command: echidna verify /var/mnt/eclipse/repos/maa-framework/absolute-zero/proofs/coq/category/CNOCategory.v
Key findings: Valid proof with 81 tactics. Complex category theory proof verified successfully. Demonstrates handling of advanced mathematical content.

5. protocol-squisher (Lean — wheelbarrow necessity)

Command: echidna verify /var/mnt/eclipse/repos/protocol-squisher/proofs/lean/wheelbarrow_necessity.lean
Key findings: Valid proof. 5 theorems in context, 0 tactics (declarative style). Cross-prover support (Lean vs Coq) verified.

6. nextgen-languages/tangle (Lean — language proofs)

Command: echidna verify /var/mnt/eclipse/repos/nextgen-languages/tangle/proofs/Tangle.lean
Key findings: Valid proof. 10 theorems in context. Lean parser handles different proof styles correctly.

Current State (updated 2026-04-04)

Unit tests: 756 pass / 0 fail / 16 ignored
- 556 in main crate, 200 across integration/property/aspect/E2E suites
Integration tests: full (echidnabot + prover backends)
E2E tests: COMPLETE (tests/e2e_prover_test.rs — 10 tests)
P2P property tests: COMPLETE (tests/p2p_property_tests.rs — ~1,000 proptest cases)
Aspect tests: COMPLETE (tests/aspect_tests.rs — 12 tests)
Julia smoke tests: COMPLETE (tests/julia/smoke_test.jl — 63 tests)
Shell validation: COMPLETE (tests/shell/validate_scripts.sh — 68 checks)
Benchmarks: BASELINED (benches/proof_benchmarks.rs — 13 Criterion functions)
panic-attack scan: NEVER RUN

Benchmark Status (verified 2026-04-04)

The 1,035 .v files are Coq proof files from external_corpora/ — NOT runnable benchmarks.

external_corpora/CoqGym/ — 6,678 Coq proofs (training/ML corpus data)
proofs/coq/ — 16 project Coq proof files
These are NOT V-lang files and NOT benchmark runners

Real Criterion benchmarks exist in benches/proof_benchmarks.rs:

bench_proof_state_construction (5 goal-count variants)
bench_term_construction (4 depth variants)
bench_prover_creation (8 provers)
bench_prover_detection (8 file extensions)
bench_trust_computation (3 scenarios: max_trust, single_prover, dangerous)
bench_axiom_scanning (4 patterns)
bench_mutation_generation
bench_pareto_frontier (3 point-count variants)
bench_statistics_tracking
bench_ffi_kind_mapping Run with: cargo bench

Ignored Test Investigation (9 tests)

All 9 ignored tests are in tests/integration_v1_2.rs. They are legitimately ignored — each has a // Requires <X> binary comment:

2 tests: #[ignore] // Requires ACL2 binary
2 tests: #[ignore] // Requires PVS binary
2 tests: #[ignore] // Requires HOL4 binary
3 tests: #[ignore] // Requires prover binaries

No action needed. These test real prover binary invocation and must remain skipped until ACL2/PVS/HOL4 are installed in the CI environment. The ignore reason is clearly documented.

(Note: TEST-NEEDS.md said "9 ignored" but cargo test shows 16 ignored total. The additional 7 are in tests/test_neural_integration.rs requiring the Julia server running on port 8081.)

What's Covered

Point-to-Point (P2P) — tests/p2p_property_tests.rs

Dispatch determinism: same input → same trust level (proptest, 500 cases)
ProofState serialisation: arbitrary states survive JSON roundtrip (200 cases)
Reject danger caps trust at Level2 (proptest, 200 cases)
Safe danger >= Warning danger (proptest, 200 cases)
ProverFactory correct kind (proptest, 50 cases)
AxiomTracker stateless / idempotent (proptest, 200 cases)
ProverKind JSON roundtrip (proptest, 50 cases)
TrustLevel ordering consistent with value() (proptest, 200 cases)
DispatchConfig clone preserves fields (proptest, 100 cases)

End-to-End (E2E) — tests/e2e_prover_test.rs

Aspect Tests — tests/aspect_tests.rs

Julia Smoke Tests — tests/julia/smoke_test.jl

Syntax validation: 9 top-level scripts (api_server, run_server, train_models, etc.)
Syntax validation: inference/ sub-package
EchidnaML module structure
api/ sub-package syntax
scripts/*.jl corpus scripts (32 scripts)
Directory structure (JULIA_SRC, Project.toml, entrypoints)
Tokenizer unit smoke (5 properties)
BOW vectorizer unit smoke (4 properties)

Shell Validation — tests/shell/validate_scripts.sh

Syntax check (bash -n) on all scripts/*.sh
Syntax check on tests/*.sh
SPDX header check
Banned pattern check (docker/npm/HTTP)
--help smoke test for selected scripts

Findings from shell validation:

scripts/install-provers.sh — missing SPDX header
scripts/mvp-smoke.sh — missing SPDX header
scripts/expand_training_data.sh — HTTP URL (should use HTTPS)
scripts/test-integration.sh — HTTP URL
scripts/test_integration.sh — HTTP URL

Benchmarks — benches/proof_benchmarks.rs

Core type construction (ProofState, Term)
Prover creation (8 representative backends)
Prover detection from file extension
Trust computation (3 scenarios)
Axiom scanning (4 patterns)
Mutation generation
Pareto frontier (3 sizes)
Statistics tracking
FFI kind mapping roundtrip

Still Missing

V-lang prover (6,694 files — ZERO tests)

This is Coq formal proof code in external_corpora/ (training data)
The "V" extension here means Coq, not V-lang
Coq proofs are inherently self-verifying when compiled with coqc
No dedicated test infrastructure for the Coq corpus
Status: Out of scope for this blitz (requires Coq toolchain + significant effort)

Zig FFI (34 files — 7 test files)

7 test files exist
Coverage verification pending

Idris2 ABI (42 files)

Formal proofs are self-verifying (checked at type-check time)
No runtime integration tests yet
Status: Deferred — requires Idris2 toolchain in CI

ReScript UI (33 files — ZERO tests)

Dashboard components need render tests
Status: Deferred — requires Deno + ReScript test setup

Self-Tests

panic-attack assail on own repo
Echidnabot self-scan
Prover correctness validation suite

Known Issues

/api/verify false-positive on unparseable input (2026-04-05)

Bug: POST /api/verify returns valid: true for content that a prover's parse_string cannot interpret, even when that content is syntactically invalid for the target prover.

Reproduction:

curl -X POST http://localhost:8090/api/verify \
  -H 'Content-Type: application/json' \
  -d '{"prover":"Coq","content":"absolute garbage @#$ not coq"}'
# → {"valid":true,"goals_remaining":0,"tactics_used":0}

Also affected: empty content, cross-prover content (Lean syntax to Coq), anything the per-prover parse_string cannot extract structure from.

Root cause: the verification pipeline is:

parse_string(content) → ProofState (partial parse; empty state on unrecognised input, no failure propagated).
verify_proof(&state) calls state.export() to regenerate prover source from the ProofState, writes it to a temp file, then runs the real backend binary (coqc, lean, idris2 --check).
Empty ProofState → empty exported file → prover returns exit 0 → valid: true.

Impact: any downstream system treating valid: true as ground truth will accumulate false positives. Observed externally: the hyperpolymath proof_attempts learning loop was polluted with 29 false-success rows before the bug was spotted; fix was to bypass /api/verify and shell out to prover binaries directly in the batch driver.

Idris2 has the mirror-image bug (false-negative): parse_string parses main = putStrLn "ok" but export() regenerates the body as ?main_todo (a hole), so idris2 --check always fails. See the generated file at /tmp/echidna_idris2/Verify.idr after any Idris2 API call.

Fix direction:

parse_string should return Err on unrecognised input rather than silently producing an empty state.
verify_proof for each prover should either preserve original content (write content directly, not export(state)) or treat an empty ProofState as a parse failure.
Add regression tests: verify on garbage, empty, cross-prover content must all return valid: false.

Partial fix (2026-04-05): server-level guard in prove_handler and verify_handler now returns valid: false when parse_string produces an empty ProofState (no goals, theorems, definitions, axioms, or variables) on non-empty input. Verified live:

garbage to Coq/Lean: valid: false (was true)
cross-prover (SMT to Coq): valid: false
real Coq/Z3 proofs: still work correctly The deeper fix — making each prover backend preserve original content instead of round-tripping through ProofState — remains outstanding.

Priority

CRG C COMPLETE as of 2026-04-04
Next priority: Fix 5 shell issues found by validation
V-lang prover testing remains the largest correctness gap but is out of scope until the Coq corpus toolchain is set up in CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test & Benchmark Requirements

CRG Grade: B — ACHIEVED 2026-04-04

CRG B Evidence — External Targets

Target Details

Current State (updated 2026-04-04)

Benchmark Status (verified 2026-04-04)

Ignored Test Investigation (9 tests)

What's Covered

Point-to-Point (P2P) — tests/p2p_property_tests.rs

End-to-End (E2E) — tests/e2e_prover_test.rs

Aspect Tests — tests/aspect_tests.rs

Julia Smoke Tests — tests/julia/smoke_test.jl

Shell Validation — tests/shell/validate_scripts.sh

Benchmarks — benches/proof_benchmarks.rs

Still Missing

V-lang prover (6,694 files — ZERO tests)

Zig FFI (34 files — 7 test files)

Idris2 ABI (42 files)

ReScript UI (33 files — ZERO tests)

Self-Tests

Known Issues

/api/verify false-positive on unparseable input (2026-04-05)

Priority

Uh oh!

FilesExpand file tree

TEST-NEEDS.md

Latest commit

History

TEST-NEEDS.md

File metadata and controls

Test & Benchmark Requirements

CRG Grade: B — ACHIEVED 2026-04-04

CRG B Evidence — External Targets

Target Details

Current State (updated 2026-04-04)

Benchmark Status (verified 2026-04-04)

Ignored Test Investigation (9 tests)

What's Covered

Point-to-Point (P2P) — tests/p2p_property_tests.rs

End-to-End (E2E) — tests/e2e_prover_test.rs

Aspect Tests — tests/aspect_tests.rs

Julia Smoke Tests — tests/julia/smoke_test.jl

Shell Validation — tests/shell/validate_scripts.sh

Benchmarks — benches/proof_benchmarks.rs

Still Missing

V-lang prover (6,694 files — ZERO tests)

Zig FFI (34 files — 7 test files)

Idris2 ABI (42 files)

ReScript UI (33 files — ZERO tests)

Self-Tests

Known Issues

/api/verify false-positive on unparseable input (2026-04-05)

Priority