Skip to content

Churn simulator: validate 80% completion at 30% churn over 72h #51

@jeremymanning

Description

@jeremymanning

Description

Per spec T144c and whitepaper Phase 2 requirement: "Over a 72-hour continuous run, 80% of submitted test jobs complete correctly with 30% simulated node churn."

Requirements

  • Build configurable churn simulator: random node kill/rejoin at configurable rate
  • Integrate with multi-node test harness
  • Track job completion rates under churn
  • Validate checkpoint/resume works across node failures
  • Run for configurable duration (target: 72 hours for Phase 2)
  • Report completion rate, data loss events, and recovery metrics

Success Criteria

  • Simulator can kill/rejoin nodes at configurable rate (target: 30% churn)
  • Job completion rate ≥ 80% at 30% churn over 72 hours
  • Zero data loss events during churn
  • Checkpoint/resume works correctly across node failures
  • Metrics reported: completion rate, recovery time, data loss count

Testing (Principle V)

  • Run on 20+ node testbed with real hardware
  • 30% churn rate → verify 80% job completion
  • Verify zero data loss during churn
  • Measure checkpoint/resume latency under various churn rates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions