Skip to content

fix(orchestration): graph state is not checkpointed per-tick — mid-execution crashes lose all progress #4747

@bug-ops

Description

@bug-ops

Description

DagScheduler holds the entire TaskGraph in-memory. GraphPersistence::save() is called only at two points in plan.rs:

  1. Defensively before scheduler_result? (on scheduler error)
  2. Terminally after the final graph is assembled

A process crash during execution (between task completions) loses all intermediate task state. TaskStatus transitions (Pending → Running → Completed/Failed), retry counts, predicate outcomes, lineage chains — all are lost.

Reproduction Steps

  1. Start a multi-task plan with 10+ tasks
  2. Kill the process mid-execution (e.g. SIGKILL after 3 tasks complete)
  3. Resume with /plan list — the graph shows status from the last terminal save only

Expected Behavior

Completed/failed task states should survive a crash. At minimum, a per-tick snapshot after each task completion would reduce the replay window.

Actual Behavior

All in-flight and completed-since-last-save state is lost. The defensive save at scheduler_result? only fires when the scheduler loop exits — not at individual task completion boundaries.

Environment

  • Affected: crates/zeph-orchestration/src/scheduler/, crates/zeph-core/src/agent/plan.rs
  • All execution modes

Logs / Evidence

plan.rs:292 — defensive save before scheduler_result?
plan.rs:315 — terminal authoritative save after into_graph()
No save call inside the scheduler tick loop or per-task completion handler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Research — medium-high complexityenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions