Skip to content

implement native durable execution layer (zeph-durable, spec 064) #4707

@bug-ops

Description

@bug-ops

Spec

Specification: specs/064-durable-execution/spec.md (added in commit 65f3a525 on branch feat/durable-execution-layer).

The spec was produced via a spec-driven design chain with adversarial validation: architecture → critic + security + performance review (all returned significant) → revision → re-validation (all cleared) → spec review. 14 invariants, 15 functional requirements, 7 measurable NFRs, 10 criterion benchmarks.

Summary

zeph-durable is a new Layer-0 in-process durable execution crate that adds crash-resumable execution to Zeph without an external runtime (we evaluated and rejected Restate-as-core; it remains an optional backend). It journals execution control flow and replays it after a crash, turning today's "discard orphaned tool pairs" behavior into a resume.

Core: append-only journal with deterministic replay (StepId by call position + ReplayDivergence fingerprint guard), DurableStep with per-EffectClass exactly-once semantics (two-phase EffectIntent/StepResult, per-class OnAmbiguous), a background JournalWriter actor on a dedicated durable.db pool (fsync off the hot path), vault-keyed XChaCha20-Poly1305 payload cipher (AAD-bound), durable promises (resolver-token auth) and timers, and a sealed ExecutionBackend trait (always-on LocalBackend + optional restate backend in the server bundle).

Scope — implement in phases (separate PRs)

  1. zeph-durable core — journal, DurableContext/DurableStep, JournalWriter actor, EffectClass/OnAmbiguous, IdempotencyKey, ReplayDivergence, PayloadCipher, DurablePromise/DurableTimer, sealed ExecutionBackend + LocalBackend, schema in zeph-db/migrations, zeph durable CLI, config [durable], --init/--migrate-config.
  2. P1 — agent tool loop (highest reliability value). Explicit change in zeph-agent-tools/tier_loop.rs (LLM call + tool dispatch wrapped as durable steps). Triggers the LLM-serialization live-API gate — requires a live multi-turn + tool-call session test before merge.
  3. P2 — orchestration /plan resume replan-counter restore (narrow scope; not auto-crash-recovery).
  4. P3 — scheduler exactly-once job fire.
  5. P4 — subagent durable promise spawn/await.

Each adapter is opt-in via config and degrades to today's behavior when disabled.

Key acceptance criteria (testable)

  • Crash mid-turn (mid tool/LLM step) resumes from the first un-journaled step on restart; no double-emission of already-printed output; no double-persisted messages (invariant 057).
  • bench_step_run_exactly_once_n at N=5 completes in <= 5 ms on CI SSD (hot-path regression gate).
  • ExactlyOnceGuarded effect crashed in the ambiguous window resolves per OnAmbiguous (paid-LLM Skip; destructive Fail + audit); destructive step with unspecified policy is a construction-time error.
  • Journal payloads are AEAD-encrypted at rest; a tampered journal entry fails authentication on replay (no forged tool/LLM result injection).
  • durable.db lives on its own pool; durable schema is applied via zeph_db::run_migrations with no sqlx::migrate! in zeph-durable (single migration runner, invariant 031).
  • Restate backend only compiles under the restate feature; absent it, LocalBackend is the default and the binary has no Restate dependency.

Suggested next step

Run /rust-agents:team-develop new-feature on this issue, starting with phase 1 (zeph-durable core). The architect and developer will pick up specs/064-durable-execution/spec.md automatically. Implement phases 2-5 as follow-up PRs gated on phase 1.

Housekeeping (separate, out of scope)

specs/ has pre-existing duplicate numbers (two 057-*, two 058-*). An automated spec-number picker could re-collide; the autoskill duplicates should be renumbered in a separate task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2High value, medium complexityepicMilestone-level tracking issuefeatureNew functionality

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions