Skip to content

feat(kiloclaw): add unexpected stop recovery#1993

Merged
pandemicsyn merged 4 commits intomainfrom
florian/chore/recover-stopped
Apr 6, 2026
Merged

feat(kiloclaw): add unexpected stop recovery#1993
pandemicsyn merged 4 commits intomainfrom
florian/chore/recover-stopped

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Apr 3, 2026

Summary

Add unexpected-stop recovery support to KiloClaw and surface that lifecycle in the admin and user UI.

This changes the worker to move an unexpectedly stopped Fly machine into a dedicated recovering state, attempt a one-shot recovery by relocating onto a replacement volume and machine, retain the old volume only when snapshots exist, and clean that retained volume up later. Recovery is now triggered on the first observed Fly stopped state while the DO still believes the instance is running; Fly created no longer participates in this path.

Old volumes without snapshots are deleted immediately after a successful cutover; old volumes with snapshots are retained for 7 days and then cleaned up automatically, with an admin override to delete them early. It also adds admin-facing visibility and cleanup controls for the retained recovery volume, plus user-facing action gating so recovery is treated as a busy machine lifecycle state.

Architecturally, the unexpected-stop recovery flow is now extracted out of the instance DO class into a dedicated kiloclaw-instance/recovery.ts module, while index.ts remains the dispatcher/orchestrator. The recovery path was also hardened so timeout and failure cleanup is shared, alarms do not race active recovery by deleting the pending recovery volume, and retained-volume cleanup verifies sandbox ownership before force-destroying any attached machine.

This also adds explicit Analytics Engine lifecycle events for the new path: recovery started, recovery succeeded, and recovery failed.

Verification

Example run:

{"tag":"reconcile","reason":"alarm","action":"unexpected_stop_recovery_trigger","old_state":"running","new_state":"recovering","fly_state":"stopped","fail_count":1,"value":1}
{"tag":"kiloclaw_do","level":"info","message":"instance.unexpected_stop_recovery_started","status":"recovering","label":"alarm_stopped","value":1,"userId":"7c2f4f32-1ef0-43cb-84bb-0b51ad5eb7bf","sandboxId":"ki_f4c9f65e4c964f0b8afccd507426df09","flyMachineId":"2869664c776038","flyRegion":"dfw","flyAppName":"dev-inst-71d24ed862ce1c18dddc"}
[DO] buildUserEnvVars: minted fresh API key, expires: 2026-05-05T23:43:40.000Z
[DO] Created Fly Machine: 2861e46b955438 region: sjc
▲ [WARNING] {"tag":"kiloclaw_do","level":"warn","message":"unexpected stop recovery timed out waiting for replacement machine startup; reconcile will continue","error":{"name":"FlyApiError","message":"Fly API waitForState(started) failed (408): {\"error\":\"deadline_exceeded: machine failed to reach desired state, started, currently created\"}\n","stack":"FlyApiError: Fly API waitForState(started) failed (408): {\"error\":\"deadline_exceeded: machine failed to reach desired state, started, currently created\"}\n\n    at assertOk (file:///Users/syn/projects/cloud-alt/kiloclaw/.wrangler/tmp/dev-Kp3d85/index.js:41992:11)\n    at async waitForState (file:///Users/syn/projects/cloud-alt/kiloclaw/.wrangler/tmp/dev-Kp3d85/index.js:42054:3)\n    at async createNewMachine (file:///Users/syn/projects/cloud-alt/kiloclaw/.wrangler/tmp/dev-Kp3d85/index.js:43234:3)\n    at async runUnexpectedStopRecoveryInBackground (file:///Users/syn/projects/cloud-alt/kiloclaw/.wrangler/tmp/dev-Kp3d85/index.js:46094:7)\n    at async KiloClawInstance.recoverUnexpectedStopInBackground (file:///Users/syn/projects/cloud-alt/kiloclaw/.wrangler/tmp/dev-Kp3d85/index.js:47808:5)","status":408,"body":"{\"error\":\"deadline_exceeded: machine failed to reach desired state, started, currently created\"}\n"},"flyMachineId":"2861e46b955438","pendingRecoveryVolumeId":"vol_v8e3o0k61gk7z3lv","userId":"7c2f4f32-1ef0-43cb-84bb-0b51ad5eb7bf","sandboxId":"ki_f4c9f65e4c964f0b8afccd507426df09","flyRegion":"sjc","flyAppName":"dev-inst-71d24ed862ce1c18dddc"}


{"tag":"reconcile","reason":"alarm","action":"unexpected_stop_recovery_machine_started","machine_id":"2861e46b955438","old_state":"recovering","new_state":"running"}
[DO] Gateway health probe passed (state: running, root: 401 )
{"tag":"reconcile","reason":"unexpected_stop_recovery_immediate_cleanup","action":"delete_volume","fly_app_name":"dev-inst-71d24ed862ce1c18dddc","volume_id":"vol_vgn2lk8o0wk8gm04"}
{"tag":"kiloclaw_do","level":"info","message":"instance.unexpected_stop_recovery_succeeded","status":"running","label":"alarm_relocated","durationMs":145463,"userId":"7c2f4f32-1ef0-43cb-84bb-0b51ad5eb7bf","sandboxId":"ki_f4c9f65e4c964f0b8afccd507426df09","flyMachineId":"2861e46b955438","flyRegion":"sjc","flyAppName":"dev-inst-71d24ed862ce1c18dddc"}

Visual Changes

Reviewer Notes

  • The worker logic change is split across reconcile.ts, index.ts, fly-machines.ts, and the new recovery.ts; recovery.ts is the best place to review the new lifecycle end-to-end.
  • New AE lifecycle events are emitted for instance.unexpected_stop_recovery_started, instance.unexpected_stop_recovery_succeeded, and instance.unexpected_stop_recovery_failed.
  • Unexpected-stop recovery is triggered only for Fly stopped while the DO still thinks the instance is running; Fly created is no longer treated as an unexpected-stop signal.
  • Old recovery source volumes are deleted immediately when they have no snapshots; if snapshots exist, the old volume is retained for 7 days, then deleted automatically by the alarm loop unless an admin deletes it earlier.
  • Retained recovery-volume cleanup verifies kiloclaw_sandbox_id before force-destroying an attached machine.

@pandemicsyn pandemicsyn marked this pull request as ready for review April 6, 2026 00:05
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 6, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (22 files)
  • apps/web/src/app/(app)/claw/components/InstanceControls.tsx
  • apps/web/src/app/(app)/claw/components/SettingsTab.tsx
  • apps/web/src/app/(app)/claw/components/claw.types.ts
  • apps/web/src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx
  • apps/web/src/lib/kiloclaw/kiloclaw-internal-client.ts
  • apps/web/src/lib/kiloclaw/types.ts
  • apps/web/src/routers/admin-kiloclaw-instances-router.ts
  • packages/db/src/schema-types.ts
  • services/kiloclaw/src/config.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance.test.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/index.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/log.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/recovery.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/state.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/types.ts
  • services/kiloclaw/src/index.test.ts
  • services/kiloclaw/src/index.ts
  • services/kiloclaw/src/routes/platform.ts
  • services/kiloclaw/src/schemas/instance-config.ts
  • services/kiloclaw/src/utils/analytics.ts

Reviewed by gpt-5.4-20260305 · 497,812 tokens

@pandemicsyn pandemicsyn requested a review from a team April 6, 2026 13:16
@pandemicsyn pandemicsyn changed the title feat(kiloclaw): add stop recovery admin tooling feat(kiloclaw): add unexpect stop recovery Apr 6, 2026
@pandemicsyn pandemicsyn changed the title feat(kiloclaw): add unexpect stop recovery feat(kiloclaw): add unexpected stop recovery Apr 6, 2026
Copy link
Copy Markdown
Contributor

@RSO RSO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pandemicsyn pandemicsyn merged commit d6d8373 into main Apr 6, 2026
33 checks passed
@pandemicsyn pandemicsyn deleted the florian/chore/recover-stopped branch April 6, 2026 14:32
jrf0110 pushed a commit that referenced this pull request Apr 7, 2026
* feat(kiloclaw): add stop recovery admin tooling

* fix(kiloclaw): trigger recovery on first stopped alarm

* fix(kiloclaw): hand off recovery startup timeouts
jrf0110 pushed a commit that referenced this pull request Apr 7, 2026
* feat(kiloclaw): add stop recovery admin tooling

* fix(kiloclaw): trigger recovery on first stopped alarm

* fix(kiloclaw): hand off recovery startup timeouts
kilo-code-bot bot pushed a commit that referenced this pull request Apr 8, 2026
* feat(kiloclaw): add stop recovery admin tooling

* fix(kiloclaw): trigger recovery on first stopped alarm

* fix(kiloclaw): hand off recovery startup timeouts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants