feat(cli): add --recover (rerun) and --recover-from (run) [gated] by kumare3 · Pull Request #1237 · flyteorg/flyte-sdk

kumare3 · 2026-06-21T18:34:27Z

Adds the recover knob to the CLI, split out of #1236 (which deliberately omits these flags so they don't surface in the CLI docs while the backend is unfinished).

Draft — gated until the flyteidl2 RunSpec.recover field + actions-service support ship. The flags are wired but raise a clear NotImplementedError at submit until then. Stacks on #1236 (the rerun foundation that owns the recover field); merge after #1236 and once the backend lands.

Usage

# Recover from the run being rerun (fetched code, reuse its succeeded actions):
flyte rerun ul56wcvgqrb9vzhzz5l2 --recover

# Recover a fresh run with NEW local code from a prior run:
flyte run main.py main --recover-from ul56wcvgqrb9vzhzz5l2

Both map to with_runcontext(recover=...): --recover (bool) recovers from the run being rerun; --recover-from <run> (string) recovers a fresh run() from a named prior run. recover reuses the prior run's succeeded actions and re-runs only what failed or changed (remote-only).

What's here

flyte rerun --recover (bool) → with_runcontext(recover=True).rerun(run).
flyte run --recover-from <run> (string) → with_runcontext(recover="<run>").run(task).
Tests assert both flags are present and forwarded.

The SDK with_runcontext(recover=...) field and gating already land in #1236; this PR only adds the CLI surface.

🤖 Generated with Claude Code

Pin the exact RunSpec / CreateRunRequest that with_runcontext(...) builds in remote mode — every field _run_remote serializes (env_vars, labels, annotations, queue→cluster, interruptible, overwrite_cache, cache_lookup_scope, service_account, notifications, max_action_concurrency), plus the ConnectError mapping, dry-run path, and per-mode dispatch. This is the byte-for-byte oracle for the upcoming run/rerun/recover/debug unification: the extraction of _build_task_spec_from_template / _submit_remote / _apply_overrides must reproduce these unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

Pull the local-TaskTemplate branch of _run_remote (image build, code-bundle cascade, serialization + task-spec translation) into a reusable _Runner._build_task_spec_from_template returning (task_spec, code_bundle, version). image_cache is folded into task_spec via the serialization context, so it is not returned. Heavy imports travel into the helper to keep `import flyte` cheap. This is the shared task-spec builder rerun-with-substitute-code will call, removing the future duplication of _replay._build_task_spec. Characterization tests (test_run_runspec_chars + test_union_run_basic) pass unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

Split _run_remote's back half into three single-responsibility helpers on _Runner: - _build_env_dict: runtime env assembly (user env_vars + injected LOG/debug/sys-path keys), shared by fresh and inherited paths; returns a fresh dict (no longer mutates self._env_vars). - _apply_overrides(base, *, task): the single place runner config maps onto a RunSpec. base=None builds a fresh spec (run/recover); base set deep-copies a prior run's spec and merges overrides by key (the rerun seam — env merge + explicit field overrides). Includes a gated recover block (raises until flyteidl2 RunSpec.recover ships). - _submit_remote: the single network call site — upload_inputs + create_run + the ConnectError mapping. Consumes an already-built run_spec. - _resolve_run_target: RunIdentifier vs ProjectIdentifier resolution. - _to_cache_lookup_scope lifted to module scope. Heavy imports travel into the helpers (import flyte stays cheap; verified via -X importtime). Characterization snapshot reproduced byte-for-byte; added unit tests for the inherited-merge path and recover gating. 141 run-path tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

Add `recover: str | None` to _Runner / with_runcontext (keyword-only, default None → fully backwards-compatible) and a hidden `--recover-from` flag on `flyte run`. The value flows into _apply_overrides, which sets RunSpec.recover once flyteidl2 ships the field and otherwise raises a clear NotImplementedError (the field is absent today). recover composes with run/rerun since it lives in the shared override seam. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

_Runner.rerun(run_name, action_name, task_template, inputs) re-runs a prior run on the shared foundation: fetch RunDetails → inherit its RunSpec (via _apply_overrides), source the task from action_details.pb2.task (or a substitute template), and either reuse the prior raw proto inputs (dataproxy.get_action_data) or convert new native kwargs against the fetched interface (guess_interface), then _submit_remote. Public surface: flyte.rerun("r1") same inputs; flyte.rerun("r1", x=2) changed inputs; flyte.rerun("r1", task_template=fixed) substitute code. flyte.replay kept as a deprecated thin alias (inputs=None). recover/debug compose via with_runcontext. Remote-only for now. Exported from flyte/__init__. Tests cover same-inputs inheritance+reuse, changed-inputs conversion against a real interface proto, the non-remote guard, and the replay alias. 147 run-path tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

- recover: bool|str = False. True recovers from the run being rerun (rerun-only); a run-name string recovers from that named run (the only form valid on run() / flyte run --recover-from). _resolve_recover_ref maps True->rerun target and rejects True on a plain run(). _apply_overrides takes the resolved recover_ref. - recover is remote-only: a truthy recover in local/hybrid mode raises ValueError up front in run() instead of being silently ignored. - Delete flyte.replay (no alias); flyte.rerun is the verb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

Re-run an existing run with its own fetched code + exact inputs (no local code). Complements `flyte run --recover-from` (which supplies new local code): rerun takes the run name as a positional and exposes context options (--project/--domain/--name/ --env/--label/--follow) plus a hidden `--recover` (reuse succeeded actions, coming soon). v1 reuses the prior inputs; changing inputs from the CLI is a follow-up (flyte.rerun(run, x=2) covers it programmatically). Registered in cli/main.py under "Run and stop tasks". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

Re-run an existing run with THIS local code, reusing the prior run's inputs — the CLI equivalent of flyte.rerun(run, task_template=local_task). Routes the file-loaded TaskTemplate through _Runner.rerun(run, task_template=...) and suppresses the dynamic per-input options (inputs come from the prior run; required inputs aren't demanded). Orthogonal to the gated --recover-from; --rerun-from is live and remote-only (errors with --local). `flyte rerun <run>` stays the no-local-code (fetched) path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

- Widen recover to bool|str|None so the CLI's str|None --recover-from type-checks (None already means "no recover" everywhere). - "re-uses" -> "reuses" (codespell). - ruff format. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

`hidden=True` still surfaces options in `flyte gen docs`, so the gated recover flags leaked into the CLI reference. Remove `flyte rerun --recover` and `flyte run --recover-from` entirely, leaving TODOs to re-add them once flyteidl2 RunSpec.recover + backend support land. `--rerun-from` (live) and the Python `with_runcontext(recover=...)` field are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

Expose the recover knob on the CLI: `flyte rerun <run> --recover` recovers from the run being rerun; `flyte run <file> <task> --recover-from <run>` recovers a fresh run (new local code) from a named prior run. Both map to with_runcontext(recover=...) and reuse a prior run's succeeded actions, re-running only what failed or changed. Gated until the flyteidl2 RunSpec.recover field + actions-service support ship (raises a clear NotImplementedError until then) — hence this PR is a draft, to land once the backend is ready. Stacks on #1236 (the rerun foundation that owns the recover field). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>

kumare3 and others added 11 commits June 20, 2026 23:06

Base automatically changed from rerun-debug-foundation to main June 22, 2026 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): add --recover (rerun) and --recover-from (run) [gated]#1237

feat(cli): add --recover (rerun) and --recover-from (run) [gated]#1237
kumare3 wants to merge 11 commits into
mainfrom
recover-cli

kumare3 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kumare3 commented Jun 21, 2026

Usage

What's here

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant