Skip to content

feat(cli): add --recover (rerun) and --recover-from (run) [gated]#1237

Draft
kumare3 wants to merge 11 commits into
mainfrom
recover-cli
Draft

feat(cli): add --recover (rerun) and --recover-from (run) [gated]#1237
kumare3 wants to merge 11 commits into
mainfrom
recover-cli

Conversation

@kumare3

@kumare3 kumare3 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Adds the recover knob to the CLI, split out of #1236 (which deliberately omits these flags so they don't surface in the CLI docs while the backend is unfinished).

Draft — gated until the flyteidl2 RunSpec.recover field + actions-service support ship. The flags are wired but raise a clear NotImplementedError at submit until then. Stacks on #1236 (the rerun foundation that owns the recover field); merge after #1236 and once the backend lands.

Usage

# Recover from the run being rerun (fetched code, reuse its succeeded actions):
flyte rerun ul56wcvgqrb9vzhzz5l2 --recover

# Recover a fresh run with NEW local code from a prior run:
flyte run main.py main --recover-from ul56wcvgqrb9vzhzz5l2

Both map to with_runcontext(recover=...): --recover (bool) recovers from the run being rerun; --recover-from <run> (string) recovers a fresh run() from a named prior run. recover reuses the prior run's succeeded actions and re-runs only what failed or changed (remote-only).

What's here

  • flyte rerun --recover (bool) → with_runcontext(recover=True).rerun(run).
  • flyte run --recover-from <run> (string) → with_runcontext(recover="<run>").run(task).
  • Tests assert both flags are present and forwarded.

The SDK with_runcontext(recover=...) field and gating already land in #1236; this PR only adds the CLI surface.

🤖 Generated with Claude Code

kumare3 and others added 11 commits June 20, 2026 23:06
Pin the exact RunSpec / CreateRunRequest that with_runcontext(...) builds in
remote mode — every field _run_remote serializes (env_vars, labels, annotations,
queue→cluster, interruptible, overwrite_cache, cache_lookup_scope, service_account,
notifications, max_action_concurrency), plus the ConnectError mapping, dry-run path,
and per-mode dispatch. This is the byte-for-byte oracle for the upcoming
run/rerun/recover/debug unification: the extraction of _build_task_spec_from_template
/ _submit_remote / _apply_overrides must reproduce these unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Pull the local-TaskTemplate branch of _run_remote (image build, code-bundle
cascade, serialization + task-spec translation) into a reusable
_Runner._build_task_spec_from_template returning (task_spec, code_bundle, version).
image_cache is folded into task_spec via the serialization context, so it is not
returned. Heavy imports travel into the helper to keep `import flyte` cheap.

This is the shared task-spec builder rerun-with-substitute-code will call, removing
the future duplication of _replay._build_task_spec. Characterization tests
(test_run_runspec_chars + test_union_run_basic) pass unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Split _run_remote's back half into three single-responsibility helpers on _Runner:

- _build_env_dict: runtime env assembly (user env_vars + injected LOG/debug/sys-path
  keys), shared by fresh and inherited paths; returns a fresh dict (no longer mutates
  self._env_vars).
- _apply_overrides(base, *, task): the single place runner config maps onto a RunSpec.
  base=None builds a fresh spec (run/recover); base set deep-copies a prior run's spec
  and merges overrides by key (the rerun seam — env merge + explicit field overrides).
  Includes a gated recover block (raises until flyteidl2 RunSpec.recover ships).
- _submit_remote: the single network call site — upload_inputs + create_run + the
  ConnectError mapping. Consumes an already-built run_spec.
- _resolve_run_target: RunIdentifier vs ProjectIdentifier resolution.
- _to_cache_lookup_scope lifted to module scope.

Heavy imports travel into the helpers (import flyte stays cheap; verified via
-X importtime). Characterization snapshot reproduced byte-for-byte; added unit tests
for the inherited-merge path and recover gating. 141 run-path tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Add `recover: str | None` to _Runner / with_runcontext (keyword-only, default None →
fully backwards-compatible) and a hidden `--recover-from` flag on `flyte run`. The
value flows into _apply_overrides, which sets RunSpec.recover once flyteidl2 ships the
field and otherwise raises a clear NotImplementedError (the field is absent today).
recover composes with run/rerun since it lives in the shared override seam.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
_Runner.rerun(run_name, action_name, task_template, inputs) re-runs a prior run on
the shared foundation: fetch RunDetails → inherit its RunSpec (via _apply_overrides),
source the task from action_details.pb2.task (or a substitute template), and either
reuse the prior raw proto inputs (dataproxy.get_action_data) or convert new native
kwargs against the fetched interface (guess_interface), then _submit_remote.

Public surface: flyte.rerun("r1") same inputs; flyte.rerun("r1", x=2) changed inputs;
flyte.rerun("r1", task_template=fixed) substitute code. flyte.replay kept as a
deprecated thin alias (inputs=None). recover/debug compose via with_runcontext.
Remote-only for now. Exported from flyte/__init__.

Tests cover same-inputs inheritance+reuse, changed-inputs conversion against a real
interface proto, the non-remote guard, and the replay alias. 147 run-path tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
- recover: bool|str = False. True recovers from the run being rerun (rerun-only);
  a run-name string recovers from that named run (the only form valid on run() /
  flyte run --recover-from). _resolve_recover_ref maps True->rerun target and rejects
  True on a plain run(). _apply_overrides takes the resolved recover_ref.
- recover is remote-only: a truthy recover in local/hybrid mode raises ValueError up
  front in run() instead of being silently ignored.
- Delete flyte.replay (no alias); flyte.rerun is the verb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Re-run an existing run with its own fetched code + exact inputs (no local code).
Complements `flyte run --recover-from` (which supplies new local code): rerun takes
the run name as a positional and exposes context options (--project/--domain/--name/
--env/--label/--follow) plus a hidden `--recover` (reuse succeeded actions, coming
soon). v1 reuses the prior inputs; changing inputs from the CLI is a follow-up
(flyte.rerun(run, x=2) covers it programmatically).

Registered in cli/main.py under "Run and stop tasks".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Re-run an existing run with THIS local code, reusing the prior run's inputs — the CLI
equivalent of flyte.rerun(run, task_template=local_task). Routes the file-loaded
TaskTemplate through _Runner.rerun(run, task_template=...) and suppresses the dynamic
per-input options (inputs come from the prior run; required inputs aren't demanded).
Orthogonal to the gated --recover-from; --rerun-from is live and remote-only (errors
with --local). `flyte rerun <run>` stays the no-local-code (fetched) path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
- Widen recover to bool|str|None so the CLI's str|None --recover-from type-checks
  (None already means "no recover" everywhere).
- "re-uses" -> "reuses" (codespell).
- ruff format.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
`hidden=True` still surfaces options in `flyte gen docs`, so the gated recover flags
leaked into the CLI reference. Remove `flyte rerun --recover` and `flyte run
--recover-from` entirely, leaving TODOs to re-add them once flyteidl2 RunSpec.recover +
backend support land. `--rerun-from` (live) and the Python `with_runcontext(recover=...)`
field are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Expose the recover knob on the CLI: `flyte rerun <run> --recover` recovers from the run
being rerun; `flyte run <file> <task> --recover-from <run>` recovers a fresh run (new
local code) from a named prior run. Both map to with_runcontext(recover=...) and reuse a
prior run's succeeded actions, re-running only what failed or changed.

Gated until the flyteidl2 RunSpec.recover field + actions-service support ship (raises a
clear NotImplementedError until then) — hence this PR is a draft, to land once the backend
is ready. Stacks on #1236 (the rerun foundation that owns the recover field).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Base automatically changed from rerun-debug-foundation to main June 22, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant