Skip to content

Fix imports to be global not relative for patches#76

Open
joyliu-q wants to merge 29 commits into
mainfrom
joy/improve-nan-miles-refactor
Open

Fix imports to be global not relative for patches#76
joyliu-q wants to merge 29 commits into
mainfrom
joy/improve-nan-miles-refactor

Conversation

@joyliu-q
Copy link
Copy Markdown
Contributor

@joyliu-q joyliu-q commented Apr 20, 2026

Will add other fixes if I needed to make any to make this run out of the box!


Open in Devin Review

nanjiangwill and others added 21 commits April 17, 2026 20:23
Adds memory, cloud, and region fields to ModalConfig and wires them
into the @app.function decorator in slime/modal_train.py so configs
can request per-container memory limits and pin cloud/region.

The 355B base config (glm47_355b_a32b) sets memory=(1024, 2 TiB) so
the trainer host has headroom for the FP32 delta baseline copy without
OOM. The non-colocated variant now inherits via `modal = _base.modal`
instead of redeclaring, and its docstring is refreshed (EP=16 is
unchanged from base, not reduced).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan generated via gen-plan with 3 Codex convergence rounds.
Covers miles/ Modal launcher mirroring slime/ contract,
Qwen3-4B LoRA smoke test, and 7 acceptance criteria.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the miles/ launcher directory mirroring the slime/ contract:
- configs/base.py: ModalConfig + MilesConfig with reflection-based CLI,
  JSON config field support, and NCCL_NVLS_ENABLE=1 in base env
- modal_train.py: 5 Modal functions (list_configs, download_model,
  prepare_dataset, convert_checkpoint, train)
- modal_helpers/utils.py: cluster context, command building, config prep
- configs/qwen3_4b_lora_smoke.py: Qwen3-4B LoRA verification config
  (num_rollout=2, bridge mode, 4x H200 colocated)

Docker image: radixark/miles:dev-202604101227
Volumes: huggingface-cache (shared), miles-data, miles-checkpoints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix prepare_data() to use zhuzilin/dapo-math-17k and zhuzilin/aime-2024
  (matching slime sibling configs), add os.makedirs before download
- Expand README with prerequisites, secrets, step-by-step workflow,
  config authoring guide, YAML/JSON fields, and patch injection docs
  (matching slime/README.md quality)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The train function now only attaches wandb-secret when the experiment
config has use_wandb=True. The smoke-test config (which disables WandB)
no longer depends on this secret being present.

Also documents required Modal secrets in README prerequisites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modal evaluates decorator args locally and remotely, so conditional
secrets cause dependency mismatch errors. Always attach wandb-secret
unconditionally, matching slime's pattern.

Also picks up user config changes (H100, 8 GPUs per node).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- configs/kimi_k25_fullparam_smoke.py: 8x H200 colocated, bridge mode,
  freezes everything except layer 60 for smoke testing. Depends on the
  K2.5 weight-update fixes landed in the miles repo.
- configs/kimi_k25_lora.py: LoRA training on top of smoke config. Targets
  MLA attention projections and MLP linear layers (rank 32, alpha 32).
- test_bridge.py: single-GPU Modal script that exercises bridge
  registration, provider construction, and model instantiation. Used for
  fast iteration when debugging bridge/provider issues without a full
  multi-node run.
- configs/base.py: add memory and image_env fields to ModalConfig so
  configs can request per-container RAM (K2.5 needs ~1.8 TiB) and bake
  LD_LIBRARY_PATH into the image.
- modal_train.py: bump base image to radixark/miles:dev (rolling),
  always prepend system libs to LD_LIBRARY_PATH, and add a
  convert_kimi_int4_to_bf16 helper to prepare the bf16 checkpoint from
  moonshotai/Kimi-K2.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop sglang_pynccl_nonfatal.patch.py from image_run_commands (patch was
  removed from miles repo).
- Remove the commented-out INT4 QAT __init__ override and the unused
  _MILES_PR896_SHA constant.
- Drop the no-op prepare_data config-patching branch — the checkpoint on
  the volume is already in the right shape.
- Tighten image_run_commands comments to name the actual issue each
  step addresses instead of paraphrasing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Phase 1 target_modules: drop linear_q_proj (unused — K2.5 has q_lora_rank
  so MLA decomposes into down+up; linear_q_proj only exists when q_lora_rank
  is None). Drop linear_fc1/linear_fc2 — the sparse-expert + compressed-
  tensors + LoRA path is the least-tested. Keep MLA decomposition
  (q_down/q_up/kv_down/kv_up) + o_proj. Gives 5 × 60 = 300 adapters instead
  of ~138K when targeting all expert MLPs.
- Enable OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1 + OPEN_TRAINING_INT4_GROUP_SIZE=32
  so training forward sees INT4-quantized base weights, matching SGLang's
  INT4 inference. Without this the LoRA adapters would learn to compensate
  for a BF16↔INT4 numerical gap that doesn't exist at inference time. LoRA
  doesn't optimize the base weights so the dequant scratch buffer fits in
  H200 memory (unlike the full-param smoke where it OOMed).
- Drop the explicit save path; save/save_interval are unset in the parent
  smoke config, so no checkpoint is written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch target_modules back to the full general-case set:
  linear_q_down_proj, linear_q_up_proj, linear_kv_down_proj,
  linear_kv_up_proj, linear_proj, linear_fc1, linear_fc2

Per https://thinkingmachines.ai/blog/lora/ attention-only LoRA significantly
underperforms MLP-only, and MLP+attention ≈ MLP-only. If the full path
works end-to-end, narrower scopes follow trivially — so exercise the
hardest code path first: MLA attention LoRA + dense MLP LoRA + shared
expert LoRA + sparse expert (GroupedMLP) LoRA on compressed-tensors experts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jvmncs
Copy link
Copy Markdown
Contributor

jvmncs commented Apr 20, 2026

I think we should make all these paths relative with sth like pathlib.Path(__file__)

nanjiangwill and others added 8 commits April 21, 2026 18:36
Introduces miles/configs/model_configuration.py with a ModelConfiguration
base class (model_name, model_path, download_model) and a KimiK25
subclass whose download_model() shells out via `modal run` to the
existing download_model and convert_kimi_int4_to_bf16 Modal app
functions. Adds miles/test_model_configuration.py covering import,
construction, the no-CLI-arg-leak guard (via _MILES_SKIP), and
NotImplementedError on the base.

Round 0 of the surge-kimi-plan RLCR loop (live residuals only; the
1-step smoke goal was met by an external 8x H200 MLA 8k x 256 run on
2026-04-21). Companion updates to _MILES_SKIP and the two experiment
scanners are already in 🆕 v0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joyliu-q joyliu-q force-pushed the joy/improve-nan-miles-refactor branch from 8cd4a8a to 2914c58 Compare April 21, 2026 20:30
@joyliu-q joyliu-q marked this pull request as ready for review April 21, 2026 22:18
_MODAL_TRAIN_PATH = _MILES_ROOT / "modal_train.py"


class KimiK25ModelConfiguration(ModelConfiguration):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add more model configs and port other model configs over to use this pattern.

@nanjiangwill nanjiangwill force-pushed the nan/miles-refactor branch 2 times, most recently from 4e5df13 to f19d3af Compare May 29, 2026 21:50
Base automatically changed from nan/miles-refactor to main May 29, 2026 22:32
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

# docs/agent-modal-training.md.
env = {**os.environ, "EXPERIMENT_CONFIG": "kimi_k25"}
subprocess.run(
["modal", "run", f"{_MODAL_TRAIN_PATH}::download_model"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 KimiK25ModelConfiguration.download_model references non-existent modal function

miles/configs/model_configuration.py:49 invokes modal run modal_train.py::download_model, but the download_model function was removed in this PR and replaced with prepare_model (miles/modal_train.py:127). When KimiK25ModelConfiguration.download_model() is called (e.g. from the test file or agent workflows), the subprocess will fail with a Modal "function not found" error because no function named download_model exists in the app.

Suggested change
["modal", "run", f"{_MODAL_TRAIN_PATH}::download_model"],
["modal", "run", f"{_MODAL_TRAIN_PATH}::prepare_model"],
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants