Fix imports to be global not relative for patches by joyliu-q · Pull Request #76 · modal-labs/multinode-training-guide

joyliu-q · 2026-04-20T19:32:07Z

Will add other fixes if I needed to make any to make this run out of the box!

Adds memory, cloud, and region fields to ModalConfig and wires them into the @app.function decorator in slime/modal_train.py so configs can request per-container memory limits and pin cloud/region. The 355B base config (glm47_355b_a32b) sets memory=(1024, 2 TiB) so the trainer host has headroom for the FP32 delta baseline copy without OOM. The non-colocated variant now inherits via `modal = _base.modal` instead of redeclaring, and its docstring is refreshed (EP=16 is unchanged from base, not reduced). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan generated via gen-plan with 3 Codex convergence rounds. Covers miles/ Modal launcher mirroring slime/ contract, Qwen3-4B LoRA smoke test, and 7 acceptance criteria. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements the miles/ launcher directory mirroring the slime/ contract: - configs/base.py: ModalConfig + MilesConfig with reflection-based CLI, JSON config field support, and NCCL_NVLS_ENABLE=1 in base env - modal_train.py: 5 Modal functions (list_configs, download_model, prepare_dataset, convert_checkpoint, train) - modal_helpers/utils.py: cluster context, command building, config prep - configs/qwen3_4b_lora_smoke.py: Qwen3-4B LoRA verification config (num_rollout=2, bridge mode, 4x H200 colocated) Docker image: radixark/miles:dev-202604101227 Volumes: huggingface-cache (shared), miles-data, miles-checkpoints Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix prepare_data() to use zhuzilin/dapo-math-17k and zhuzilin/aime-2024 (matching slime sibling configs), add os.makedirs before download - Expand README with prerequisites, secrets, step-by-step workflow, config authoring guide, YAML/JSON fields, and patch injection docs (matching slime/README.md quality) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The train function now only attaches wandb-secret when the experiment config has use_wandb=True. The smoke-test config (which disables WandB) no longer depends on this secret being present. Also documents required Modal secrets in README prerequisites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Modal evaluates decorator args locally and remotely, so conditional secrets cause dependency mismatch errors. Always attach wandb-secret unconditionally, matching slime's pattern. Also picks up user config changes (H100, 8 GPUs per node). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- configs/kimi_k25_fullparam_smoke.py: 8x H200 colocated, bridge mode, freezes everything except layer 60 for smoke testing. Depends on the K2.5 weight-update fixes landed in the miles repo. - configs/kimi_k25_lora.py: LoRA training on top of smoke config. Targets MLA attention projections and MLP linear layers (rank 32, alpha 32). - test_bridge.py: single-GPU Modal script that exercises bridge registration, provider construction, and model instantiation. Used for fast iteration when debugging bridge/provider issues without a full multi-node run. - configs/base.py: add memory and image_env fields to ModalConfig so configs can request per-container RAM (K2.5 needs ~1.8 TiB) and bake LD_LIBRARY_PATH into the image. - modal_train.py: bump base image to radixark/miles:dev (rolling), always prepend system libs to LD_LIBRARY_PATH, and add a convert_kimi_int4_to_bf16 helper to prepare the bf16 checkpoint from moonshotai/Kimi-K2.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop sglang_pynccl_nonfatal.patch.py from image_run_commands (patch was removed from miles repo). - Remove the commented-out INT4 QAT __init__ override and the unused _MILES_PR896_SHA constant. - Drop the no-op prepare_data config-patching branch — the checkpoint on the volume is already in the right shape. - Tighten image_run_commands comments to name the actual issue each step addresses instead of paraphrasing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Phase 1 target_modules: drop linear_q_proj (unused — K2.5 has q_lora_rank so MLA decomposes into down+up; linear_q_proj only exists when q_lora_rank is None). Drop linear_fc1/linear_fc2 — the sparse-expert + compressed- tensors + LoRA path is the least-tested. Keep MLA decomposition (q_down/q_up/kv_down/kv_up) + o_proj. Gives 5 × 60 = 300 adapters instead of ~138K when targeting all expert MLPs. - Enable OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1 + OPEN_TRAINING_INT4_GROUP_SIZE=32 so training forward sees INT4-quantized base weights, matching SGLang's INT4 inference. Without this the LoRA adapters would learn to compensate for a BF16↔INT4 numerical gap that doesn't exist at inference time. LoRA doesn't optimize the base weights so the dequant scratch buffer fits in H200 memory (unlike the full-param smoke where it OOMed). - Drop the explicit save path; save/save_interval are unset in the parent smoke config, so no checkpoint is written. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch target_modules back to the full general-case set: linear_q_down_proj, linear_q_up_proj, linear_kv_down_proj, linear_kv_up_proj, linear_proj, linear_fc1, linear_fc2 Per https://thinkingmachines.ai/blog/lora/ attention-only LoRA significantly underperforms MLP-only, and MLP+attention ≈ MLP-only. If the full path works end-to-end, narrower scopes follow trivially — so exercise the hardest code path first: MLA attention LoRA + dense MLP LoRA + shared expert LoRA + sparse expert (GroupedMLP) LoRA on compressed-tensors experts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jvmncs · 2026-04-20T20:47:00Z

I think we should make all these paths relative with sth like pathlib.Path(__file__)

Introduces miles/configs/model_configuration.py with a ModelConfiguration base class (model_name, model_path, download_model) and a KimiK25 subclass whose download_model() shells out via `modal run` to the existing download_model and convert_kimi_int4_to_bf16 Modal app functions. Adds miles/test_model_configuration.py covering import, construction, the no-CLI-arg-leak guard (via _MILES_SKIP), and NotImplementedError on the base. Round 0 of the surge-kimi-plan RLCR loop (live residuals only; the 1-step smoke goal was met by an external 8x H200 MLA 8k x 256 run on 2026-04-21). Companion updates to _MILES_SKIP and the two experiment scanners are already in 🆕 v0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joyliu-q · 2026-04-21T22:19:23Z

+_MODAL_TRAIN_PATH = _MILES_ROOT / "modal_train.py"
+
+
+class KimiK25ModelConfiguration(ModelConfiguration):


TODO: Add more model configs and port other model configs over to use this pattern.

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-05-29T22:35:10Z

+        # docs/agent-modal-training.md.
+        env = {**os.environ, "EXPERIMENT_CONFIG": "kimi_k25"}
+        subprocess.run(
+            ["modal", "run", f"{_MODAL_TRAIN_PATH}::download_model"],


🔴 KimiK25ModelConfiguration.download_model references non-existent modal function

miles/configs/model_configuration.py:49 invokes modal run modal_train.py::download_model, but the download_model function was removed in this PR and replaced with prepare_model (miles/modal_train.py:127). When KimiK25ModelConfiguration.download_model() is called (e.g. from the test file or agent workflows), the subprocess will fail with a Modal "function not found" error because no function named download_model exists in the app.

Suggested change

["modal", "run", f"{_MODAL_TRAIN_PATH}::download_model"],

["modal", "run", f"{_MODAL_TRAIN_PATH}::prepare_model"],

Was this helpful? React with 👍 or 👎 to provide feedback.

nanjiangwill and others added 21 commits April 17, 2026 20:23

remove miles v1

54852c2

Add miles v2 implementation plan and draft

dda1ce8

Plan generated via gen-plan with 3 Codex convergence rounds. Covers miles/ Modal launcher mirroring slime/ contract, Qwen3-4B LoRA smoke test, and 7 acceptance criteria. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Enable WandB in miles smoke-test config

35f67aa

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kimi 2.5 working full param

5d5144c

working kimi 2.5 lora

f1a2ceb

working kimi 2.5 lora

138aa31

cleanup

2dbe593

cleanup

1180b38

cleanup

f4e3465

install megatron bridge

f4904ae

cleanup

e74993a

cleanup

5c32ddb

nanjiangwill and others added 8 commits April 21, 2026 18:36

working kimi 2.5 lora mla without oom

53ffe81

🐛 Fix imports to be global not relative for patches

fa4c243

🎉 Add path resolution

d5c9bd3

🔥 Remove some agent files

b775367

🎨 Directory detection

d4768ce

🆕 v0

5b29802

chore: gitignore .claude/ local state

2914c58

joyliu-q force-pushed the joy/improve-nan-miles-refactor branch from 8cd4a8a to 2914c58 Compare April 21, 2026 20:30

joyliu-q marked this pull request as ready for review April 21, 2026 22:18

joyliu-q commented Apr 21, 2026

View reviewed changes

nanjiangwill force-pushed the nan/miles-refactor branch 2 times, most recently from 4e5df13 to f19d3af Compare May 29, 2026 21:50

Base automatically changed from nan/miles-refactor to main May 29, 2026 22:32

devin-ai-integration Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix imports to be global not relative for patches#76

Fix imports to be global not relative for patches#76
joyliu-q wants to merge 29 commits into
mainfrom
joy/improve-nan-miles-refactor

joyliu-q commented Apr 20, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

jvmncs commented Apr 20, 2026

Uh oh!

joyliu-q Apr 21, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		_MODAL_TRAIN_PATH = _MILES_ROOT / "modal_train.py"


		class KimiK25ModelConfiguration(ModelConfiguration):

	["modal", "run", f"{_MODAL_TRAIN_PATH}::download_model"],
	["modal", "run", f"{_MODAL_TRAIN_PATH}::prepare_model"],

Conversation

joyliu-q commented Apr 20, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvmncs commented Apr 20, 2026

Uh oh!

joyliu-q Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joyliu-q commented Apr 20, 2026 •

edited by devin-ai-integration Bot

Loading