Fix imports to be global not relative for patches#76
Conversation
Adds memory, cloud, and region fields to ModalConfig and wires them into the @app.function decorator in slime/modal_train.py so configs can request per-container memory limits and pin cloud/region. The 355B base config (glm47_355b_a32b) sets memory=(1024, 2 TiB) so the trainer host has headroom for the FP32 delta baseline copy without OOM. The non-colocated variant now inherits via `modal = _base.modal` instead of redeclaring, and its docstring is refreshed (EP=16 is unchanged from base, not reduced). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan generated via gen-plan with 3 Codex convergence rounds. Covers miles/ Modal launcher mirroring slime/ contract, Qwen3-4B LoRA smoke test, and 7 acceptance criteria. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the miles/ launcher directory mirroring the slime/ contract: - configs/base.py: ModalConfig + MilesConfig with reflection-based CLI, JSON config field support, and NCCL_NVLS_ENABLE=1 in base env - modal_train.py: 5 Modal functions (list_configs, download_model, prepare_dataset, convert_checkpoint, train) - modal_helpers/utils.py: cluster context, command building, config prep - configs/qwen3_4b_lora_smoke.py: Qwen3-4B LoRA verification config (num_rollout=2, bridge mode, 4x H200 colocated) Docker image: radixark/miles:dev-202604101227 Volumes: huggingface-cache (shared), miles-data, miles-checkpoints Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix prepare_data() to use zhuzilin/dapo-math-17k and zhuzilin/aime-2024 (matching slime sibling configs), add os.makedirs before download - Expand README with prerequisites, secrets, step-by-step workflow, config authoring guide, YAML/JSON fields, and patch injection docs (matching slime/README.md quality) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The train function now only attaches wandb-secret when the experiment config has use_wandb=True. The smoke-test config (which disables WandB) no longer depends on this secret being present. Also documents required Modal secrets in README prerequisites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modal evaluates decorator args locally and remotely, so conditional secrets cause dependency mismatch errors. Always attach wandb-secret unconditionally, matching slime's pattern. Also picks up user config changes (H100, 8 GPUs per node). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- configs/kimi_k25_fullparam_smoke.py: 8x H200 colocated, bridge mode, freezes everything except layer 60 for smoke testing. Depends on the K2.5 weight-update fixes landed in the miles repo. - configs/kimi_k25_lora.py: LoRA training on top of smoke config. Targets MLA attention projections and MLP linear layers (rank 32, alpha 32). - test_bridge.py: single-GPU Modal script that exercises bridge registration, provider construction, and model instantiation. Used for fast iteration when debugging bridge/provider issues without a full multi-node run. - configs/base.py: add memory and image_env fields to ModalConfig so configs can request per-container RAM (K2.5 needs ~1.8 TiB) and bake LD_LIBRARY_PATH into the image. - modal_train.py: bump base image to radixark/miles:dev (rolling), always prepend system libs to LD_LIBRARY_PATH, and add a convert_kimi_int4_to_bf16 helper to prepare the bf16 checkpoint from moonshotai/Kimi-K2.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop sglang_pynccl_nonfatal.patch.py from image_run_commands (patch was removed from miles repo). - Remove the commented-out INT4 QAT __init__ override and the unused _MILES_PR896_SHA constant. - Drop the no-op prepare_data config-patching branch — the checkpoint on the volume is already in the right shape. - Tighten image_run_commands comments to name the actual issue each step addresses instead of paraphrasing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Phase 1 target_modules: drop linear_q_proj (unused — K2.5 has q_lora_rank so MLA decomposes into down+up; linear_q_proj only exists when q_lora_rank is None). Drop linear_fc1/linear_fc2 — the sparse-expert + compressed- tensors + LoRA path is the least-tested. Keep MLA decomposition (q_down/q_up/kv_down/kv_up) + o_proj. Gives 5 × 60 = 300 adapters instead of ~138K when targeting all expert MLPs. - Enable OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1 + OPEN_TRAINING_INT4_GROUP_SIZE=32 so training forward sees INT4-quantized base weights, matching SGLang's INT4 inference. Without this the LoRA adapters would learn to compensate for a BF16↔INT4 numerical gap that doesn't exist at inference time. LoRA doesn't optimize the base weights so the dequant scratch buffer fits in H200 memory (unlike the full-param smoke where it OOMed). - Drop the explicit save path; save/save_interval are unset in the parent smoke config, so no checkpoint is written. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch target_modules back to the full general-case set: linear_q_down_proj, linear_q_up_proj, linear_kv_down_proj, linear_kv_up_proj, linear_proj, linear_fc1, linear_fc2 Per https://thinkingmachines.ai/blog/lora/ attention-only LoRA significantly underperforms MLP-only, and MLP+attention ≈ MLP-only. If the full path works end-to-end, narrower scopes follow trivially — so exercise the hardest code path first: MLA attention LoRA + dense MLP LoRA + shared expert LoRA + sparse expert (GroupedMLP) LoRA on compressed-tensors experts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I think we should make all these paths relative with sth like |
Introduces miles/configs/model_configuration.py with a ModelConfiguration base class (model_name, model_path, download_model) and a KimiK25 subclass whose download_model() shells out via `modal run` to the existing download_model and convert_kimi_int4_to_bf16 Modal app functions. Adds miles/test_model_configuration.py covering import, construction, the no-CLI-arg-leak guard (via _MILES_SKIP), and NotImplementedError on the base. Round 0 of the surge-kimi-plan RLCR loop (live residuals only; the 1-step smoke goal was met by an external 8x H200 MLA 8k x 256 run on 2026-04-21). Companion updates to _MILES_SKIP and the two experiment scanners are already in 🆕 v0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8cd4a8a to
2914c58
Compare
| _MODAL_TRAIN_PATH = _MILES_ROOT / "modal_train.py" | ||
|
|
||
|
|
||
| class KimiK25ModelConfiguration(ModelConfiguration): |
There was a problem hiding this comment.
TODO: Add more model configs and port other model configs over to use this pattern.
4e5df13 to
f19d3af
Compare
| # docs/agent-modal-training.md. | ||
| env = {**os.environ, "EXPERIMENT_CONFIG": "kimi_k25"} | ||
| subprocess.run( | ||
| ["modal", "run", f"{_MODAL_TRAIN_PATH}::download_model"], |
There was a problem hiding this comment.
🔴 KimiK25ModelConfiguration.download_model references non-existent modal function
miles/configs/model_configuration.py:49 invokes modal run modal_train.py::download_model, but the download_model function was removed in this PR and replaced with prepare_model (miles/modal_train.py:127). When KimiK25ModelConfiguration.download_model() is called (e.g. from the test file or agent workflows), the subprocess will fail with a Modal "function not found" error because no function named download_model exists in the app.
| ["modal", "run", f"{_MODAL_TRAIN_PATH}::download_model"], | |
| ["modal", "run", f"{_MODAL_TRAIN_PATH}::prepare_model"], | |
Was this helpful? React with 👍 or 👎 to provide feedback.
Will add other fixes if I needed to make any to make this run out of the box!