Wire slm.evaluate() into the train loop / studio / server

`slm.evaluate()` (#1) ships the SDK + CLI surface, but nothing inside the product calls it yet — it's a leaf reachable only via `slm.evaluate(...)` and `shadowlm eval`. The task-quality number it produces is the "eval gate" the product thesis depends on ("run the shadow until it does the job as well as the frontier, then switch"), so it should be wired into the loop:

- [ ] **`finetune()` eval-on-holdout** — pass an eval set, attach the task-quality score to `TrainingRun` alongside `eval_loss`. This is the actual eval gate.
- [ ] **Studio** — surface the score in Runs / Playground.
- [ ] **`/v1/evaluate` endpoint** — so the studio can evaluate.

None are blocking the merge of #1; this just tracks turning the primitive into something the loop consumes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire slm.evaluate() into the train loop / studio / server #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Wire slm.evaluate() into the train loop / studio / server #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions