Align generation prompts with review rubric to cut the ~70% repair rate (font sizing, design excellence)

## Observation

During the 2026-06-10 bulk regeneration (pp-basic + 7 old specs, 15 libraries each, model sonnet), the window 2026-06-09T22:00Z..2026-06-10T06:30Z shows:

- **112 impl-generate runs, 78 impl-repair runs** → roughly **70% of implementations needed at least one repair round** (pp-basic: 6 of 7 new libs).
- **Zero technical failures** — every implementation eventually merged with final scores 84–93. The repairs were driven almost entirely by review feedback, not broken code.

That means the repair loop is currently functioning as a systematic second generation pass: each round costs one extra Claude generate run + one extra review run and adds ~20–30 min latency per implementation. At catalog scale (327 specs × 15 libraries) this roughly doubles LLM usage for predictable, recurring feedback.

## The recurring rejection reasons

1. **Font sizes below style-guide targets (VQ-01)** — the most consistent pattern. Example: pp-basic/makie attempt 1 (PR #8528) was rejected at 85/100 with `titlesize=20` ≈ 40px effective on the 2400×2400 canvas vs the review rubric's ~67px title / ~53px axis-label targets. The reviewer even prescribed the exact fix (`titlesize=28`, labels 20, ticks 16). The generation prompts apparently don't state these numeric targets, but the review rubric scores against them.
2. **Design-excellence deductions (DE-01 sophistication, DE-03 storytelling)** — "correct but not elevated": missing visual hierarchy, no emphasis/annotation guiding the eye, no confidence band / deviation encoding. Reviews repeatedly suggest the same class of fixes.
3. **Occasionally missed spec-required features (SC-02)** — e.g. scatter-connected-temporal/muix attempt 1 (PR #8545) rejected at 81/100 with two spec-required features missing; final score after repair was 90.

## Suggested direction (to be discussed)

- Add the rubric's **numeric pixel targets per canvas size** (3200×1800 / 2400×2400) to `prompts/default-style-guide.md` — ideally including the per-library translation (e.g. Makie `titlesize≈28`, CSS-px libs ~22px in the 1600×900 mount, etc.), so generation and review use the same numbers.
- Port the review rubric's DE expectations into the generation prompts as a short **design-excellence checklist** (one deliberate visual-hierarchy/storytelling element per plot).
- Add an explicit **self-check against the spec's required features** to the generation prompt before the snippet is committed.
- Optionally: record the repair-reason category in metadata so the effect of prompt changes on the repair rate is measurable.

## Reference data

- Run window: 2026-06-09T22:00Z – 2026-06-10T06:30Z (pp-basic, scatter-connected-temporal, line-yield-curve, acf-pacf, recurrence-basic, funnel-meta-analysis, line-load-duration, area-elevation-profile)
- Example PRs: #8528 (makie pp-basic, 85 REJECTED → 87 merged), #8545 (muix scatter-connected-temporal, 81 REJECTED → 90 merged), #8521/#8522 (highcharts/muix pp-basic, each one repair round)
- Final muix scores across the run: 84–91; highcharts: 84–91 — the repairs work, they are just predictable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align generation prompts with review rubric to cut the ~70% repair rate (font sizing, design excellence) #8643

Observation

The recurring rejection reasons

Suggested direction (to be discussed)

Reference data

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Align generation prompts with review rubric to cut the ~70% repair rate (font sizing, design excellence) #8643

Description

Observation

The recurring rejection reasons

Suggested direction (to be discussed)

Reference data

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions