Skip to content

Align generation prompts with review rubric to cut the ~70% repair rate (font sizing, design excellence) #8643

@MarkusNeusinger

Description

@MarkusNeusinger

Observation

During the 2026-06-10 bulk regeneration (pp-basic + 7 old specs, 15 libraries each, model sonnet), the window 2026-06-09T22:00Z..2026-06-10T06:30Z shows:

  • 112 impl-generate runs, 78 impl-repair runs → roughly 70% of implementations needed at least one repair round (pp-basic: 6 of 7 new libs).
  • Zero technical failures — every implementation eventually merged with final scores 84–93. The repairs were driven almost entirely by review feedback, not broken code.

That means the repair loop is currently functioning as a systematic second generation pass: each round costs one extra Claude generate run + one extra review run and adds ~20–30 min latency per implementation. At catalog scale (327 specs × 15 libraries) this roughly doubles LLM usage for predictable, recurring feedback.

The recurring rejection reasons

  1. Font sizes below style-guide targets (VQ-01) — the most consistent pattern. Example: pp-basic/makie attempt 1 (PR feat(makie): implement pp-basic #8528) was rejected at 85/100 with titlesize=20 ≈ 40px effective on the 2400×2400 canvas vs the review rubric's ~67px title / ~53px axis-label targets. The reviewer even prescribed the exact fix (titlesize=28, labels 20, ticks 16). The generation prompts apparently don't state these numeric targets, but the review rubric scores against them.
  2. Design-excellence deductions (DE-01 sophistication, DE-03 storytelling) — "correct but not elevated": missing visual hierarchy, no emphasis/annotation guiding the eye, no confidence band / deviation encoding. Reviews repeatedly suggest the same class of fixes.
  3. Occasionally missed spec-required features (SC-02) — e.g. scatter-connected-temporal/muix attempt 1 (PR feat(muix): implement scatter-connected-temporal #8545) rejected at 81/100 with two spec-required features missing; final score after repair was 90.

Suggested direction (to be discussed)

  • Add the rubric's numeric pixel targets per canvas size (3200×1800 / 2400×2400) to prompts/default-style-guide.md — ideally including the per-library translation (e.g. Makie titlesize≈28, CSS-px libs ~22px in the 1600×900 mount, etc.), so generation and review use the same numbers.
  • Port the review rubric's DE expectations into the generation prompts as a short design-excellence checklist (one deliberate visual-hierarchy/storytelling element per plot).
  • Add an explicit self-check against the spec's required features to the generation prompt before the snippet is committed.
  • Optionally: record the repair-reason category in metadata so the effect of prompt changes on the repair rate is measurable.

Reference data

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureWorkflow, backend, or frontend issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions