Skip to content

Commit e6171f3

Browse files
committed
Add criterion scoring and raise typing basics
1 parent 158bea0 commit e6171f3

17 files changed

Lines changed: 623 additions & 268 deletions

Makefile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: test embed-examples build check-generated fingerprint browser-layout-test seo-cache-lint verify-examples check-registry-integrity check-confusable-pairs check-broad-surface-tours check-footgun-coverage check-notes-supported check-quality-scores check-no-figure-rationales check-journey-outcomes quality-checks format-examples verify-python-version verify smoke-deployment dev deploy lint
1+
.PHONY: test embed-examples build check-generated fingerprint browser-layout-test seo-cache-lint verify-examples check-registry-integrity check-confusable-pairs check-broad-surface-tours check-footgun-coverage check-notes-supported score-example-criteria check-quality-scores check-no-figure-rationales check-journey-outcomes quality-checks format-examples verify-python-version verify smoke-deployment dev deploy lint
22

33
test:
44
python3 -m unittest discover -s tests -v
@@ -38,6 +38,9 @@ check-footgun-coverage:
3838
check-notes-supported:
3939
scripts/check_notes_supported.py
4040

41+
score-example-criteria:
42+
scripts/score_example_criteria.py --limit 12
43+
4144
check-quality-scores:
4245
scripts/check_quality_scores.py
4346

@@ -47,7 +50,7 @@ check-no-figure-rationales:
4750
check-journey-outcomes:
4851
scripts/check_journey_outcomes.py
4952

50-
quality-checks: check-registry-integrity check-confusable-pairs check-broad-surface-tours check-footgun-coverage check-notes-supported check-quality-scores check-no-figure-rationales check-journey-outcomes
53+
quality-checks: check-registry-integrity check-confusable-pairs check-broad-surface-tours check-footgun-coverage check-notes-supported score-example-criteria check-quality-scores check-no-figure-rationales check-journey-outcomes
5154

5255
format-examples:
5356
scripts/format_examples.py

docs/lessons-learned.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,4 +115,5 @@ git diff --check
115115
- **Quality debt must be tracked, not normalized away.** `docs/example-quality-rubric.md` sets a 9.0 target and `scripts/check_quality_scores.py` enforces the score registry: pages below the hard minimum need a concrete improvement backlog entry, stale backlog entries fail once a page clears the gate, and Hello World is the only standing waiver because first examples are traditionally tiny. A score below target is allowed only when the remaining work is named.
116116
- **No-figure decisions need a registry.** Some examples should not have figures, but that cannot be an invisible omission. `scripts/check_no_figure_rationales.py` validates `no_figure_rationales` so future constraint-shaped pages can opt out explicitly instead of shipping weak diagrams.
117117
- **Journey sections need outcome contracts.** `scripts/check_journey_outcomes.py` ties each journey section to learner outcomes and support examples so journey pages stay mental maps rather than catalog slices.
118+
- **Opaque scores hide the next move.** `scripts/score_example_criteria.py` breaks each page into rubric criteria so quality work can target decomposition, boundaries, source/result pairing, graph support, or practical payoff directly. `docs/quality-search.md` records the hill-climbing and simulated-annealing loop for escaping locally tidy but globally weak page shapes.
118119
- **Deployment smoke belongs beside CI.** `scripts/smoke_deployment.py` checks rendered Worker pages, runtime-boundary pages, journey pages, prototype review pages, and representative Dynamic Worker POST runs for HTTP failures, exception markers, and stale edited-code output. Build success is not enough; the deployed Worker must render and execute edited examples.

docs/quality-registries.toml

Lines changed: 0 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -196,30 +196,6 @@ expires = "never"
196196
# figure would distort the lesson. Current production attaches figures
197197
# to every example, so this registry is intentionally empty.
198198

199-
[quality_improvement_backlog.constants]
200-
cause = "convention page needs stronger boundary against Final and ordinary variables"
201-
next_action = "add a cell contrasting naming convention with typing.Final and runtime rebinding"
202-
203-
[quality_improvement_backlog.truthiness]
204-
cause = "truth-value protocol is under-linked to booleans and special methods"
205-
next_action = "add a boundary cell that predicts bool() for empty containers, None, and custom __bool__"
206-
207-
[quality_improvement_backlog.virtual-environments]
208-
cause = "runtime-boundary page is constrained by Dynamic Workers and needs stronger standard-Python path"
209-
next_action = "teach venv creation/activation as unsupported Standard Python, then show local dependency evidence"
210-
211-
[quality_improvement_backlog.literal-and-final]
212-
cause = "advanced type page compresses Literal and Final into one cell"
213-
next_action = "split value restriction from rebinding restriction and show runtime annotations boundary"
214-
215-
[quality_improvement_backlog.paramspec]
216-
cause = "single-cell advanced typing page hides why ordinary Callable loses parameter shape"
217-
next_action = "add a decorator typed with Callable[..., T] before the ParamSpec-preserving version"
218-
219-
[quality_improvement_backlog.number-parsing]
220-
cause = "parsing page lacks enough failure/recovery shape"
221-
next_action = "add cells for int base handling, ValueError recovery, and validation boundary"
222-
223199
[quality_improvement_backlog.values]
224200
cause = "foundational page is graph-linked now but still needs a sharper object/type mental model"
225201
next_action = "add or revise a cell that connects value, type, and operation with a nearby See also path"
@@ -260,10 +236,6 @@ next_action = "add a cell that chooses a branch from a non-bool value and points
260236
cause = "match syntax is shown but shape-dispatch vs if/elif boundary could be clearer"
261237
next_action = "add a comparable if/elif or data-shape cell that makes match's payoff visible"
262238

263-
[quality_improvement_backlog.while-loops]
264-
cause = "loop shape is shown but for-vs-while decision boundary is thin"
265-
next_action = "add a state-changing while cell beside an iterable for-loop alternative"
266-
267239
[quality_improvement_backlog.lists]
268240
cause = "list operations are shown but sequence vs set/dict and mutation boundaries need sharpening"
269241
next_action = "add a cell contrasting append/index order with set membership or tuple immutability"
@@ -280,10 +252,6 @@ next_action = "add or sharpen cells for get/default, key membership, and safe de
280252
cause = "set uniqueness is shown but list-vs-set tradeoff and ordering boundary need emphasis"
281253
next_action = "add a cell comparing membership/duplicates with a list"
282254

283-
[quality_improvement_backlog.slices]
284-
cause = "slice syntax is shown but off-by-one and copy-vs-view boundaries need stronger evidence"
285-
next_action = "add a cell showing adjacent slices meeting at the same boundary index"
286-
287255
[quality_improvement_backlog.comprehensions]
288256
cause = "map/filter shape is shown but eager vs lazy and loop equivalence need stronger progression"
289257
next_action = "add a generator-expression contrast or explicit loop-equivalence cell"
@@ -304,18 +272,10 @@ next_action = "add a call-site contrast where unnamed booleans are ambiguous"
304272
cause = "collection of extra arguments is shown but forwarding boundary is underdeveloped"
305273
next_action = "add a wrapper cell that forwards *args and **kwargs to another callable"
306274

307-
[quality_improvement_backlog.multiple-return-values]
308-
cause = "tuple return is shown but tuple/unpacking relationship needs a stronger explicit link"
309-
next_action = "add a cell showing the returned value is a tuple before unpacking"
310-
311275
[quality_improvement_backlog.closures]
312276
cause = "closure memory is shown but late-binding footgun deserves more adjacent evidence"
313277
next_action = "add or sharpen loop-closure broken/fixed cells"
314278

315-
[quality_improvement_backlog.recursion]
316-
cause = "recursive shape is shown but base-case and failure boundaries need more evidence"
317-
next_action = "add a base-case-first cell and a note on RecursionError/iteration alternative"
318-
319279
[quality_improvement_backlog.lambdas]
320280
cause = "lambda syntax is shown but def-vs-lambda boundary is too light"
321281
next_action = "add a cell where lambda is useful as an argument and def is clearer for reuse"
@@ -336,10 +296,6 @@ next_action = "add a before/after cell changing a public attribute into a proper
336296
cause = "try/except structure is shown but bare-except and cleanup boundaries need sharper evidence"
337297
next_action = "add a cell contrasting specific exception handling with overbroad catching"
338298

339-
[quality_improvement_backlog.enums]
340-
cause = "enum values are shown but raw constants/string alternatives need stronger contrast"
341-
next_action = "add a cell comparing Enum identity/name/value with plain strings"
342-
343299
[quality_improvement_backlog.custom-exceptions]
344300
cause = "custom exception class is shown but when not to create one is underdeveloped"
345301
next_action = "add a boundary cell contrasting domain error with built-in ValueError"

docs/quality-search.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Rubric-driven quality search
2+
3+
Python By Example now has two complementary scoring loops:
4+
5+
1. `scripts/check_quality_scores.py` is the editorial gate. It enforces the curated score registry, hard-minimum waivers, stale backlog cleanup, weak journey-section tracking, and the 10-point rubric weight model.
6+
2. `scripts/score_example_criteria.py` is the search aid. It breaks each page into rubric criteria so rewrite work can target the weakest axis instead of treating the score as one opaque number.
7+
8+
The criterion report is deliberately heuristic. It should suggest candidates, not replace editorial review.
9+
10+
## Hill-climbing move types
11+
12+
Use these moves when a page is already close and the weakest criterion is clear:
13+
14+
- **Decompose one compressed cell** into setup, boundary, and payoff cells.
15+
- **Add a before/after contrast** when the feature exists to remove boilerplate or clarify a shape.
16+
- **Add a runtime/static boundary cell** for typing pages where runtime behavior differs from type-checker behavior.
17+
- **Add a failure/recovery cell** for parsing, exceptions, warnings, and validation examples.
18+
- **Add a standard-Python/Worker-boundary unsupported cell** for runtime features constrained by Dynamic Workers.
19+
- **Strengthen graph edges** with prerequisite, neighboring, and next-depth `see_also` links.
20+
- **Replace generic prose** with a concrete domain pressure: user input, package setup, protocol bytes, record shape, service logging, or state transition.
21+
22+
## Escaping local maxima with simulated annealing
23+
24+
Greedy hill-climbing tends to overfit the current page shape: it adds one more note or one more small cell even when the page needs a different structure. For pages stuck around 8.2-8.8, use a simulated-annealing review loop:
25+
26+
1. **State**: the page markdown plus metadata, figure rationale, and graph edges.
27+
2. **Energy**: `10 - curated_score`, with penalties for weak criterion scores, unsupported runtime ambiguity, graph isolation, empty output evidence, and overlong code runs.
28+
3. **Neighbor moves**:
29+
- split a cell;
30+
- merge two repetitive cells;
31+
- swap the first example domain;
32+
- introduce a contrasting failure case;
33+
- move from toy data to realistic data;
34+
- convert a figure requirement into a no-figure rationale when the page is constraint-shaped;
35+
- add/remove a `see_also` edge;
36+
- rewrite the intro around “when to use this”.
37+
4. **Temperature**: start high enough to accept occasional worse rewrites, especially when they introduce a new structure. Cool after tests, verification, and rubric review pass.
38+
5. **Acceptance rule**: accept improvements always; accept a worse intermediate with probability based on score loss and temperature only if executable correctness and docs links remain valid.
39+
6. **Refinement**: after cooling, run `make verify`, the criterion report, and a manual rubric pass before updating the curated score.
40+
41+
This gives the project permission to try non-local changes — different domains, different cell order, or a no-figure rationale — without normalizing failed experiments into production.
42+
43+
## Wider-system unlocks
44+
45+
Future improvements that create new quality headroom:
46+
47+
- Store criterion-level editorial subscores in TOML once the heuristic report stabilizes.
48+
- Add an authoring command that proposes the top three rewrite moves for a slug from the criterion deficits.
49+
- Add browser snapshots for representative low-score shapes, not only layout smoke.
50+
- Track page archetypes (`foundational`, `protocol-boundary`, `static-typing`, `aggregator`, `runtime-constrained`) so rubrics can apply the right expectations.
51+
- Add a no-figure review path to avoid weak diagrams for constraint-shaped pages.
52+
- Let CI post a quality delta comment for PRs: scores changed, graph edges changed, weak criteria changed.

scripts/score_example_criteria.py

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
#!/usr/bin/env python3
2+
"""Heuristic criterion-level scoring for example quality hill-climbing.
3+
4+
This is not the editorial source of truth. It is a search aid: it breaks the
5+
rubric into observable criteria so the next rewrite can target the weakest
6+
axis instead of arguing about one opaque number.
7+
"""
8+
from __future__ import annotations
9+
10+
import argparse
11+
import json
12+
import re
13+
import sys
14+
import tomllib
15+
from pathlib import Path
16+
from statistics import mean
17+
18+
ROOT = Path(__file__).resolve().parents[1]
19+
sys.path.insert(0, str(ROOT))
20+
21+
from src.example_loader import load_examples # noqa: E402
22+
from src.marginalia import EXAMPLE_QUALITY_SCORES # noqa: E402
23+
24+
REGISTRY = ROOT / "docs" / "quality-registries.toml"
25+
GENERIC_PHRASES = [
26+
"it exists to make a common boundary explicit",
27+
"the example is small, deterministic",
28+
"prefer simpler neighboring tools",
29+
]
30+
BOUNDARY_WORDS = re.compile(r"\b(prefer|instead|boundary|when|unless|except|error|raises?|static|runtime|unsupported|footgun|warning)\b", re.I)
31+
RATIONALE_WORDS = re.compile(r"\b(use|prefer|reach for|when|because|useful|right tool|fit|shape)\b", re.I)
32+
TOY_WORDS = re.compile(r"\b(foo|bar|baz|spam|eggs)\b", re.I)
33+
34+
35+
def clamp(value: float) -> float:
36+
return max(0.0, min(1.0, value))
37+
38+
39+
def tokenise(text: str) -> set[str]:
40+
return {token.lower() for token in re.findall(r"[a-zA-Z_]{3,}", text)}
41+
42+
43+
def criterion_scores(example: dict) -> dict[str, float]:
44+
prose = "\n".join(example.get("explanation", []))
45+
notes = "\n".join(example.get("notes", []))
46+
code = example.get("code", "")
47+
cells = example.get("cells", [])
48+
normal_cells = [cell for cell in cells if cell.get("kind") == "cell"]
49+
unsupported_cells = [cell for cell in cells if cell.get("kind") == "unsupported"]
50+
outputs = [cell.get("output", "") for cell in normal_cells]
51+
all_text = "\n".join([example.get("summary", ""), prose, notes])
52+
cell_prose = [" ".join(cell.get("prose", [])) for cell in normal_cells]
53+
distinct_cell_starts = len({text[:60] for text in cell_prose})
54+
output_lines = sum(len(output.splitlines()) for output in outputs if output)
55+
code_tokens = tokenise(code)
56+
prose_tokens = tokenise(all_text)
57+
overlap = len(code_tokens & prose_tokens)
58+
generic_penalty = 0.2 * sum(phrase in all_text.lower() for phrase in GENERIC_PHRASES)
59+
60+
return {
61+
"conceptual_payoff": clamp(0.45 + min(len(prose) / 900, 0.35) + min(overlap / 30, 0.2) - generic_penalty),
62+
"rationale": clamp(0.35 + 0.35 * bool(RATIONALE_WORDS.search(all_text)) + min(len(example.get("notes", [])) / 6, 0.3) - generic_penalty),
63+
"alternatives_and_boundaries": clamp(0.25 + 0.25 * bool(BOUNDARY_WORDS.search(all_text)) + 0.2 * bool(unsupported_cells) + min(notes.lower().count("prefer") / 2, 0.2)),
64+
"executable_determinism": clamp(0.75 + 0.25 * bool(example.get("expected_output")) - 0.25 * bool(example.get("version_sensitive"))),
65+
"python_idiom_and_accuracy": clamp(0.75 + 0.15 * bool(example.get("doc_url")) + 0.1 * ("print(" in code) - 0.25 * bool(TOY_WORDS.search(code))),
66+
"literate_fit": clamp(0.35 + min(len(normal_cells) / 4, 0.35) + 0.3 * all(cell.get("prose") for cell in normal_cells)),
67+
"source_result_pairing": clamp(0.35 + 0.45 * all(outputs) + min(output_lines / 8, 0.2)),
68+
"concept_decomposition": clamp(0.25 + min(len(normal_cells) / 3, 0.55) + 0.2 * (len(normal_cells) >= 3)),
69+
"progressive_walkthrough": clamp(0.35 + min(distinct_cell_starts / max(len(normal_cells), 1), 0.45) + 0.2 * (len(normal_cells) >= 2)),
70+
"representative_coverage": clamp(0.3 + min(output_lines / 10, 0.25) + min(len(example.get("see_also", [])) / 4, 0.25) + 0.2 * (len(normal_cells) >= 3)),
71+
"practical_usefulness": clamp(0.55 + 0.25 * (not bool(TOY_WORDS.search(code))) + 0.2 * bool(re.search(r"Ada|Grace|project|config|score|price|request|path|file|service|team", code))),
72+
"editorial_progression": clamp(0.35 + min(len(example.get("explanation", [])) / 3, 0.25) + min(len(example.get("notes", [])) / 4, 0.25) + 0.15 * bool(example.get("see_also"))),
73+
}
74+
75+
76+
def weighted_score(scores: dict[str, float], weights: dict[str, float]) -> float:
77+
return round(sum(scores[name] * float(weight) for name, weight in weights.items()), 1)
78+
79+
80+
def main() -> int:
81+
parser = argparse.ArgumentParser()
82+
parser.add_argument("--json", action="store_true")
83+
parser.add_argument("--below", type=float, default=9.0)
84+
parser.add_argument("--limit", type=int, default=20)
85+
args = parser.parse_args()
86+
87+
weights = tomllib.loads(REGISTRY.read_text())["score_model"]
88+
_, examples = load_examples()
89+
rows = []
90+
for example in examples:
91+
criteria = criterion_scores(example)
92+
heuristic = weighted_score(criteria, weights)
93+
curated, comment = EXAMPLE_QUALITY_SCORES[example["slug"]]
94+
weakest = sorted(criteria.items(), key=lambda item: item[1])[:3]
95+
rows.append({
96+
"slug": example["slug"],
97+
"curated": curated,
98+
"heuristic": heuristic,
99+
"delta": round(curated - heuristic, 1),
100+
"comment": comment,
101+
"weakest": weakest,
102+
"criteria": criteria,
103+
})
104+
105+
if args.json:
106+
print(json.dumps(rows, indent=2, sort_keys=True))
107+
return 0
108+
109+
selected = [row for row in rows if row["curated"] < args.below]
110+
selected.sort(key=lambda row: (row["curated"], row["heuristic"]))
111+
for row in selected[: args.limit]:
112+
weak = ", ".join(f"{name}={score:.2f}" for name, score in row["weakest"])
113+
print(f"{row['curated']:>3.1f} h={row['heuristic']:>3.1f} {row['slug']:<30} {weak}")
114+
print(
115+
f"criterion heuristic: examples={len(rows)} "
116+
f"curated_avg={mean(row['curated'] for row in rows):.2f} "
117+
f"heuristic_avg={mean(row['heuristic'] for row in rows):.2f}"
118+
)
119+
return 0
120+
121+
122+
if __name__ == "__main__":
123+
raise SystemExit(main())

src/asset_manifest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
# Generated by scripts/fingerprint_assets.py. Do not edit by hand.
22
ASSET_PATHS = {'SITE_CSS': '/site.57a55415849b.css', 'SYNTAX_JS': '/syntax-highlight.3b6c7f730d46.js', 'EDITOR_JS': '/editor.a4a7766e1b9b.js'}
3-
HTML_CACHE_VERSION = 'd56bf0e86233'
3+
HTML_CACHE_VERSION = 'b5738224e50a'

0 commit comments

Comments
 (0)