feat(ci): org-control plane quality pass — security, tests, DRY, observability by dnplkndll · Pull Request #2 · ledoent/.github

dnplkndll · 2026-05-19T02:35:04Z

Quality / DRY / docs / coverage pass on the org-control plane. 8 commits, each addressing one self-review finding, organized for sequential review.

Commits in order

#	Commit	Concern	What changed
1	`docs(ci): clarify PAT needs Workflows:write …`	Docs	Original commit — flagged the `Workflows:` permission gap that broke distribution
2	`fix(digest): treat 404 "Branch not found" as skipped`	Correctness	`merge-upstream` returns 404 (not 422) when upstream lacks the branch; both now bucket as skipped
3	`sec: pin 3rd-party actions to commit SHAs`	Security	Replaced `@v3`/`@v4` mutable tags with commit SHAs in both `fork-sync-and-digest.yml` and the distributed `forward-port.yml` template. Added rationale for `pull_request_target` use in the template.
4	`ci: add concurrency lock + heredoc $GITHUB_OUTPUT`	Correctness	Workflow-level `concurrency:` so cron + manual dispatch can't race. Heredoc for subject output so it survives `=`, backticks, or newlines in the value.
5	`refactor(scripts): extract shared github HTTP helper …`	DRY + DX	New `_github.py` absorbs the duplicated `gh()` helper. `require_token()` replaces bare `os.environ["GH_TOKEN"]` reads — actionable exit-2 message instead of cryptic KeyError.
6	`feat(digest): surface distributor outcomes in the email body`	Observability	Splits the pipeline into collect → distribute → render. Distributor results now appear in the email body (not just the artifact tarball) — a 14/14 distributor failure no longer sends a "looks fine" digest. New `render_digest.py` script; `render` step uses `if: always()` so the email goes out even if the distributor crashes.
7	`test: pytest coverage for parser + sync state mapping + renderer`	Tests	19 tests across `tests/test_load_forks.py`, `tests/test_sync_branch_state.py`, `tests/test_render_digest.py`. Caught a real parser bug — comment-only lines inside an entry were silently truncating it (the OpenUpgrade entry's `# Default branch is …` lines were turning the entry into just `{repo: ledoent/OpenUpgrade}`). Fix moves the blank-line check ahead of comment-stripping. New `tests.yml` workflow runs the suite on every PR touching `.github/scripts/**` or `.github/forks.yml`.
8	`docs: repo-root README + clarify PAT resource-owner trap`	Docs	New top-level `README.md` for the org-control overview. Explicitly flags the "Resource owner MUST be the `ledoent` org" trap that cost us a day on initial setup. Rewrote `.github/scripts/README.md` as internals-focused with a stage-by-stage data-flow table.

Verified output

Ran the full pipeline locally against the live PAT — 38 sync results, 4 MIG PRs detected, 14-fork distributor row populated. Rendered digest:

subject: Ledoent digest 2026-05-20 — ⚠️ 1 sync fail — 4 MIG

Screenshot of the rendered HTML — note the new "Forward-port distribution" stat row and the actionable surfacing of a real 409 merge conflict on ledoent/OpenUpgrade@ledoent (the fork's ledoent branch has diverged from upstream and needs a manual merge; the old digest format would have made this look like a generic "failed" entry):

Upload note — screenshot lives locally at /tmp/digest-screenshot.png. gcloud auth expired during this session; once you've run gcloud auth login, push the asset with:
gcloud storage cp /tmp/digest-screenshot.png gs://ledo-pr-assets/oca-reviews/ledoent-dotgithub-pr2/digest-preview.png

Test results

$ python3 -m pytest .github/scripts/tests/ -v
============================== 19 passed in 0.18s ==============================

The new tests.yml workflow will gate future changes automatically.

Findings I deliberately did NOT address

Finding	Why deferred
Add PyYAML dependency	The hand-rolled parser is now covered by tests + has comments — keeps the workflow zero-deps
`defaults:` block in forks.yml to DRY the 18 OCA entries	Would require parser feature work; the duplication is mechanical and grep-friendly today
Retry / rate-limit handling in `_github.py`	One place to add it now if it becomes needed; no observed need yet
Plain-text email body alongside HTML	`action-send-mail` supports it but the recipient (you) reads HTML; YAGNI until that changes
Branch protection on `main`	Org-policy concern, not a code change for this PR

Action required (carried over)

The PAT pasted into chat is still in the secret. Rotate it at https://github.com/settings/personal-access-tokens — pick ci-main, click Regenerate token, then:

gh secret set LEDOENT_FORK_SYNC_TOKEN --repo ledoent/.github
# paste the new value

Safe to squash-merge after rotation. Tests pass, scripts re-tested locally against live API.

…hub/workflows/* First run of the digest workflow returned 403 "Resource not accessible by personal access token" on every fork for both the merge-upstream call and the distributor's file write. Root cause: fine-grained PATs separate Contents permission from Workflows permission, and writing files under .github/workflows/ requires the latter. Updates the docstring on the workflow file and the README so the PAT permissions list matches what's actually required to run end-to-end.

merge-upstream returns two API shapes for the same "branch doesn't exist anywhere yet" condition: 422 + "does not exist" — branch isn't on the fork 404 + "Branch not found" — branch isn't on the upstream either Previously only 422 was treated as skipped, so forks that don't yet have a 19.0 branch upstream (ddmrp, report-print-send) showed up in the failure bucket and made the daily subject line noisy. After this fix, both cases bucket into "skipped (branch n/a)" and the "Sync failures" header only appears when there's a real failure to look at.

@v4

Mutable tags like @v3/@v4 mean a maintainer (or attacker) can swap the underlying code without our consent — the workflow runs whatever that ref points to, with our PAT in env. Pin to commit SHA so updates become deliberate review events. Same pin treatment for forward-port.yml template (every fork's CI is running that file with write-scoped GITHUB_TOKEN). Also documents WHY the template uses `pull_request_target` rather than `pull_request`: the action needs to write back after merge, which requires the elevated permission set; PR-author code never executes in the job, so the usual security concern doesn't apply. Comment lets the next reader see the reasoning instead of guessing. Comments next to each pin show the version tag the SHA currently maps to, so the next bump can be a one-line review against the upstream changelog.

Two small robustness fixes: 1. Concurrency group on the workflow name so the daily cron + a manual dispatch can't race and double-call merge-upstream on the same branches. cancel-in-progress: false so a manual test fired during a cron run queues + completes against the synced state instead of leaving forks half-touched. 2. Subject line is now written to \$GITHUB_OUTPUT via heredoc. The old `echo "value=\$(cat …)" >> \$GITHUB_OUTPUT` form silently corrupts the variable if the value contains `=`, backticks, or a newline. Subjects don't currently include those, but a future render tweak adding `[12 sync fail = 4 forks]` would land as a hard-to-trace email-step failure.

…sing-token error Two related cleanups in one commit because both touch the same lines: 1. _github.py absorbs the gh() helper that was copy-pasted across fork_sync_digest.py and distribute_forward_port.py. Same behaviour, one source of truth — and a place to add retry / rate-limit handling later without forking the change. 2. require_token() replaces the bare os.environ["GH_TOKEN"] reads. The old form raises `KeyError: 'GH_TOKEN'` with no hint about the secret name or where to set it; the new form prints the exact `env:` block needed and exits 2. Spent ten minutes debugging that the first time, so worth a one-off helper. Both scripts continue to work exactly as before — the public surface (gh(method, path, body)) is unchanged.

Splits the pipeline into collect → distribute → render so the forward-port distributor's results land in the daily email, not just in the artifact tarball. Before: render lived inside fork_sync_digest.py. The distributor ran AFTER render, so its 14/14 failure on our first run looked like "digest OK" in the inbox. The actual failures were only visible if you downloaded the artifact and grep'd through JSON. After: 1. fork_sync_digest.py → collects sync + MIG data, writes sync-results.json + mig-buckets.json + forks-parsed.json. No render. 2. distribute_forward_port.py → writes forward-port-distribution.json (unchanged behaviour). 3. render_digest.py → reads all three JSONs, writes digest.{html,subject,exit}. The render step has `if: always()` so a crash in the distributor doesn't suppress the digest — the email still reports what synced and flags the dist failure in its own section. Subject line now tags both `⚠️ N sync fail` AND `⚠️ N dist fail` separately so the inbox view distinguishes the two failure modes at a glance. Verified locally against the live PAT: 36-entry sync, simulated 1-dist-failure → subject reads "Ledoent digest YYYY-MM-DD — ⚠️ 1 dist fail — 4 MIG" and the body contains a "Distributor failures" section with the per-fork detail.

19 tests covering the three pieces of behaviour most likely to silently regress: 1. load_forks() — the hand-rolled YAML parser. Tests cover the happy path, multi-branch lists, null upstreams, inline comments, blank-line termination, and a smoke test against the real forks.yml shipped in this repo. 2. sync_branch() — the response → state mapping for merge-upstream. Tests pin the 200-fast-forward / 200-none / 422-skip / 404-skip / 403-real-fail / 409-real-fail cases so a refactor can't accidentally re-classify "branch n/a" as failure. 3. render() — subject tagging and failure surfacing. Tests pin "⚠️ N sync fail" vs "⚠️ N dist fail" separation, MIG counts in the subject, and HTML escaping of failure messages. Running pytest exposed a real bug in load_forks(): comment-only lines inside an entry (like the OpenUpgrade entry's two indented `# Default branch is …` lines) were being misread as blank-line entry separators, truncating the entry to just `{repo: ledoent/OpenUpgrade}`. The upstream comment-stripping path was firing before the blank-line check. Fix moves the blank-line check ahead of comment-stripping so only TRULY empty lines terminate; comment-only lines are no-ops. Also adds .github/workflows/tests.yml so pytest runs on PRs that touch the scripts. No secrets or API calls — safe to run from forks.

Adds a top-level README so https://github.com/ledoent/.github gives a visitor the org-control overview without making them dig through .github/scripts/. Covers what runs daily, the secret list, and how to add a fork. The README explicitly calls out the PAT resource-owner trap: the token must be issued under the ledoent ORG, not a personal account — a fine-grained PAT scoped to "all repositories owned by you" (where you = a personal account) returns 403 on every ledoent/* repo, even when permissions are correct. That's the failure mode that cost us a day on the initial setup; flagging it in two places (root README + scripts/README.md) so the next person doesn't repeat it. The scripts/ README is rewritten as internals-focused since the overview lives at root now. New table maps each stage of the pipeline to its inputs/outputs so the data flow is greppable.

The fork's default branch on ledoent/OpenUpgrade is `ledoent`, which is a CI overlay (= 19.0 + lab CI customizations baked on top). It is NOT a copy of an upstream branch. merge-upstream was trying to merge OCA/OpenUpgrade@19.0 into this overlay, which is the wrong direction: upstream/19.0 ----A----B----C----D \\ ledoent --------+ci-overlay--+ ←← merging A..D in here conflicts with the overlay's intentional drift, returns 409 The actual workflow is the reverse: when 19.0 drifts too far, the human rebases the CI overlay on top of 19.0 manually (per docs/branch-model.md on the lab side). PRs always target OCA's 19.0, never the overlay. Removing "ledoent" from branches[] so only 19.0 gets auto-synced. The overlay can stay as the fork's default branch — the digest just won't touch it. Drops one false-failure entry from tomorrow's email.

dnplkndll added 2 commits May 18, 2026 22:25

dnplkndll changed the title ~~docs(ci): PAT needs Workflows:write for the distributor~~ ci: PAT scope docs + treat 404 branches as skipped May 20, 2026

dnplkndll added 6 commits May 19, 2026 22:05

dnplkndll changed the title ~~ci: PAT scope docs + treat 404 branches as skipped~~ feat(ci): org-control plane quality pass — security, tests, DRY, observability May 20, 2026

dnplkndll merged commit d66d28d into main May 20, 2026
1 check passed

dnplkndll deleted the chore/document-pat-workflows-scope branch May 20, 2026 14:47

dnplkndll mentioned this pull request May 20, 2026

feat(forks): preemptively track 20.0 so sync auto-activates #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): org-control plane quality pass — security, tests, DRY, observability#2

feat(ci): org-control plane quality pass — security, tests, DRY, observability#2
dnplkndll merged 9 commits into
mainfrom
chore/document-pat-workflows-scope

dnplkndll commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dnplkndll commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits in order

Verified output

Test results

Findings I deliberately did NOT address

Action required (carried over)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dnplkndll commented May 19, 2026 •

edited

Loading