feat(ci): org-control plane quality pass — security, tests, DRY, observability#2
Merged
Merged
Conversation
…hub/workflows/* First run of the digest workflow returned 403 "Resource not accessible by personal access token" on every fork for both the merge-upstream call and the distributor's file write. Root cause: fine-grained PATs separate Contents permission from Workflows permission, and writing files under .github/workflows/ requires the latter. Updates the docstring on the workflow file and the README so the PAT permissions list matches what's actually required to run end-to-end.
merge-upstream returns two API shapes for the same "branch doesn't exist anywhere yet" condition: 422 + "does not exist" — branch isn't on the fork 404 + "Branch not found" — branch isn't on the upstream either Previously only 422 was treated as skipped, so forks that don't yet have a 19.0 branch upstream (ddmrp, report-print-send) showed up in the failure bucket and made the daily subject line noisy. After this fix, both cases bucket into "skipped (branch n/a)" and the "Sync failures" header only appears when there's a real failure to look at.
Mutable tags like @v3/@v4 mean a maintainer (or attacker) can swap the underlying code without our consent — the workflow runs whatever that ref points to, with our PAT in env. Pin to commit SHA so updates become deliberate review events. Same pin treatment for forward-port.yml template (every fork's CI is running that file with write-scoped GITHUB_TOKEN). Also documents WHY the template uses `pull_request_target` rather than `pull_request`: the action needs to write back after merge, which requires the elevated permission set; PR-author code never executes in the job, so the usual security concern doesn't apply. Comment lets the next reader see the reasoning instead of guessing. Comments next to each pin show the version tag the SHA currently maps to, so the next bump can be a one-line review against the upstream changelog.
Two small robustness fixes: 1. Concurrency group on the workflow name so the daily cron + a manual dispatch can't race and double-call merge-upstream on the same branches. cancel-in-progress: false so a manual test fired during a cron run queues + completes against the synced state instead of leaving forks half-touched. 2. Subject line is now written to \$GITHUB_OUTPUT via heredoc. The old `echo "value=\$(cat …)" >> \$GITHUB_OUTPUT` form silently corrupts the variable if the value contains `=`, backticks, or a newline. Subjects don't currently include those, but a future render tweak adding `[12 sync fail = 4 forks]` would land as a hard-to-trace email-step failure.
…sing-token error Two related cleanups in one commit because both touch the same lines: 1. _github.py absorbs the gh() helper that was copy-pasted across fork_sync_digest.py and distribute_forward_port.py. Same behaviour, one source of truth — and a place to add retry / rate-limit handling later without forking the change. 2. require_token() replaces the bare os.environ["GH_TOKEN"] reads. The old form raises `KeyError: 'GH_TOKEN'` with no hint about the secret name or where to set it; the new form prints the exact `env:` block needed and exits 2. Spent ten minutes debugging that the first time, so worth a one-off helper. Both scripts continue to work exactly as before — the public surface (gh(method, path, body)) is unchanged.
Splits the pipeline into collect → distribute → render so the
forward-port distributor's results land in the daily email, not
just in the artifact tarball.
Before: render lived inside fork_sync_digest.py. The distributor ran
AFTER render, so its 14/14 failure on our first run looked like
"digest OK" in the inbox. The actual failures were only visible if
you downloaded the artifact and grep'd through JSON.
After:
1. fork_sync_digest.py → collects sync + MIG data, writes
sync-results.json + mig-buckets.json
+ forks-parsed.json. No render.
2. distribute_forward_port.py
→ writes forward-port-distribution.json
(unchanged behaviour).
3. render_digest.py → reads all three JSONs, writes
digest.{html,subject,exit}.
The render step has `if: always()` so a crash in the distributor
doesn't suppress the digest — the email still reports what synced
and flags the dist failure in its own section. Subject line now
tags both `⚠️ N sync fail` AND `⚠️ N dist fail` separately so the
inbox view distinguishes the two failure modes at a glance.
Verified locally against the live PAT: 36-entry sync, simulated
1-dist-failure → subject reads "Ledoent digest YYYY-MM-DD — ⚠️ 1
dist fail — 4 MIG" and the body contains a "Distributor failures"
section with the per-fork detail.
19 tests covering the three pieces of behaviour most likely to
silently regress:
1. load_forks() — the hand-rolled YAML parser. Tests cover the
happy path, multi-branch lists, null upstreams, inline comments,
blank-line termination, and a smoke test against the real
forks.yml shipped in this repo.
2. sync_branch() — the response → state mapping for merge-upstream.
Tests pin the 200-fast-forward / 200-none / 422-skip / 404-skip /
403-real-fail / 409-real-fail cases so a refactor can't
accidentally re-classify "branch n/a" as failure.
3. render() — subject tagging and failure surfacing. Tests pin
"⚠️ N sync fail" vs "⚠️ N dist fail" separation, MIG counts in
the subject, and HTML escaping of failure messages.
Running pytest exposed a real bug in load_forks(): comment-only lines
inside an entry (like the OpenUpgrade entry's two indented `# Default
branch is …` lines) were being misread as blank-line entry separators,
truncating the entry to just `{repo: ledoent/OpenUpgrade}`. The
upstream comment-stripping path was firing before the blank-line
check. Fix moves the blank-line check ahead of comment-stripping so
only TRULY empty lines terminate; comment-only lines are no-ops.
Also adds .github/workflows/tests.yml so pytest runs on PRs that
touch the scripts. No secrets or API calls — safe to run from forks.
Adds a top-level README so https://github.com/ledoent/.github gives a visitor the org-control overview without making them dig through .github/scripts/. Covers what runs daily, the secret list, and how to add a fork. The README explicitly calls out the PAT resource-owner trap: the token must be issued under the ledoent ORG, not a personal account — a fine-grained PAT scoped to "all repositories owned by you" (where you = a personal account) returns 403 on every ledoent/* repo, even when permissions are correct. That's the failure mode that cost us a day on the initial setup; flagging it in two places (root README + scripts/README.md) so the next person doesn't repeat it. The scripts/ README is rewritten as internals-focused since the overview lives at root now. New table maps each stage of the pipeline to its inputs/outputs so the data flow is greppable.
The fork's default branch on ledoent/OpenUpgrade is `ledoent`, which
is a CI overlay (= 19.0 + lab CI customizations baked on top). It is
NOT a copy of an upstream branch.
merge-upstream was trying to merge OCA/OpenUpgrade@19.0 into this
overlay, which is the wrong direction:
upstream/19.0 ----A----B----C----D
\\
ledoent --------+ci-overlay--+ ←← merging A..D in here
conflicts with the overlay's
intentional drift, returns 409
The actual workflow is the reverse: when 19.0 drifts too far, the
human rebases the CI overlay on top of 19.0 manually (per
docs/branch-model.md on the lab side). PRs always target OCA's 19.0,
never the overlay.
Removing "ledoent" from branches[] so only 19.0 gets auto-synced.
The overlay can stay as the fork's default branch — the digest just
won't touch it. Drops one false-failure entry from tomorrow's email.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Quality / DRY / docs / coverage pass on the org-control plane. 8 commits, each addressing one self-review finding, organized for sequential review.
Commits in order
docs(ci): clarify PAT needs Workflows:write …Workflows:permission gap that broke distributionfix(digest): treat 404 "Branch not found" as skippedmerge-upstreamreturns 404 (not 422) when upstream lacks the branch; both now bucket as skippedsec: pin 3rd-party actions to commit SHAs@v3/@v4mutable tags with commit SHAs in bothfork-sync-and-digest.ymland the distributedforward-port.ymltemplate. Added rationale forpull_request_targetuse in the template.ci: add concurrency lock + heredoc $GITHUB_OUTPUTconcurrency:so cron + manual dispatch can't race. Heredoc for subject output so it survives=, backticks, or newlines in the value.refactor(scripts): extract shared github HTTP helper …_github.pyabsorbs the duplicatedgh()helper.require_token()replaces bareos.environ["GH_TOKEN"]reads — actionable exit-2 message instead of cryptic KeyError.feat(digest): surface distributor outcomes in the email bodyrender_digest.pyscript;renderstep usesif: always()so the email goes out even if the distributor crashes.test: pytest coverage for parser + sync state mapping + renderertests/test_load_forks.py,tests/test_sync_branch_state.py,tests/test_render_digest.py. Caught a real parser bug — comment-only lines inside an entry were silently truncating it (the OpenUpgrade entry's# Default branch is …lines were turning the entry into just{repo: ledoent/OpenUpgrade}). Fix moves the blank-line check ahead of comment-stripping. Newtests.ymlworkflow runs the suite on every PR touching.github/scripts/**or.github/forks.yml.docs: repo-root README + clarify PAT resource-owner trapREADME.mdfor the org-control overview. Explicitly flags the "Resource owner MUST be theledoentorg" trap that cost us a day on initial setup. Rewrote.github/scripts/README.mdas internals-focused with a stage-by-stage data-flow table.Verified output
Ran the full pipeline locally against the live PAT — 38 sync results, 4 MIG PRs detected, 14-fork distributor row populated. Rendered digest:
Screenshot of the rendered HTML — note the new "Forward-port distribution" stat row and the actionable surfacing of a real
409 merge conflictonledoent/OpenUpgrade@ledoent(the fork'sledoentbranch has diverged from upstream and needs a manual merge; the old digest format would have made this look like a generic "failed" entry):Test results
The new
tests.ymlworkflow will gate future changes automatically.Findings I deliberately did NOT address
defaults:block in forks.yml to DRY the 18 OCA entries_github.pyaction-send-mailsupports it but the recipient (you) reads HTML; YAGNI until that changesmainAction required (carried over)
The PAT pasted into chat is still in the secret. Rotate it at https://github.com/settings/personal-access-tokens — pick
ci-main, click Regenerate token, then:Safe to squash-merge after rotation. Tests pass, scripts re-tested locally against live API.