Skip to content

feat(ci): org-control plane quality pass — security, tests, DRY, observability#2

Merged
dnplkndll merged 9 commits into
mainfrom
chore/document-pat-workflows-scope
May 20, 2026
Merged

feat(ci): org-control plane quality pass — security, tests, DRY, observability#2
dnplkndll merged 9 commits into
mainfrom
chore/document-pat-workflows-scope

Conversation

@dnplkndll
Copy link
Copy Markdown
Contributor

@dnplkndll dnplkndll commented May 19, 2026

Quality / DRY / docs / coverage pass on the org-control plane. 8 commits, each addressing one self-review finding, organized for sequential review.

Commits in order

# Commit Concern What changed
1 docs(ci): clarify PAT needs Workflows:write … Docs Original commit — flagged the Workflows: permission gap that broke distribution
2 fix(digest): treat 404 "Branch not found" as skipped Correctness merge-upstream returns 404 (not 422) when upstream lacks the branch; both now bucket as skipped
3 sec: pin 3rd-party actions to commit SHAs Security Replaced @v3/@v4 mutable tags with commit SHAs in both fork-sync-and-digest.yml and the distributed forward-port.yml template. Added rationale for pull_request_target use in the template.
4 ci: add concurrency lock + heredoc $GITHUB_OUTPUT Correctness Workflow-level concurrency: so cron + manual dispatch can't race. Heredoc for subject output so it survives =, backticks, or newlines in the value.
5 refactor(scripts): extract shared github HTTP helper … DRY + DX New _github.py absorbs the duplicated gh() helper. require_token() replaces bare os.environ["GH_TOKEN"] reads — actionable exit-2 message instead of cryptic KeyError.
6 feat(digest): surface distributor outcomes in the email body Observability Splits the pipeline into collect → distribute → render. Distributor results now appear in the email body (not just the artifact tarball) — a 14/14 distributor failure no longer sends a "looks fine" digest. New render_digest.py script; render step uses if: always() so the email goes out even if the distributor crashes.
7 test: pytest coverage for parser + sync state mapping + renderer Tests 19 tests across tests/test_load_forks.py, tests/test_sync_branch_state.py, tests/test_render_digest.py. Caught a real parser bug — comment-only lines inside an entry were silently truncating it (the OpenUpgrade entry's # Default branch is … lines were turning the entry into just {repo: ledoent/OpenUpgrade}). Fix moves the blank-line check ahead of comment-stripping. New tests.yml workflow runs the suite on every PR touching .github/scripts/** or .github/forks.yml.
8 docs: repo-root README + clarify PAT resource-owner trap Docs New top-level README.md for the org-control overview. Explicitly flags the "Resource owner MUST be the ledoent org" trap that cost us a day on initial setup. Rewrote .github/scripts/README.md as internals-focused with a stage-by-stage data-flow table.

Verified output

Ran the full pipeline locally against the live PAT — 38 sync results, 4 MIG PRs detected, 14-fork distributor row populated. Rendered digest:

subject: Ledoent digest 2026-05-20 — ⚠️ 1 sync fail — 4 MIG

Screenshot of the rendered HTML — note the new "Forward-port distribution" stat row and the actionable surfacing of a real 409 merge conflict on ledoent/OpenUpgrade@ledoent (the fork's ledoent branch has diverged from upstream and needs a manual merge; the old digest format would have made this look like a generic "failed" entry):

digest preview

Upload note — screenshot lives locally at /tmp/digest-screenshot.png. gcloud auth expired during this session; once you've run gcloud auth login, push the asset with:

gcloud storage cp /tmp/digest-screenshot.png gs://ledo-pr-assets/oca-reviews/ledoent-dotgithub-pr2/digest-preview.png

Test results

$ python3 -m pytest .github/scripts/tests/ -v
============================== 19 passed in 0.18s ==============================

The new tests.yml workflow will gate future changes automatically.

Findings I deliberately did NOT address

Finding Why deferred
Add PyYAML dependency The hand-rolled parser is now covered by tests + has comments — keeps the workflow zero-deps
defaults: block in forks.yml to DRY the 18 OCA entries Would require parser feature work; the duplication is mechanical and grep-friendly today
Retry / rate-limit handling in _github.py One place to add it now if it becomes needed; no observed need yet
Plain-text email body alongside HTML action-send-mail supports it but the recipient (you) reads HTML; YAGNI until that changes
Branch protection on main Org-policy concern, not a code change for this PR

Action required (carried over)

The PAT pasted into chat is still in the secret. Rotate it at https://github.com/settings/personal-access-tokens — pick ci-main, click Regenerate token, then:

gh secret set LEDOENT_FORK_SYNC_TOKEN --repo ledoent/.github
# paste the new value

Safe to squash-merge after rotation. Tests pass, scripts re-tested locally against live API.

dnplkndll added 2 commits May 18, 2026 22:25
…hub/workflows/*

First run of the digest workflow returned 403 "Resource not accessible by
personal access token" on every fork for both the merge-upstream call and
the distributor's file write. Root cause: fine-grained PATs separate
Contents permission from Workflows permission, and writing files under
.github/workflows/ requires the latter.

Updates the docstring on the workflow file and the README so the PAT
permissions list matches what's actually required to run end-to-end.
merge-upstream returns two API shapes for the same "branch doesn't
exist anywhere yet" condition:
  422 + "does not exist"   — branch isn't on the fork
  404 + "Branch not found" — branch isn't on the upstream either
Previously only 422 was treated as skipped, so forks that don't yet
have a 19.0 branch upstream (ddmrp, report-print-send) showed up in
the failure bucket and made the daily subject line noisy.

After this fix, both cases bucket into "skipped (branch n/a)" and the
"Sync failures" header only appears when there's a real failure to
look at.
@dnplkndll dnplkndll changed the title docs(ci): PAT needs Workflows:write for the distributor ci: PAT scope docs + treat 404 branches as skipped May 20, 2026
dnplkndll added 6 commits May 19, 2026 22:05
Mutable tags like @v3/@v4 mean a maintainer (or attacker) can swap
the underlying code without our consent — the workflow runs whatever
that ref points to, with our PAT in env. Pin to commit SHA so updates
become deliberate review events.

Same pin treatment for forward-port.yml template (every fork's CI is
running that file with write-scoped GITHUB_TOKEN).

Also documents WHY the template uses `pull_request_target` rather
than `pull_request`: the action needs to write back after merge,
which requires the elevated permission set; PR-author code never
executes in the job, so the usual security concern doesn't apply.
Comment lets the next reader see the reasoning instead of guessing.

Comments next to each pin show the version tag the SHA currently
maps to, so the next bump can be a one-line review against the
upstream changelog.
Two small robustness fixes:

1. Concurrency group on the workflow name so the daily cron + a
   manual dispatch can't race and double-call merge-upstream on the
   same branches. cancel-in-progress: false so a manual test fired
   during a cron run queues + completes against the synced state
   instead of leaving forks half-touched.

2. Subject line is now written to \$GITHUB_OUTPUT via heredoc. The
   old `echo "value=\$(cat …)" >> \$GITHUB_OUTPUT` form silently
   corrupts the variable if the value contains `=`, backticks, or a
   newline. Subjects don't currently include those, but a future
   render tweak adding `[12 sync fail = 4 forks]` would land as a
   hard-to-trace email-step failure.
…sing-token error

Two related cleanups in one commit because both touch the same lines:

1. _github.py absorbs the gh() helper that was copy-pasted across
   fork_sync_digest.py and distribute_forward_port.py. Same behaviour,
   one source of truth — and a place to add retry / rate-limit
   handling later without forking the change.

2. require_token() replaces the bare os.environ["GH_TOKEN"] reads.
   The old form raises `KeyError: 'GH_TOKEN'` with no hint about the
   secret name or where to set it; the new form prints the exact
   `env:` block needed and exits 2. Spent ten minutes debugging that
   the first time, so worth a one-off helper.

Both scripts continue to work exactly as before — the public surface
(gh(method, path, body)) is unchanged.
Splits the pipeline into collect → distribute → render so the
forward-port distributor's results land in the daily email, not
just in the artifact tarball.

Before: render lived inside fork_sync_digest.py. The distributor ran
AFTER render, so its 14/14 failure on our first run looked like
"digest OK" in the inbox. The actual failures were only visible if
you downloaded the artifact and grep'd through JSON.

After:
  1. fork_sync_digest.py  → collects sync + MIG data, writes
                             sync-results.json + mig-buckets.json
                             + forks-parsed.json. No render.
  2. distribute_forward_port.py
                          → writes forward-port-distribution.json
                            (unchanged behaviour).
  3. render_digest.py     → reads all three JSONs, writes
                             digest.{html,subject,exit}.

The render step has `if: always()` so a crash in the distributor
doesn't suppress the digest — the email still reports what synced
and flags the dist failure in its own section. Subject line now
tags both `⚠️ N sync fail` AND `⚠️ N dist fail` separately so the
inbox view distinguishes the two failure modes at a glance.

Verified locally against the live PAT: 36-entry sync, simulated
1-dist-failure → subject reads "Ledoent digest YYYY-MM-DD — ⚠️ 1
dist fail — 4 MIG" and the body contains a "Distributor failures"
section with the per-fork detail.
19 tests covering the three pieces of behaviour most likely to
silently regress:

  1. load_forks() — the hand-rolled YAML parser. Tests cover the
     happy path, multi-branch lists, null upstreams, inline comments,
     blank-line termination, and a smoke test against the real
     forks.yml shipped in this repo.

  2. sync_branch() — the response → state mapping for merge-upstream.
     Tests pin the 200-fast-forward / 200-none / 422-skip / 404-skip /
     403-real-fail / 409-real-fail cases so a refactor can't
     accidentally re-classify "branch n/a" as failure.

  3. render() — subject tagging and failure surfacing. Tests pin
     "⚠️ N sync fail" vs "⚠️ N dist fail" separation, MIG counts in
     the subject, and HTML escaping of failure messages.

Running pytest exposed a real bug in load_forks(): comment-only lines
inside an entry (like the OpenUpgrade entry's two indented `# Default
branch is …` lines) were being misread as blank-line entry separators,
truncating the entry to just `{repo: ledoent/OpenUpgrade}`. The
upstream comment-stripping path was firing before the blank-line
check. Fix moves the blank-line check ahead of comment-stripping so
only TRULY empty lines terminate; comment-only lines are no-ops.

Also adds .github/workflows/tests.yml so pytest runs on PRs that
touch the scripts. No secrets or API calls — safe to run from forks.
Adds a top-level README so https://github.com/ledoent/.github gives a
visitor the org-control overview without making them dig through
.github/scripts/. Covers what runs daily, the secret list, and how
to add a fork.

The README explicitly calls out the PAT resource-owner trap: the
token must be issued under the ledoent ORG, not a personal account —
a fine-grained PAT scoped to "all repositories owned by you" (where
you = a personal account) returns 403 on every ledoent/* repo, even
when permissions are correct. That's the failure mode that cost us a
day on the initial setup; flagging it in two places (root README +
scripts/README.md) so the next person doesn't repeat it.

The scripts/ README is rewritten as internals-focused since the
overview lives at root now. New table maps each stage of the
pipeline to its inputs/outputs so the data flow is greppable.
@dnplkndll dnplkndll changed the title ci: PAT scope docs + treat 404 branches as skipped feat(ci): org-control plane quality pass — security, tests, DRY, observability May 20, 2026
The fork's default branch on ledoent/OpenUpgrade is `ledoent`, which
is a CI overlay (= 19.0 + lab CI customizations baked on top). It is
NOT a copy of an upstream branch.

merge-upstream was trying to merge OCA/OpenUpgrade@19.0 into this
overlay, which is the wrong direction:

  upstream/19.0 ----A----B----C----D
                                    \\
  ledoent       --------+ci-overlay--+ ←← merging A..D in here
                                       conflicts with the overlay's
                                       intentional drift, returns 409

The actual workflow is the reverse: when 19.0 drifts too far, the
human rebases the CI overlay on top of 19.0 manually (per
docs/branch-model.md on the lab side). PRs always target OCA's 19.0,
never the overlay.

Removing "ledoent" from branches[] so only 19.0 gets auto-synced.
The overlay can stay as the fork's default branch — the digest just
won't touch it. Drops one false-failure entry from tomorrow's email.
@dnplkndll dnplkndll merged commit d66d28d into main May 20, 2026
1 check passed
@dnplkndll dnplkndll deleted the chore/document-pat-workflows-scope branch May 20, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant