Skip to content

ci: make the Docker test matrix resilient to registry flakes (#172)#173

Merged
pierre-warnier merged 1 commit into
mainfrom
ci/harden-docker-matrix
Jun 10, 2026
Merged

ci: make the Docker test matrix resilient to registry flakes (#172)#173
pierre-warnier merged 1 commit into
mainfrom
ci/harden-docker-matrix

Conversation

@pierre-warnier

Copy link
Copy Markdown
Collaborator

Summary

The Test (debian/alpine/fedora) matrix fails intermittently on transient network errors unrelated to the code. On 2026-06-10 main CI for 27daa7b failed four times in a row — Docker Hub base-image pull i/o timeout (alpine, then fedora) and a crates.io curl failed [55] Broken pipe. The runner-native Test job (no Docker pull, cached) passed every time, isolating the cause to the matrix's network exposure.

Changes (.github/workflows/ci.yml only)

  • fail-fast: false on the matrix — one distro's flake no longer cancels the other two.
  • Retry docker compose build <target> (the Docker Hub pull) 3× with 30s backoff.
  • CARGO_NET_RETRY=10 — set workflow-wide for runner-native jobs and passed into the container (-e) so cargo retries crate downloads. A real test failure still fails fast (no retry wraps the test run).

No new third-party actions; a plain shell retry loop keeps the supply-chain surface unchanged.

Validation

  • YAML parses; verified locally that docker compose run --rm -e CARGO_NET_RETRY=10 <svc> propagates the var into the container.
  • Note: the matrix is push-gated, so it does not execute on this PR — it runs on merge to main. The retry/back-off logic is plain shell and reviewed inline.

Closes #172.

The Test (debian/alpine/fedora) matrix failed four times in a row on
2026-06-10, each on a transient network error (Docker Hub base-image
pull i/o timeout; crates.io download broken pipe) unrelated to the code.

- fail-fast: false so one distro's flake no longer cancels the others.
- Retry 'docker compose build <target>' (the Docker Hub pull) with backoff.
- Pass CARGO_NET_RETRY into the container and set it for runner-native
  jobs so cargo retries crate downloads; real test failures still fail
  fast (no step-level retry around the test run).

No new third-party actions — a shell retry loop keeps the supply-chain
surface unchanged.

Closes #172.
Copilot AI review requested due to automatic review settings June 10, 2026 10:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Docker-based CI test matrix against transient network/registry failures (Docker Hub pulls and crates.io downloads) so that unrelated flakes don’t fail or prematurely cancel the overall matrix run.

Changes:

  • Set CARGO_NET_RETRY=10 at the workflow level to make cargo downloads more resilient.
  • Disable matrix fail-fast so one distro flake doesn’t cancel the other distro jobs.
  • Add a small shell retry loop with backoff around docker compose build <target> to mitigate transient Docker Hub pull failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/ci.yml
Comment thread .github/workflows/ci.yml
@pierre-warnier pierre-warnier merged commit 4ce4184 into main Jun 10, 2026
8 checks passed
@pierre-warnier pierre-warnier deleted the ci/harden-docker-matrix branch June 10, 2026 10:42
pierre-warnier added a commit that referenced this pull request Jun 10, 2026
Address review feedback on #173:
- Don't sleep after the final build attempt (no retry follows it).
- Forward the workflow-wide CARGO_NET_RETRY into the container with a
  bare '-e CARGO_NET_RETRY' instead of re-hardcoding the value, keeping
  a single source of truth.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: Docker test matrix is flaky on transient registry/network timeouts

2 participants