From d773f3fd9971dc4cfa9ff79ff4e2b180fe1bb148 Mon Sep 17 00:00:00 2001 From: Austin Kurpuis Date: Mon, 4 May 2026 13:21:43 -0700 Subject: [PATCH] Cleanup stale tags Co-authored-by: Copilot --- internal/demo/k8s/demo-loop-runner/README.md | 116 ++++++++++-------- .../scripts/cleanup-drained-versions.sh | 12 ++ .../scripts/run_release_once.sh | 19 --- .../templates/cleanup-deployment.yaml | 4 + 4 files changed, 82 insertions(+), 69 deletions(-) diff --git a/internal/demo/k8s/demo-loop-runner/README.md b/internal/demo/k8s/demo-loop-runner/README.md index 68d5dfe..687cf1b 100644 --- a/internal/demo/k8s/demo-loop-runner/README.md +++ b/internal/demo/k8s/demo-loop-runner/README.md @@ -1,9 +1,10 @@ # Demo Loop Runner -Two independent Kubernetes CronJobs that drive the Temporal Worker Controller rainbow deployment demo: +Three long-running Kubernetes Deployments that drive the Temporal Worker Controller rainbow deployment demo: -1. **Release CronJob** (`rainbow-release`) — generates a new worker version, builds the container image via Kaniko, and deploys it via Skaffold/Helm -2. **Traffic CronJob** (`rainbow-traffic`) — starts workflows on Temporal to create realistic in-flight traffic across versions +1. **`rainbow-release`** — simulates releases by retagging the `latest` ECR image with a new random hex tag and patching the TemporalWorkerDeployment +2. **`rainbow-traffic`** — starts workflows on Temporal to create realistic in-flight traffic across versions +3. **`rainbow-cleanup`** — removes drained Temporal versions and their ECR tags once all their workflows have completed Scripts are deployed as ConfigMaps and mounted into the pods, so changes to behavior only require a Helm upgrade — no image rebuild. @@ -11,46 +12,61 @@ Scripts are deployed as ConfigMaps and mounted into the pods, so changes to beha ``` ┌────────────────────────────────────────────────────────────┐ -│ CronJob: rainbow-release (every 3 min) │ +│ Deployment: rainbow-release (every ~30s) │ │ ConfigMap: rainbow-release-scripts │ │ run_release_once.sh │ -│ ├─ generate_version_cron.sh → mutate worker, commit │ -│ ├─ build_version_kaniko.sh → Kaniko Job → ECR push │ -│ └─ deploy_version_skaffold.sh → skaffold deploy │ +│ ├─ retag ECR :latest → : │ +│ └─ kubectl patch TWD image → triggers rollout │ └────┬───────────────────────┬───────────────────────────────┘ │ │ ▼ ▼ K8s API AWS ECR ┌────────────────────────────────────────────────────────────┐ -│ CronJob: rainbow-traffic (every 1 min) │ +│ Deployment: rainbow-traffic (every ~3s) │ │ ConfigMap: rainbow-traffic-scripts │ -│ generate-traffic.sh → temporal workflow start │ +│ generate-traffic.sh → temporal workflow start │ └────┬───────────────────────────────────────────────────────┘ │ ▼ Temporal Cloud + +┌────────────────────────────────────────────────────────────┐ +│ Deployment: rainbow-cleanup (every ~120s) │ +│ ConfigMap: rainbow-cleanup-scripts │ +│ cleanup-drained-versions.sh │ +│ ├─ delete drained Temporal versions (0 workflows) │ +│ └─ delete matching ECR tag after version removed │ +└────┬───────────────────────┬───────────────────────────────┘ + │ │ + ▼ ▼ + Temporal Cloud AWS ECR ``` ## How It Works -### Release CronJob +### Release + +Each iteration: +1. Generates a random 6-character hex tag (e.g. `a3f9c1`) +2. Fetches the manifest for `ECR_REPO:latest` and pushes it as `ECR_REPO:` +3. Patches the TWD's container image to the new tag +4. The controller detects the change and begins a rainbow rollout -Each invocation: -1. Clones the repo, reads version counter from `rainbow-version-state` ConfigMap -2. Mutates `worker.go` with version-specific sleep duration and commits -3. Launches a Kaniko Job to build and push the image to ECR -4. Runs `skaffold deploy` to update the TemporalWorkerDeployment CR -5. The controller detects the change and begins a rainbow rollout +No git clone, no build, no Kaniko. A new ECR tag pointing at the same image is enough to trigger the controller. -### Traffic CronJob +### Traffic -Each invocation: -1. Counts running workflows via `temporal workflow count` -2. Computes how many to start (respects `MAX_RUNNING_WORKFLOWS` cap) -3. Starts workflows with `temporal workflow start` on the configured task queue +Each iteration starts `WORKFLOWS_PER_RUN` workflows against the configured task queue. Workflows are long-running (30s sleep), so there is always live traffic spread across multiple versions during rollouts. -Because releases and traffic are decoupled, traffic keeps flowing regardless of release state, and releases aren't blocked by traffic generation timing. +### Cleanup + +Each iteration: +1. Describes the Temporal worker deployment to list all versions +2. For each version that is not current/target and has no running workflows: + - If a K8s Deployment still exists for it, skips (controller manages its lifecycle) + - If no K8s Deployment exists and 0 running workflows, deletes the Temporal version and its ECR tag +3. Catch-all: terminates any workflows pinned to versions with no remaining K8s Deployment (true orphans) ## Configuration @@ -64,34 +80,34 @@ All values via Helm: | `image.tag` | `latest` | Image tag | | `image.pullPolicy` | `Always` | Pull policy | -### Release CronJob +### Release | Value | Default | Description | |-------|---------|-------------| -| `release.schedule` | `*/3 * * * *` | Cron schedule for releases | -| `release.activeDeadlineSeconds` | `900` | Job timeout | +| `release.intervalSeconds` | `30` | Sleep between releases | | `release.name` | `helloworld` | TemporalWorkerDeployment name | -| `release.repoUrl` | — | Git repo to clone | -| `release.repoRef` | `main` | Branch/tag | -| `release.worker` | `helloworld` | Worker build arg | -| `release.waitForTwdRollout` | `false` | Block until rollout completes | +| `release.ecrRepo` | `025066239481.dkr.ecr.us-east-2.amazonaws.com/helloworld` | ECR repository URI | +| `release.sourceTag` | `latest` | Tag to retag from | -### Traffic CronJob +### Traffic | Value | Default | Description | |-------|---------|-------------| -| `traffic.schedule` | `* * * * *` | Cron schedule for traffic | -| `traffic.activeDeadlineSeconds` | `120` | Job timeout | +| `traffic.intervalSeconds` | `3` | Sleep between traffic runs | | `traffic.workflowType` | `HelloWorld` | Workflow type to start | -| `traffic.workflowsPerRun` | `5` | Target workflows per invocation | -| `traffic.maxNewWorkflowsPerRun` | `5` | Max new workflows per invocation | -| `traffic.maxRunningWorkflows` | `10` | Global cap on running workflows | +| `traffic.workflowsPerRun` | `6` | Workflows started per iteration | + +### Cleanup + +| Value | Default | Description | +|-------|---------|-------------| +| `cleanup.intervalSeconds` | `120` | Sleep between cleanup runs | ### AWS | Value | Default | Description | |-------|---------|-------------| -| `aws.roleArn` | — | IAM role ARN for IRSA (ECR push) | +| `aws.roleArn` | — | IAM role ARN for IRSA | | `aws.region` | `us-east-2` | AWS region | ### Temporal @@ -109,7 +125,7 @@ All values via Helm: Built from `internal/demo/Dockerfile.release-job` — Alpine with tools only (no scripts): -- `kubectl`, `temporal` CLI, `skaffold`, `helm`, `aws-cli`, `git`, `jq` +- `kubectl`, `temporal` CLI, `aws-cli`, `jq` Scripts are mounted from ConfigMaps at `/opt/scripts/`. @@ -117,13 +133,13 @@ Scripts are mounted from ConfigMaps at `/opt/scripts/`. Live in the chart at `scripts/`: -| Script | CronJob | Purpose | -|--------|---------|---------| -| `run_release_once.sh` | release | Orchestrates one full release cycle | -| `generate_version_cron.sh` | release | Mutates worker.go, commits, emits image tag | -| `build_version_kaniko.sh` | release | Creates Kaniko Job, waits for completion | -| `deploy_version_skaffold.sh` | release | Skaffold deploy with pre-built artifact | -| `generate-traffic.sh` | traffic | Counts workflows, starts new ones up to cap | +| Script | Deployment | Purpose | +|--------|-----------|---------| +| `run_release_once.sh` | release | Retags ECR image and patches TWD | +| `generate-traffic.sh` | traffic | Starts workflows against the task queue | +| `cleanup-drained-versions.sh` | cleanup | Deletes drained Temporal versions and ECR tags | + +> `build_version_kaniko.sh`, `generate_version_cron.sh`, and `deploy_version_skaffold.sh` are unused legacy scripts kept for reference. To change script behavior, edit the files and re-deploy the chart. No image rebuild needed. @@ -146,13 +162,13 @@ helm upgrade --install demo-loop-runner internal/demo/k8s/demo-loop-runner \ ## How Rainbow Deployments Emerge -Because workflows sleep 150–240s and releases happen every 3 minutes, there's always version overlap: +Because workflows sleep 30s and releases happen every ~30s, there is always version overlap: -1. Version N deploys → workflows start on version N -2. 3 min later, version N+1 deploys → controller begins ramping -3. Version N's workflows still running (pinned to N) -4. Traffic CronJob independently starts new workflows (routed to latest) -5. Version N drains as its workflows complete +1. Version N deploys → workflows start on version N (pinned to it) +2. ~30s later, version N+1 deploys → controller begins ramping traffic +3. Version N's in-flight workflows finish draining +4. Traffic keeps starting new workflows independently (routed to the current version) +5. Cleanup removes version N from Temporal and ECR once it's fully drained 6. Meanwhile version N+2 deploys... The decoupled traffic ensures continuous workflow pressure independent of the release cadence. diff --git a/internal/demo/k8s/demo-loop-runner/scripts/cleanup-drained-versions.sh b/internal/demo/k8s/demo-loop-runner/scripts/cleanup-drained-versions.sh index 57dd15c..3f118ea 100644 --- a/internal/demo/k8s/demo-loop-runner/scripts/cleanup-drained-versions.sh +++ b/internal/demo/k8s/demo-loop-runner/scripts/cleanup-drained-versions.sh @@ -16,6 +16,8 @@ set -eu NAMESPACE="${NAMESPACE:-default}" RELEASE_NAME="${RELEASE_NAME:-helloworld}" +AWS_REGION="${AWS_REGION:-us-east-2}" +ECR_REPO="${ECR_REPO:-}" WORKER_DEPLOYMENT_NAME="${WORKER_DEPLOYMENT_NAME:-default/${RELEASE_NAME}}" MANAGER_IDENTITY="${MANAGER_IDENTITY:-temporal-worker-controller/temporal-system}" TIMESTAMP=$(date -u +'%Y-%m-%dT%H:%M:%SZ') @@ -121,6 +123,16 @@ for BUILD_ID in $ALL_VERSIONS; do --api-key "$TEMPORAL_API_KEY" \ --tls 2>&1; then DELETED=$((DELETED + 1)) + # Delete the corresponding ECR image tag now that the version is fully cleaned up + if [ -n "$ECR_REPO" ]; then + REPO_NAME=$(echo "$ECR_REPO" | cut -d/ -f2-) + aws ecr batch-delete-image \ + --region "$AWS_REGION" \ + --repository-name "$REPO_NAME" \ + --image-ids "imageTag=$BUILD_ID" >/dev/null 2>&1 && \ + echo "[$TIMESTAMP] Deleted ECR tag $BUILD_ID" || \ + echo "[$TIMESTAMP] Warning: could not delete ECR tag $BUILD_ID (may not exist)" + fi else FAILED=$((FAILED + 1)) fi diff --git a/internal/demo/k8s/demo-loop-runner/scripts/run_release_once.sh b/internal/demo/k8s/demo-loop-runner/scripts/run_release_once.sh index 7c3ffe7..7d8739e 100644 --- a/internal/demo/k8s/demo-loop-runner/scripts/run_release_once.sh +++ b/internal/demo/k8s/demo-loop-runner/scripts/run_release_once.sh @@ -34,25 +34,6 @@ if [ -z "$MANIFEST" ] || [ "$MANIFEST" = "None" ]; then exit 1 fi -# Clean up old tags to stay under ECR's 1000-tag-per-image limit. -# Keep only the 20 most recent hex tags (plus latest/gatefix-*). -OLD_TAGS=$(aws ecr describe-images \ - --region "$AWS_REGION" \ - --repository-name "$REPO_NAME" \ - --image-ids imageTag="$SOURCE_TAG" \ - --query 'imageDetails[0].imageTags' \ - --output json 2>/dev/null | \ - jq -r '.[] | select(. != "latest" and (startswith("gatefix") | not))' | \ - tail -n +21) || true - -if [ -n "$OLD_TAGS" ]; then - echo "$OLD_TAGS" | jq -R -s 'split("\n") | map(select(. != "")) | map({"imageTag": .})' | \ - aws ecr batch-delete-image \ - --region "$AWS_REGION" \ - --repository-name "$REPO_NAME" \ - --image-ids file:///dev/stdin >/dev/null 2>&1 || true -fi - PUT_RESULT=$(aws ecr put-image \ --region "$AWS_REGION" \ --repository-name "$REPO_NAME" \ diff --git a/internal/demo/k8s/demo-loop-runner/templates/cleanup-deployment.yaml b/internal/demo/k8s/demo-loop-runner/templates/cleanup-deployment.yaml index 4ee4db6..6fa6b75 100644 --- a/internal/demo/k8s/demo-loop-runner/templates/cleanup-deployment.yaml +++ b/internal/demo/k8s/demo-loop-runner/templates/cleanup-deployment.yaml @@ -51,6 +51,10 @@ spec: key: {{ .Values.temporal.apiKey.secretKey }} - name: WORKER_DEPLOYMENT_NAME value: {{ .Values.temporal.workerDeployment | quote }} + - name: AWS_REGION + value: {{ .Values.aws.region | quote }} + - name: ECR_REPO + value: {{ .Values.release.ecrRepo | quote }} resources: requests: cpu: 50m