Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 66 additions & 50 deletions internal/demo/k8s/demo-loop-runner/README.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,72 @@
# Demo Loop Runner

Two independent Kubernetes CronJobs that drive the Temporal Worker Controller rainbow deployment demo:
Three long-running Kubernetes Deployments that drive the Temporal Worker Controller rainbow deployment demo:

1. **Release CronJob** (`rainbow-release`) — generates a new worker version, builds the container image via Kaniko, and deploys it via Skaffold/Helm
2. **Traffic CronJob** (`rainbow-traffic`) — starts workflows on Temporal to create realistic in-flight traffic across versions
1. **`rainbow-release`** — simulates releases by retagging the `latest` ECR image with a new random hex tag and patching the TemporalWorkerDeployment
2. **`rainbow-traffic`** — starts workflows on Temporal to create realistic in-flight traffic across versions
3. **`rainbow-cleanup`** — removes drained Temporal versions and their ECR tags once all their workflows have completed

Scripts are deployed as ConfigMaps and mounted into the pods, so changes to behavior only require a Helm upgrade — no image rebuild.

## Architecture

```
┌────────────────────────────────────────────────────────────┐
CronJob: rainbow-release (every 3 min)
Deployment: rainbow-release (every ~30s)
│ ConfigMap: rainbow-release-scripts │
│ run_release_once.sh │
│ ├─ generate_version_cron.sh → mutate worker, commit │
│ ├─ build_version_kaniko.sh → Kaniko Job → ECR push │
│ └─ deploy_version_skaffold.sh → skaffold deploy │
│ ├─ retag ECR :latest → :<random-hex> │
│ └─ kubectl patch TWD image → triggers rollout │
└────┬───────────────────────┬───────────────────────────────┘
│ │
▼ ▼
K8s API AWS ECR

┌────────────────────────────────────────────────────────────┐
CronJob: rainbow-traffic (every 1 min)
Deployment: rainbow-traffic (every ~3s)
│ ConfigMap: rainbow-traffic-scripts │
│ generate-traffic.sh → temporal workflow start
│ generate-traffic.sh → temporal workflow start │
└────┬───────────────────────────────────────────────────────┘
Temporal Cloud

┌────────────────────────────────────────────────────────────┐
│ Deployment: rainbow-cleanup (every ~120s) │
│ ConfigMap: rainbow-cleanup-scripts │
│ cleanup-drained-versions.sh │
│ ├─ delete drained Temporal versions (0 workflows) │
│ └─ delete matching ECR tag after version removed │
└────┬───────────────────────┬───────────────────────────────┘
│ │
▼ ▼
Temporal Cloud AWS ECR
```

## How It Works

### Release CronJob
### Release

Each iteration:
1. Generates a random 6-character hex tag (e.g. `a3f9c1`)
2. Fetches the manifest for `ECR_REPO:latest` and pushes it as `ECR_REPO:<hex>`
3. Patches the TWD's container image to the new tag
4. The controller detects the change and begins a rainbow rollout

Each invocation:
1. Clones the repo, reads version counter from `rainbow-version-state` ConfigMap
2. Mutates `worker.go` with version-specific sleep duration and commits
3. Launches a Kaniko Job to build and push the image to ECR
4. Runs `skaffold deploy` to update the TemporalWorkerDeployment CR
5. The controller detects the change and begins a rainbow rollout
No git clone, no build, no Kaniko. A new ECR tag pointing at the same image is enough to trigger the controller.

### Traffic CronJob
### Traffic

Each invocation:
1. Counts running workflows via `temporal workflow count`
2. Computes how many to start (respects `MAX_RUNNING_WORKFLOWS` cap)
3. Starts workflows with `temporal workflow start` on the configured task queue
Each iteration starts `WORKFLOWS_PER_RUN` workflows against the configured task queue. Workflows are long-running (30s sleep), so there is always live traffic spread across multiple versions during rollouts.

Because releases and traffic are decoupled, traffic keeps flowing regardless of release state, and releases aren't blocked by traffic generation timing.
### Cleanup

Each iteration:
1. Describes the Temporal worker deployment to list all versions
2. For each version that is not current/target and has no running workflows:
- If a K8s Deployment still exists for it, skips (controller manages its lifecycle)
- If no K8s Deployment exists and 0 running workflows, deletes the Temporal version and its ECR tag
3. Catch-all: terminates any workflows pinned to versions with no remaining K8s Deployment (true orphans)

## Configuration

Expand All @@ -64,34 +80,34 @@ All values via Helm:
| `image.tag` | `latest` | Image tag |
| `image.pullPolicy` | `Always` | Pull policy |

### Release CronJob
### Release

| Value | Default | Description |
|-------|---------|-------------|
| `release.schedule` | `*/3 * * * *` | Cron schedule for releases |
| `release.activeDeadlineSeconds` | `900` | Job timeout |
| `release.intervalSeconds` | `30` | Sleep between releases |
| `release.name` | `helloworld` | TemporalWorkerDeployment name |
| `release.repoUrl` | — | Git repo to clone |
| `release.repoRef` | `main` | Branch/tag |
| `release.worker` | `helloworld` | Worker build arg |
| `release.waitForTwdRollout` | `false` | Block until rollout completes |
| `release.ecrRepo` | `025066239481.dkr.ecr.us-east-2.amazonaws.com/helloworld` | ECR repository URI |
| `release.sourceTag` | `latest` | Tag to retag from |

### Traffic CronJob
### Traffic

| Value | Default | Description |
|-------|---------|-------------|
| `traffic.schedule` | `* * * * *` | Cron schedule for traffic |
| `traffic.activeDeadlineSeconds` | `120` | Job timeout |
| `traffic.intervalSeconds` | `3` | Sleep between traffic runs |
| `traffic.workflowType` | `HelloWorld` | Workflow type to start |
| `traffic.workflowsPerRun` | `5` | Target workflows per invocation |
| `traffic.maxNewWorkflowsPerRun` | `5` | Max new workflows per invocation |
| `traffic.maxRunningWorkflows` | `10` | Global cap on running workflows |
| `traffic.workflowsPerRun` | `6` | Workflows started per iteration |

### Cleanup

| Value | Default | Description |
|-------|---------|-------------|
| `cleanup.intervalSeconds` | `120` | Sleep between cleanup runs |

### AWS

| Value | Default | Description |
|-------|---------|-------------|
| `aws.roleArn` | — | IAM role ARN for IRSA (ECR push) |
| `aws.roleArn` | — | IAM role ARN for IRSA |
| `aws.region` | `us-east-2` | AWS region |

### Temporal
Expand All @@ -109,21 +125,21 @@ All values via Helm:

Built from `internal/demo/Dockerfile.release-job` — Alpine with tools only (no scripts):

- `kubectl`, `temporal` CLI, `skaffold`, `helm`, `aws-cli`, `git`, `jq`
- `kubectl`, `temporal` CLI, `aws-cli`, `jq`

Scripts are mounted from ConfigMaps at `/opt/scripts/`.

## Scripts

Live in the chart at `scripts/`:

| Script | CronJob | Purpose |
|--------|---------|---------|
| `run_release_once.sh` | release | Orchestrates one full release cycle |
| `generate_version_cron.sh` | release | Mutates worker.go, commits, emits image tag |
| `build_version_kaniko.sh` | release | Creates Kaniko Job, waits for completion |
| `deploy_version_skaffold.sh` | release | Skaffold deploy with pre-built artifact |
| `generate-traffic.sh` | traffic | Counts workflows, starts new ones up to cap |
| Script | Deployment | Purpose |
|--------|-----------|---------|
| `run_release_once.sh` | release | Retags ECR image and patches TWD |
| `generate-traffic.sh` | traffic | Starts workflows against the task queue |
| `cleanup-drained-versions.sh` | cleanup | Deletes drained Temporal versions and ECR tags |

> `build_version_kaniko.sh`, `generate_version_cron.sh`, and `deploy_version_skaffold.sh` are unused legacy scripts kept for reference.

To change script behavior, edit the files and re-deploy the chart. No image rebuild needed.

Expand All @@ -146,13 +162,13 @@ helm upgrade --install demo-loop-runner internal/demo/k8s/demo-loop-runner \

## How Rainbow Deployments Emerge

Because workflows sleep 150–240s and releases happen every 3 minutes, there's always version overlap:
Because workflows sleep 30s and releases happen every ~30s, there is always version overlap:

1. Version N deploys → workflows start on version N
2. 3 min later, version N+1 deploys → controller begins ramping
3. Version N's workflows still running (pinned to N)
4. Traffic CronJob independently starts new workflows (routed to latest)
5. Version N drains as its workflows complete
1. Version N deploys → workflows start on version N (pinned to it)
2. ~30s later, version N+1 deploys → controller begins ramping traffic
3. Version N's in-flight workflows finish draining
4. Traffic keeps starting new workflows independently (routed to the current version)
5. Cleanup removes version N from Temporal and ECR once it's fully drained
6. Meanwhile version N+2 deploys...

The decoupled traffic ensures continuous workflow pressure independent of the release cadence.
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ set -eu

NAMESPACE="${NAMESPACE:-default}"
RELEASE_NAME="${RELEASE_NAME:-helloworld}"
AWS_REGION="${AWS_REGION:-us-east-2}"
ECR_REPO="${ECR_REPO:-}"
WORKER_DEPLOYMENT_NAME="${WORKER_DEPLOYMENT_NAME:-default/${RELEASE_NAME}}"
MANAGER_IDENTITY="${MANAGER_IDENTITY:-temporal-worker-controller/temporal-system}"
TIMESTAMP=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
Expand Down Expand Up @@ -121,6 +123,16 @@ for BUILD_ID in $ALL_VERSIONS; do
--api-key "$TEMPORAL_API_KEY" \
--tls 2>&1; then
DELETED=$((DELETED + 1))
# Delete the corresponding ECR image tag now that the version is fully cleaned up
if [ -n "$ECR_REPO" ]; then
REPO_NAME=$(echo "$ECR_REPO" | cut -d/ -f2-)
aws ecr batch-delete-image \
--region "$AWS_REGION" \
--repository-name "$REPO_NAME" \
--image-ids "imageTag=$BUILD_ID" >/dev/null 2>&1 && \
echo "[$TIMESTAMP] Deleted ECR tag $BUILD_ID" || \
echo "[$TIMESTAMP] Warning: could not delete ECR tag $BUILD_ID (may not exist)"
fi
else
FAILED=$((FAILED + 1))
fi
Expand Down
19 changes: 0 additions & 19 deletions internal/demo/k8s/demo-loop-runner/scripts/run_release_once.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,25 +34,6 @@ if [ -z "$MANIFEST" ] || [ "$MANIFEST" = "None" ]; then
exit 1
fi

# Clean up old tags to stay under ECR's 1000-tag-per-image limit.
# Keep only the 20 most recent hex tags (plus latest/gatefix-*).
OLD_TAGS=$(aws ecr describe-images \
--region "$AWS_REGION" \
--repository-name "$REPO_NAME" \
--image-ids imageTag="$SOURCE_TAG" \
--query 'imageDetails[0].imageTags' \
--output json 2>/dev/null | \
jq -r '.[] | select(. != "latest" and (startswith("gatefix") | not))' | \
tail -n +21) || true

if [ -n "$OLD_TAGS" ]; then
echo "$OLD_TAGS" | jq -R -s 'split("\n") | map(select(. != "")) | map({"imageTag": .})' | \
aws ecr batch-delete-image \
--region "$AWS_REGION" \
--repository-name "$REPO_NAME" \
--image-ids file:///dev/stdin >/dev/null 2>&1 || true
fi

PUT_RESULT=$(aws ecr put-image \
--region "$AWS_REGION" \
--repository-name "$REPO_NAME" \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ spec:
key: {{ .Values.temporal.apiKey.secretKey }}
- name: WORKER_DEPLOYMENT_NAME
value: {{ .Values.temporal.workerDeployment | quote }}
- name: AWS_REGION
value: {{ .Values.aws.region | quote }}
- name: ECR_REPO
value: {{ .Values.release.ecrRepo | quote }}
resources:
requests:
cpu: 50m
Expand Down
Loading