Add Job workload support to CRUD benchmarking framework by diamondpowell · Pull Request #1133 · Azure/telescope

diamondpowell · 2026-04-14T19:27:57Z

Summary

Adds Job workload support to the CRUD benchmarking framework — the third and final planned workload method. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads: success means the pod terminated cleanly (succeeded > 0), failure raises immediately.

Branch cleanup note: Rebased and squashed for reviewability. Independent of StatefulSet PR (#1132).

Changes

modules/python/crud/workload_templates/job.yml

New K8s manifest template using batch/v1 API with restartPolicy: Never
Uses JOB_COMPLETIONS placeholder, no parallelism (defaults to match completions)

modules/python/crud/azure/node_pool_crud.py

Add create_job() — same loop pattern as other workloads
Uses complete condition instead of available/ready since Jobs terminate after completion
No wait_for_pods_ready — pods exit after job finishes

modules/python/crud/main.py

Add jobs subparser with --node-pool-name, --number-of-jobs, --completions, --manifest-dir
Add elif command == "jobs" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

Add _check_job_condition and _is_job_condition_met — checks completion_time + succeeded count
Add wait_for_job_completed with 5-min timeout and 30s polling
Add Job kind support to apply_manifest, update_manifest, delete_manifest

steps/engine/crud/k8s/execute.yml

Add jobs script block calling python3 main.py jobs
Add number_of_jobs and completions parameters

steps/topology/k8s-crud-gpu/execute-crud.yml

Wire number_of_jobs and completions through to engine step

modules/python/clients/aks_client.py

Fix: set gpu_profile driver to "None" for non-GPU node pools

Changes since initial PR

modules/python/crud/azure/node_pool_crud.py

Extract _apply_job() helper from loop body — separates orchestration from execution
Pod labels now include workload type: nginx-container-deployment-1 vs nginx-container-job-1

modules/python/crud/main.py

Docstring updated to list supported workload types: (deployment, job)
Uses workload_common_parser for shared args (--count, --replicas, --manifest-dir, --label-selector), with --completions added on job_parser

steps/engine/crud/k8s/execute.yml

Workload steps gated to Azure-only: ${{ if eq(parameters.cloud, 'azure') }}:
--manifest-dir passed conditionally: ${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"}

Notes

--replicas is inherited from workload_common_parser but unused by jobs — jobs use --completions instead. Left as-is to keep the shared parser simple.

Tests

test_azure_node_pool_crud.py:

test_create_job_success
test_create_job_failure
test_create_job_no_client
test_create_job_partial_success

test_kubernetes_client.py:

test_wait_for_condition_job_success — Job completes successfully
test_wait_for_condition_job_timeout — fails within timeout
test_wait_for_condition_job_not_found — not found, returns failure

Dependencies

Based on test-refactor (PR #879) — must merge first. Independent of StatefulSet PR (#1132).

Add create_job() to NodePoolCRUD that deploys K8s Jobs onto node pools. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads — success means the pod terminated cleanly (succeeded > 0), failure raises immediately (no self-healing). - Add 'jobs' subcommand to handle_workload_operations() in main.py with --number-of-jobs and --completions args - Add job.yml workload template with configurable completions and node affinity via label_selector - Add _check_job_condition and _is_job_condition_met to kubernetes_client.py — checks completion_time + succeeded count - Add wait_for_job_completed with 5-min timeout and 30s polling - Job kind support in apply/update/delete manifest methods

Add job execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_jobs, completions) flow from pipeline matrix → topology → engine step → main.py. - Add jobs script block to steps/engine/crud/k8s/execute.yml - Pass number_of_jobs and completions through topology execute-crud.yml - Jobs run after deployment, before scale-down + delete

Add comprehensive test coverage for create_job and job wait_for_condition: - test_create_job_success: single job completes successfully - test_create_job_failure: job fails to complete - test_create_job_partial_success: continues on individual failures - test_job_wait_for_condition: validates _check_job_condition and _is_job_condition_met for 'complete' and 'failed' states - test_wait_for_job_completed: timeout and polling behavior - Tests cover Job-specific semantics (succeeded count, completion_time, failed + no active pods)

- Extract _apply_job helper (matches _apply_deployment pattern) - Use os.path for default template path instead of hardcoded string - Use per-job labels to avoid selector collision - Remove redundant outer try/except - Use workload_common_parser for shared args (--count, --manifest-dir, etc.) - Add hasattr guard for cloud provider compatibility - Use args.count instead of args.number_of_jobs - Rename subcommand from 'jobs' to 'job' (matches K8s resource type) - Update pipeline YAML to use count parameter

- Wrap job pipeline step inside Azure cloud gate (matches deployment) - Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)

The k8s-gpu-cluster-crud scenario already has terraform inputs. Custom scenario dirs will be reverted before merge.

- Add gpu_node_pool: '' to prevent GPU driver install on Standard_D4s_v3 - Add replicas: 10 for deployment step (topology passes it to engine) - Include workload type in pod labels to prevent future collision: - deployment: nginx-container-deployment-{index} - job: nginx-container-job-{index}

diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from 84f388c to 1da0a77 Compare April 21, 2026 04:08

diamondpowell force-pushed the dipowell/crud-jobs branch 3 times, most recently from 787e4d6 to 5a0978d Compare May 5, 2026 19:31

diamondpowell mentioned this pull request May 7, 2026

Improve deployment workload support in CRUD benchmarking framework #879

Merged

diamondpowell and others added 8 commits May 14, 2026 12:09

fix: resolve trailing whitespace and yamllint issues

c2cf983

fix: gate job step to Azure-only and use conditional manifest-dir

5e14829

- Wrap job pipeline step inside Azure cloud gate (matches deployment) - Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)

fix: correct docstring to list supported workload types

3e9ac9f

docs: clarify nginx -t command choice in job template

697d614

diamondpowell force-pushed the dipowell/crud-jobs branch from 8aed904 to 697d614 Compare May 14, 2026 18:22

diamondpowell added 6 commits May 14, 2026 14:54

test: populate pipeline test config for job validation

a4fa1cd

fix: use k8s-crud-gpu topology (k8s-crud doesn't exist)

72f2650

fix: use existing perf-eval scenario for pipeline test

9fe29a6

The k8s-gpu-cluster-crud scenario already has terraform inputs. Custom scenario dirs will be reverted before merge.

revert: restore pipeline test placeholder before merge

7d9c686

chore: remove inline comment from job template

c2c59b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Job workload support to CRUD benchmarking framework#1133

Add Job workload support to CRUD benchmarking framework#1133
diamondpowell wants to merge 14 commits into
mainfrom
dipowell/crud-jobs

diamondpowell commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

diamondpowell commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Changes since initial PR

Notes

Tests

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

diamondpowell commented Apr 14, 2026 •

edited

Loading