Skip to content

Add Job workload support to CRUD benchmarking framework#1133

Draft
diamondpowell wants to merge 14 commits into
mainfrom
dipowell/crud-jobs
Draft

Add Job workload support to CRUD benchmarking framework#1133
diamondpowell wants to merge 14 commits into
mainfrom
dipowell/crud-jobs

Conversation

@diamondpowell
Copy link
Copy Markdown
Contributor

@diamondpowell diamondpowell commented Apr 14, 2026

Summary

Adds Job workload support to the CRUD benchmarking framework — the third and final planned workload method. Unlike deployments/statefulsets which run indefinitely, Jobs are run-to-completion workloads: success means the pod terminated cleanly (succeeded > 0), failure raises immediately.

Branch cleanup note: Rebased and squashed for reviewability. Independent of StatefulSet PR (#1132).

Changes

modules/python/crud/workload_templates/job.yml

  • New K8s manifest template using batch/v1 API with restartPolicy: Never
  • Uses JOB_COMPLETIONS placeholder, no parallelism (defaults to match completions)

modules/python/crud/azure/node_pool_crud.py

  • Add create_job() — same loop pattern as other workloads
  • Uses complete condition instead of available/ready since Jobs terminate after completion
  • No wait_for_pods_ready — pods exit after job finishes

modules/python/crud/main.py

  • Add jobs subparser with --node-pool-name, --number-of-jobs, --completions, --manifest-dir
  • Add elif command == "jobs" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

  • Add _check_job_condition and _is_job_condition_met — checks completion_time + succeeded count
  • Add wait_for_job_completed with 5-min timeout and 30s polling
  • Add Job kind support to apply_manifest, update_manifest, delete_manifest

steps/engine/crud/k8s/execute.yml

  • Add jobs script block calling python3 main.py jobs
  • Add number_of_jobs and completions parameters

steps/topology/k8s-crud-gpu/execute-crud.yml

  • Wire number_of_jobs and completions through to engine step

modules/python/clients/aks_client.py

  • Fix: set gpu_profile driver to "None" for non-GPU node pools

Changes since initial PR

modules/python/crud/azure/node_pool_crud.py

  • Extract _apply_job() helper from loop body — separates orchestration from execution
  • Pod labels now include workload type: nginx-container-deployment-1 vs nginx-container-job-1

modules/python/crud/main.py

  • Docstring updated to list supported workload types: (deployment, job)
  • Uses workload_common_parser for shared args (--count, --replicas, --manifest-dir, --label-selector), with --completions added on job_parser

steps/engine/crud/k8s/execute.yml

  • Workload steps gated to Azure-only: ${{ if eq(parameters.cloud, 'azure') }}:
  • --manifest-dir passed conditionally: ${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"}

Notes

  • --replicas is inherited from workload_common_parser but unused by jobs — jobs use --completions instead. Left as-is to keep the shared parser simple.

Tests

test_azure_node_pool_crud.py:

  • test_create_job_success
  • test_create_job_failure
  • test_create_job_no_client
  • test_create_job_partial_success

test_kubernetes_client.py:

  • test_wait_for_condition_job_success — Job completes successfully
  • test_wait_for_condition_job_timeout — fails within timeout
  • test_wait_for_condition_job_not_found — not found, returns failure

Dependencies

Based on test-refactor (PR #879) — must merge first. Independent of StatefulSet PR (#1132).

@diamondpowell diamondpowell force-pushed the dipowell/crud-jobs branch 2 times, most recently from 84f388c to 1da0a77 Compare April 21, 2026 04:08
@diamondpowell diamondpowell force-pushed the dipowell/crud-jobs branch 3 times, most recently from 787e4d6 to 5a0978d Compare May 5, 2026 19:31
diamondpowell and others added 8 commits May 14, 2026 12:09
Add create_job() to NodePoolCRUD that deploys K8s Jobs onto node pools.
Unlike deployments/statefulsets which run indefinitely, Jobs are
run-to-completion workloads — success means the pod terminated cleanly
(succeeded > 0), failure raises immediately (no self-healing).

- Add 'jobs' subcommand to handle_workload_operations() in main.py
  with --number-of-jobs and --completions args
- Add job.yml workload template with configurable completions and
  node affinity via label_selector
- Add _check_job_condition and _is_job_condition_met to
  kubernetes_client.py — checks completion_time + succeeded count
- Add wait_for_job_completed with 5-min timeout and 30s polling
- Job kind support in apply/update/delete manifest methods
Add job execution step to the k8s CRUD engine pipeline between
deployment and scale-down. Parameters (number_of_jobs, completions)
flow from pipeline matrix → topology → engine step → main.py.

- Add jobs script block to steps/engine/crud/k8s/execute.yml
- Pass number_of_jobs and completions through topology execute-crud.yml
- Jobs run after deployment, before scale-down + delete
Add comprehensive test coverage for create_job and job wait_for_condition:

- test_create_job_success: single job completes successfully
- test_create_job_failure: job fails to complete
- test_create_job_partial_success: continues on individual failures
- test_job_wait_for_condition: validates _check_job_condition and
  _is_job_condition_met for 'complete' and 'failed' states
- test_wait_for_job_completed: timeout and polling behavior
- Tests cover Job-specific semantics (succeeded count, completion_time,
  failed + no active pods)
- Extract _apply_job helper (matches _apply_deployment pattern)
- Use os.path for default template path instead of hardcoded string
- Use per-job labels to avoid selector collision
- Remove redundant outer try/except
- Use workload_common_parser for shared args (--count, --manifest-dir, etc.)
- Add hasattr guard for cloud provider compatibility
- Use args.count instead of args.number_of_jobs
- Rename subcommand from 'jobs' to 'job' (matches K8s resource type)
- Update pipeline YAML to use count parameter
- Wrap job pipeline step inside Azure cloud gate (matches deployment)
- Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)
The k8s-gpu-cluster-crud scenario already has terraform inputs.
Custom scenario dirs will be reverted before merge.
- Add gpu_node_pool: '' to prevent GPU driver install on Standard_D4s_v3
- Add replicas: 10 for deployment step (topology passes it to engine)
- Include workload type in pod labels to prevent future collision:
  - deployment: nginx-container-deployment-{index}
  - job: nginx-container-job-{index}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant