Add StatefulSet workload support to CRUD benchmarking framework by diamondpowell · Pull Request #1132 · Azure/telescope

diamondpowell · 2026-04-14T18:45:28Z

Summary

Adds StatefulSet workload support to the CRUD benchmarking framework — the second of three planned workload methods (deployment, statefulset, jobs). Measures K8s StatefulSet create/verify latency on AKS node pools.

Branch cleanup note: Rebased and squashed for reviewability. All commits are logically grouped. Also includes a fix for gpu_profile driver setting that caused node pool creation failure on non-GPU pools.

Changes

modules/python/crud/workload_templates/statefulset.yml

New K8s manifest template with headless Service (clusterIP: None) for stable pod DNS
Uses STATEFULSET_REPLICAS placeholder, configurable node affinity via label_selector

modules/python/crud/azure/node_pool_crud.py

Add create_statefulset() — same loop pattern as create_deployment
Uses ready condition (not available) since StatefulSets don't support the available condition type

modules/python/crud/main.py

Add statefulset subparser with --node-pool-name, --number-of-statefulsets, --replicas, --manifest-dir
Add elif command == "statefulset" routing in handle_workload_operations

modules/python/clients/kubernetes_client.py

Add _is_statefulset_ready and _check_statefulset_condition for readiness polling
Extend wait_for_condition to support StatefulSet resource type

steps/engine/crud/k8s/execute.yml

Add statefulset script block calling python3 main.py statefulset
Add number_of_statefulsets parameter

steps/topology/k8s-crud-gpu/execute-crud.yml

Wire number_of_statefulsets through to engine step

modules/python/clients/aks_client.py

Fix: set gpu_profile driver to "None" for non-GPU node pools (was incorrectly set to "Install", causing creation failures)

Changes since initial PR

modules/python/crud/azure/node_pool_crud.py

Extract _apply_statefulset() helper from loop body — separates orchestration (loop, error handling) from execution (template, apply, verify)
Pod labels now include workload type to prevent selector collision: nginx-container-deployment-1 vs nginx-container-statefulset-1. Without this, wait_for_pods_ready sees pods from both workloads (20 instead of 10) and fails on exact count check.

modules/python/crud/main.py

Docstring updated to list supported workload types: (deployment, statefulset)
Uses workload_common_parser for shared args (--count, --replicas, --manifest-dir, --label-selector)

steps/engine/crud/k8s/execute.yml

Workload steps gated to Azure-only: ${{ if eq(parameters.cloud, 'azure') }}: — AWS node_pool_crud.py doesn't have workload methods yet
--manifest-dir passed conditionally: ${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"} — avoids passing empty string when not set

Tests

test_azure_node_pool_crud.py:

test_create_statefulset_success — happy path
test_create_statefulset_failure — all fail to become ready
test_create_statefulset_no_client — returns early when k8s client unavailable
test_create_statefulset_partial_success — continues on failures, returns False

test_kubernetes_client.py:

test_wait_for_condition_statefulset_success
test_wait_for_condition_statefulset_timeout
test_wait_for_condition_statefulset_not_found

Dependencies

Based on test-refactor (PR #879) — must merge first.

Add create_statefulset() to NodePoolCRUD that deploys K8s StatefulSets onto node pools after provisioning. Follows the same pattern as create_deployment — multi-doc YAML manifest parsing, configurable replica count, and per-statefulset readiness validation via wait_for_condition. - Add 'statefulset' subcommand to handle_workload_operations() in main.py with --number-of-statefulsets and --replicas args - Add statefulset.yml workload template with configurable replicas and node affinity via label_selector - Add _is_statefulset_ready and _check_statefulset_condition to kubernetes_client.py for readiness polling

Add statefulset execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_statefulsets, replicas) flow from pipeline matrix → topology → engine step → main.py. - Add statefulset script block to steps/engine/crud/k8s/execute.yml - Pass number_of_statefulsets through topology execute-crud.yml

Add test coverage for create_statefulset and statefulset wait_for_condition: - test_create_statefulset_success: single statefulset with readiness check - test_create_statefulset_failure: statefulset fails to become ready - test_create_statefulset_partial_success: continues on individual failures - test_create_statefulset_no_client: returns early when k8s client unavailable - test_statefulset_wait_for_condition: validates _is_statefulset_ready and _check_statefulset_condition polling logic

- Extract _apply_statefulset helper (matches _apply_deployment pattern) - Use os.path for default template path instead of hardcoded string - Use per-statefulset labels to avoid selector collision - Remove redundant outer try/except - Use workload_common_parser for shared args (--count, --replicas, etc.) - Add hasattr guard for cloud provider compatibility - Use args.count instead of args.number_of_statefulsets - Update pipeline YAML to use count parameter

…-dir - Wrap statefulset pipeline step inside Azure cloud gate (matches deployment) - Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)

The k8s-gpu-cluster-crud scenario already has terraform inputs. Custom scenario dirs will be reverted before merge.

- Add gpu_node_pool: '' to pipeline test matrix to prevent GPU driver installation on non-GPU VM (Standard_D4s_v3) - Include workload type in pod labels to prevent collision when both deployment and statefulset run in the same pipeline: - deployment: nginx-container-deployment-{index} - statefulset: nginx-container-statefulset-{index} Previously both used nginx-container-{index}, causing wait_for_pods_ready to see 20 pods when expecting 10

Copilot

Pull request overview

Adds Kubernetes StatefulSet workload support to the CRUD benchmarking framework so Telescope can measure StatefulSet create/verify latency (AKS-focused), alongside the existing Deployment workload path.

Changes:

Introduces a parameterized StatefulSet + headless Service manifest template and a new statefulset CLI subcommand/dispatcher path.
Adds AKS NodePoolCRUD orchestration for creating/verifying multiple StatefulSets (including per-workload label isolation to avoid selector collisions).
Extends the Kubernetes client’s wait_for_condition to support StatefulSet readiness checks and adds unit tests for the new behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
steps/engine/crud/k8s/execute.yml	Adds an Azure-gated pipeline step to run StatefulSet workload operations via `main.py statefulset`.
modules/python/crud/workload_templates/statefulset.yml	New templated StatefulSet + headless Service manifest used by the CRUD workload runner.
modules/python/crud/main.py	Adds `statefulset` subcommand and routes it through the shared workload dispatcher and args.
modules/python/crud/azure/node_pool_crud.py	Implements StatefulSet apply/verify loop (`create_statefulset` + helper) and updates deployment labels to avoid cross-workload collisions.
modules/python/clients/kubernetes_client.py	Adds StatefulSet readiness polling support to `wait_for_condition`.
modules/python/tests/crud/test_azure_node_pool_crud.py	Adds unit tests covering StatefulSet creation success/failure/partial success and no-client behavior.
modules/python/tests/clients/test_kubernetes_client.py	Adds unit tests for StatefulSet readiness wait success/timeout/not-found cases.

vittoriasalim · 2026-05-17T23:17:51Z

+        """Check if a statefulset has all replicas ready."""
+        status = statefulset.status
+        spec_replicas = statefulset.spec.replicas or 0
+        return (
+            status is not None
+            and status.ready_replicas is not None
+            and status.ready_replicas == spec_replicas
+            and spec_replicas > 0


+        --result-dir "$RESULT_DIR" \
+        --node-pool-name "$POOL_NAME" \
+        --count "$COUNT" \
+        --replicas "$REPLICAS" \
+        ${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"} \


vittoriasalim · 2026-05-17T23:37:17Z

+      CLOUD: ${{ parameters.cloud }}
+      STEP_TIME_OUT: ${{ parameters.step_time_out }}
+      RESULT_DIR: $(System.DefaultWorkingDirectory)/$(RUN_ID)
+      GPU_NODE_POOL: ${{ parameters.gpu_node_pool }}


you can use the variable directly no need to state in parameters

Oh good call, I followed my existing pattern but using parameters directly is cleaner. Want me to update just this step or all of them while I'm at it?

vittoriasalim · 2026-05-17T23:37:50Z

  vm_size: ""
  create_node_count: ""
  scale_node_count: ""
  scale_step_up_count: ""
  pool_name: ""
  cloud: ""
  step_time_out: 600
  step_wait_time: 30
  gpu_node_pool: false
  count: 1
  replicas: 10
  manifest_dir: ""


can use the parameters directly

Yes, good call

xinWeiWei24 · 2026-05-18T01:24:39Z

+
+        logger.info("Successfully created and verified StatefulSet %d", statefulset_index)
+
+    def create_statefulset(


create_statefulset and _apply_statefulset share essentially the same logic as create_deployment and _apply_deployment. Consider consolidating them into a single internal helper.

Good point, I agree the consolidation makes sense. I actually already have this scoped out (extracting workload methods into a shared cloud-agnostic module so deployment/statefulset/jobs all use the same helper).

Would you prefer I do that refactor in this PR before merge, or as an immediate follow-up PR?

diamondpowell force-pushed the dipowell/crud-statefulset branch 3 times, most recently from 8a14575 to 1207d1d Compare April 21, 2026 01:58

diamondpowell requested a review from lokesh-keyan April 21, 2026 15:50

diamondpowell force-pushed the dipowell/crud-statefulset branch from 695cd4e to 61a2300 Compare May 4, 2026 20:58

diamondpowell mentioned this pull request May 4, 2026

Add Job workload support to CRUD benchmarking framework #1133

Draft

diamondpowell force-pushed the dipowell/crud-statefulset branch 3 times, most recently from 7ea865e to dea3f3f Compare May 5, 2026 19:21

diamondpowell mentioned this pull request May 7, 2026

Improve deployment workload support in CRUD benchmarking framework #879

Merged

diamondpowell and others added 7 commits May 14, 2026 11:56

test: configure pipeline for statefulset workload validation

aac5e7f

fix: gate statefulset step to Azure-only and use conditional manifest…

b8c17b6

…-dir - Wrap statefulset pipeline step inside Azure cloud gate (matches deployment) - Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)

fix: correct docstring to list supported workload types

2db2b31

diamondpowell force-pushed the dipowell/crud-statefulset branch from 77ed08a to 2db2b31 Compare May 14, 2026 18:22

diamondpowell added 6 commits May 14, 2026 14:53

fix: remove trailing whitespace in test file

8648753

test: populate pipeline test config for statefulset validation

19e5aaa

fix: use k8s-crud-gpu topology (k8s-crud doesn't exist)

6f54c8e

fix: use existing perf-eval scenario for pipeline test

4e5c53d

The k8s-gpu-cluster-crud scenario already has terraform inputs. Custom scenario dirs will be reverted before merge.

revert: restore pipeline test placeholder before merge

cd7bd00

diamondpowell marked this pull request as ready for review May 15, 2026 16:57

Copilot AI review requested due to automatic review settings May 15, 2026 16:57

diamondpowell requested review from alyssa1303, anson627, sumanthreddy29 and xinWeiWei24 as code owners May 15, 2026 16:57

diamondpowell requested review from liyu-ma, vittoriasalim and wonderyl as code owners May 15, 2026 16:57

Copilot started reviewing on behalf of diamondpowell May 15, 2026 17:01 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

vittoriasalim reviewed May 17, 2026

View reviewed changes

xinWeiWei24 reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StatefulSet workload support to CRUD benchmarking framework#1132

Add StatefulSet workload support to CRUD benchmarking framework#1132
diamondpowell wants to merge 13 commits into
mainfrom
dipowell/crud-statefulset

diamondpowell commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

vittoriasalim May 17, 2026

Uh oh!

vittoriasalim May 17, 2026

Uh oh!

diamondpowell May 20, 2026 •

edited

Loading

Uh oh!

vittoriasalim May 17, 2026

Uh oh!

diamondpowell May 20, 2026

Uh oh!

xinWeiWei24 May 18, 2026

Uh oh!

diamondpowell May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		logger.info("Successfully created and verified StatefulSet %d", statefulset_index)

		def create_statefulset(

Conversation

diamondpowell commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Changes since initial PR

Tests

Dependencies

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vittoriasalim May 17, 2026

Choose a reason for hiding this comment

Uh oh!

vittoriasalim May 17, 2026

Choose a reason for hiding this comment

Uh oh!

diamondpowell May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vittoriasalim May 17, 2026

Choose a reason for hiding this comment

Uh oh!

diamondpowell May 20, 2026

Choose a reason for hiding this comment

Uh oh!

xinWeiWei24 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

diamondpowell May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

diamondpowell commented Apr 14, 2026 •

edited

Loading

diamondpowell May 20, 2026 •

edited

Loading