Add StatefulSet workload support to CRUD benchmarking framework#1132
Add StatefulSet workload support to CRUD benchmarking framework#1132diamondpowell wants to merge 13 commits into
Conversation
8a14575 to
1207d1d
Compare
695cd4e to
61a2300
Compare
7ea865e to
dea3f3f
Compare
Add create_statefulset() to NodePoolCRUD that deploys K8s StatefulSets onto node pools after provisioning. Follows the same pattern as create_deployment — multi-doc YAML manifest parsing, configurable replica count, and per-statefulset readiness validation via wait_for_condition. - Add 'statefulset' subcommand to handle_workload_operations() in main.py with --number-of-statefulsets and --replicas args - Add statefulset.yml workload template with configurable replicas and node affinity via label_selector - Add _is_statefulset_ready and _check_statefulset_condition to kubernetes_client.py for readiness polling
Add statefulset execution step to the k8s CRUD engine pipeline between deployment and scale-down. Parameters (number_of_statefulsets, replicas) flow from pipeline matrix → topology → engine step → main.py. - Add statefulset script block to steps/engine/crud/k8s/execute.yml - Pass number_of_statefulsets through topology execute-crud.yml
Add test coverage for create_statefulset and statefulset wait_for_condition: - test_create_statefulset_success: single statefulset with readiness check - test_create_statefulset_failure: statefulset fails to become ready - test_create_statefulset_partial_success: continues on individual failures - test_create_statefulset_no_client: returns early when k8s client unavailable - test_statefulset_wait_for_condition: validates _is_statefulset_ready and _check_statefulset_condition polling logic
- Extract _apply_statefulset helper (matches _apply_deployment pattern) - Use os.path for default template path instead of hardcoded string - Use per-statefulset labels to avoid selector collision - Remove redundant outer try/except - Use workload_common_parser for shared args (--count, --replicas, etc.) - Add hasattr guard for cloud provider compatibility - Use args.count instead of args.number_of_statefulsets - Update pipeline YAML to use count parameter
…-dir
- Wrap statefulset pipeline step inside Azure cloud gate (matches deployment)
- Use ${MANIFEST_DIR:+--manifest-dir} conditional (matches deployment pattern)
77ed08a to
2db2b31
Compare
The k8s-gpu-cluster-crud scenario already has terraform inputs. Custom scenario dirs will be reverted before merge.
- Add gpu_node_pool: '' to pipeline test matrix to prevent GPU driver
installation on non-GPU VM (Standard_D4s_v3)
- Include workload type in pod labels to prevent collision when both
deployment and statefulset run in the same pipeline:
- deployment: nginx-container-deployment-{index}
- statefulset: nginx-container-statefulset-{index}
Previously both used nginx-container-{index}, causing
wait_for_pods_ready to see 20 pods when expecting 10
There was a problem hiding this comment.
Pull request overview
Adds Kubernetes StatefulSet workload support to the CRUD benchmarking framework so Telescope can measure StatefulSet create/verify latency (AKS-focused), alongside the existing Deployment workload path.
Changes:
- Introduces a parameterized StatefulSet + headless Service manifest template and a new
statefulsetCLI subcommand/dispatcher path. - Adds AKS NodePoolCRUD orchestration for creating/verifying multiple StatefulSets (including per-workload label isolation to avoid selector collisions).
- Extends the Kubernetes client’s
wait_for_conditionto support StatefulSet readiness checks and adds unit tests for the new behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| steps/engine/crud/k8s/execute.yml | Adds an Azure-gated pipeline step to run StatefulSet workload operations via main.py statefulset. |
| modules/python/crud/workload_templates/statefulset.yml | New templated StatefulSet + headless Service manifest used by the CRUD workload runner. |
| modules/python/crud/main.py | Adds statefulset subcommand and routes it through the shared workload dispatcher and args. |
| modules/python/crud/azure/node_pool_crud.py | Implements StatefulSet apply/verify loop (create_statefulset + helper) and updates deployment labels to avoid cross-workload collisions. |
| modules/python/clients/kubernetes_client.py | Adds StatefulSet readiness polling support to wait_for_condition. |
| modules/python/tests/crud/test_azure_node_pool_crud.py | Adds unit tests covering StatefulSet creation success/failure/partial success and no-client behavior. |
| modules/python/tests/clients/test_kubernetes_client.py | Adds unit tests for StatefulSet readiness wait success/timeout/not-found cases. |
| """Check if a statefulset has all replicas ready.""" | ||
| status = statefulset.status | ||
| spec_replicas = statefulset.spec.replicas or 0 | ||
| return ( | ||
| status is not None | ||
| and status.ready_replicas is not None | ||
| and status.ready_replicas == spec_replicas | ||
| and spec_replicas > 0 |
| --result-dir "$RESULT_DIR" \ | ||
| --node-pool-name "$POOL_NAME" \ | ||
| --count "$COUNT" \ | ||
| --replicas "$REPLICAS" \ | ||
| ${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"} \ |
| CLOUD: ${{ parameters.cloud }} | ||
| STEP_TIME_OUT: ${{ parameters.step_time_out }} | ||
| RESULT_DIR: $(System.DefaultWorkingDirectory)/$(RUN_ID) | ||
| GPU_NODE_POOL: ${{ parameters.gpu_node_pool }} |
There was a problem hiding this comment.
you can use the variable directly no need to state in parameters
There was a problem hiding this comment.
Oh good call, I followed my existing pattern but using parameters directly is cleaner. Want me to update just this step or all of them while I'm at it?
| vm_size: "" | ||
| create_node_count: "" | ||
| scale_node_count: "" | ||
| scale_step_up_count: "" | ||
| pool_name: "" | ||
| cloud: "" | ||
| step_time_out: 600 | ||
| step_wait_time: 30 | ||
| gpu_node_pool: false | ||
| count: 1 | ||
| replicas: 10 | ||
| manifest_dir: "" |
There was a problem hiding this comment.
can use the parameters directly
There was a problem hiding this comment.
Yes, good call
|
|
||
| logger.info("Successfully created and verified StatefulSet %d", statefulset_index) | ||
|
|
||
| def create_statefulset( |
There was a problem hiding this comment.
create_statefulset and _apply_statefulset share essentially the same logic as create_deployment and _apply_deployment. Consider consolidating them into a single internal helper.
There was a problem hiding this comment.
Good point, I agree the consolidation makes sense. I actually already have this scoped out (extracting workload methods into a shared cloud-agnostic module so deployment/statefulset/jobs all use the same helper).
Would you prefer I do that refactor in this PR before merge, or as an immediate follow-up PR?
Summary
Adds StatefulSet workload support to the CRUD benchmarking framework — the second of three planned workload methods (
deployment,statefulset,jobs). Measures K8s StatefulSet create/verify latency on AKS node pools.Branch cleanup note: Rebased and squashed for reviewability. All commits are logically grouped. Also includes a fix for
gpu_profiledriver setting that caused node pool creation failure on non-GPU pools.Changes
modules/python/crud/workload_templates/statefulset.ymlclusterIP: None) for stable pod DNSSTATEFULSET_REPLICASplaceholder, configurable node affinity via label_selectormodules/python/crud/azure/node_pool_crud.pycreate_statefulset()— same loop pattern ascreate_deploymentreadycondition (notavailable) since StatefulSets don't support theavailablecondition typemodules/python/crud/main.pystatefulsetsubparser with--node-pool-name,--number-of-statefulsets,--replicas,--manifest-direlif command == "statefulset"routing inhandle_workload_operationsmodules/python/clients/kubernetes_client.py_is_statefulset_readyand_check_statefulset_conditionfor readiness pollingwait_for_conditionto support StatefulSet resource typesteps/engine/crud/k8s/execute.ymlstatefulsetscript block callingpython3 main.py statefulsetnumber_of_statefulsetsparametersteps/topology/k8s-crud-gpu/execute-crud.ymlnumber_of_statefulsetsthrough to engine stepmodules/python/clients/aks_client.pygpu_profiledriver to"None"for non-GPU node pools (was incorrectly set to"Install", causing creation failures)Changes since initial PR
modules/python/crud/azure/node_pool_crud.py_apply_statefulset()helper from loop body — separates orchestration (loop, error handling) from execution (template, apply, verify)nginx-container-deployment-1vsnginx-container-statefulset-1. Without this,wait_for_pods_readysees pods from both workloads (20 instead of 10) and fails on exact count check.modules/python/crud/main.py(deployment, statefulset)workload_common_parserfor shared args (--count,--replicas,--manifest-dir,--label-selector)steps/engine/crud/k8s/execute.yml${{ if eq(parameters.cloud, 'azure') }}:— AWSnode_pool_crud.pydoesn't have workload methods yet--manifest-dirpassed conditionally:${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"}— avoids passing empty string when not setTests
test_azure_node_pool_crud.py:test_create_statefulset_success— happy pathtest_create_statefulset_failure— all fail to become readytest_create_statefulset_no_client— returns early when k8s client unavailabletest_create_statefulset_partial_success— continues on failures, returns Falsetest_kubernetes_client.py:test_wait_for_condition_statefulset_successtest_wait_for_condition_statefulset_timeouttest_wait_for_condition_statefulset_not_foundDependencies
Based on
test-refactor(PR #879) — must merge first.