feat(crud): add AKSMachineClient batch scale path (BatchPutMachine)#1185
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Enables the AKSMachineClient batch scale path (use_batch_api=True) by implementing BatchPutMachine-based chunked PUTs with bounded 429 retry/backoff, and adds unit tests for the new helpers and dispatch behavior.
Changes:
- Implements batch scale dispatch (
_scale_machine_batch) and supporting helpers (_array_split,_create_batch_machines,_make_batch_request). - Records per-chunk execution time metadata (
batch_command_execution_times) for observability. - Adds/updates tests covering batch dispatch and helper behaviors (including 429 retries).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| modules/python/clients/aks_machine_client.py | Implements the batch scale path, chunking logic, BatchPutMachine header construction, and bounded 429 retry/backoff. |
| modules/python/tests/test_aks_machine_client.py | Adds tests asserting batch dispatch, chunking semantics, header shape, and retry/raise behavior for batch requests. |
karenychen
commented
May 19, 2026
Adds the use_batch_api=True branch of scale_machine and the four batch helpers it dispatches to: _array_split, _scale_machine_batch, _create_batch_machines, _make_batch_request. Replaces the NotImplementedError stub landed in #1181. The batch path uses the AKS Machine API's BatchPutMachine HEADER pattern (NOT the non-existent ARM /$batch envelope): a single PUT to the first machine name in the chunk carrying a BatchPutMachine header whose JSON value contains vmSkus (wrapped in {value: [...]}) and batchMachines for the *additional* machines in the chunk. The batchMachines entries use the key 'machineName' (NOT 'name') — using the wrong key causes the API to silently ignore the entries and create only the URL-pointed machine. 429 responses are retried with bounded exponential backoff (_BATCH_429_MAX_RETRIES=3, 1s/2s/4s). Non-2xx non-429 responses raise immediately. Per-chunk wall time is captured in op.add_metadata('batch_command_execution_times', {chunk_<idx>: seconds}). Verification: - pylint repo-root on touched files: 10.00/10. - pytest modules/python/tests/test_aks_machine_client.py: 31 passed (13 new tests for the batch helpers + dispatch). - pytest --cov=. --cov-fail-under=80: 724 passed, 83.45% coverage.
c8a039c to
40d44f5
Compare
karenychen
commented
May 20, 2026
- Cap batch PUT timeout at _PER_REQUEST_TIMEOUT_CAP to preserve budget for
agentpool/readiness polling (mirrors individual PUT path).
- Reduce BatchPutMachine vmSkus payload to {name, resourceType} so the
header schema stays consistent across VM sizes instead of leaking
hard-coded standardDv3Family / vCPU / location values.
- Record a single command_execution_time metadata on both batch and
individual scale paths; keep per-chunk batch_command_execution_times
for granularity.
- Tighten _scale_machine_batch and _create_batch_machines docstrings
(BatchPutMachine pattern wording, drop redundant Kusto correlation
prose).
Changes:
- _make_batch_request now calls self._session.request instead of
module-level requests.request, so the batch path reuses the same
pooled TCP/TLS connections (and adapters/proxies) as the individual
make_request path. Avoids per-PUT handshakes during large scales.
- _make_batch_request gained chunk_idx and first_machine_name params;
errors and 429 retry warnings now carry chunk={idx} target={name}
{METHOD} {url} so failing slices are self-identifying in logs/Kusto.
- Tests updated to patch self.client._session.request instead of
clients.aks_machine_client.requests.request.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
modules/python/clients/aks_machine_client.py:401
- scale_machine can now raise ValueError for invalid machine_workers or when scale_machine_count is not divisible by machine_workers in the batch path. The docstring's Raises section should document these ValueError cases so callers know about the stricter contract when use_batch_api=True.
Raises:
RuntimeError: fewer machines landed than requested, agentpool did
not reach Succeeded within ``timeout``, or P100 readiness did
not complete within ``readiness_wait_timeout``.
modules/python/clients/aks_machine_client.py:705
- The comment says the batch PUT timeout cap 'mirrors the cap applied at the individual machine PUT site', but _create_single_machine currently uses request.timeout directly (no _PER_REQUEST_TIMEOUT_CAP). Either apply the same min(request.timeout, _PER_REQUEST_TIMEOUT_CAP) cap for individual PUTs, or adjust this comment to avoid implying parity that doesn't exist.
# Cap per-request timeout so a single batch PUT (plus 429 backoff)
# cannot consume the whole operation budget that downstream
# provisioning/readiness waits depend on. Mirrors the cap applied at
# the individual machine PUT site.
put_timeout = min(request.timeout, _PER_REQUEST_TIMEOUT_CAP)
- scale_machine docstring: Flow bullet now covers both individual and batch modes; removed redundant 'When use_batch_api=True' paragraph - test docstring: '429 then 200' -> '429 then 2xx (201)' - Remove batch_command_execution_times entirely; _scale_machine_batch now returns only List[str]. Single command_execution_time covers both paths.
- _make_batch_request: loop is now 1 initial + _BATCH_429_MAX_RETRIES (3) retries so the implemented backoff (1s, 2s, 4s) matches the documented pattern. Test asserts call_count == 4. - Clarify put_timeout comment to point at put_machine (the per-machine PUT helper which already applies _PER_REQUEST_TIMEOUT_CAP).
- _BATCH_429_MAX_RETRIES now means total attempt budget (not retries on top of an initial call). Set to 5 with documented backoff 1s/2s/4s/8s. - _make_batch_request loops range(_BATCH_429_MAX_RETRIES) and asserts call_count == 5 in the exhausted-budget test.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
vittoriasalim
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Adds the AKS Machine API batch scale path (
use_batch_api=True) toAKSMachineClient, dispatching across multiple workers via theBatchPutMachineHEADER pattern. This is the final piece needed to scale machine-mode agent pools efficiently at 1000+ machine counts where N individual PUTs would exhaust per-machine RP overhead.Stack: builds on #1179 (scaffolding) + #1181 (non-batch scale path).
What
scale_machine(use_batch_api=True)now dispatches to a new_scale_machine_batchhelper instead of raisingNotImplementedError. The helper:scale_machine_count % machine_workers == 0(raisesValueErrorotherwise — see contract below).per_worker = scale_machine_count // machine_workersand slices the pre-builtnameslist byworker_idusing arithmetic indexing.ThreadPoolExecutor(max_workers=machine_workers); each PUT carries a contiguous slice of machines in itsBatchPutMachineheader.successful_machines; the other workers' results still surface.Contract: exact-multiple partitioning (ado-telescope parity)
scale_machine_countMUST be an exact multiple ofmachine_workerswhenuse_batch_api=True. This mirrors ado-telescope's batch contract verbatim: every worker handles exactly the same per-worker count, so chunk boundaries are byte-identical across runs and the resultingchunk_0/chunk_1/...Kusto correlation keys line up exactly with ado dashboards. Non-exact-multiple inputs raiseValueErrorat theMachineCRUDboundary instead of producing a short final chunk that would skew per-chunk latency dashboards.Background: the
machineName(NOTname) landmineThe
BatchPutMachineheader payload'sbatchMachinesarray uses the keymachineName— the wrong key (e.g.name) causes the API to silently ignore the extras and create only the URL-pointed machine. This was the root cause of an earlier staging build wheresucceeded=truewas reported but onlyworker_countmachines actually landed.429 retry budget
_make_batch_requestretries on HTTP 429 with bounded exponential backoff: up to_BATCH_429_MAX_RETRIEStotal attempts (5 by default) with sleeps of 1s, 2s, 4s, 8s between them. All other non-2xx responses raise immediately.Metadata enrichment
scale_machinerecords a singleop.add_metadata("command_execution_time", <wall_seconds>)covering both the individual and batch dispatch paths (uniform shape across modes for downstream dashboards).successful_machines/percentile_node_readiness_times/node_readiness_time/cluster_infosemantics are unchanged from #1181.Verification
pylint modules/python/clients/aks_machine_client.py modules/python/tests/test_aks_machine_client.py: 10.00/10pytest modules/python/tests/test_aks_machine_client.py: 27 passedpytest --cov=. --cov-fail-under=80: 720 passed, 83.43% totalStack
This is the final code-level slice. Remaining slices in the series:
feat(crud): add MachineCRUD orchestratorcrud/azure/machine_crud.py+ testsfeat(crud): add create-machine + scale-machine CLI subcommandscrud/main.pyedits + testsfeat(crud-machine): scenario terraform inputsfeat(crud-machine): pipeline templates + production pipeline