feat(crud): add AKSMachineClient batch scale path (BatchPutMachine) by karenychen · Pull Request #1185 · Azure/telescope

karenychen · 2026-05-19T19:19:13Z

Why

Adds the AKS Machine API batch scale path (use_batch_api=True) to AKSMachineClient, dispatching across multiple workers via the BatchPutMachine HEADER pattern. This is the final piece needed to scale machine-mode agent pools efficiently at 1000+ machine counts where N individual PUTs would exhaust per-machine RP overhead.

Stack: builds on #1179 (scaffolding) + #1181 (non-batch scale path).

What

scale_machine(use_batch_api=True) now dispatches to a new _scale_machine_batch helper instead of raising NotImplementedError. The helper:

Validates scale_machine_count % machine_workers == 0 (raises ValueError otherwise — see contract below).
Computes per_worker = scale_machine_count // machine_workers and slices the pre-built names list by worker_id using arithmetic indexing.
Submits one PUT per worker in a ThreadPoolExecutor (max_workers=machine_workers); each PUT carries a contiguous slice of machines in its BatchPutMachine header.
Isolates per-worker failures: an exception in one worker is logged and that worker's slice is excluded from successful_machines; the other workers' results still surface.

Contract: exact-multiple partitioning (ado-telescope parity)

scale_machine_count MUST be an exact multiple of machine_workers when use_batch_api=True. This mirrors ado-telescope's batch contract verbatim: every worker handles exactly the same per-worker count, so chunk boundaries are byte-identical across runs and the resulting chunk_0/chunk_1/... Kusto correlation keys line up exactly with ado dashboards. Non-exact-multiple inputs raise ValueError at the MachineCRUD boundary instead of producing a short final chunk that would skew per-chunk latency dashboards.

Background: the `machineName` (NOT `name`) landmine

The BatchPutMachine header payload's batchMachines array uses the key machineName — the wrong key (e.g. name) causes the API to silently ignore the extras and create only the URL-pointed machine. This was the root cause of an earlier staging build where succeeded=true was reported but only worker_count machines actually landed.

429 retry budget

_make_batch_request retries on HTTP 429 with bounded exponential backoff: up to _BATCH_429_MAX_RETRIES total attempts (5 by default) with sleeps of 1s, 2s, 4s, 8s between them. All other non-2xx responses raise immediately.

Metadata enrichment

scale_machine records a single op.add_metadata("command_execution_time", <wall_seconds>) covering both the individual and batch dispatch paths (uniform shape across modes for downstream dashboards). successful_machines / percentile_node_readiness_times / node_readiness_time / cluster_info semantics are unchanged from #1181.

Verification

pylint modules/python/clients/aks_machine_client.py modules/python/tests/test_aks_machine_client.py: 10.00/10
pytest modules/python/tests/test_aks_machine_client.py: 27 passed
Full module pytest --cov=. --cov-fail-under=80: 720 passed, 83.43% total

Stack

This is the final code-level slice. Remaining slices in the series:

#	PR title	Files	LOC
4	`feat(crud): add MachineCRUD orchestrator`	`crud/azure/machine_crud.py` + tests	~200
5	`feat(crud): add create-machine + scale-machine CLI subcommands`	`crud/main.py` edits + tests	~280
6	`feat(crud-machine): scenario terraform inputs`	tfvars + test inputs	~51
7	`feat(crud-machine): pipeline templates + production pipeline`	YAML	~170

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Enables the AKSMachineClient batch scale path (use_batch_api=True) by implementing BatchPutMachine-based chunked PUTs with bounded 429 retry/backoff, and adds unit tests for the new helpers and dispatch behavior.

Changes:

Implements batch scale dispatch (_scale_machine_batch) and supporting helpers (_array_split, _create_batch_machines, _make_batch_request).
Records per-chunk execution time metadata (batch_command_execution_times) for observability.
Adds/updates tests covering batch dispatch and helper behaviors (including 429 retries).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
modules/python/clients/aks_machine_client.py	Implements the batch scale path, chunking logic, BatchPutMachine header construction, and bounded 429 retry/backoff.
modules/python/tests/test_aks_machine_client.py	Adds tests asserting batch dispatch, chunking semantics, header shape, and retry/raise behavior for batch requests.

Adds the use_batch_api=True branch of scale_machine and the four batch helpers it dispatches to: _array_split, _scale_machine_batch, _create_batch_machines, _make_batch_request. Replaces the NotImplementedError stub landed in #1181. The batch path uses the AKS Machine API's BatchPutMachine HEADER pattern (NOT the non-existent ARM /$batch envelope): a single PUT to the first machine name in the chunk carrying a BatchPutMachine header whose JSON value contains vmSkus (wrapped in {value: [...]}) and batchMachines for the *additional* machines in the chunk. The batchMachines entries use the key 'machineName' (NOT 'name') — using the wrong key causes the API to silently ignore the entries and create only the URL-pointed machine. 429 responses are retried with bounded exponential backoff (_BATCH_429_MAX_RETRIES=3, 1s/2s/4s). Non-2xx non-429 responses raise immediately. Per-chunk wall time is captured in op.add_metadata('batch_command_execution_times', {chunk_<idx>: seconds}). Verification: - pylint repo-root on touched files: 10.00/10. - pytest modules/python/tests/test_aks_machine_client.py: 31 passed (13 new tests for the batch helpers + dispatch). - pytest --cov=. --cov-fail-under=80: 724 passed, 83.45% coverage.

- Cap batch PUT timeout at _PER_REQUEST_TIMEOUT_CAP to preserve budget for agentpool/readiness polling (mirrors individual PUT path). - Reduce BatchPutMachine vmSkus payload to {name, resourceType} so the header schema stays consistent across VM sizes instead of leaking hard-coded standardDv3Family / vCPU / location values. - Record a single command_execution_time metadata on both batch and individual scale paths; keep per-chunk batch_command_execution_times for granularity. - Tighten _scale_machine_batch and _create_batch_machines docstrings (BatchPutMachine pattern wording, drop redundant Kusto correlation prose).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Changes: - _make_batch_request now calls self._session.request instead of module-level requests.request, so the batch path reuses the same pooled TCP/TLS connections (and adapters/proxies) as the individual make_request path. Avoids per-PUT handshakes during large scales. - _make_batch_request gained chunk_idx and first_machine_name params; errors and 429 retry warnings now carry chunk={idx} target={name} {METHOD} {url} so failing slices are self-identifying in logs/Kusto. - Tests updated to patch self.client._session.request instead of clients.aks_machine_client.requests.request.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

modules/python/clients/aks_machine_client.py:401

scale_machine can now raise ValueError for invalid machine_workers or when scale_machine_count is not divisible by machine_workers in the batch path. The docstring's Raises section should document these ValueError cases so callers know about the stricter contract when use_batch_api=True.

        Raises:
            RuntimeError: fewer machines landed than requested, agentpool did
                not reach Succeeded within ``timeout``, or P100 readiness did
                not complete within ``readiness_wait_timeout``.

modules/python/clients/aks_machine_client.py:705

The comment says the batch PUT timeout cap 'mirrors the cap applied at the individual machine PUT site', but _create_single_machine currently uses request.timeout directly (no _PER_REQUEST_TIMEOUT_CAP). Either apply the same min(request.timeout, _PER_REQUEST_TIMEOUT_CAP) cap for individual PUTs, or adjust this comment to avoid implying parity that doesn't exist.

        # Cap per-request timeout so a single batch PUT (plus 429 backoff)
        # cannot consume the whole operation budget that downstream
        # provisioning/readiness waits depend on. Mirrors the cap applied at
        # the individual machine PUT site.
        put_timeout = min(request.timeout, _PER_REQUEST_TIMEOUT_CAP)

- scale_machine docstring: Flow bullet now covers both individual and batch modes; removed redundant 'When use_batch_api=True' paragraph - test docstring: '429 then 200' -> '429 then 2xx (201)' - Remove batch_command_execution_times entirely; _scale_machine_batch now returns only List[str]. Single command_execution_time covers both paths.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

- _make_batch_request: loop is now 1 initial + _BATCH_429_MAX_RETRIES (3) retries so the implemented backoff (1s, 2s, 4s) matches the documented pattern. Test asserts call_count == 4. - Clarify put_timeout comment to point at put_machine (the per-machine PUT helper which already applies _PER_REQUEST_TIMEOUT_CAP).

- _BATCH_429_MAX_RETRIES now means total attempt budget (not retries on top of an initial call). Set to 5 with documented backoff 1s/2s/4s/8s. - _make_batch_request loops range(_BATCH_429_MAX_RETRIES) and asserts call_count == 5 in the exhausted-budget test.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings May 19, 2026 19:19

karenychen requested review from alyssa1303, anson627, liyu-ma, sumanthreddy29, vittoriasalim, wonderyl and xinWeiWei24 as code owners May 19, 2026 19:19

Copilot AI reviewed May 19, 2026

View reviewed changes

Copilot started reviewing on behalf of karenychen May 19, 2026 19:34 View session

karenychen commented May 19, 2026

View reviewed changes

Comment thread modules/python/clients/aks_machine_client.py Outdated

karenychen force-pushed the pr/crud-machine-batch-scale-path branch from c8a039c to 40d44f5 Compare May 19, 2026 21:27

karenychen commented May 20, 2026

View reviewed changes

Comment thread modules/python/clients/aks_machine_client.py

karenychen requested a review from Copilot May 20, 2026 21:56

Copilot started reviewing on behalf of karenychen May 20, 2026 21:58 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread modules/python/clients/aks_machine_client.py Outdated

Comment thread modules/python/clients/aks_machine_client.py

karenychen requested a review from Copilot May 20, 2026 22:46

Copilot started reviewing on behalf of karenychen May 20, 2026 22:48 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread modules/python/clients/aks_machine_client.py Outdated

Comment thread modules/python/tests/test_aks_machine_client.py

Comment thread modules/python/clients/aks_machine_client.py Outdated

karenychen requested a review from Copilot May 20, 2026 23:16

Copilot started reviewing on behalf of karenychen May 20, 2026 23:17 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread modules/python/clients/aks_machine_client.py Outdated

Comment thread modules/python/clients/aks_machine_client.py Outdated

Comment thread modules/python/clients/aks_machine_client.py

karenychen added 2 commits May 20, 2026 16:27

karenychen requested a review from Copilot May 20, 2026 23:32

Copilot started reviewing on behalf of karenychen May 20, 2026 23:34 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread modules/python/clients/aks_machine_client.py Outdated

Comment thread modules/python/clients/aks_machine_client.py Outdated

karenychen and others added 2 commits May 20, 2026 16:43

Potential fix for pull request finding

2f5a5c7

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

f8af723

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

karenychen requested a review from Copilot May 20, 2026 23:44

Copilot started reviewing on behalf of karenychen May 20, 2026 23:46 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread modules/python/clients/aks_machine_client.py

Comment thread modules/python/clients/aks_machine_client.py

vittoriasalim approved these changes May 21, 2026

View reviewed changes

karenychen merged commit 2779756 into main May 21, 2026
7 checks passed

karenychen deleted the pr/crud-machine-batch-scale-path branch May 21, 2026 04:09

Conversation

karenychen commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Contract: exact-multiple partitioning (ado-telescope parity)

Background: the machineName (NOT name) landmine

429 retry budget

Metadata enrichment

Verification

Stack

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karenychen commented May 19, 2026 •

edited

Loading

Background: the `machineName` (NOT `name`) landmine