Skip to content

feat(crud): add AKSMachineClient batch scale path (BatchPutMachine)#1185

Merged
karenychen merged 8 commits into
mainfrom
pr/crud-machine-batch-scale-path
May 21, 2026
Merged

feat(crud): add AKSMachineClient batch scale path (BatchPutMachine)#1185
karenychen merged 8 commits into
mainfrom
pr/crud-machine-batch-scale-path

Conversation

@karenychen
Copy link
Copy Markdown
Contributor

@karenychen karenychen commented May 19, 2026

Why

Adds the AKS Machine API batch scale path (use_batch_api=True) to AKSMachineClient, dispatching across multiple workers via the BatchPutMachine HEADER pattern. This is the final piece needed to scale machine-mode agent pools efficiently at 1000+ machine counts where N individual PUTs would exhaust per-machine RP overhead.

Stack: builds on #1179 (scaffolding) + #1181 (non-batch scale path).

What

scale_machine(use_batch_api=True) now dispatches to a new _scale_machine_batch helper instead of raising NotImplementedError. The helper:

  1. Validates scale_machine_count % machine_workers == 0 (raises ValueError otherwise — see contract below).
  2. Computes per_worker = scale_machine_count // machine_workers and slices the pre-built names list by worker_id using arithmetic indexing.
  3. Submits one PUT per worker in a ThreadPoolExecutor (max_workers=machine_workers); each PUT carries a contiguous slice of machines in its BatchPutMachine header.
  4. Isolates per-worker failures: an exception in one worker is logged and that worker's slice is excluded from successful_machines; the other workers' results still surface.

Contract: exact-multiple partitioning (ado-telescope parity)

scale_machine_count MUST be an exact multiple of machine_workers when use_batch_api=True. This mirrors ado-telescope's batch contract verbatim: every worker handles exactly the same per-worker count, so chunk boundaries are byte-identical across runs and the resulting chunk_0/chunk_1/... Kusto correlation keys line up exactly with ado dashboards. Non-exact-multiple inputs raise ValueError at the MachineCRUD boundary instead of producing a short final chunk that would skew per-chunk latency dashboards.

Background: the machineName (NOT name) landmine

The BatchPutMachine header payload's batchMachines array uses the key machineName — the wrong key (e.g. name) causes the API to silently ignore the extras and create only the URL-pointed machine. This was the root cause of an earlier staging build where succeeded=true was reported but only worker_count machines actually landed.

429 retry budget

_make_batch_request retries on HTTP 429 with bounded exponential backoff: up to _BATCH_429_MAX_RETRIES total attempts (5 by default) with sleeps of 1s, 2s, 4s, 8s between them. All other non-2xx responses raise immediately.

Metadata enrichment

scale_machine records a single op.add_metadata("command_execution_time", <wall_seconds>) covering both the individual and batch dispatch paths (uniform shape across modes for downstream dashboards). successful_machines / percentile_node_readiness_times / node_readiness_time / cluster_info semantics are unchanged from #1181.

Verification

  • pylint modules/python/clients/aks_machine_client.py modules/python/tests/test_aks_machine_client.py: 10.00/10
  • pytest modules/python/tests/test_aks_machine_client.py: 27 passed
  • Full module pytest --cov=. --cov-fail-under=80: 720 passed, 83.43% total

Stack

This is the final code-level slice. Remaining slices in the series:

# PR title Files LOC
4 feat(crud): add MachineCRUD orchestrator crud/azure/machine_crud.py + tests ~200
5 feat(crud): add create-machine + scale-machine CLI subcommands crud/main.py edits + tests ~280
6 feat(crud-machine): scenario terraform inputs tfvars + test inputs ~51
7 feat(crud-machine): pipeline templates + production pipeline YAML ~170

Copilot AI review requested due to automatic review settings May 19, 2026 19:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Enables the AKSMachineClient batch scale path (use_batch_api=True) by implementing BatchPutMachine-based chunked PUTs with bounded 429 retry/backoff, and adds unit tests for the new helpers and dispatch behavior.

Changes:

  • Implements batch scale dispatch (_scale_machine_batch) and supporting helpers (_array_split, _create_batch_machines, _make_batch_request).
  • Records per-chunk execution time metadata (batch_command_execution_times) for observability.
  • Adds/updates tests covering batch dispatch and helper behaviors (including 429 retries).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
modules/python/clients/aks_machine_client.py Implements the batch scale path, chunking logic, BatchPutMachine header construction, and bounded 429 retry/backoff.
modules/python/tests/test_aks_machine_client.py Adds tests asserting batch dispatch, chunking semantics, header shape, and retry/raise behavior for batch requests.

Comment thread modules/python/clients/aks_machine_client.py
Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py Outdated
Adds the use_batch_api=True branch of scale_machine and the four batch
helpers it dispatches to: _array_split, _scale_machine_batch,
_create_batch_machines, _make_batch_request. Replaces the NotImplementedError
stub landed in #1181.

The batch path uses the AKS Machine API's BatchPutMachine HEADER pattern (NOT
the non-existent ARM /$batch envelope): a single PUT to the first machine
name in the chunk carrying a BatchPutMachine header whose JSON value contains
vmSkus (wrapped in {value: [...]}) and batchMachines for the *additional*
machines in the chunk. The batchMachines entries use the key 'machineName'
(NOT 'name') — using the wrong key causes the API to silently ignore the
entries and create only the URL-pointed machine.

429 responses are retried with bounded exponential backoff
(_BATCH_429_MAX_RETRIES=3, 1s/2s/4s). Non-2xx non-429 responses raise
immediately. Per-chunk wall time is captured in
op.add_metadata('batch_command_execution_times', {chunk_<idx>: seconds}).

Verification:
- pylint repo-root on touched files: 10.00/10.
- pytest modules/python/tests/test_aks_machine_client.py: 31 passed (13 new tests
  for the batch helpers + dispatch).
- pytest --cov=. --cov-fail-under=80: 724 passed, 83.45% coverage.
@karenychen karenychen force-pushed the pr/crud-machine-batch-scale-path branch from c8a039c to 40d44f5 Compare May 19, 2026 21:27
Comment thread modules/python/clients/aks_machine_client.py
- Cap batch PUT timeout at _PER_REQUEST_TIMEOUT_CAP to preserve budget for
  agentpool/readiness polling (mirrors individual PUT path).
- Reduce BatchPutMachine vmSkus payload to {name, resourceType} so the
  header schema stays consistent across VM sizes instead of leaking
  hard-coded standardDv3Family / vCPU / location values.
- Record a single command_execution_time metadata on both batch and
  individual scale paths; keep per-chunk batch_command_execution_times
  for granularity.
- Tighten _scale_machine_batch and _create_batch_machines docstrings
  (BatchPutMachine pattern wording, drop redundant Kusto correlation
  prose).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py
Changes:
- _make_batch_request now calls self._session.request instead of
  module-level requests.request, so the batch path reuses the same
  pooled TCP/TLS connections (and adapters/proxies) as the individual
  make_request path. Avoids per-PUT handshakes during large scales.
- _make_batch_request gained chunk_idx and first_machine_name params;
  errors and 429 retry warnings now carry chunk={idx} target={name}
  {METHOD} {url} so failing slices are self-identifying in logs/Kusto.
- Tests updated to patch self.client._session.request instead of
  clients.aks_machine_client.requests.request.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

modules/python/clients/aks_machine_client.py:401

  • scale_machine can now raise ValueError for invalid machine_workers or when scale_machine_count is not divisible by machine_workers in the batch path. The docstring's Raises section should document these ValueError cases so callers know about the stricter contract when use_batch_api=True.
        Raises:
            RuntimeError: fewer machines landed than requested, agentpool did
                not reach Succeeded within ``timeout``, or P100 readiness did
                not complete within ``readiness_wait_timeout``.

modules/python/clients/aks_machine_client.py:705

  • The comment says the batch PUT timeout cap 'mirrors the cap applied at the individual machine PUT site', but _create_single_machine currently uses request.timeout directly (no _PER_REQUEST_TIMEOUT_CAP). Either apply the same min(request.timeout, _PER_REQUEST_TIMEOUT_CAP) cap for individual PUTs, or adjust this comment to avoid implying parity that doesn't exist.
        # Cap per-request timeout so a single batch PUT (plus 429 backoff)
        # cannot consume the whole operation budget that downstream
        # provisioning/readiness waits depend on. Mirrors the cap applied at
        # the individual machine PUT site.
        put_timeout = min(request.timeout, _PER_REQUEST_TIMEOUT_CAP)

Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/tests/test_aks_machine_client.py
Comment thread modules/python/clients/aks_machine_client.py Outdated
- scale_machine docstring: Flow bullet now covers both individual and batch
  modes; removed redundant 'When use_batch_api=True' paragraph
- test docstring: '429 then 200' -> '429 then 2xx (201)'
- Remove batch_command_execution_times entirely; _scale_machine_batch now
  returns only List[str]. Single command_execution_time covers both paths.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py
- _make_batch_request: loop is now 1 initial + _BATCH_429_MAX_RETRIES (3)
  retries so the implemented backoff (1s, 2s, 4s) matches the documented
  pattern. Test asserts call_count == 4.
- Clarify put_timeout comment to point at put_machine (the per-machine
  PUT helper which already applies _PER_REQUEST_TIMEOUT_CAP).
- _BATCH_429_MAX_RETRIES now means total attempt budget (not retries on
  top of an initial call). Set to 5 with documented backoff 1s/2s/4s/8s.
- _make_batch_request loops range(_BATCH_429_MAX_RETRIES) and asserts
  call_count == 5 in the exhausted-budget test.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread modules/python/clients/aks_machine_client.py Outdated
Comment thread modules/python/clients/aks_machine_client.py Outdated
karenychen and others added 2 commits May 20, 2026 16:43
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread modules/python/clients/aks_machine_client.py
Comment thread modules/python/clients/aks_machine_client.py
@karenychen karenychen merged commit 2779756 into main May 21, 2026
7 checks passed
@karenychen karenychen deleted the pr/crud-machine-batch-scale-path branch May 21, 2026 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants