Skip to content

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168

Draft
skosuri1 wants to merge 115 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2
Draft

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
skosuri1 wants to merge 115 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2

Conversation

@skosuri1
Copy link
Copy Markdown

@skosuri1 skosuri1 commented May 6, 2026

Stacked on top of #1157 (skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.

This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.

Phase 3 Deliverables

  • 20-node baseline cluster size (spec line 24). Current clusters are 3 nodes (2 default + 1 prompool) — sized for harness validation, not real scale measurement.
  • Cluster-count tiers: add azure-5.tfvars, azure-10.tfvars, azure-20.tfvars and corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.
  • Parallel CL2 fan-out: replace sequential per-cluster CL2 with bounded concurrency (default 4). Requires async wrapping of utils.run_cl2_command (currently synchronous, modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.
  • etcd PodMonitor capacity check at 20 clusters: 28 watchers per cluster × 20 = 560 watchers; verify Prom scrape budget holds.
  • Scaling-curve dashboards from cluster-attributed results (Kusto).

Out of Scope (deferred to later phases / pre-merge of #1157)

skosuri and others added 30 commits May 6, 2026 13:59
…idn't fix root cause); fix n5 condition syntax
… referenced it but variables.tf didn't declare)
skosuri added 30 commits May 18, 2026 10:50
…Dv3 SKU + 24h deletion_delay + lower Fleet-bug debug gate to N>=3
…e); bump fleet destroy budget 10min->30min for N=100
…erings, service-cidr override) + dev-pipeline stage running pod-churn-combined
…olate from peered Kusto rows without schema change
…pod subnets (Azure AKS now requires explicit delegation in eastus2euap, build 67743)
….0/8, 200 subnets, 100 AKS at 10xD4_v3) + condition:false dev-pipeline stage
…gation on all 100 pod subnets (forgot in initial gen; matches commit 0c0677e for peered tfvars)
…n_apply_failure (skip if profile already exists)
…_SECONDS) so a stuck CL2 doesn't block all 100 workers; +2 tests
…red-VNet 100 concurrent creates hit Azure per-VNet subnet PUT serialization; build 67774 evidence)
…eededState; aks_wait_succeeded fail-fast on terminal Failed (build 67775 evidence: 17% fail rate at parallelism=8)
…8 evidence: VirtualNetworkNotInSucceededState leaves cluster half-created; AlreadyExists on retry blocks recovery)
… 'already exists' match (build 67798: 99/100 clusters succeeded, only mesh-72 blocked by terraform-retry hitting AlreadyExists with CamelCase-only regex)
…L2 plumbing + n2 4-cell smoke stage (condition:false)
… stages + single-scenario soft-fail in execute.yml
…ing + test_type_suffix mechanism for Kusto cell discrimination in share-infra mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant