[AWS/EKS] Tune VPC CNI warm pool for kubernetes_node_scale in EksKarpenterCluster#6557
[AWS/EKS] Tune VPC CNI warm pool for kubernetes_node_scale in EksKarpenterCluster#6557kiryl-filatau wants to merge 4 commits intoGoogleCloudPlatform:masterfrom
Conversation
|
I'm wondering what the heck this even does. I found: Which seems like a reasonable explanation & discusses IP addresses. Could you provide a link in your description? |
|
Second question: |
| def _PostCreate(self): | ||
| """Performs post-creation steps for the cluster.""" | ||
| super()._PostCreate() | ||
| if 'kubernetes_node_scale' in FLAGS.benchmarks: |
There was a problem hiding this comment.
It is a bit more complex, but please add a flag in providers/aws/flags & reference it from here. For a few reasons:
- We want resources to be unaware of benchmarks. Rather than the resource knowing about a benchmark, the benchmark (or just the user calling the benchmark & setting flag values) should tell the resource (via flag) what it wants it to do.
- We might want to enable this in other benchmarks - like scale -> 1k or 5k pods would probably also benefit from this right?
There was a problem hiding this comment.
Done, added --eks_tune_vpc_cni_for_scale to providers/aws/flags.py and replaced the benchmark check with it. Thanks for the suggestion!
NOTE: should be merged only after PR#6512 is merged.
What
In
EksKarpenterCluster._PostCreate, when the benchmark iskubernetes_node_scale, tune the VPC CNI warm-pool settings on theaws-nodeDaemonSet inkube-systemand wait for the rollout tocomplete before the benchmark run starts.
Settings applied:
Why
By default, the AWS VPC CNI pre-allocates a warm pool of ENIs and
secondary IPs on each node as soon as it joins the cluster. The number
of IPs reserved scales with the instance type — larger instances have
more ENI slots and more IPs per ENI, so each node can reserve 10–30+
IPs before a single pod is scheduled.
At 5k-node scale this becomes a hard blocker: the cumulative IP
pre-allocation across all nodes exhausts the subnet address space
before all nodes can be scheduled, causing the benchmark to fail with
InsufficientCapacityErrorandFailedSchedulingevents.Setting
WARM_ENI_TARGET=0,WARM_IP_TARGET=1,MINIMUM_IP_TARGET=1instructs the CNI to keep only 1 IP warm per node instead of a full
ENI's worth, which is sufficient for the
kubernetes_node_scaleworkload (one pod per node). This is not a performance optimization —
it is a prerequisite for the benchmark to complete successfully at
this scale.
See
WARM_IP_TARGETandMINIMUM_IP_TARGETfor reference, and this overview for a practical explanation.Scope
The tuning block is guarded by
'kubernetes_node_scale' in FLAGS.benchmarks, so it is a no-op for all other benchmarks. Noexisting behaviour is changed outside that gate.
Testing
Validated with two back-to-back 5k-node runs on EKS + Karpenter in
us-east-1. Both runs completed with status
SUCCEEDED.
Usage
python pkb.py \
--benchmarks=kubernetes_node_scale \
--eks_tune_vpc_cni_for_scale=True \
....