Skip to content

[AWS/EKS] Tune VPC CNI warm pool for kubernetes_node_scale in EksKarpenterCluster#6557

Open
kiryl-filatau wants to merge 4 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:aws-5k-fix
Open

[AWS/EKS] Tune VPC CNI warm pool for kubernetes_node_scale in EksKarpenterCluster#6557
kiryl-filatau wants to merge 4 commits intoGoogleCloudPlatform:masterfrom
kiryl-filatau:aws-5k-fix

Conversation

@kiryl-filatau
Copy link
Copy Markdown
Collaborator

@kiryl-filatau kiryl-filatau commented Mar 25, 2026

NOTE: should be merged only after PR#6512 is merged.

What

In EksKarpenterCluster._PostCreate, when the benchmark is
kubernetes_node_scale, tune the VPC CNI warm-pool settings on the
aws-node DaemonSet in kube-system and wait for the rollout to
complete before the benchmark run starts.
Settings applied:

  • WARM_ENI_TARGET=0
  • WARM_IP_TARGET=1
  • MINIMUM_IP_TARGET=1

Why

By default, the AWS VPC CNI pre-allocates a warm pool of ENIs and
secondary IPs on each node as soon as it joins the cluster. The number
of IPs reserved scales with the instance type — larger instances have
more ENI slots and more IPs per ENI, so each node can reserve 10–30+
IPs before a single pod is scheduled.
At 5k-node scale this becomes a hard blocker: the cumulative IP
pre-allocation across all nodes exhausts the subnet address space
before all nodes can be scheduled, causing the benchmark to fail with
InsufficientCapacityError and FailedScheduling events.
Setting WARM_ENI_TARGET=0, WARM_IP_TARGET=1, MINIMUM_IP_TARGET=1
instructs the CNI to keep only 1 IP warm per node instead of a full
ENI's worth, which is sufficient for the kubernetes_node_scale
workload (one pod per node). This is not a performance optimization —
it is a prerequisite for the benchmark to complete successfully at
this scale.
See WARM_IP_TARGET and MINIMUM_IP_TARGET for reference, and this overview for a practical explanation.

Scope

The tuning block is guarded by 'kubernetes_node_scale' in FLAGS.benchmarks, so it is a no-op for all other benchmarks. No
existing behaviour is changed outside that gate.

Testing

Validated with two back-to-back 5k-node runs on EKS + Karpenter in
us-east-1. Both runs completed with status
SUCCEEDED.

Usage

python pkb.py \
--benchmarks=kubernetes_node_scale \
--eks_tune_vpc_cni_for_scale=True \
....

@hubatish
Copy link
Copy Markdown
Collaborator

I'm wondering what the heck this even does. I found:
https://medium.com/@GiteshWadhwa/optimizing-kubernetes-networking-understanding-warm-eni-target-warm-ip-target-and-14e74096b067

Which seems like a reasonable explanation & discusses IP addresses. Could you provide a link in your description?

@hubatish
Copy link
Copy Markdown
Collaborator

Second question:
We often like to run a somewhat naive set of benchmarks without optimizations.. Is this such a premature optimization? Or justified/necessary because either the benchmark fails without or this is indeed purely a networking thing? I know for AKS/GKE we set networking values with cidr during cluster creation / prior to scaling, so maybe even with this optimization EKS is still doing more work while scaling?

def _PostCreate(self):
"""Performs post-creation steps for the cluster."""
super()._PostCreate()
if 'kubernetes_node_scale' in FLAGS.benchmarks:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit more complex, but please add a flag in providers/aws/flags & reference it from here. For a few reasons:

  1. We want resources to be unaware of benchmarks. Rather than the resource knowing about a benchmark, the benchmark (or just the user calling the benchmark & setting flag values) should tell the resource (via flag) what it wants it to do.
  2. We might want to enable this in other benchmarks - like scale -> 1k or 5k pods would probably also benefit from this right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added --eks_tune_vpc_cni_for_scale to providers/aws/flags.py and replaced the benchmark check with it. Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants