Skip to content

fix: [sc-301863] [AKS] Add node pool action is broken with "Automatic" mode#85

Open
amandineslx wants to merge 9 commits into
masterfrom
bug/dss14-sc-301863--aks-add-node-pool-action-is-broken-with
Open

fix: [sc-301863] [AKS] Add node pool action is broken with "Automatic" mode#85
amandineslx wants to merge 9 commits into
masterfrom
bug/dss14-sc-301863--aks-add-node-pool-action-is-broken-with

Conversation

@amandineslx

@amandineslx amandineslx commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

When creating a node pool on a cluster, the following applies:

  • there needs to be at least one System node pool
  • an invalid node pool mode is not accepted anymore by Azure

On DSS side, we offer an Automatic node pool mode that should behave as:

  • if there is already one System node pool in the cluster, the new Automatic one will be User
  • if there is no System node pool in the cluster, the new Automatic one needs to be System

@amandineslx amandineslx added this to the 14.6.0 milestone Apr 17, 2026
@amandineslx amandineslx requested a review from a team April 17, 2026 14:47
@amandineslx amandineslx self-assigned this Apr 17, 2026

@pjestin-dku pjestin-dku left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this way to fix the issue. This changes the behavior for both cluster creation and node pool addition.

While digging around I found this PR which added the "Automatic" mode, with a comment raising interesting issues: dc710a0#commitcomment-53684049

It seems to me we should never set Automatic as the mode (which is enforced here). However, from what the comment says, it looks like the first node pool was intended to have mode System, and the rest unset mode (i.e. User).

What do you think about moving this logic to the client, i.e. python-clusters/create-aks-cluster/cluster.py for cluster creation and python-runnables/add-node-pool/runnable.py for node pool addition?

@thtrunck thtrunck requested a review from vrutz April 21, 2026 07:55
@vrutz

vrutz commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

I'm not sure about this way to fix the issue. This changes the behavior for both cluster creation and node pool addition.

While digging around I found this PR which added the "Automatic" mode, with a comment raising interesting issues: dc710a0#commitcomment-53684049

It seems to me we should never set Automatic as the mode (which is enforced here). However, from what the comment says, it looks like the first node pool was intended to have mode System, and the rest unset mode (i.e. User).

What do you think about moving this logic to the client, i.e. python-clusters/create-aks-cluster/cluster.py for cluster creation and python-runnables/add-node-pool/runnable.py for node pool addition?

I don't see a good way to keep the automatic mode either which means we are in for a breaking change where the users will have to decide whether the node pools should be system or user.

Could a cluster be created without initial node pools? If yes, that may help our case. Bc then we can make the automatic mode always user in the macro since the cluster will have at least one node pool (system).
But with cluster creation, we could choose a random node pool as system... 🤷🏻‍♂️

@amandineslx amandineslx requested a review from pjestin-dku April 29, 2026 16:51

@pjestin-dku pjestin-dku left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
The string Automatic is no longer passed as a literal mode when adding a node pool. We are also changing the behavior when creating a cluster, I think the new behavior will be more flexible 👍
I tested by creating clusters with various node pool configurations: as expected one of the node pools is created as "System" if there is no other.
I also tried the add node pool action: as expected the new node pool is "User" only if the input mode was "Automatic"
Note that in case we are creating a cluster with one node pool in mode "Automatic", then another node pool in mode "System", then both node pools become "System". I don't think think this is an issue.

@thtrunck thtrunck self-requested a review May 5, 2026 07:54
@thtrunck thtrunck modified the milestones: 14.6.0, 14.6.2 May 6, 2026

@thtrunck thtrunck left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for setting CriticalAddonsOnly=true:NoSchedule is weird we only do that is the mode is System.
You changed the resolution time (now in node_pool_builder.with_mode you passe the resolved mode instead of resolving later).
So this means that for a single automatic nodepool we will have the taint and we won't be able to do anything.

I say that with only looking at the code maybe I'm missing something but I'm provisioning an instance to actually test.

@amandineslx amandineslx requested a review from thtrunck May 18, 2026 10:31
Comment thread python-lib/dku_azure/utils.py Outdated
Comment on lines +98 to +104
def determine_node_pool_mode(input_node_pool_mode, is_existing_system_node_pool):
if input_node_pool_mode != "Automatic":
return input_node_pool_mode
if is_existing_system_node_pool:
return "User"
else:
return "System" No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, the first node pool with Automatic mode will be set to System. But it could be that another later node pool is set to System. Would it not make more sense to first go through all of the node pools, check if one is explicitly System and use that, if not, use the first Automatic mode node pool and if they are all in User mode, throw an exception? (or let Azure throw it).

I'm not sure if there can be multiple System node pools in one cluster and that could clash

Comment thread python-clusters/create-aks-cluster/cluster.py Outdated

@vrutz vrutz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested with explicit System node pools, no explicit System node pools, using the macro for adding/removing node pools and the mode is correct

@mhoarau mhoarau modified the milestones: 14.6.2, 15.0.0 Jun 4, 2026
@thtrunck thtrunck removed their request for review June 11, 2026 07:53
@thtrunck thtrunck requested review from pjestin-dku and thtrunck June 11, 2026 07:54
@mhoarau mhoarau modified the milestones: 15.0.0, 14.7.2 Jun 11, 2026

@pjestin-dku pjestin-dku left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest changes look good.
I tested both cluster creation and node pool addition with different combinations, it works as expected 👍
I have a minor suggestion about code simplification.

Comment on lines +106 to +110
def is_explicit_system_node_pool(node_pool_modes):
for node_pool_mode in node_pool_modes:
if node_pool_mode == "System":
return True
return False No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, this can be done in a simpler way

Suggested change
def is_explicit_system_node_pool(node_pool_modes):
for node_pool_mode in node_pool_modes:
if node_pool_mode == "System":
return True
return False
def is_explicit_system_node_pool(node_pool_modes):
return "System" in node_pool_modes

@thtrunck thtrunck left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments but those were taken into account and tested by other reviewer so approving to not block the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants