feat(auto scaling): implement NiFi auto-scaling with graceful node decommissioning#915
Open
soenkeliebau wants to merge 6 commits intomainfrom
Open
feat(auto scaling): implement NiFi auto-scaling with graceful node decommissioning#915soenkeliebau wants to merge 6 commits intomainfrom
soenkeliebau wants to merge 6 commits intomainfrom
Conversation
…ration Add NiFi-specific scaling hooks that drive node offload, disconnect, and deletion via the NiFi REST API before the StatefulSet replica count is reduced. Supports both NiFi 1.x (offload-first) and 2.x (disconnect-first) scale-down sequences. Key components: - NifiScalingHooks implementing the ScalingHooks trait - NifiApiClient for REST API calls (connect, cluster nodes, status updates) - Credential resolution from Kubernetes Secrets - Controller integration with StackableScaler reconciliation - RBAC, Helm config, and generated files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document how to configure horizontal auto-scaling for NiFi clusters using StackableScaler and HPA, including configuration steps, status inspection, scale-down decommission behavior, failure recovery via the retry annotation, and current limitations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain that the label is required, auto-injected by the mutating webhook in commons-operator, and harmless to set explicitly in manifests for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace replicas: 0 convention with ReplicasConfig enum matching. Create StackableScaler and HPA via ClusterResources.add(). Switch from .watches() to .owns() for scaler event routing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…update Part of the ReplicasConfig rewrite: replace the raw HorizontalPodAutoscalerSpec wrapper in HpaConfig with a custom struct exposing only user-relevant fields, add replicas preservation support, and align Label::managed_by usage with ClusterResources conventions. - Preserve existing StackableScaler spec.replicas across reconciles by reading the current value via `client.get_opt::<StackableScaler>()` before building the scaler object. This prevents server-side apply from resetting externally-set replica counts (e.g. from the HPA). - Pass `NIFI_CONTROLLER_NAME` to `build_scaler()` and `build_hpa_from_user_spec()` for correct `managed-by` label format. - Register `.owns::<HorizontalPodAutoscaler>()` on the controller so that HPA changes trigger reconciliation. - Add dedicated error variants (BuildHpa, GetExistingScaler, ApplyScaler, ApplyHpa) instead of reusing ApplyRoleGroupStatefulSet for scaler/HPA operations. - Update RBAC roles to include create/delete/update verbs for StackableScaler and full CRUD for HorizontalPodAutoscaler. - Update `hpa_config.spec` call sites to `hpa_config.as_ref()` to match the new flat HpaConfig struct from operator-rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds HPA-driven auto-scaling for NiFi clusters with graceful node decommissioning via the
NiFi REST API. Scaling is configured per role group through the new
ReplicasConfigenum,and the operator manages StackableScaler and HPA resources as implementation details.
NifiScalingHooks-- implements theScalingHookstrait with version-awaredecommissioning sequences:
NifiApiClient-- authenticated REST API client for NiFi cluster management:/controller/cluster(list nodes),/controller/cluster/nodes/{id}(set status, delete node)
ReplicasConfig-based reconcile -- replaces the old integer-basedreplicasfield:Fixed(n): static replica count, no scaler/HPA createdHpa(config): creates StackableScaler + HPA, runs state machine on each reconcileExternallyScaled: creates StackableScaler without HPA for user-managed scalingAuto: returns explicit "not yet implemented" errorspec.replicasbeforerebuilding to prevent overwriting HPA-managed values with initial defaults
.owns()for bothStackableScalerandHorizontalPodAutoscalerso changes trigger NiFi cluster reconciliationstackablescalersandstackablescalers/statusinautoscaling.stackable.tech, plus full CRUD onhorizontalpodautoscalersinautoscalinginspection, scale-down behavior, failure recovery, and current limitations
fixes stackabletech/issues#667
User-facing configuration
The operator creates the StackableScaler and HPA automatically. Users never interact with
these resources directly.
Authentication
Only
SingleUserauthentication is currently supported for the NiFi REST API calls duringscaling. LDAP and OIDC configurations return an explicit
UnsupportedScalerAuthenticationerror. This limitation is documented and will be addressed in a follow-up.
Dependencies
ReplicasConfig,ScalingHookstrait,reconcile_scaler(), builder helpers(see feat(auto scaling): add StackableScaler CRD, state machine, ReplicasConfig, and scaling hooks operator-rs#1181)
(see feat (auto scale) : add StackableScaler CRD rollout and admission webhook commons-operator#411 -- needed for testing,
as it installs the StackableScaler CRD and admission webhook)
../operator-rs/crates/stackable-operator(development dependency)reqwest0.12 (rustls-tls, json) for NiFi REST API callsTest plan
cargo test --all-featurespasses -- unit tests cover pod FQDN construction, API URLgeneration, and NiFi version detection
cargo clippy --all-targets --all-features -- -D warningscleanreplicas: { hpa: ... }config,verify StackableScaler and HPA are created
(PreScaling no-op -> Scaling -> PostScaling no-op -> Idle)
via REST API -> StatefulSet scaled down -> state machine completes
Fixed(n)config: no scaler/HPA created, behaves as beforeAuthor
Reviewer
Acceptance
type/deprecationlabel & add to the deprecation scheduletype/experimentallabel & add to the experimental features tracker