Skip to content

feat(auto scaling): implement NiFi auto-scaling with graceful node decommissioning#915

Open
soenkeliebau wants to merge 6 commits intomainfrom
feat/autoscale
Open

feat(auto scaling): implement NiFi auto-scaling with graceful node decommissioning#915
soenkeliebau wants to merge 6 commits intomainfrom
feat/autoscale

Conversation

@soenkeliebau
Copy link
Member

Summary

Adds HPA-driven auto-scaling for NiFi clusters with graceful node decommissioning via the
NiFi REST API. Scaling is configured per role group through the new ReplicasConfig enum,
and the operator manages StackableScaler and HPA resources as implementation details.

  • NifiScalingHooks -- implements the ScalingHooks trait with version-aware
    decommissioning sequences:
    • NiFi 1.x: CONNECTED -> OFFLOADING -> OFFLOADED -> DISCONNECTING -> DISCONNECTED -> DELETE
    • NiFi 2.x: CONNECTED -> DISCONNECTING -> DISCONNECTED -> OFFLOADING -> OFFLOADED -> DELETE
    • Scale-up is a no-op (NiFi nodes self-register on startup)
  • NifiApiClient -- authenticated REST API client for NiFi cluster management:
    • SingleUser credential resolution from Kubernetes Secrets
    • Bearer token authentication
    • Endpoints: /controller/cluster (list nodes), /controller/cluster/nodes/{id}
      (set status, delete node)
  • ReplicasConfig-based reconcile -- replaces the old integer-based replicas field:
    • Fixed(n): static replica count, no scaler/HPA created
    • Hpa(config): creates StackableScaler + HPA, runs state machine on each reconcile
    • ExternallyScaled: creates StackableScaler without HPA for user-managed scaling
    • Auto: returns explicit "not yet implemented" error
  • Replicas preservation -- reads existing StackableScaler's spec.replicas before
    rebuilding to prevent overwriting HPA-managed values with initial defaults
  • Watch registration -- .owns() for both StackableScaler and
    HorizontalPodAutoscaler so changes trigger NiFi cluster reconciliation
  • RBAC -- full CRUD on stackablescalers and stackablescalers/status in
    autoscaling.stackable.tech, plus full CRUD on horizontalpodautoscalers in autoscaling
  • Documentation -- comprehensive auto-scaling guide covering configuration, status
    inspection, scale-down behavior, failure recovery, and current limitations

fixes stackabletech/issues#667

User-facing configuration

apiVersion: nifi.stackable.tech/v1alpha1
kind: NifiCluster
spec:
  nodes:
    roleGroups:
      default:
        replicas:
          hpa:
            maxReplicas: 10
            minReplicas: 3
            metrics:
              - type: Resource
                resource:
                  name: cpu
                  target:
                    type: Utilization
                    averageUtilization: 80

The operator creates the StackableScaler and HPA automatically. Users never interact with
these resources directly.

Authentication

Only SingleUser authentication is currently supported for the NiFi REST API calls during
scaling. LDAP and OIDC configurations return an explicit UnsupportedScalerAuthentication
error. This limitation is documented and will be addressed in a follow-up.

Dependencies

Test plan

  • cargo test --all-features passes -- unit tests cover pod FQDN construction, API URL
    generation, and NiFi version detection
  • cargo clippy --all-targets --all-features -- -D warnings clean
  • Manual integration test: deploy NiFi cluster with replicas: { hpa: ... } config,
    verify StackableScaler and HPA are created
  • Scale-up: HPA increases replicas -> new pods start -> state machine completes
    (PreScaling no-op -> Scaling -> PostScaling no-op -> Idle)
  • Scale-down: HPA decreases replicas -> pre_scale hook offloads/disconnects/deletes nodes
    via REST API -> StatefulSet scaled down -> state machine completes
  • NiFi 1.x vs 2.x: verify correct decommission sequence for each version
  • Mid-scaling HPA update blocked by admission webhook
  • Failed state recovery via retry annotation
  • Fixed(n) config: no scaler/HPA created, behaves as before
  • Reporting task service selector works with ReplicasConfig-based role groups

Author

  • Changes are OpenShift compatible
  • CRD changes approved
  • CRD documentation for all fields, following the style guide.
  • Helm chart can be installed and deployed operator works
  • Integration tests passed (for non trivial changes)
  • Changes need to be "offline" compatible
  • Links to generated (nightly) docs added
  • Release note snippet added

Reviewer

  • Code contains useful comments
  • Code contains useful logging statements
  • (Integration-)Test cases added
  • Documentation added or updated. Follows the style guide.
  • Changelog updated
  • Cargo.toml only contains references to git tags (not specific commits or branches)

Acceptance

  • Feature Tracker has been updated
  • Proper release label has been added
  • Links to generated (nightly) docs added
  • Release note snippet added
  • Add type/deprecation label & add to the deprecation schedule
  • Add type/experimental label & add to the experimental features tracker

soenkeliebau and others added 6 commits March 11, 2026 09:39
…ration

Add NiFi-specific scaling hooks that drive node offload, disconnect, and
deletion via the NiFi REST API before the StatefulSet replica count is
reduced. Supports both NiFi 1.x (offload-first) and 2.x (disconnect-first)
scale-down sequences.

Key components:
- NifiScalingHooks implementing the ScalingHooks trait
- NifiApiClient for REST API calls (connect, cluster nodes, status updates)
- Credential resolution from Kubernetes Secrets
- Controller integration with StackableScaler reconciliation
- RBAC, Helm config, and generated files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document how to configure horizontal auto-scaling for NiFi clusters
using StackableScaler and HPA, including configuration steps, status
inspection, scale-down decommission behavior, failure recovery via
the retry annotation, and current limitations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain that the label is required, auto-injected by the mutating
webhook in commons-operator, and harmless to set explicitly in
manifests for clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace replicas: 0 convention with ReplicasConfig enum matching.
Create StackableScaler and HPA via ClusterResources.add().
Switch from .watches() to .owns() for scaler event routing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…update

Part of the ReplicasConfig rewrite: replace the raw
HorizontalPodAutoscalerSpec wrapper in HpaConfig with a custom struct
exposing only user-relevant fields, add replicas preservation support,
and align Label::managed_by usage with ClusterResources conventions.

- Preserve existing StackableScaler spec.replicas across reconciles by
  reading the current value via `client.get_opt::<StackableScaler>()`
  before building the scaler object. This prevents server-side apply
  from resetting externally-set replica counts (e.g. from the HPA).

- Pass `NIFI_CONTROLLER_NAME` to `build_scaler()` and
  `build_hpa_from_user_spec()` for correct `managed-by` label format.

- Register `.owns::<HorizontalPodAutoscaler>()` on the controller so
  that HPA changes trigger reconciliation.

- Add dedicated error variants (BuildHpa, GetExistingScaler,
  ApplyScaler, ApplyHpa) instead of reusing ApplyRoleGroupStatefulSet
  for scaler/HPA operations.

- Update RBAC roles to include create/delete/update verbs for
  StackableScaler and full CRUD for HorizontalPodAutoscaler.

- Update `hpa_config.spec` call sites to `hpa_config.as_ref()` to
  match the new flat HpaConfig struct from operator-rs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Shared AutoScaling Hook Functionality

1 participant