Skip to content

NIFI-15880: Allow Connectors to enter a Troublehshooting mode where the flow beco…#11175

Draft
markap14 wants to merge 7 commits into
apache:mainfrom
markap14:nifi-connectors-troubleshooting
Draft

NIFI-15880: Allow Connectors to enter a Troublehshooting mode where the flow beco…#11175
markap14 wants to merge 7 commits into
apache:mainfrom
markap14:nifi-connectors-troubleshooting

Conversation

@markap14
Copy link
Copy Markdown
Contributor

…mes modifiable and user-managed (temporarily) rather than Connector managed.

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
  • Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@markap14 markap14 force-pushed the nifi-connectors-troubleshooting branch 8 times, most recently from 08ee450 to 2f5bc8f Compare April 27, 2026 21:22
@markap14 markap14 changed the title Allow Connectors to enter a Troublehshooting mode where the flow beco… NIFI-15880: Allow Connectors to enter a Troublehshooting mode where the flow beco… May 14, 2026
Comment on lines +736 to +739
if (!STOPPED_STATES.contains(processor.getScheduledState())) {
throw new IllegalStateException("Cannot end Troubleshooting mode for " + this + " because Processor " + processor.getIdentifier()
+ " is in state " + processor.getScheduledState() + "; it must be STOPPED or DISABLED.");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this works currently is a bit confusing. You can enter troubleshooting mode while the processor is running, but you can't end troubleshooting with it still running. This forces the user to have to manually stop all processors before they would be allowed to end troubleshooting. For large connectors, this will be a tedious process of manually stopping everything.

It might be more palatable if STOP was a valid action on the connector as a whole while in troubleshooting mode. But it is explicitly not allowed in the current state.

@markap14 markap14 force-pushed the nifi-connectors-troubleshooting branch 3 times, most recently from 9a6d871 to e6b3760 Compare May 19, 2026 17:24
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectorRepository.getConnectors() got a required ConnectorSyncMode parameter. But FlowController is still calling getConnectors() with no args. Results in a compile error.

markap14 added 7 commits May 27, 2026 13:59
…he flow becomes modifiable and user-managed (temporarily) rather than Connector managed.
…owable action; fix findProcessGroup for connector child groups

- Refactored verifyCanEndTroubleshooting to delegate to a new
  getReasonCannotEndTroubleshooting() method that returns Optional<String>.
  The verify method throws if the Optional is present; createEndTroubleshootingAction
  uses the same method to populate the reason field on the allowable action so
  callers see why the action is unavailable without having to attempt it.
- Replaced the throwing verifyAllComponentsStoppedAndDisabled helper with
  findReasonComponentsNotStopped, which returns the first reason found as an
  Optional<String>. The expensive flow-revert preflight only runs after the
  cheap component-state scan passes.
- Fixed StandardProcessGroup.findProcessGroup to use the unconditional
  flowManager.getGroup(id) instead of the connector-ID-filtered overload.
  Only the root managed process group carries a connectorId; child groups do
  not, so the filtered lookup always returned null for non-root groups,
  causing a 404 when navigating into a child group in Troubleshooting mode.
  The existing isOwner check already enforces hierarchy ownership.
…up via the Connector flow endpoint

Adds testGetFlowForChildProcessGroupInTroubleshooting to
ConnectorTroubleshootingIT to verify that the Connector's
/flow/process-groups/{groupId} endpoint returns the correct flow for a
child process group within the managed flow while in Troubleshooting mode.
…rebuild

- Use fully-qualified org.apache.nifi.controller.ScheduledState for local
  variables in findReasonComponentsNotStopped to resolve the naming conflict
  with the imported org.apache.nifi.flow.ScheduledState.
- Pass ConnectorSyncMode.LOCAL_ONLY to ConnectorRepository.getConnectors()
  in FlowController.findInputPortIncludingConnectorManaged and
  findOutputPortIncludingConnectorManaged; the no-arg overload was removed
  from the interface and these internal lookups do not need a provider sync.
DependentSecretVerifyConnector, RequiredSecretConnector, and
TrackingConnector did not implement the new abstract
Connector.getActiveFlow(FlowContext) method added to nifi-api.
Add a stub returning null to each, since none of the tests
exercised that method.
…rs in component DAOs

After rebasing onto main (NIFI-15919), ConnectorRepository.getConnector
and ConnectorRepository.getConnectors now require a ConnectorSyncMode.
Update the connector-managed component lookups added to StandardFunnelDAO,
StandardInputPortDAO, StandardLabelDAO, StandardOutputPortDAO, and
StandardRemoteProcessGroupDAO to pass ConnectorSyncMode.LOCAL_ONLY.
These are internal walks across in-memory connector state to locate a
component by id and have no need to consult the configuration provider.
…euedData

assertEquals(-1, getQueuedCount(...)) compared an int against an
OptionalInt, so the comparison was never meaningful. Replace it with
assertFalse(getQueuedCount(...).isPresent(), ...) which expresses the
intended check: after ending Troubleshooting the connection added during
Troubleshooting must no longer exist in the managed flow.
@markap14 markap14 force-pushed the nifi-connectors-troubleshooting branch from ba28a18 to 332ec7b Compare May 27, 2026 19:02
Copy link
Copy Markdown
Contributor

@kevdoran kevdoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugs / functional concerns

1. Cluster flow sync of an already-TROUBLESHOOTING connector will fail with IllegalStateException

StandardConnectorNode.setName now throws if the connector is in TROUBLESHOOTING.

@Override
public void setName(final String name) {
    if (getCurrentState() == ConnectorState.TROUBLESHOOTING) {
        throw new IllegalStateException("Cannot rename " + this + " while it is in Troubleshooting mode; exit Troubleshooting mode before renaming the Connector.");
    }
    this.name = name;
}

But StandardConnectorRepository.syncConnector calls connector.setName(effectiveName) unconditionally and TROUBLESHOOTING is neither in isWaitableState nor isRejectableState:

        // Set name locally (no provider.save())
        connector.setName(effectiveName);

        final boolean wasRunning = currentState == ConnectorState.RUNNING;
        final boolean configChanged = isNewConnector || isConfigurationUpdated(connector, effectiveActiveConfig, effectiveWorkingConfig);
        final boolean restoringTroubleshooting = effectiveScheduledState == VersionedConnectorState.TROUBLESHOOTING;

On a fresh startup it's fine because the connector node is freshly created in STOPPED (then restoreTroubleshootingState() flips it at the end). However, in a clustered configuration when a node receives a sync after disconnect/reconnect (or any other path where syncConnector runs against an existing connector that is already TROUBLESHOOTING in memory), setName will throw and the exception will propagate out of VersionedFlowSynchronizer unchecked — flow sync fails hard instead of returning a ConnectorSyncResult.failed.

This is the same family of regressions that earlier rebases hit (see commit 8842121). Suggested fixes:

  • In syncConnector, skip the setName when currentState == TROUBLESHOOTING && effectiveScheduledState == VersionedConnectorState.TROUBLESHOOTING && Objects.equals(effectiveName, connector.getName()).
  • Or have the rename-guard live in ConnectorRepository.updateConnector (which is the public mutation path) and not in setName so that internal sync paths can still set the name on the node. The repository already has the same guard; the node-level one is duplicative.

This is the most concerning functional bug I found. Worth a system test that exercises clustered sync while the connector is in TROUBLESHOOTING (the new IT covers single-node restart in testEndTroubleshootingBlockedByQueuedData, but not multi-node sync).

2. verifyEnterTroubleshooting is run twice; providers can observe two calls per transition

StandardConnectorDAO.enterTroubleshootingrepository.enterTroubleshootingverifyEnterTroubleshooting(connector) (which calls configurationProvider.verifyEnterTroubleshooting). The REST resource also calls serviceFacade.verifyEnterConnectorTroubleshooting(id) as the verify phase of withWriteLock, which calls dao.verifyEnterTroubleshooting(id)repository.verifyEnterTroubleshooting(connector) (again calling the provider).

For the default Apache no-op this is harmless. For third-party ConnectorConfigurationProvider implementations that back this with an external round-trip (e.g., to refuse troubleshooting while the upstream system thinks the connector is in a transient state), it will fire twice per transition. Worth either (a) caching/idempotency in the provider implementation, or (b) the framework taking care to only verify once.

The same shape exists for endTroubleshooting: verifyCanEndTroubleshooting is run by the verify phase (cheaply now), and then connector.endTroubleshooting() calls it a second time, including the expensive initializationContext.verifyUpdateFlow(...) preflight. Not strictly a bug, but it is a redundant heavy call on the critical path.

3. getReasonCannotEndTroubleshooting runs the expensive preflight inside getAvailableActions

private ConnectorAction createEndTroubleshootingAction() {
    final Optional<String> reason = getReasonCannotEndTroubleshooting();
    return new StandardConnectorAction("END_TROUBLESHOOTING", "Exit Troubleshooting mode for the connector", reason.isEmpty(), reason.orElse(null));
}

getReasonCannotEndTroubleshooting does two things:

  1. A scan of every processor/port/RPG/CS in the managed flow tree.
  2. If that passes, it invokes the Connector's getActiveFlow(FlowContext) and runs verifyUpdateFlow(...) against the managed PG (a full flow-comparison preflight).

getAvailableActions is called by the GET connector endpoint (i.e., on every refresh of the Connector entity). If the user sits on the Connector while in TROUBLESHOOTING, the UI may refresh the entity frequently. Once everything is stopped, every refresh runs the full preflight on a non-trivial flow. Suggest computing reason only on demand (e.g., lazily, or return a coarse-grained "ready/not-ready" by checking component states only and reserve the preflight for the actual verify endpoint). At minimum gate the preflight behind a flag or compute it only when called from the REST verify path.

4. applyUpdate blocked by TROUBLESHOOTING surfaces an unfriendly IllegalStateException

@Override
public void applyUpdate(final ConnectorNode connector, final ConnectorUpdateContext context) throws FlowUpdateException {
    logger.debug("Applying update to {}", connector);

    if (connector.getCurrentState() == ConnectorState.TROUBLESHOOTING) {
        throw new IllegalStateException("Cannot apply an update to " + connector + " while it is in Troubleshooting mode; "
            + "exit Troubleshooting mode before applying updates.");
    }

    if (configurationProvider != null && !configurationProvider.shouldApplyUpdate(connector.getIdentifier())) {

This is potentially the most user-visible touchpoint for any external system orchestrating connectors via the REST API: external API clients that call apply-update will now fail with an HTTP error (depends on the global exception mapper for IllegalStateException, but it's typically 400 in NiFi) any time the connector happens to be in TROUBLESHOOTING. It would be more correct to:

  • Map this to HTTP 409 specifically. transitionStateForUpdating and setName should throw a typed exception (e.g., extending IllegalStateException but mapped at the REST layer to 409 with a stable code) so that API client retry policies can distinguish "transient — retry" from "user owns this, leave it alone".
  • Add an applyUpdate short-circuit verification method on the facade so that callers can ask first, rather than swallowing a failing POST.

This is more of a maintainability ask, but it has clear downstream impact for any client that drives connector updates programmatically.

5. StandardConnectorDAO.endTroubleshooting wraps FlowUpdateException with + e

@Override
public void endTroubleshooting(final String id) {
    final ConnectorNode connector = getConnector(id);
    try {
        getConnectorRepository().endTroubleshooting(connector);
    } catch (final FlowUpdateException e) {
        throw new IllegalStateException("Failed to exit troubleshooting mode for Connector " + id + ": " + e, e);
    }
}

+ e inlines e.toString() (class name + message) inside the IllegalStateException message AND e is also chained as the cause. Use e.getMessage() for the user-facing message, or omit the inlined message entirely. Minor, but it makes the error popup uglier.

6. restoreFlowPreservingIdentifiers uses mapAssetReferences(false)

// Sensitive property values in the proposed snapshot were encrypted using the same PropertyEncryptor when the snapshot
// was persisted (for example, when a Connector-managed flow is persisted in Troubleshooting mode). The currently loaded
// flow therefore must also be mapped with an equivalent SensitiveValueEncryptor so the comparison between "current" and
// "proposed" sensitive values operates on matching ciphertext; otherwise every sensitive property appears to differ and
// the decrypted value written back to the live component is the encrypted payload rather than the plaintext (or parameter
// reference) that was originally captured.
final FlowMappingOptions flowMappingOptions = new FlowMappingOptions.Builder()
    .mapSensitiveConfiguration(true)
    .mapPropertyDescriptors(true)
    .stateLookup(stateLookup)
    .sensitiveValueEncryptor(encryptor::encrypt)
    .componentIdLookup(ComponentIdLookup.VERSIONED_OR_GENERATE)
    .mapInstanceIdentifiers(true)
    .mapControllerServiceReferencesToVersionedId(true)
    .mapFlowRegistryClientId(false)
    .mapAssetReferences(false)
    .build();

Connectors backed by external configuration providers commonly rely on ASSET_REFERENCE translation for parameter/property bindings, and the managed PG may embed parameter or asset references (e.g., for controller services). If a user edits a controller service in Troubleshooting mode and changes an asset reference, will the restore-from-persisted-snapshot correctly preserve it? With mapAssetReferences(false) the comparator side won't normalize them. I am not 100% certain this is wrong, but it's worth confirming with @markap14 explicitly whether assets edited while in TROUBLESHOOTING survive a NiFi restart correctly (the new IT does not appear to cover this).

7. endConnectorTroubleshooting doesn't re-check cluster connectivity at mutation time

@Override
public void verifyEndConnectorTroubleshooting(final String id) {
    final List<NodeIdentifier> unconnectedNodes = getUnconnectedNodes();
    if (!unconnectedNodes.isEmpty()) {
        throw new IllegalStateException("Cannot exit Troubleshooting mode because the following cluster nodes are not CONNECTED: "
                + unconnectedNodes + ". All nodes must be CONNECTED before this operation may proceed.");
    }
    connectorDAO.verifyEndTroubleshooting(id);
}

The cluster check is only in verifyEnd…. A node can disconnect between verify and endConnectorTroubleshooting — small TOCTOU. Low risk because withWriteLock will replicate the action while still holding the lock, but worth a comment/defense in depth.

8. findOwningConnector walks getParent(), but child groups carry a back-reference too

Just a heads-up that flowManager.getGroup(id) is now used in findProcessGroup (good fix), but findOwningConnector walks parents in pure Java rather than asking the connector repository for the connector ID directly. For deep PG hierarchies inside a managed flow this is O(depth) per locate; combined with the for each connector: managedGroup.findFunnel(funnelId) pattern in the new DAO methods, you have O(connectors × tree-walk) per component lookup. Fine for the typical "one or a few Connectors" install, but if a deployment runs many connectors on a single node, this is a hotspot.


Maintainability concerns

  1. ProcessGroup.findOwningConnector() is added on the interface (ProcessGroup.java +21 lines). Every other ProcessGroup implementation (MockProcessGroup, etc.) now has to implement it. The implementation in StandardProcessGroup is sensible. Consider providing a default implementation that uses getConnectorIdentifier() + getParent() so that other implementers don't have to repeat the loop.

  2. ComponentDAO/AuthorizableLookup API explosion. Every get* method now has a (id, boolean includeConnectorManaged) overload across 6 DAOs and 5 AuthorizableLookup getters. The single boolean is fine for now but it's an API smell — most callers will pass false, the only callers that pass true are the Connector-scoped REST endpoints. An alternative would be a single ConnectorAwareLookup.getInsideManagedFlow(...) facade that doesn't pollute every interface. Worth a follow-up.

  3. VersionedComponentFlowMapper.mapConnector(ConnectorNode) (1-arg) is preserved as a backward-compat overload that silently does NOT serialize the managed PG even when the connector is in TROUBLESHOOTING. This is a latent footgun for any future caller. Either inline it, deprecate it with @Deprecated(forRemoval=true), or have it call the 2-arg overload with a ControllerServiceProvider resolved from the connector node itself.

  4. endTroubleshooting on the Repository does not notify the provider. enterTroubleshooting calls provider.verifyEnterTroubleshooting, but there is no provider.endTroubleshooting(connectorId) or similar lifecycle hook. For the Apache default this is fine, but third-party ConnectorConfigurationProvider implementations may want a way to observe transitions back out of TROUBLESHOOTING. If you anticipate any need for that signal, it's better to add the hook now — adding a lifecycle method later is awkward because the default behavior would need to remain no-op for backward compatibility.

  5. Javadoc references to ScheduledState left behind. ConnectorRepository.syncConnector's Javadoc still says ConnectorConfigurationProvider#getSyncDirective(String, org.apache.nifi.flow.ScheduledState) — should be VersionedConnectorState:

* <p>When a {@link ConnectorConfigurationProvider} is configured, the provider's
* {@link ConnectorConfigurationProvider#getSyncDirective(String, org.apache.nifi.flow.ScheduledState)}
* method is called to obtain a {@link ConnectorSyncDirective} that may override the
* working configuration, name, or ScheduledState from the versioned flow.</p>
  1. Typo in API Javadoc. ConnectorRepository.endTroubleshooting:
* @throws IllegalStateException if the Connector cannot be transitioned out of Troubleshooting mode because it doe snot adhere

"doe snot adhere" → "does not adhere".

  1. The reviewer's earlier comment from rfellows is still not addressed. rfellows observed that you can enter Troubleshooting with components running, but you cannot exit until everything is stopped/disabled, leaving the user to stop everything manually. A bulk "stop all components and exit" action would dramatically improve UX. The PR partially addressed this by surfacing the reason in END_TROUBLESHOOTING.reason, but didn't add the bulk action. For a connector with many processors this UX gap will be felt; worth either adding it now or filing a follow-up.

  2. DAO performance asymmetry. StandardProcessorDAO/StandardControllerServiceDAO use flowManager.getProcessorNode(id) / flowManager.getControllerServiceNode(id) (O(1) lookups). StandardFunnelDAO, StandardInputPortDAO, StandardOutputPortDAO, StandardLabelDAO, StandardRemoteProcessGroupDAO iterate connectorRepository.getConnectors(LOCAL_ONLY) and walk each managed group. For consistency it would be nicer to have FlowManager index these by id too, but I understand why it wasn't done in this PR — out of scope.

  3. StandardConnectorRepository.endTroubleshooting doesn't run a verifyEndTroubleshooting-style provider hook. Compare to enterTroubleshooting(connector) which calls verifyEnterTroubleshooting(connector) first. The DAO already calls connector.verifyCanEndTroubleshooting() separately, but the repo method ought to call its own verifyEndTroubleshooting(connector) (which doesn't exist yet) for symmetry — even if the provider hook is no-op. Otherwise the only entry point that verifies is the DAO, and any future caller that goes straight to the repo loses the check.

  4. .cursor/rules/code-style.mdc is committed as part of the PR. That's a 76-line code-style guide that probably shouldn't ship with an upstream Apache PR. (May just be a leftover.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants