NIFI-15880: Allow Connectors to enter a Troublehshooting mode where the flow beco…#11175
NIFI-15880: Allow Connectors to enter a Troublehshooting mode where the flow beco…#11175markap14 wants to merge 7 commits into
Conversation
08ee450 to
2f5bc8f
Compare
| if (!STOPPED_STATES.contains(processor.getScheduledState())) { | ||
| throw new IllegalStateException("Cannot end Troubleshooting mode for " + this + " because Processor " + processor.getIdentifier() | ||
| + " is in state " + processor.getScheduledState() + "; it must be STOPPED or DISABLED."); | ||
| } |
There was a problem hiding this comment.
The way this works currently is a bit confusing. You can enter troubleshooting mode while the processor is running, but you can't end troubleshooting with it still running. This forces the user to have to manually stop all processors before they would be allowed to end troubleshooting. For large connectors, this will be a tedious process of manually stopping everything.
It might be more palatable if STOP was a valid action on the connector as a whole while in troubleshooting mode. But it is explicitly not allowed in the current state.
9a6d871 to
e6b3760
Compare
There was a problem hiding this comment.
ConnectorRepository.getConnectors() got a required ConnectorSyncMode parameter. But FlowController is still calling getConnectors() with no args. Results in a compile error.
…he flow becomes modifiable and user-managed (temporarily) rather than Connector managed.
…owable action; fix findProcessGroup for connector child groups - Refactored verifyCanEndTroubleshooting to delegate to a new getReasonCannotEndTroubleshooting() method that returns Optional<String>. The verify method throws if the Optional is present; createEndTroubleshootingAction uses the same method to populate the reason field on the allowable action so callers see why the action is unavailable without having to attempt it. - Replaced the throwing verifyAllComponentsStoppedAndDisabled helper with findReasonComponentsNotStopped, which returns the first reason found as an Optional<String>. The expensive flow-revert preflight only runs after the cheap component-state scan passes. - Fixed StandardProcessGroup.findProcessGroup to use the unconditional flowManager.getGroup(id) instead of the connector-ID-filtered overload. Only the root managed process group carries a connectorId; child groups do not, so the filtered lookup always returned null for non-root groups, causing a 404 when navigating into a child group in Troubleshooting mode. The existing isOwner check already enforces hierarchy ownership.
…up via the Connector flow endpoint
Adds testGetFlowForChildProcessGroupInTroubleshooting to
ConnectorTroubleshootingIT to verify that the Connector's
/flow/process-groups/{groupId} endpoint returns the correct flow for a
child process group within the managed flow while in Troubleshooting mode.
…rebuild - Use fully-qualified org.apache.nifi.controller.ScheduledState for local variables in findReasonComponentsNotStopped to resolve the naming conflict with the imported org.apache.nifi.flow.ScheduledState. - Pass ConnectorSyncMode.LOCAL_ONLY to ConnectorRepository.getConnectors() in FlowController.findInputPortIncludingConnectorManaged and findOutputPortIncludingConnectorManaged; the no-arg overload was removed from the interface and these internal lookups do not need a provider sync.
DependentSecretVerifyConnector, RequiredSecretConnector, and TrackingConnector did not implement the new abstract Connector.getActiveFlow(FlowContext) method added to nifi-api. Add a stub returning null to each, since none of the tests exercised that method.
…rs in component DAOs After rebasing onto main (NIFI-15919), ConnectorRepository.getConnector and ConnectorRepository.getConnectors now require a ConnectorSyncMode. Update the connector-managed component lookups added to StandardFunnelDAO, StandardInputPortDAO, StandardLabelDAO, StandardOutputPortDAO, and StandardRemoteProcessGroupDAO to pass ConnectorSyncMode.LOCAL_ONLY. These are internal walks across in-memory connector state to locate a component by id and have no need to consult the configuration provider.
…euedData assertEquals(-1, getQueuedCount(...)) compared an int against an OptionalInt, so the comparison was never meaningful. Replace it with assertFalse(getQueuedCount(...).isPresent(), ...) which expresses the intended check: after ending Troubleshooting the connection added during Troubleshooting must no longer exist in the managed flow.
ba28a18 to
332ec7b
Compare
There was a problem hiding this comment.
Bugs / functional concerns
1. Cluster flow sync of an already-TROUBLESHOOTING connector will fail with IllegalStateException
StandardConnectorNode.setName now throws if the connector is in TROUBLESHOOTING.
@Override
public void setName(final String name) {
if (getCurrentState() == ConnectorState.TROUBLESHOOTING) {
throw new IllegalStateException("Cannot rename " + this + " while it is in Troubleshooting mode; exit Troubleshooting mode before renaming the Connector.");
}
this.name = name;
}But StandardConnectorRepository.syncConnector calls connector.setName(effectiveName) unconditionally and TROUBLESHOOTING is neither in isWaitableState nor isRejectableState:
// Set name locally (no provider.save())
connector.setName(effectiveName);
final boolean wasRunning = currentState == ConnectorState.RUNNING;
final boolean configChanged = isNewConnector || isConfigurationUpdated(connector, effectiveActiveConfig, effectiveWorkingConfig);
final boolean restoringTroubleshooting = effectiveScheduledState == VersionedConnectorState.TROUBLESHOOTING;On a fresh startup it's fine because the connector node is freshly created in STOPPED (then restoreTroubleshootingState() flips it at the end). However, in a clustered configuration when a node receives a sync after disconnect/reconnect (or any other path where syncConnector runs against an existing connector that is already TROUBLESHOOTING in memory), setName will throw and the exception will propagate out of VersionedFlowSynchronizer unchecked — flow sync fails hard instead of returning a ConnectorSyncResult.failed.
This is the same family of regressions that earlier rebases hit (see commit 8842121). Suggested fixes:
- In
syncConnector, skip thesetNamewhencurrentState == TROUBLESHOOTING && effectiveScheduledState == VersionedConnectorState.TROUBLESHOOTING && Objects.equals(effectiveName, connector.getName()). - Or have the rename-guard live in
ConnectorRepository.updateConnector(which is the public mutation path) and not insetNameso that internal sync paths can still set the name on the node. The repository already has the same guard; the node-level one is duplicative.
This is the most concerning functional bug I found. Worth a system test that exercises clustered sync while the connector is in TROUBLESHOOTING (the new IT covers single-node restart in testEndTroubleshootingBlockedByQueuedData, but not multi-node sync).
2. verifyEnterTroubleshooting is run twice; providers can observe two calls per transition
StandardConnectorDAO.enterTroubleshooting → repository.enterTroubleshooting → verifyEnterTroubleshooting(connector) (which calls configurationProvider.verifyEnterTroubleshooting). The REST resource also calls serviceFacade.verifyEnterConnectorTroubleshooting(id) as the verify phase of withWriteLock, which calls dao.verifyEnterTroubleshooting(id) → repository.verifyEnterTroubleshooting(connector) (again calling the provider).
For the default Apache no-op this is harmless. For third-party ConnectorConfigurationProvider implementations that back this with an external round-trip (e.g., to refuse troubleshooting while the upstream system thinks the connector is in a transient state), it will fire twice per transition. Worth either (a) caching/idempotency in the provider implementation, or (b) the framework taking care to only verify once.
The same shape exists for endTroubleshooting: verifyCanEndTroubleshooting is run by the verify phase (cheaply now), and then connector.endTroubleshooting() calls it a second time, including the expensive initializationContext.verifyUpdateFlow(...) preflight. Not strictly a bug, but it is a redundant heavy call on the critical path.
3. getReasonCannotEndTroubleshooting runs the expensive preflight inside getAvailableActions
private ConnectorAction createEndTroubleshootingAction() {
final Optional<String> reason = getReasonCannotEndTroubleshooting();
return new StandardConnectorAction("END_TROUBLESHOOTING", "Exit Troubleshooting mode for the connector", reason.isEmpty(), reason.orElse(null));
}getReasonCannotEndTroubleshooting does two things:
- A scan of every processor/port/RPG/CS in the managed flow tree.
- If that passes, it invokes the Connector's
getActiveFlow(FlowContext)and runsverifyUpdateFlow(...)against the managed PG (a full flow-comparison preflight).
getAvailableActions is called by the GET connector endpoint (i.e., on every refresh of the Connector entity). If the user sits on the Connector while in TROUBLESHOOTING, the UI may refresh the entity frequently. Once everything is stopped, every refresh runs the full preflight on a non-trivial flow. Suggest computing reason only on demand (e.g., lazily, or return a coarse-grained "ready/not-ready" by checking component states only and reserve the preflight for the actual verify endpoint). At minimum gate the preflight behind a flag or compute it only when called from the REST verify path.
4. applyUpdate blocked by TROUBLESHOOTING surfaces an unfriendly IllegalStateException
@Override
public void applyUpdate(final ConnectorNode connector, final ConnectorUpdateContext context) throws FlowUpdateException {
logger.debug("Applying update to {}", connector);
if (connector.getCurrentState() == ConnectorState.TROUBLESHOOTING) {
throw new IllegalStateException("Cannot apply an update to " + connector + " while it is in Troubleshooting mode; "
+ "exit Troubleshooting mode before applying updates.");
}
if (configurationProvider != null && !configurationProvider.shouldApplyUpdate(connector.getIdentifier())) {This is potentially the most user-visible touchpoint for any external system orchestrating connectors via the REST API: external API clients that call apply-update will now fail with an HTTP error (depends on the global exception mapper for IllegalStateException, but it's typically 400 in NiFi) any time the connector happens to be in TROUBLESHOOTING. It would be more correct to:
- Map this to HTTP 409 specifically.
transitionStateForUpdatingandsetNameshould throw a typed exception (e.g., extendingIllegalStateExceptionbut mapped at the REST layer to 409 with a stable code) so that API client retry policies can distinguish "transient — retry" from "user owns this, leave it alone". - Add an
applyUpdateshort-circuit verification method on the facade so that callers can ask first, rather than swallowing a failing POST.
This is more of a maintainability ask, but it has clear downstream impact for any client that drives connector updates programmatically.
5. StandardConnectorDAO.endTroubleshooting wraps FlowUpdateException with + e
@Override
public void endTroubleshooting(final String id) {
final ConnectorNode connector = getConnector(id);
try {
getConnectorRepository().endTroubleshooting(connector);
} catch (final FlowUpdateException e) {
throw new IllegalStateException("Failed to exit troubleshooting mode for Connector " + id + ": " + e, e);
}
}+ e inlines e.toString() (class name + message) inside the IllegalStateException message AND e is also chained as the cause. Use e.getMessage() for the user-facing message, or omit the inlined message entirely. Minor, but it makes the error popup uglier.
6. restoreFlowPreservingIdentifiers uses mapAssetReferences(false)
// Sensitive property values in the proposed snapshot were encrypted using the same PropertyEncryptor when the snapshot
// was persisted (for example, when a Connector-managed flow is persisted in Troubleshooting mode). The currently loaded
// flow therefore must also be mapped with an equivalent SensitiveValueEncryptor so the comparison between "current" and
// "proposed" sensitive values operates on matching ciphertext; otherwise every sensitive property appears to differ and
// the decrypted value written back to the live component is the encrypted payload rather than the plaintext (or parameter
// reference) that was originally captured.
final FlowMappingOptions flowMappingOptions = new FlowMappingOptions.Builder()
.mapSensitiveConfiguration(true)
.mapPropertyDescriptors(true)
.stateLookup(stateLookup)
.sensitiveValueEncryptor(encryptor::encrypt)
.componentIdLookup(ComponentIdLookup.VERSIONED_OR_GENERATE)
.mapInstanceIdentifiers(true)
.mapControllerServiceReferencesToVersionedId(true)
.mapFlowRegistryClientId(false)
.mapAssetReferences(false)
.build();Connectors backed by external configuration providers commonly rely on ASSET_REFERENCE translation for parameter/property bindings, and the managed PG may embed parameter or asset references (e.g., for controller services). If a user edits a controller service in Troubleshooting mode and changes an asset reference, will the restore-from-persisted-snapshot correctly preserve it? With mapAssetReferences(false) the comparator side won't normalize them. I am not 100% certain this is wrong, but it's worth confirming with @markap14 explicitly whether assets edited while in TROUBLESHOOTING survive a NiFi restart correctly (the new IT does not appear to cover this).
7. endConnectorTroubleshooting doesn't re-check cluster connectivity at mutation time
@Override
public void verifyEndConnectorTroubleshooting(final String id) {
final List<NodeIdentifier> unconnectedNodes = getUnconnectedNodes();
if (!unconnectedNodes.isEmpty()) {
throw new IllegalStateException("Cannot exit Troubleshooting mode because the following cluster nodes are not CONNECTED: "
+ unconnectedNodes + ". All nodes must be CONNECTED before this operation may proceed.");
}
connectorDAO.verifyEndTroubleshooting(id);
}The cluster check is only in verifyEnd…. A node can disconnect between verify and endConnectorTroubleshooting — small TOCTOU. Low risk because withWriteLock will replicate the action while still holding the lock, but worth a comment/defense in depth.
8. findOwningConnector walks getParent(), but child groups carry a back-reference too
Just a heads-up that flowManager.getGroup(id) is now used in findProcessGroup (good fix), but findOwningConnector walks parents in pure Java rather than asking the connector repository for the connector ID directly. For deep PG hierarchies inside a managed flow this is O(depth) per locate; combined with the for each connector: managedGroup.findFunnel(funnelId) pattern in the new DAO methods, you have O(connectors × tree-walk) per component lookup. Fine for the typical "one or a few Connectors" install, but if a deployment runs many connectors on a single node, this is a hotspot.
Maintainability concerns
-
ProcessGroup.findOwningConnector()is added on the interface (ProcessGroup.java+21 lines). Every otherProcessGroupimplementation (MockProcessGroup, etc.) now has to implement it. The implementation inStandardProcessGroupis sensible. Consider providing a default implementation that usesgetConnectorIdentifier()+getParent()so that other implementers don't have to repeat the loop. -
ComponentDAO/AuthorizableLookupAPI explosion. Everyget*method now has a(id, boolean includeConnectorManaged)overload across 6 DAOs and 5AuthorizableLookupgetters. The single boolean is fine for now but it's an API smell — most callers will passfalse, the only callers that passtrueare the Connector-scoped REST endpoints. An alternative would be a singleConnectorAwareLookup.getInsideManagedFlow(...)facade that doesn't pollute every interface. Worth a follow-up. -
VersionedComponentFlowMapper.mapConnector(ConnectorNode)(1-arg) is preserved as a backward-compat overload that silently does NOT serialize the managed PG even when the connector is in TROUBLESHOOTING. This is a latent footgun for any future caller. Either inline it, deprecate it with@Deprecated(forRemoval=true), or have it call the 2-arg overload with aControllerServiceProviderresolved from the connector node itself. -
endTroubleshootingon the Repository does not notify the provider.enterTroubleshootingcallsprovider.verifyEnterTroubleshooting, but there is noprovider.endTroubleshooting(connectorId)or similar lifecycle hook. For the Apache default this is fine, but third-partyConnectorConfigurationProviderimplementations may want a way to observe transitions back out of TROUBLESHOOTING. If you anticipate any need for that signal, it's better to add the hook now — adding a lifecycle method later is awkward because the default behavior would need to remain no-op for backward compatibility. -
Javadoc references to
ScheduledStateleft behind.ConnectorRepository.syncConnector's Javadoc still saysConnectorConfigurationProvider#getSyncDirective(String, org.apache.nifi.flow.ScheduledState)— should beVersionedConnectorState:
* <p>When a {@link ConnectorConfigurationProvider} is configured, the provider's
* {@link ConnectorConfigurationProvider#getSyncDirective(String, org.apache.nifi.flow.ScheduledState)}
* method is called to obtain a {@link ConnectorSyncDirective} that may override the
* working configuration, name, or ScheduledState from the versioned flow.</p>- Typo in API Javadoc.
ConnectorRepository.endTroubleshooting:
* @throws IllegalStateException if the Connector cannot be transitioned out of Troubleshooting mode because it doe snot adhere"doe snot adhere" → "does not adhere".
-
The reviewer's earlier comment from rfellows is still not addressed. rfellows observed that you can enter Troubleshooting with components running, but you cannot exit until everything is stopped/disabled, leaving the user to stop everything manually. A bulk "stop all components and exit" action would dramatically improve UX. The PR partially addressed this by surfacing the reason in
END_TROUBLESHOOTING.reason, but didn't add the bulk action. For a connector with many processors this UX gap will be felt; worth either adding it now or filing a follow-up. -
DAO performance asymmetry.
StandardProcessorDAO/StandardControllerServiceDAOuseflowManager.getProcessorNode(id)/flowManager.getControllerServiceNode(id)(O(1) lookups).StandardFunnelDAO,StandardInputPortDAO,StandardOutputPortDAO,StandardLabelDAO,StandardRemoteProcessGroupDAOiterateconnectorRepository.getConnectors(LOCAL_ONLY)and walk each managed group. For consistency it would be nicer to haveFlowManagerindex these by id too, but I understand why it wasn't done in this PR — out of scope. -
StandardConnectorRepository.endTroubleshootingdoesn't run averifyEndTroubleshooting-style provider hook. Compare toenterTroubleshooting(connector)which callsverifyEnterTroubleshooting(connector)first. The DAO already callsconnector.verifyCanEndTroubleshooting()separately, but the repo method ought to call its ownverifyEndTroubleshooting(connector)(which doesn't exist yet) for symmetry — even if the provider hook is no-op. Otherwise the only entry point that verifies is the DAO, and any future caller that goes straight to the repo loses the check. -
.cursor/rules/code-style.mdcis committed as part of the PR. That's a 76-line code-style guide that probably shouldn't ship with an upstream Apache PR. (May just be a leftover.)
…mes modifiable and user-managed (temporarily) rather than Connector managed.
Summary
NIFI-00000
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000VerifiedstatusPull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation