Remove NACKs from IPC code by clumens · Pull Request #4041 · ClusterLabs/pacemaker

clumens · 2026-01-29T20:05:13Z

No description provided.

No other daemon behaves this way on unknown message, and this isn't old code. I just added it in 7d56d46 without mentioning any reasoning behind it. I can't find any clients that are expecting a NACK in this case so we should be good to just change this. Ref T991

nrwahl2

Looks good! Very minor feedback

lib/lrmd/lrmd_client.c

lib/fencing/st_client.c

lib/common/ipc_server.c

It's possible for execd to reply with an ACK message with an error code. This happens if execd receives an invalid message - this could be a programming error, or it could potentially be a problem in transmission. The client code was only incidentally handling these messages. Before, it would fall through to the "Invalid registration message" case and be handled that way. However, because it's part of the protocol for the server to respond with a NACK on errors, it seems more correct to me for us to explicitly handle this message first thing. Note that this doesn't change any behavior. Before the client would get a message it didn't expect and the IPC connection would fail. Now the client gets an error message and the IPC connection will fail. I just like the more defined behavior. Ref T991

It's confusing to have both ACKs and NACKs in the IPC protocol when both of those can also include an error code. Most daemons do not send a NACK anyway, and most (all?) clients aren't set up to handle it. In the interest of simplifying the IPC code a bit, I'm going to remove NACKs so clients can always expect an ACK and read the error code to know what to do. Ref T991

There's no doubt more that could be done along these lines, but for the moment this will have to do. I need to add some additional error handling into this function so it makes sense to unindent to make some room.

It's possible for based to reply with an ACK message with an error code. This happens if based receives an invalid message - this could be a programming error, or it could potentially be a problem in transmission. The client code was only incidentally handling these messages. Before, it would fall through to the "Reply to the CIB registration message has unknown type" case and be handled that way. However, because it's part of the protocol for the server to respond with a NACK on errors, it seems more correct to me for us to explicitly handle this message first thing. Note that this doesn't change any behavior. Before the client would get a message it didn't expect and the IPC connection would fail. Now the client gets an error message and the IPC connection will fail. I just like the more defined behavior. Ref T991

* Change the tag that pcmk__log_xml_trace logs with to be the same as in cib_native_signon. * Use g_clear_pointer when cleaning up the answer variable. * Unindent the code at the end of the function. I'm going to be adding additional error handling here too, so it's nice to have some extra room to do so.

This is just like previous patches, but applies to signing on to a remote CIB instead. Ref T991

It's confusing to have both ACKs and NACKs in the IPC protocol when both of those can also include an error code. Most daemons do not send a NACK anyway, and most (all?) clients aren't set up to handle it. In the interest of simplifying the IPC code a bit, I'm going to remove NACKs so clients can always expect an ACK and read the error code to know what to do. Ref T991

I'm going to be adding some additional error handling in this function, so it'll be nice to have some room to do so. This only unindents the first level of the end of this function. There's more to do, but it's easier to follow if it's broken up into two patches.

It's possible for fenced to reply with an ACK message with an error code. This happens if fenced receives an invalid message - this could be a programming error, or it could potentially be a problem in transmission. The client code was only incidentally handling these messages. Before, it would fall through to the "invalid reply type" case and be handled that way. However, because it's part of the protocol for the server to respond with a NACK on errors, it seems more correct to me for us to explicitly handle this message first thing. Note that this doesn't change any behavior. Before the client would get a message it didn't expect and the IPC connection would fail. Now the client gets an error message and the IPC connection will fail. I just like the more defined behavior. Ref T991

It's confusing to have both ACKs and NACKs in the IPC protocol when both of those can also include an error code. Most daemons do not send a NACK anyway, and most (all?) clients aren't set up to handle it. In the interest of simplifying the IPC code a bit, I'm going to remove NACKs so clients can always expect an ACK and read the error code to know what to do. Ref T991

I've been testing the handling of invalid messages by making sure execd_invalid_msg always returns true. This will trigger for the very first message a daemon receives, which makes it pretty easy to experiment with. In this case, the first message is register: (pcmk__remote_message_xml@remote.c:358) trace: [remote msg] <lrmd_command t="lrmd" lrmd_op="register" lrmd_clientname="pacemaker-remote-rhel9-scratch-4:3121" lrmd_protocol_ version="1.2" lrmd_is_ipc_provider="true" lrmd_remote_msg_id="1" lrmd_remote_msg_type="request"/> In real life, we could be getting an invalid message by something being garbled or a misbehaving client. Tracing through the code that got us to the above log message, you'll see that the remote message was almost certainly received via the TLS channel (it could also be a TCP socket, but I haven't seen anything use that). However, ACKs/NACKs are an IPC mechanism, which happens on the local system only and uses completely different communications channels. There is a way to proxy IPC so that it's retransmitted to another system, but we only proxy IPC if the received message has lrmd_op="ipc_fwd". In this case, the local system is the end point of the invalid message. We would also be originating the ACK/NACK, which would mean we'd need to be able to construct some sort of fake proxied message and put it in the TLS or TCP channel. I guess it's possible we could do this, but it seems like a lot of work when the client isn't even expecting it (see handle_remote_msg). This was just recently added in 7acc4f3 and has not been in any release, so there's no mixed version upgrade concerns here. Ref T991

That way, we don't have to free it if we're just going to return right away.

Due to the if/else block, in all these log messages, op will be "request". This means all these log messages say something along the lines of "Relayed request request 1 from...".

* Make the comment explaining why we're sending a NACK a little more verbose so we can understand it better in the future. * Return after sending the NACK, allowing the code under it to be unindented. There's tons more of this that could be done, but I'm not going to worry about that right now. Ref T991

…_ack_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

…_ack. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

…ck_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Fixes T991

clumens · 2026-03-30T19:34:12Z

Non-voting build failures are simply because this PR hasn't been rebased on main to pick up the Alpine Linux and log level fixes from a while back.

nrwahl2 · 2026-03-30T19:39:17Z

lib/cib/cib_remote.c


-    /* If we received a NACK in response, based thinks we originally
-     * sent an invalid message.
+    /* If we got an ACK in response, it came from handle_unknown_request in


Are you confident about this? I haven't traced it recently. But AFAIK, the code in cib_remote.c is just for clients administering the CIB from a remote machine and has nothing at all to do with Pacemaker Remote.

I would expect an ACK here to come from based, just like in cib_native.c.

Aside from the copy & pate error here, I think the right place for this check is cib_remote_callback_dispatch and cib_remove_command_dispatch, after reading the data but before doing anything with it. This is for the same reason as lib/cib/cib_native.c - the current spot is just relevant for the signon process, which only ever gets one response from the server.

...or maybe cib_remote_perform_op is the better spot based on the ability to return a useful return code. The dispatch functions appear to just be returning -1/0.

nrwahl2 · 2026-03-30T19:42:33Z

lib/cib/cib_native.c

+        /* If we got an ACK in response, it came from dispatch_common and means
+         * one of two things:
+         *
+         * (1) We originally sent a PCMK__VALUE_CIB_NOTIFY and based is ACKing


If a PCMK__VALUE_CIB_NOTIFY response falls through, wouldn't that result in a "Reply to CIB registration message has unknown type" message?

Looking at this some more, I don't think the lib/cib/cib_native.c portions of this PR are actually necessary, or perhaps this code is in the wrong spot. In cib_native_signon, we are only sending a CIB_OP_REGISTER message and there's only one possible response to that. I don't think there's any point in handling an ACK here.

cib_native_perform_op_delegate looks like the more correct spot for this. Let me know if this makes sense to you too.

clumens · 2026-04-02T18:07:07Z

I think it's going to be easiest to review the new code as an entirely new PR instead of trying to track what happened through the compare links.

clumens requested a review from nrwahl2 January 29, 2026 20:05

nrwahl2 added the review: in progress PRs that are currently being reviewed label Jan 29, 2026

nrwahl2 requested changes Mar 27, 2026

View reviewed changes

nrwahl2 removed the review: in progress PRs that are currently being reviewed label Mar 27, 2026

clumens added 19 commits March 27, 2026 12:02

Refactor: libcib: Unindent the end of cib_native_signon.

92107db

There's no doubt more that could be done along these lines, but for the moment this will have to do. I need to add some additional error handling into this function so it makes sense to unindent to make some room.

Low: libcib: Explicitly handle receiving a NACK from remote based.

8e682e5

This is just like previous patches, but applies to signing on to a remote CIB instead. Ref T991

Refactor: libstonith: Additional unindenting in stonith_api_signon.

663aa7f

Low: daemons: Delay creating an xmlNode in handle_register_request.

8b38617

That way, we don't have to free it if we're just going to return right away.

Refactor: liblrmd: Fix log messages in remote_proxy_cb.

714d247

Due to the if/else block, in all these log messages, op will be "request". This means all these log messages say something along the lines of "Relayed request request 1 from...".

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_create…

cbadda5

…_ack_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_create…

51e95cd

…_ack. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_send_a…

18111fb

…ck_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_send_ack.

314238d

All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Fixes T991

clumens force-pushed the remove-naks branch from e1411c8 to 314238d Compare March 30, 2026 16:41

nrwahl2 reviewed Mar 30, 2026

View reviewed changes

clumens closed this Apr 2, 2026

clumens deleted the remove-naks branch April 2, 2026 18:07

clumens mentioned this pull request Apr 2, 2026

Remove NACKs from IPC code #4079

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove NACKs from IPC code#4041

Remove NACKs from IPC code#4041
clumens wants to merge 20 commits intoClusterLabs:mainfrom
clumens:remove-naks

clumens commented Jan 29, 2026

Uh oh!

nrwahl2 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clumens commented Mar 30, 2026

Uh oh!

nrwahl2 Mar 30, 2026 •

edited

Loading

Uh oh!

clumens Apr 1, 2026 •

edited

Loading

Uh oh!

nrwahl2 Mar 30, 2026

Uh oh!

clumens Apr 1, 2026 •

edited

Loading

Uh oh!

clumens commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

clumens commented Jan 29, 2026

Uh oh!

nrwahl2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clumens commented Mar 30, 2026

Uh oh!

nrwahl2 Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clumens Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrwahl2 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

clumens Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clumens commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nrwahl2 Mar 30, 2026 •

edited

Loading

clumens Apr 1, 2026 •

edited

Loading

clumens Apr 1, 2026 •

edited

Loading