Remove NACKs from IPC code by clumens · Pull Request #4079 · ClusterLabs/pacemaker

clumens · 2026-04-02T19:01:21Z

This is an update of #4041. The problem with that PR is that it only modified the clients to deal with NACKs during connection to the server, and not during the rest of the IPC communication. This corrects that and also rebases on main to pick up the prep PR.

There's two places we can receive a NACK from execd: (1) During the handshake process. lrmd_handshake constructs the handshake message, then calls lrmd_send_xml to send that message and read its reply. The reply should be a registration message constructed by execd_process_signon, but if the command part of the message got screwed up in transmission, the server would see that as an unknown message and respond with a NACK. (2) During any other message sending. This is anything that passes through lrmd_send_command after the handshake has occurred. Again, this passes through lrmd_send_xml to send the command and read its reply. A NACK here would be received in response to a message the server doesn't understand. Essentially, anywhere that calls lrmd_send_xml needs to accept the possibility of receiving a NACK. I think this is going to be a common theme throughout the daemons. Ref T991

No other daemon behaves this way on unknown message, and this isn't old code. I just added it in 7d56d46 without mentioning any reasoning behind it. I can't find any clients that are expecting a NACK in this case so we should be good to just change this. Ref T991

It's confusing to have both ACKs and NACKs in the IPC protocol when both of those can also include an error code. Most daemons do not send a NACK anyway, and most (all?) clients aren't set up to handle it. In the interest of simplifying the IPC code a bit, I'm going to remove NACKs so clients can always expect an ACK and read the error code to know what to do. Note that we aren't checking the status code on the ACK at all. In execd, the only places we can receive an ACK are in the event of an error, so we never need to examine the status code to decide between success and failure. Ref T991

I've been testing the handling of invalid messages by making sure execd_invalid_msg always returns true. This will trigger for the very first message a daemon receives, which makes it pretty easy to experiment with. In this case, the first message is register: (pcmk__remote_message_xml@remote.c:358) trace: [remote msg] <lrmd_command t="lrmd" lrmd_op="register" lrmd_clientname="pacemaker-remote-rhel9-scratch-4:3121" lrmd_protocol_ version="1.2" lrmd_is_ipc_provider="true" lrmd_remote_msg_id="1" lrmd_remote_msg_type="request"/> In real life, we could be getting an invalid message by something being garbled or a misbehaving client. Tracing through the code that got us to the above log message, you'll see that the remote message was almost certainly received via the TLS channel (it could also be a TCP socket, but I haven't seen anything use that). However, ACKs/NACKs are an IPC mechanism, which happens on the local system only and uses completely different communications channels. There is a way to proxy IPC so that it's retransmitted to another system, but we only proxy IPC if the received message has lrmd_op="ipc_fwd". In this case, the local system is the end point of the invalid message. We would also be originating the ACK/NACK, which would mean we'd need to be able to construct some sort of fake proxied message and put it in the TLS or TCP channel. I guess it's possible we could do this, but it seems like a lot of work when the client isn't even expecting it (see handle_remote_msg). This was just recently added in 7acc4f3 and has not been in any release, so there's no mixed version upgrade concerns here. Ref T991

fenced deals with sending NACKs completely differently from execd. While execd can return a NACK to the client on a message that was received but is incorrect, fenced just sends a regular response (compare handle_unknown_request functions). However, fenced can return a NACK if it fails to put the client's message together. Thus, anywhere the client calls crm_ipc_send and cares about a response, we need to check for a NACK. Ref T991

It's confusing to have both ACKs and NACKs in the IPC protocol when both of those can also include an error code. Most daemons do not send a NACK anyway, and most (all?) clients aren't set up to handle it. In the interest of simplifying the IPC code a bit, I'm going to remove NACKs so clients can always expect an ACK and read the error code to know what to do. We need to be more careful with ACKs in fenced than we did in execd. fenced can send an ACK on success for STONITH_OP_NOTIFY, so anywhere we receive an ACK we also need to check its status to see if it's an error or not. Ref T991

based deals with sending NACKs identically to fenced. It only returns a NACK to the client if it fails to put the client's message together. Thus, anywhere the client calls crm_ipc_send (for local connections) or pcmk__send_remote_xml (for remote connections) and cares about a response, we need to check for a NACK.

It's confusing to have both ACKs and NACKs in the IPC protocol when both of those can also include an error code. Most daemons do not send a NACK anyway, and most (all?) clients aren't set up to handle it. In the interest of simplifying the IPC code a bit, I'm going to remove NACKs so clients can always expect an ACK and read the error code to know what to do. Just like with fenced, we need to be more careful with ACKs in based. based can send an ACK on both success and failure for PCMK__VALUE_CIB_NOTIFY, so anywhere we receive an ACK we also need to check it status to see if it's an error or not. Ref T991

…_ack_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

…_ack. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

…ck_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Fixes T991

clumens added 12 commits April 2, 2026 14:13

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_create…

c768ce5

…_ack_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_create…

b952443

…_ack. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_send_a…

b3e271f

…ck_as. All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Ref T991

Refactor: libcrmcommon: Remove the tag argument from pcmk__ipc_send_ack.

5ffa295

All ACKs are now created with PCMK__XE_ACK so this argument serves no purpose. Fixes T991

clumens requested a review from nrwahl2 April 2, 2026 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove NACKs from IPC code #4079

Remove NACKs from IPC code #4079
clumens wants to merge 12 commits intoClusterLabs:mainfrom
clumens:remove-naks-2

clumens commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

clumens commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant