Refactor STREAM_FORWARDING (#98) into client_req_stream API #104
Refactor STREAM_FORWARDING (#98) into client_req_stream API #104filimonov wants to merge 4 commits into
client_req_stream API #104Conversation
Replace the recipe-in-a-comment contract (callers hand-building
client_request messages, managing term+flag and rpc_client themselves)
with a typed NuRaft abstraction: raft_server::open_client_req_stream()
returns a client_req_stream fenced to the current leader's term.
Also fix a back-compat regression in asio_service.cxx: expected_term_ was
being applied to any non-zero client_request.term, not just when
STREAM_FORWARDING was set. Now gated on the flag.
Details:
- client_req_stream: is_broken(), append() -> cmd_result<ptr<buffer>>,
fail-fast contract when already broken. Requires async_replication;
refused otherwise, since pipelining has no throughput benefit in sync
mode.
- Local path: zero-cost pass-through to append_entries_ext.
- Remote path: owns one rpc_client, sends req_msg with
STREAM_FORWARDING_REQUEST, adapts the handler into a cmd_result.
- handle_cli_req: reject_cli_req helper for every accepted=false path,
routed above the store loop; three-phase banner makes the no-gap
invariant structural rather than comment-only. cb_func::ReturnNull
semantics are NOT changed — PreAppendLogLeader / AppendLogs retain
their original "return nullptr cancels the request" contract.
Combining stream forwarding with hooks that use ReturnNull is
documented as unsupported (see docs/stream_forwarding.md); note that
AppendLogs returning nullptr was always latent corruption (batch
stored + transport error to caller + retry = duplicate commit), but
fixing that is out of scope for this refactor and deferred to a
future version-bumped change.
- rpc_session: STREAM_FORWARDING teardown is decided inside
on_resp_ready AFTER write_resp_meta_ runs, so a meta-callback that
mutates accepted cannot desync the teardown decision from the
serialized response. No regression test added for the meta-callback
race — covering it needs a real asio harness with a registered meta
callback, which the fake-network unit-test infrastructure does not
offer.
- Drop dead conn_attempts_ / reconnect guard in rpc_client.
- Long recipe comment moved to docs/stream_forwarding.md.
Verified via ClickHouse build and test_keeper_multinode_simple (5/5 pass).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Is the api / contract good enough for an "official" nuraft feature? It's stripped down to just what is needed for KeeperRequestDispatcher2, for maximum simplicity.
- The caller must check is_broken() frequently. I guess most users would want a callback instead (people usually don't have a busy loop re-checking everything every 100us?). But it would be nontrivial to provide a callback because it'd require a new (?) mechanism to notify streams of term change.
- Tracking in-flight requests is up to the caller. Commit info is not propagated. The user has to have some whole separate mechanism for correlating in-flight requests with committed entries. Which probably has to pretty much look like KeeperRequestDispatcher2. Maybe it would be more appropriate to use streams with async replication turned off, but with pipelined appends similar to #101 , so that replies from leader are actually meaningful? Then streams would be usable without the commit callback dance... ah, no, that wouldn't guarantee that the committed entry was applied on the current follower yet. So maybe that + make the stream wait for commit before reporting success.
- Errors are not differentiated, the outcome of append is either "ok (... kinda ok, see previous item)" or "stream broke lol". Error responses are not sent from leader.
- User has to implement retry backoff for when stream breaks.
Feels like too many footguns for a user-facing production feature.
It may make sense for someone to spend the time to implement this feature in a more friendly and robust way (with much more code and hence higher probability of bugs and more testing). I don't want to be that someone :) . Until that happens, maybe it should stay a clickhouse-only thing instead of a very half-baked nuraft thing? Or is it better to add an unfriendly implementation and leave it to someone else to improve it? Idk. At least the doc should be made very clear about the limitations instead of being a marketing copy.
@antonio2368 , plz be the judge :)
| @@ -0,0 +1,102 @@ | |||
| /************************************************************************ | |||
| Copyright 2024-present ClickHouse, Inc. | |||
There was a problem hiding this comment.
(Huh, how does copyright work for open source contributions and forks? If we upstream this, will nuraft end up with some kind of mixed copyright? Does the license allow that? What exactly is the purpose of these comments? I wonder what would happen if we just have the comment say that this comment itself comes WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND regarding its accuracy or legal status :P )
There was a problem hiding this comment.
It short: it ok, and may stay like that. There are no issues even if it will be upstreamed :) They can add own copyrights nearby. The licence explicitly allows others to use code written by you
Example of how it may look:
https://github.com/llvm/llvm-project/blob/da6ca203a677f51764cc9dfa1bbc9660a5c199e3/polly/lib/External/isl/isl_aff.c#L2-L9
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND regarding its accuracy or legal status
I'm not sure if legal departments will share that sense of humor :)
| bs.put_i32(resp->get_result_code()); | ||
| } | ||
|
|
||
| // Teardown decision for STREAM_FORWARDING. Computed here, after |
There was a problem hiding this comment.
Stopping here is too late, we may have already received and processed the next message oh I see, the assumption is that if leader rejects an append it'll reject all subsequent appends with the same term. No transient rejections. A slightly stronger assumption than before: the if ((flags_ & STREAM_FORWARDING) && !resp->get_accepted()) thing above allowed transient synchronous rejections without breaking no-gap.
But late rejection through write_resp_meta_ should be considered incompatible with STREAM_FORWARDING, we may have already received and processed the next message.
I'd remove all this and revert to the if ((flags_ & STREAM_FORWARDING) && !resp->get_accepted()) thing above.
| * CANCELLED — async_replication disabled on the cluster config | ||
| * NOT_LEADER — no current leader or term==0 | ||
| * FAILED — leader not in cluster config, rpc_client creation | ||
| * failed, or transport does not support pipelining |
There was a problem hiding this comment.
What should KeeperDispatcher2 do if transport doesn't support pipelining? I think we should:
- allow creating a stream anyway,
- make the stream notice that transport doesn't support pipelining, and keep track of whether a request is in flight (just another atomic next to
broken_), - add method
is_ready()to it to indicate whetherappendcan be called right now; if transport supports pipelining,is_readyis always true, otherwise it's true if there's no request in flight, - make KeeperDispatcher2's loop check
is_ready(and busy-wait for it just like for everything else).
|
|
||
| bool client_req_stream::is_broken() const { | ||
| if (state_->broken_.load(std::memory_order_acquire)) return true; | ||
| auto srv = state_->srv_.lock(); |
There was a problem hiding this comment.
Comment says this function is just 2 atomic loads, but this adds something like +1 atomic CAS and +1 atomic dec (miiiight be contended). Maybe use a non-weak pointer here? (Maybe add a non-weak pointer outside state_, or maybe use a plain pointer and say that client_req_stream is not allowed to outlive the raft_server.)
| * observation, safe from any thread, cheap (two atomic loads). | ||
| */ | ||
| bool is_broken() const; | ||
|
|
There was a problem hiding this comment.
Need mark_as_broken(). KeeperAppendStream has markAsBroken(), and KeeperRequestDispatcher2 uses it, and it would be mildly inconvenient to add another atomic+check in (actually for that particular call site it wouldn't be a big deal to go without it, but still seems more natural to have it).KeeperRequestDispatcher2 if it weren't available
There was a problem hiding this comment.
What's odd. If owner of the stream thinks that the stream is bad - it should just destroy it and create a new one, why should it keep communicating with a broken stream, saying 'hey stream, bad news for you - i think you're broken' :)
| // One-shot latch so a partially-queued send can't let both the handler | ||
| // and the catch block complete `res`. | ||
| auto completed = cs_new<std::atomic<bool>>(false); | ||
| auto try_complete = |
There was a problem hiding this comment.
Nit: send isn't even supposed to throw. Drop this and just set state_->broken_ and rethrow, and let the caller deal with ambiguity about whether the callback will be called or not (clickhouse can probably just std::abort).
| * `accepted=false` with a non-OK `result_code`. Callers should | ||
| * distinguish `CANCELLED` (client-side short-circuit when already | ||
| * broken, no I/O) from anything else (leader's verbatim code on | ||
| * remote rejection, or `FAILED` on transport exception). |
There was a problem hiding this comment.
Should they distinguish it? Neither of the existing callers care.
"leader's verbatim code on remote rejection" - no, leader doesn't send response on error in STREAM_FORWARDING mode. This function doesn't meaningfully differentiate errors, there are just two outcomes: "ok, here's the log_idx from leader (but the entry is not committed, it may still get lost)", or "error of some kind, idk". Just list those outcomes in the comment.
(This and the required polling of is_broken make this whole feature quite user unfriendly. It feels weird to have it presented as a real nuraft feature.)
There was a problem hiding this comment.
+1. Most of nuraft code just return nulls + write something to log.
| ptr<std::exception> ret_err = | ||
| err ? std::static_pointer_cast<std::exception>(err) : nullptr; | ||
| cmd_result_code code = cmd_result_code::OK; | ||
|
|
||
| if (err || !resp) { | ||
| state->broken_.store(true, std::memory_order_release); | ||
| code = cmd_result_code::FAILED; | ||
| } else if (!resp->get_accepted()) { | ||
| state->broken_.store(true, std::memory_order_release); | ||
| code = resp->get_result_code(); | ||
| // Invariant: accepted=false always carries non-OK | ||
| // (reject_cli_req). OK here means protocol violation. | ||
| if (code == cmd_result_code::OK) code = cmd_result_code::FAILED; | ||
| } else { | ||
| res->accept(); | ||
| ret_buf = resp->get_ctx(); | ||
| } | ||
|
|
||
| try_complete(*completed, ret_buf, ret_err, code); |
There was a problem hiding this comment.
This amount of code seems to oversell how much meaningful error differentiation is possible here. Just do if (err || !resp || !resp->get_accepted() || resp->get_result_code() != cmd_result_code::OK) { broken_ and FAILED } else { get_ctx() and OK }?
Or at least: in the else, need to propagate resp->get_result_code()? I don't think accepted necessarily means success.
| if (!cfg || !cfg->is_async_replication()) { | ||
| p_wn("open_client_req_stream: refused - async_replication is disabled"); | ||
| return fail(cmd_result_code::CANCELLED); | ||
| } |
There was a problem hiding this comment.
Allow this, it doesn't break anything, we want KeeperRequestDispatcher2 to work in this case.
I actually was not planning to push it up, make it official etc. (i don't care). The problem that PR tries to solve: Keeper code get 'polluted' / poisoned with many nuraft internals. That tight coupling may become a big headache in case of eventual upgrades of nuraft or other changes in nuraft code. All nuraft-low level things should live closer to other nuraft code (in the same submodule at least). When you see nuraft code in context of other nuraft code - you know how it's used, how to review it / make it robust etc. Eventually you can transform it into a real feature. When you see some nuraft internals in ClickHouse repo you can't really review that / follow that. I'm ok with not 'overselling' it (actually my plan was just to move it 'as is' from keeper code to nuraft, but with AI it went a bit wider :) |
e644e20 to
608a804
Compare
|
Moved a simplified version of this into #98 |
It's a PR to #98 :)
Summary
Replace the ad-hoc "recipe in a comment" contract that Keeper followed to forward pipelined client requests to the leader with a typed NuRaft abstraction:
raft_server::open_client_req_stream()returns aclient_req_streamthat is single-channel, ordered, fail-closed, and term-fenced to the current leader. The local fast path delegates toappend_entries_ext; the remote path owns onerpc_clientand sendsclient_requestmessages with theSTREAM_FORWARDINGwire flag.Along the way, fixes a real back-compat bug on the server side and tightens several latent failure modes that the old hand-rolled code had.
Fixes and improvements vs. baseline
correctness fixes
expected_term_gating on the wire flag. Before this change, any non-zeroclient_request.termwas treated as a term fence on the leader, even for callers that didn't opt into STREAM_FORWARDING. That broke any externalraft_server_handler::process_requser whoseclient_requesthappened to carry a term. The check is now strictly gated on theSTREAM_FORWARDINGwire flag.Empty-batch no-op. Old Keeper code, if called with an empty
requests_for_sessions, forwarded an empty batch toappend_entries_ext, which returnsaccepted=false, OK— old code then treated that as fatal and broke the stream. The stream now short-circuits empty batches.Local-path exception safety. If
append_entries_extthrows synchronously (e.g.store_log_entrymid-batch fail), the old code let the exception propagate without latching the Keeperis_brokenflag, so subsequentisBroken()checks returnedfalseon an already-broken stream. The new path latchesbrokenin atry/catchbefore rethrowing.Remote-path send-throw + double-completion. If
rpc_client::sendthrew synchronously after partially queuing the handler, both the handler and the catch block could complete thecmd_result. A one-shot CAS latch now guarantees exactly one completion wins.Meta-callback teardown race. The STREAM_FORWARDING teardown-on-reject decision is now computed inside
on_resp_readyafterwrite_resp_meta_runs. Without this, a meta callback that mutatesresp->accepted_could desync the teardown from the serialized response and leave the session open on a rejection.API hygiene / robustness
Typed API.
raft_server::open_client_req_stream→client_req_stream. Keeper no longer hand-buildsreq_msg, flipsSTREAM_FORWARDING_REQUEST, or reaches intoasio_service::create_client/get_config()->get_server()->get_endpoint().async_replicationprecondition checked at open time. Old code silently serialized behindclient_req_timeout_in sync mode, collapsing throughput with no diagnostic.streaming_mode_precondition checked via newrpc_client::supports_pipelining()virtual (defaultfalse;asio_rpc_clientoverrides to returnstreaming_mode_). Without it, a second in-flightappendused to hit the busy-socket assertion inasio_rpc_client::register_req_send.Lifetime safety. Stream holds
wptr<raft_server>; a destroyed server makes the stream permanently broken rather than dereferencing stale state. In-flight completion callbacks share aptr<shared_state>so they can still latch broken after the stream is destroyed.Correct
srcon forwardedreq_msg. Old Keeper usedsrc=0, which is a legal server id and confused peer-id-sensitivecb_funchooks. The stream now stamps the local server id.Diagnostic factory return.
open_client_req_streamtakes an optionalcmd_result_code* out_errso callers can log the specific reason the stream could not be opened (CANCELLED/NOT_LEADER/FAILED). Keeper uses this for log lines.Invariant / maintainability
Structural no-gap invariant in
handle_cli_req. Newreject_cli_reqhelper is the sole funnel foraccepted=falseresponses, all of them in the pre-store phase. A three-phase banner (validate / append+store / completion) makes the structural rule visible, and a phase-3 runtime guard (p_ft+assert(false)) catches any future regression that would silently break STREAM_FORWARDING's close-on-reject correctness.supports_pipelining()transport capability. New virtual onrpc_client; alternate transports can cleanly opt in.Docs moved to
docs/stream_forwarding.md. Semantics, preconditions, wire-flag mechanics, no-gap invariant proof sketch, and an Operator Runbook for rolling upgrades are in a discoverable file instead of a 40-line comment inasio_service.cxx.Umbrella include.
nuraft.hxxnow exposesclient_req_streamso consumers get the full API via the standard entry point.Reduced Keeper ↔ NuRaft coupling. Keeper went from ~8 direct NuRaft internal symbols (
req_msg,msg_type,log_val_type,STREAM_FORWARDING_REQUEST,asio_service::create_client,get_config(),srv_config::get_endpoint(),rpc_client::send) down to one:raft_instance->open_client_req_stream(timeout_ms, &err).Unit tests. Three deterministic tests cover previously-uncovered behaviors:
async_replicationrefusal with reason-code check, local happy path, and empty-batch no-op.Consciously not fixed (documented scope)
Rolling-upgrade silent degrade. Against a peer that doesn't understand
STREAM_FORWARDINGthe flag is silently ignored and the no-gap guarantee doesn't hold during the upgrade window. A per-feature ACK bit was considered and explicitly declined (would reinvent capability negotiation per-feature, wrong abstraction for NuRaft).Feature follows the same deployment discipline as
async_replicationdid: precondition + operator runbook, not runtime enforcement. The correct long-term answer is a general peer-capability handshake, outside the scope of this change.cb_func::ReturnNullcombined with stream forwarding. Documented as unsupported.PreAppendLogLeader/AppendLogs/AppendLogFailedReturnNullmid-batch always closed the sessionwith a stored prefix — latent corruption on retry. Original NuRaft ABI retained as-is for this PR; fixing it cleanly needs a version bump/CHANGELOG path that this fork does not expose.
Mid-batch
store_log_entrythrow. The stored prefix will replicate; the stream breaks. No layer can fix this — callers need state-machine-level dedup if they want exactly-once on retry.Verification
client_req_stream_test3/3,raft_server_test21/21.test_keeper_multinode_simpleintegration suite: 5/5.Keeper consumer
That makes
src/Coordination/KeeperAppendStream.cpplook like that: