Skip to content

Harden ConnectionManager#close against non-StandardError during @sock.close#69

Merged
danmayer merged 1 commit intomainfrom
fiber_close_ensure_hardening
Apr 27, 2026
Merged

Harden ConnectionManager#close against non-StandardError during @sock.close#69
danmayer merged 1 commit intomainfrom
fiber_close_ensure_hardening

Conversation

@danmayer
Copy link
Copy Markdown

Summary

Follow-up to #68. Closes a second state-leak pathway on the same fiber-cancellation theme: when a non-StandardError (e.g. Async::Stop) is fired into a fiber that's already unwinding through Dalli's cleanup, it can escape @sock.close and leave the client in a dirty state that gets returned to the connection pool.

The bug

With #68 merged, Base#request now calls close from an ensure when a non-StandardError aborts a request. That fix is correct — but ConnectionManager#close is itself not fully hardened against non-StandardError escapes:

def close
  return unless @sock

  begin
    @sock.close
  rescue StandardError   # ← only StandardError
    nil
  end
  @sock = nil            # ← skipped if @sock.close raises non-StandardError
  @pid = nil             # ← skipped
  abort_request!         # ← skipped
end

If the scheduler fires a second Async::Stop into the fiber while it's inside this cleanup (e.g. a timeout that triggers cancellation of already-cancelling work), @sock.close can raise a non-StandardError. The rescue StandardError doesn't catch it, the three post-lines never run, and the exception propagates. The client goes back to the ConnectionPool with @request_in_progress == true and @sock still pointing at a half-closed FD.

The next fiber that checks out the client hits confirm_ready!close if request_in_progress?, which attempts the same close. If that is also interrupted, the dirty-client reuse loops until the scheduler stops re-cancelling — during which window cross-response byte pollution on the socket is possible, matching the production symptom of "reads occasionally returning unexpected data."

The fix

Move the state reset into an ensure, so @sock = nil / @pid = nil / abort_request! always run regardless of what @sock.close raises. The non-StandardError continues to propagate (we don't widen the rescue), so callers still see the cancellation.

def close
  return unless @sock

  begin
    @sock.close
  rescue StandardError
    nil
  ensure
    @sock = nil
    @pid = nil
    abort_request!
  end
end

Test

test/integration/test_fiber_concurrency.rb grows a second case that:

  1. Warms the connection via dc.set
  2. Overrides @sock.close to raise FiberCancellation (Exception subclass, stand-in for Async::Stop)
  3. Overrides read_line to raise FiberCancellation (first cancellation, triggers the ensure path from Clean up connection state on non-StandardError aborts #68)
  4. Asserts the client is not request_in_progress? and not connected? after the double-exception unwinds

Verified the test fails on main (pre-fix) with the exact state leak this PR targets:

@request_in_progress=true, @sock=#<Dalli::Socket::TCP:fd 7>

and passes with the ensure applied.

Test plan

….close

A second Async::Stop fired into a fiber that's already unwinding through
Base#request's ensure cleanup can escape @sock.close. The existing
rescue only catches StandardError, so the non-StandardError skips
@sock = nil / @pid = nil / abort_request! and the client is returned
to the pool with a half-closed socket and @request_in_progress == true.

Moves the state reset into an ensure so it runs regardless of what
@sock.close raises.
@danmayer danmayer merged commit 3c80c18 into main Apr 27, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants