fix(ffi): segfault when threadsafe JSCallback invoked from multiple native threads#28115
fix(ffi): segfault when threadsafe JSCallback invoked from multiple native threads#28115
Conversation
|
Updated 2:45 PM PT - Apr 29th, 2026
❌ @autofix-ci[bot], your commit 9c80f24 has 2 failures in
🧪 To try this PR locally: bunx bun-pr 28115That installs a local version of the PR into your bun-28115 --bun |
WalkthroughFFICallbackFunctionWrapper is made thread-safe: it now derives from ThreadSafeRefCounted, caches the ScriptExecutionContextIdentifier, uses Ref/leakRef for creation and deref for destruction, and the threadsafe callback path captures a Ref and cached context id before posting work. A regression test exercising multi-threaded callbacks was added. Changes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/regression/issue/28113.test.ts`:
- Around line 48-52: The test currently uses a fixed 2s sleep to wait for the
JSCallback to run (counter and callback / JSCallback setup); replace that sleep
with an awaited, deterministic loop that polls the counter until it equals the
expected value (or throws after a reasonable timeout) so the test awaits the
completion condition instead of waiting a fixed time; apply the same change to
the other occurrence referenced (around the second sleep at lines 56-57) and
ensure the wait has a clear timeout guard to fail fast if the callback never
runs.
- Around line 94-95: Remove the brittle negative crash-string assertions by
deleting the two lines that assert on stderr:
expect(stderr).not.toContain("Segmentation fault"); and
expect(stderr).not.toContain("Bus error"); in the test (where stderr is
asserted). Keep the existing exit-code and behavioral assertions (do not add
replacement checks) so the test relies on exit/status semantics instead of
searching output for panic strings.
- Line 4: The skip condition only excludes Windows ARM64 but the test uses
pthreads which are unavailable under MSVC on any Windows host; update the flag
used to skip the test (isFFIUnavailable) to treat all Windows builds as
unavailable (e.g., change the definition of isFFIUnavailable from "isWindows &&
isArm64" to simply "isWindows") so the pthread-based test is skipped on Windows;
modify the declaration of isFFIUnavailable referenced by the test to reflect
this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2554b051-4dd0-4ee0-873f-51662ce4259b
📒 Files selected for processing (2)
src/bun.js/bindings/JSFFIFunction.cpptest/regression/issue/28113.test.ts
|
@robobun adopt and verify |
|
✅ Pushed two fixes for review feedback in 0663aea:
|
|
Epic. The fix appears to work! ⭐ Thank you Claude Code bot. |
…Callback
FFI_Callback_threadsafe_call is the trampoline for JSCallback({ threadsafe: true })
and runs on arbitrary native threads. It was capturing the
FFICallbackFunctionWrapper by value in the postTaskTo lambda, which invokes
JSC::Strong<>'s copy constructor. That constructor calls HandleSet::allocate()
and writeBarrier() on the VM's HandleSet — a singly-linked free list with no
locking — racing with the JS thread and corrupting the strong-handle list.
Cache the ScriptExecutionContextIdentifier (a plain uint32_t) in the wrapper at
construction time (on the JS thread). Make the wrapper ThreadSafeRefCounted and
capture a Ref<> in the lambda instead of copying it, so no Strong<> is touched
off-thread and the wrapper survives a concurrent close() while tasks are still
queued.
The new test spawns 4 pthreads that each fire the callback 5000 times while the
JS thread churns HandleSet allocations via JSCallback create/close. Under
debug+ASAN the unfixed build reliably hits
'member call on null pointer of type JSC::HandleSet' in Strong.h from the
corrupted free list; the fixed build receives all 20000 callbacks.
Closes #28113
f8a231e to
436fcd4
Compare
|
Rebased onto current Same core fix — cached New test ( |
| // TinyCC (and all of bun:ffi) is disabled on Windows ARM64. | ||
| // On Windows x64 there is no system `cc`, so skip there too — the bug being | ||
| // covered (JSC::Strong<> copied on a non-JS thread) is platform-independent. | ||
| const canRun = !isWindows && !(isWindows && isArm64); |
There was a problem hiding this comment.
🟡 Nit: !isWindows && !(isWindows && isArm64) is logically equivalent to just !isWindows — the second clause can never affect the result (if isWindows is false the first clause already passes; if true it already fails). Consider simplifying to const canRun = !isWindows; and dropping the now-unused isArm64 import from harness.
Extended reasoning...
What
Line 8 of test/js/bun/ffi/ffi-threadsafe-callback.test.ts reads:
const canRun = !isWindows && !(isWindows && isArm64);This expression is a tautology over !isWindows. The second conjunct !(isWindows && isArm64) is dead code, and as a consequence isArm64 (imported on line 3) is effectively unused.
Step-by-step proof
Enumerate the two cases for isWindows:
isWindows = false→ first clause!isWindowsistrue. Second clause:isWindows && isArm64isfalse && X = false, so!(false) = true. Result:true && true = true. Same as!isWindows.isWindows = true→ first clause!isWindowsisfalse.&&short-circuits; the second clause is never evaluated. Result:false. Same as!isWindows.
In both cases the result equals !isWindows regardless of isArm64, so isArm64 contributes nothing and the import on line 3 is unused.
Why existing code doesn't prevent it
There's no lint rule catching tautological boolean sub-expressions here, and TypeScript's noUnusedLocals doesn't flag isArm64 because it is syntactically referenced — just in dead code.
Addressing the "documentary purpose" objection
One could argue the two-clause form mirrors the two-line comment above it (Windows ARM64 lacks TinyCC; Windows x64 lacks cc). But that argument doesn't hold up: the comment already fully documents both reasons, and the second boolean clause doesn't add independent information — it's a strict subset of the first (isWindows && isArm64 ⊆ isWindows). If anything, leaving it in is mildly misleading: a reader skimming the expression might assume there's some Windows-non-ARM64 case that can run, when there isn't. The comment is the right place for the rationale; the code should just say what it does.
Impact
Zero behavioral impact — the test skips on exactly the same platforms either way. This is purely a readability/cleanliness nit: a redundant clause and an unused import in a brand-new test file.
Fix
import { bunEnv, bunExe, isMacOS, isWindows, tempDir } from "harness";
// TinyCC (and all of bun:ffi) is disabled on Windows ARM64.
// On Windows x64 there is no system `cc`, so skip there too — the bug being
// covered (JSC::Strong<> copied on a non-JS thread) is platform-independent.
const canRun = !isWindows;|
Independently hit this and pushed a minimal variant to The test on my branch may be useful as an alternative/addition: it dlopen's |
Problem
FFI_Callback_threadsafe_callis the trampoline fornew JSCallback(fn, { threadsafe: true })and is invoked from arbitrary native threads — that's its entire purpose. It was capturingFFICallbackFunctionWrapperby value in thepostTaskTolambda:That copy invokes
JSC::Strong<>'s copy constructor on the calling native thread, which callsHandleSet::allocate()andwriteBarrier().HandleSetis a non-locked singly-linked free list + sentinel list owned by the VM. Mutating it from a non-JS thread races with the JS thread (which churns the same lists on everyStrong<>create/destroy and during GC marking), corrupting the handle lists.It also called
wrapper.globalObject.get()on the foreign thread to fish out the script execution context, reading aHandleSlotconcurrently with GC.Repro
— from
test/js/bun/ffi/ffi-threadsafe-callback.test.ts, which spawns 4 pthreads each firing a threadsafeJSCallback5000× while the JS thread creates/closes throwawayJSCallbacks to contend on the sameHandleSet. Under debug+ASAN the unfixed build fails 5/5 runs within ~1s.Fix
ScriptExecutionContextIdentifier(a plainuint32_t) in the wrapper at construction time (on the JS thread).FFICallbackFunctionWrapperThreadSafeRefCountedand capture aRef<>in the lambda instead of copying it. Creating aRefis just an atomic increment; theStrong<>members are never copied.FFICallbackFunctionWrapper_destroybecomesderef(), so the wrapper survives aclose()that races with already-queued tasks.The posted task still runs on the JS thread and dereferences
wrapperRef->m_functionthere, which is safe.Verification
bun bd test test/js/bun/ffi/ffi-threadsafe-callback.test.tsHandleSet/SentinelLinkedListAll existing
test/js/bun/ffi/*tests pass.Closes #28113