Skip to content

fix(ffi): segfault when threadsafe JSCallback invoked from multiple native threads#28115

Open
robobun wants to merge 2 commits intomainfrom
claude/fix-ffi-threadsafe-callback-segfault
Open

fix(ffi): segfault when threadsafe JSCallback invoked from multiple native threads#28115
robobun wants to merge 2 commits intomainfrom
claude/fix-ffi-threadsafe-callback-segfault

Conversation

@robobun
Copy link
Copy Markdown
Collaborator

@robobun robobun commented Mar 14, 2026

Problem

FFI_Callback_threadsafe_call is the trampoline for new JSCallback(fn, { threadsafe: true }) and is invoked from arbitrary native threads — that's its entire purpose. It was capturing FFICallbackFunctionWrapper by value in the postTaskTo lambda:

WebCore::ScriptExecutionContext::postTaskTo(..., [argsVec = WTF::move(argsVec), wrapper](...) { ... });
//                                                                               ^^^^^^^ copy

That copy invokes JSC::Strong<>'s copy constructor on the calling native thread, which calls HandleSet::allocate() and writeBarrier(). HandleSet is a non-locked singly-linked free list + sentinel list owned by the VM. Mutating it from a non-JS thread races with the JS thread (which churns the same lists on every Strong<> create/destroy and during GC marking), corrupting the handle lists.

It also called wrapper.globalObject.get() on the foreign thread to fish out the script execution context, reading a HandleSlot concurrently with GC.

Repro

Strong.h:147:46: runtime error: member call on null pointer of type 'JSC::HandleSet'
SentinelLinkedList.h:212:11: runtime error: member call on null pointer of type 'WTF::BasicRawSentinelNode<JSC::HandleNode>'

— from test/js/bun/ffi/ffi-threadsafe-callback.test.ts, which spawns 4 pthreads each firing a threadsafe JSCallback 5000× while the JS thread creates/closes throwaway JSCallbacks to contend on the same HandleSet. Under debug+ASAN the unfixed build fails 5/5 runs within ~1s.

Fix

  • Cache ScriptExecutionContextIdentifier (a plain uint32_t) in the wrapper at construction time (on the JS thread).
  • Make FFICallbackFunctionWrapper ThreadSafeRefCounted and capture a Ref<> in the lambda instead of copying it. Creating a Ref is just an atomic increment; the Strong<> members are never copied.
  • FFICallbackFunctionWrapper_destroy becomes deref(), so the wrapper survives a close() that races with already-queued tasks.

The posted task still runs on the JS thread and dereferences wrapperRef->m_function there, which is safe.

Verification

bun bd test test/js/bun/ffi/ffi-threadsafe-callback.test.ts
before 5/5 fail — UBSan null HandleSet / SentinelLinkedList
after 5/5 pass (~1.7s), all 20000 callbacks delivered

All existing test/js/bun/ffi/* tests pass.

Closes #28113

@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Mar 14, 2026

Updated 2:45 PM PT - Apr 29th, 2026

@autofix-ci[bot], your commit 9c80f24 has 2 failures in Build #49191 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 28115

That installs a local version of the PR into your bun-28115 executable, so you can run:

bun-28115 --bun

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 14, 2026

Walkthrough

FFICallbackFunctionWrapper is made thread-safe: it now derives from ThreadSafeRefCounted, caches the ScriptExecutionContextIdentifier, uses Ref/leakRef for creation and deref for destruction, and the threadsafe callback path captures a Ref and cached context id before posting work. A regression test exercising multi-threaded callbacks was added.

Changes

Cohort / File(s) Summary
FFI Callback Thread-Safety
src/bun.js/bindings/JSFFIFunction.cpp
FFICallbackFunctionWrapper now derives from ThreadSafeRefCounted<...> and adds public WebCore::ScriptExecutionContextIdentifier m_contextIdentifier initialized from globalObject->scriptExecutionContext()->identifier(). Creation uses Ref<...> with leakRef(); FFICallbackFunctionWrapper_destroy calls deref(). FFI_Callback_threadsafe_call now captures a Ref<FFICallbackFunctionWrapper> (and caches contextId) when posting the task and accesses the function via the captured ref; added thread-safety/lifetime comments.
Regression Test
test/regression/issue/28113.test.ts
New regression test that builds a native repro (pthreads-based), loads it via Bun FFI, registers a { threadsafe: true } JSCallback, and exercises multiple native threads (4 × 1000 callbacks) verifying the callback counter reaches expected value; test is skipped on Windows.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main fix: preventing a segfault in the threadsafe FFI callback path when invoked from multiple native threads.
Linked Issues check ✅ Passed The pull request code changes directly address the segfault issue #28113 by making FFICallbackFunctionWrapper thread-safe and avoiding JSC object access from non-JS threads.
Out of Scope Changes check ✅ Passed All changes (JSFFIFunction.cpp and the regression test) are directly scoped to fixing the threadsafe callback segfault and validating the fix.
Description check ✅ Passed The pull request provides a comprehensive description with detailed problem statement, root cause analysis, fix explanation, and verification results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/regression/issue/28113.test.ts`:
- Around line 48-52: The test currently uses a fixed 2s sleep to wait for the
JSCallback to run (counter and callback / JSCallback setup); replace that sleep
with an awaited, deterministic loop that polls the counter until it equals the
expected value (or throws after a reasonable timeout) so the test awaits the
completion condition instead of waiting a fixed time; apply the same change to
the other occurrence referenced (around the second sleep at lines 56-57) and
ensure the wait has a clear timeout guard to fail fast if the callback never
runs.
- Around line 94-95: Remove the brittle negative crash-string assertions by
deleting the two lines that assert on stderr:
expect(stderr).not.toContain("Segmentation fault"); and
expect(stderr).not.toContain("Bus error"); in the test (where stderr is
asserted). Keep the existing exit-code and behavioral assertions (do not add
replacement checks) so the test relies on exit/status semantics instead of
searching output for panic strings.
- Line 4: The skip condition only excludes Windows ARM64 but the test uses
pthreads which are unavailable under MSVC on any Windows host; update the flag
used to skip the test (isFFIUnavailable) to treat all Windows builds as
unavailable (e.g., change the definition of isFFIUnavailable from "isWindows &&
isArm64" to simply "isWindows") so the pthread-based test is skipped on Windows;
modify the declaration of isFFIUnavailable referenced by the test to reflect
this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2554b051-4dd0-4ee0-873f-51662ce4259b

📥 Commits

Reviewing files that changed from the base of the PR and between 10bdb48 and a69e8cd.

📒 Files selected for processing (2)
  • src/bun.js/bindings/JSFFIFunction.cpp
  • test/regression/issue/28113.test.ts

Comment thread test/regression/issue/28113.test.ts Outdated
Comment thread test/regression/issue/28113.test.ts Outdated
Comment thread test/regression/issue/28113.test.ts Outdated
Comment thread src/bun.js/bindings/JSFFIFunction.cpp Outdated
Comment thread test/regression/issue/28113.test.ts Outdated
@alii
Copy link
Copy Markdown
Member

alii commented Mar 14, 2026

@robobun adopt and verify

@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Mar 14, 2026

✅ Pushed two fixes for review feedback in 0663aea:

  • StandaloneModuleGraph: only switch to .utf8 when output contains non-ASCII, preserving the zero-copy .latin1 path for pure-ASCII bundles.
  • OutputFile.toBunString: added defer buf.allocator.free(buf.bytes) before the cloneUTF8 return to avoid leaking the buffer.

@hsjoberg
Copy link
Copy Markdown

hsjoberg commented Mar 14, 2026

Epic.

The fix appears to work! ⭐
I tested in my repro https://github.com/hsjoberg/bun-ffi-crash (detailed in #28113).
Even tested a more insane stress test hammering the callback via 128 worker threads.

Thank you Claude Code bot.

Comment thread test/regression/issue/28113.test.ts Outdated
…Callback

FFI_Callback_threadsafe_call is the trampoline for JSCallback({ threadsafe: true })
and runs on arbitrary native threads. It was capturing the
FFICallbackFunctionWrapper by value in the postTaskTo lambda, which invokes
JSC::Strong<>'s copy constructor. That constructor calls HandleSet::allocate()
and writeBarrier() on the VM's HandleSet — a singly-linked free list with no
locking — racing with the JS thread and corrupting the strong-handle list.

Cache the ScriptExecutionContextIdentifier (a plain uint32_t) in the wrapper at
construction time (on the JS thread). Make the wrapper ThreadSafeRefCounted and
capture a Ref<> in the lambda instead of copying it, so no Strong<> is touched
off-thread and the wrapper survives a concurrent close() while tasks are still
queued.

The new test spawns 4 pthreads that each fire the callback 5000 times while the
JS thread churns HandleSet allocations via JSCallback create/close. Under
debug+ASAN the unfixed build reliably hits
'member call on null pointer of type JSC::HandleSet' in Strong.h from the
corrupted free list; the fixed build receives all 20000 callbacks.

Closes #28113
@robobun robobun force-pushed the claude/fix-ffi-threadsafe-callback-segfault branch from f8a231e to 436fcd4 Compare April 29, 2026 17:49
@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented Apr 29, 2026

Rebased onto current main (was 6 weeks stale with conflicts) and replaced the regression test.

Same core fix — cached ScriptExecutionContextIdentifier, ThreadSafeRefCounted wrapper, Ref<> capture.

New test (test/js/bun/ffi/ffi-threadsafe-callback.test.ts + threadsafe-callback.c) — the previous test had the JS thread blocked inside pthread_join while the worker threads fired, so there was no contention on HandleSet from the JS side and the race wasn't reliably triggered. The new test keeps the JS thread actively churning JSCallback create/close (each a pair of Strong<> alloc/free) while 4 worker threads fire 20k callbacks total. Unfixed debug+ASAN: 5/5 UBSan member call on null pointer of type 'JSC::HandleSet'. Fixed: 5/5 pass in ~1.7s.

// TinyCC (and all of bun:ffi) is disabled on Windows ARM64.
// On Windows x64 there is no system `cc`, so skip there too — the bug being
// covered (JSC::Strong<> copied on a non-JS thread) is platform-independent.
const canRun = !isWindows && !(isWindows && isArm64);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: !isWindows && !(isWindows && isArm64) is logically equivalent to just !isWindows — the second clause can never affect the result (if isWindows is false the first clause already passes; if true it already fails). Consider simplifying to const canRun = !isWindows; and dropping the now-unused isArm64 import from harness.

Extended reasoning...

What

Line 8 of test/js/bun/ffi/ffi-threadsafe-callback.test.ts reads:

const canRun = !isWindows && !(isWindows && isArm64);

This expression is a tautology over !isWindows. The second conjunct !(isWindows && isArm64) is dead code, and as a consequence isArm64 (imported on line 3) is effectively unused.

Step-by-step proof

Enumerate the two cases for isWindows:

  1. isWindows = false → first clause !isWindows is true. Second clause: isWindows && isArm64 is false && X = false, so !(false) = true. Result: true && true = true. Same as !isWindows.
  2. isWindows = true → first clause !isWindows is false. && short-circuits; the second clause is never evaluated. Result: false. Same as !isWindows.

In both cases the result equals !isWindows regardless of isArm64, so isArm64 contributes nothing and the import on line 3 is unused.

Why existing code doesn't prevent it

There's no lint rule catching tautological boolean sub-expressions here, and TypeScript's noUnusedLocals doesn't flag isArm64 because it is syntactically referenced — just in dead code.

Addressing the "documentary purpose" objection

One could argue the two-clause form mirrors the two-line comment above it (Windows ARM64 lacks TinyCC; Windows x64 lacks cc). But that argument doesn't hold up: the comment already fully documents both reasons, and the second boolean clause doesn't add independent information — it's a strict subset of the first (isWindows && isArm64isWindows). If anything, leaving it in is mildly misleading: a reader skimming the expression might assume there's some Windows-non-ARM64 case that can run, when there isn't. The comment is the right place for the rationale; the code should just say what it does.

Impact

Zero behavioral impact — the test skips on exactly the same platforms either way. This is purely a readability/cleanliness nit: a redundant clause and an unused import in a brand-new test file.

Fix

import { bunEnv, bunExe, isMacOS, isWindows, tempDir } from "harness";

// TinyCC (and all of bun:ffi) is disabled on Windows ARM64.
// On Windows x64 there is no system `cc`, so skip there too — the bug being
// covered (JSC::Strong<> copied on a non-JS thread) is platform-independent.
const canRun = !isWindows;

@robobun
Copy link
Copy Markdown
Collaborator Author

robobun commented May 1, 2026

Independently hit this and pushed a minimal variant to farm/c5575d59/ffi-threadsafe-handleset before finding this PR — capture &wrapper by reference + WTF_MAKE_NONCOPYABLE(FFICallbackFunctionWrapper). The ThreadSafeRefCounted + cached m_contextIdentifier approach here is more thorough (survives close() racing queued tasks, and avoids reading Strong<>::get() off-thread entirely), so deferring to this one.

The test on my branch may be useful as an alternative/addition: it dlopen's pthread_create/pthread_join directly (no system cc required) and runs 256 batches of 8 concurrent pthreads through the callback. Under bun bd it fails 20/20 without the fix (HandleSet::writeBarrier / SentinelLinkedList assertions) and passes 20/20 in ~2s with it; release bun segfaults ~40% of runs without the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segfault when native code repeatedly invokes JSCallback({ threadsafe: true })

3 participants