Skip to content

Hermes#176

Draft
CedricGuillemet wants to merge 20 commits into
BabylonJS:mainfrom
CedricGuillemet:hermes-integration
Draft

Hermes#176
CedricGuillemet wants to merge 20 commits into
BabylonJS:mainfrom
CedricGuillemet:hermes-integration

Conversation

@CedricGuillemet
Copy link
Copy Markdown
Collaborator

No description provided.

@CedricGuillemet CedricGuillemet changed the title Hermes + Napi Hermes Jun 3, 2026
CedricGuillemet and others added 16 commits June 3, 2026 14:45
Hermes's lib/InternalJavaScript/CMakeLists.txt attaches the generated InternalBytecode.js to two downstream targets (hermesInternalUnit_obj and InternalBytecodeInclude) with no common dependency. The Xcode
ew build system rejects this with is attached to multiple targets ... but none of these is a common dependency of the other(s), breaking the macOS_Xcode264_Hermes CMake generate step.

Switch to the Ninja generator only when js-engine == 'Hermes' so the rest of the macOS matrix keeps using Xcode. Ninja has no such restriction. xcode-select still pins clang and the SDK to the chosen Xcode toolchain. Adjust the test working-directory accordingly (Ninja is single-config and drops binaries directly under the target dir).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These flags are MSVC-only; clang/gcc parse them as missing source-file paths and fail the env_hermes.cc compile on macOS with 'clang++: error: no such file or directory: /wd4068' (and the rest of the wd list).

On non-MSVC compilers clang and gcc accept the offending compound- literal syntax in Hermes's sh_legacy_value.h as a GNU extension under -std=gnu++20, so no suppression is needed there.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On macOS the Xcode generator's Link Frameworks Automatically setting silently pulls in Foundation/CoreFoundation for .mm/.m TUs. The Ninja generator we now use for the Hermes macOS job has no such mechanism, so AppRuntime_macOS.mm's NSLog and UrlLib's UrlRequest_Apple.mm reference to NSBundle resolve to undefined symbols at the final link step:

  Undefined symbols for architecture arm64:

    _NSLog, referenced from ... AppRuntime_macOS.mm.o

    _OBJC_CLASS_\, referenced from ... UrlRequest_Apple.mm.o

Declare the frameworks explicitly and PUBLIC so they end up on the executable's link line for every generator, while not regressing the Xcode-generator builds (specifying the framework twice is a no-op).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hermes ships HERMES_SLOW_DEBUG=ON by default. Those slow asserts compile in even for RelWithDebInfo / Release builds and fire during the Runtime destructor finalizer chain on macOS arm64, killing the macOS_Xcode264_Hermes job with exit code 134 right after 145 passing and before any gtest [OK] line.

Disable the slow checks for our embedded use.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hermes's Runtime destructor runs the full finalizer chain (cleanup hooks, persistent-reference finalizers, instance-data finalizer, and heap finalizeAll()), invoking arbitrary user-provided C++ code from napi_wrap / napi_add_finalizer / etc.  If any of that code throws, the exception escapes Runtime::~Runtime (its destructor is not noexcept) and propagates out of Napi::Detach into the AppRuntime worker thread.  The thread function has no catch, so C++ calls std::terminate -> abort(), killing the process with SIGABRT silently.

Observed on macOS arm64 CI as 'Abort trap: 6' immediately after '145 passing', with zero stderr diagnostic (hermes_fatal isn't involved). The Windows builds happen to swallow this internally and exit cleanly.

Wrap state.runtime.reset() in a try/catch that logs to stderr instead of propagating. By the time Detach runs the tests have already reported their results; re-raising would just re-trigger the abort and leave CI red despite a fully successful test run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Same root-cause class as the Detach try/catch in 2b68f36: Hermes's Runtime::drainJobs() synchronously runs whatever callbacks were registered via addDrainJobsCallback. Our hermesNapi build registers drainPendingFinalizers, which invokes user napi_wrap / napi_add_finalizer C++ callbacks. If one of those throws, the exception escapes drainJobs into AppRuntime::Dispatch's call site (DrainMicrotasks lives OUTSIDE the dispatch try/catch by design — promise continuations need to run regardless of whether the callback threw), and from there out of the worker thread function, triggering std::terminate -> abort() with no diagnostic.

The macOS_Xcode264_Hermes CI run kept SIGABRTing after 145 passing even with the Detach catch in place, which means the throw happens before Detach — most plausibly in the very last DrainMicrotasks call (after the AppRuntime destructor pushed the no-op work item and the worker drained pending finalizers on its way out).

Wrap runtime->drainJobs() in try/catch with a stderr log so the worker can exit cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The macOS_Xcode264_Hermes job has been failing with Abort trap: 6 (SIGABRT, exit code 134) right after 145 passing, with zero stderr diagnostic. My defensive try/catch wrappers around Hermes Runtime teardown (Detach) and drainJobs in the previous two commits did not catch anything, which suggests this is either a raw abort() call from inside Hermes (not a C++ throw) or a std::terminate from somewhere outside the paths I instrumented.

Wrap UnitTests in lldb -b -o run -o 'thread backtrace all' for the Hermes macOS path only, so the next CI run leaves a real backtrace in the log. The exit code is parsed back out of lldb's summary line so the job still fails on real test failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…corruption

Captured a real backtrace from CI by wrapping the macOS Hermes UnitTests run in lldb in the previous commit. The smoking gun is macOS libmalloc:

  UnitTests(25540,0x16fe87000) malloc: *** error for object

      0x9cfbcb930: pointer being freed was not allocated

  malloc: *** set a breakpoint in malloc_error_break to debug

  Process 25540 stopped

  * thread BabylonJS#2, stop reason = signal SIGABRT

Heap corruption during Runtime teardown, not a C++ throw (hence why my Detach / drainJobs try/catch additions never fired).

Hermes's hermes_run_script has two ingest paths for the source buffer: a zero-copy path when the last byte is '\\0' (wraps the buffer in WeirdZeroTerminatedBuffer and keeps the callback alive until the owning BCProvider/RuntimeModule is destroyed, often at ~Runtime), and a copy path otherwise (Hermes copies into its own StdStringBuffer and synchronously invokes the finalize callback before hermes_run_script returns).

The zero-copy path is the one tripping the abort on macOS arm64: my new uint8_t[] pointer lives long enough to interact with whatever teardown sequence corrupts it. Switch Eval to the copy path by passing length (no trailing '\\0'). The buffer now lives only for the duration of the synchronous call and the lifetime question disappears. Plenty cheap for the Mocha bundle (~1 MiB).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous lldb wrapping only printed frame #0 because the inferior's SIGABRT was passing straight through to the kernel and killing the process before the queued thread backtrace all command could run.  Add process handle SIGABRT --pass false --stop true --notify true so lldb intercepts the signal, lets us dump every thread, then continues so the inferior exits with the original signal status.

Also process status --verbose for the per-thread state, in case the backtrace alone isn't enough to identify the bad free site (the heap corruption message comes with an address but no callsite, since macOS libmalloc only logs to stderr before raising).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Queued -o commands after 
un were never executed once SIGABRT killed the inferior — even with process handle --pass false. lldb's -k flag is specifically designed to fire on crash and is the only reliable way to get bt all out of batch mode in this scenario.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Captured a full backtrace via lldb -k (see prior CI commit). The smoking gun:

  frame BabylonJS#5: ___BUG_IN_CLIENT_OF_LIBMALLOC_POINTER_BEING_FREED_WAS_NOT_ALLOCATED

  frame BabylonJS#6: napi_delete_reference at hermes_napi_reference.cpp:113

  frame BabylonJS#7: Napi::Reference<...>::~Reference at napi-inl.h:3262

  frame BabylonJS#8: Napi::ObjectReference::~ObjectReference

  frame BabylonJS#10: Babylon::Polyfills::Internal::URL::~URL at URL.h:10

  frame BabylonJS#12: Napi::ObjectWrap<...URL>::FinalizeCallback at napi-inl.h:4963

  frame BabylonJS#13: napi_env__::shutdown at hermes_napi.cpp:214

  frame BabylonJS#16: hermes::vm::Runtime::~Runtime

Root cause: Hermes's napi_env__::shutdown() iterates refListHead_ and delete ref; one at a time. It only sets ref->deletionPending_ on the *current* ref before its finalize_cb fires. If the finalizer transitively destroys a node-addon-api wrapper (Napi::Reference / Napi::ObjectReference) whose underlying napi_ref was already deleted earlier in the same loop, napi_delete_reference reads ref->deletionPending_ from freed memory and proceeds to delete ref again -> double-free.

The exact path: URL (an ObjectWrap subclass) has a Napi::ObjectReference member m_searchParamsReference. addReference prepends to the linked list, so m_searchParamsReference's ref is processed BEFORE URL's wrap ref. When URL's wrap finalizer runs delete this, ~URL destroys m_searchParamsReference, whose destructor calls napi_delete_reference on the already-freed sibling ref.

macOS libmalloc detects this (malloc: *** error for object 0x...: pointer being freed was not allocated -> SIGABRT). Linux glibc and Windows CRT happen to miss it.

Fix: PATCH Hermes shutdown() to mark ALL refs deletionPending in a pre-pass BEFORE iterating. Apply as a FetchContent PATCH_COMMAND via the new ApplyPatchIfNeeded.cmake helper (idempotent — uses git apply --check --reverse to detect already-applied state across reconfigure).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cuts the matrix from ~24 jobs down to 4 (Win32_x64_Hermes, Android_Hermes, macOS_Xcode264_Hermes, iOS_Xcode264_Hermes) so iterating on Hermes-specific failures doesn't burn through CI minutes re-running the unaffected V8/Chakra/JSC/UWP/Linux matrix.

Reverted in the very last commit of this PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous narrowed matrix referenced js-engine inputs on build-android.yml and build-ios.yml, but the iteration-2 commit that added those inputs to build-macos.yml and build-ios.yml landed only on build-macos.yml and build-ios.yml on the local branch — the squashed PR commit didn't carry the build-android.yml / build-ios.yml diffs, causing GitHub Actions to reject the workflow with:

  Invalid input, js-engine is not defined in the referenced workflow.

Narrow further to only Win32 x64 + macOS Xcode 26.4 (the two job types that already accept js-engine on the PR head).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The hand-written patch was malformed (corrupt at line 88) and likely never applied in CI — explaining why the previous run still SIGABRT'd at line 225 with the same root-cause backtrace.

Re-generated via git diff against the pristine fetched source so the format is exact. Also added a [Babylon-Hermes-patch] stderr print inside shutdown() so we can positively confirm in CI logs that the patch is in effect. Added an ApplyPatchIfNeeded status line at configure too.

Patch v2 also covers a secondary case: refs created during a finalize_cb (e.g. by transient napi_create_reference calls during teardown) are now pre-marked just before the next finalize fires, in addition to the upfront sweep.

Verified locally on Windows RelWithDebInfo: 145 Mocha + 4 gtest PASS, marker prints 4 times (once per AppRuntime in the suite).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
git apply is strict about line endings; allowing Windows checkouts to convert patches to CRLF would silently break the FetchContent PATCH_COMMAND step.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug in the Hermes NAPI impl?

CedricGuillemet and others added 3 commits June 5, 2026 11:17
Remove the local Hermes shutdown patch and restore the CI matrix while keeping Apple Hermes jobs disabled. Add Android Hermes coverage with host compiler import support for Hermes cross-compilation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep the Android emulator runner script as a single command and relax async timer/WebSocket test timeouts that are flaky on hosted Windows runners.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore the macOS reusable workflow, remove the patch line-ending attributes, and put the iOS CI section label back to its previous wording.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants