Reusable llama.cpp web bridge runtime (JS + WASM).
This repository provides:
src/llama_webgpu_core.cpp(native bridge core)js/llama_webgpu_bridge.js(JS runtime wrapper)CMakeLists.txtfor Emscripten builds
Requirements:
- Emscripten SDK (
emcmake,emcc) inPATH - llama.cpp source checkout at tag
b9116or a compatible checkout exposingllama_state_save_file/llama_state_load_filewith the signatures used bysrc/llama_webgpu_core.cpp
Build command:
./scripts/build_bridge.shUseful environment variables:
LLAMA_CPP_DIR(path to llama.cpp source)BUILD_DIR(cmake build dir)OUT_DIR(output directory; defaults todist/)WEBGPU_BRIDGE_BUILD_MEM64(1to also build optional wasm64 core assets)WEBGPU_BRIDGE_MEM64_MAX_MEMORY(optional wasm64 max linear memory bytes)WEBGPU_BRIDGE_PTHREADS(1/0, defaults to1)WEBGPU_BRIDGE_PTHREAD_POOL_SIZE(defaults to4)WEBGPU_BRIDGE_PTHREAD_POOL_SIZE_STRICT(defaults to0)
Notes:
- wasm64 builds default to
WEBGPU_BRIDGE_MEM64_MAX_MEMORY=12884901888(12 GiB). - Large single-file remote model loading requires a cross-origin isolated page
(
COOP/COEP) so worker-thread runtime paths are available. - pthread builds preallocate
WEBGPU_BRIDGE_PTHREAD_POOL_SIZEworkers and cap bridge-selected thread counts to that compiled pool size.WEBGPU_BRIDGE_PTHREAD_POOL_SIZE_STRICTdefaults to0so an unexpected over-pool request does not hard-abort the wasm runtime, but it can be overridden for stricter local diagnostics.
Build outputs:
dist/llama_webgpu_bridge.jsdist/llama_webgpu_bridge_worker.jsdist/llama_webgpu_core.jsdist/llama_webgpu_core.wasm
Optional outputs (when WEBGPU_BRIDGE_BUILD_MEM64=1):
dist/llama_webgpu_core_mem64.jsdist/llama_webgpu_core_mem64.wasm
The bridge exposes llama.cpp session/state persistence through both direct runtime
and worker-backed LlamaWebGpuBridge instances.
API:
await bridge.stateSaveFile(path, tokens = []) -> trueawait bridge.stateLoadFile(path, tokenCapacity = bridge.getContextSize()) -> { tokens }await bridge.stateSaveBytes(tokens = []) -> Uint8Arrayawait bridge.stateLoadBytes(bytes, tokenCapacity = bridge.getContextSize()) -> { tokens }
stateSave* snapshots the current llama.cpp context; it does not tokenize or
evaluate the supplied tokens. Save only after the prompt/prefix you want to
restore has already been evaluated by the bridge, then pass the exact token
sequence for that evaluated prompt/prefix:
// After loadModelFromUrl(...) and after prompt/prefix evaluation:
const prefixTokens = await bridge.tokenize(prefixText, true);
await bridge.stateSaveFile('/prompt-state.bin', prefixTokens);
const restored = await bridge.stateLoadFile(
'/prompt-state.bin',
bridge.getContextSize(),
);
console.log(restored.tokens);
const bytes = await bridge.stateSaveBytes(prefixTokens);
await bridge.stateLoadBytes(bytes, bridge.getContextSize());State files are opaque llama.cpp state/session files. They are tied to the same model, llama.cpp build, and compatible runtime/model-load parameters. Loading a state file from a different model/build can fail.
The tokens argument is stored in the llama.cpp state/session file and is
returned by stateLoad*; it is not evaluated by stateSave* and is not validated
against the KV cache. Passing the wrong token list can make later prompt-prefix
reuse incorrect. Passing [] is allowed, but gives the bridge no restored
prefix-token metadata to reuse.
stateLoad* requires tokenCapacity to be positive, large enough for the stored
token list, and no larger than the active context size. If omitted, the JS API
uses bridge.getContextSize(). Empty stateLoadBytes input is rejected. All
four state methods require a loaded model.
stateSaveFile and stateLoadFile operate on the active WASMFS instance. In a
browser this filesystem is virtual and not durable by default, and worker-mode
paths live inside the worker runtime. Use stateSaveBytes and stateLoadBytes
when the application needs to persist snapshots in IndexedDB, OPFS, Cache API, or
another app-managed durable store.
State save/load is rejected while generation is active. On successful load the
bridge restores the prompt token list returned as { tokens }, so reissuing the
same prompt can reuse the loaded KV state via the existing prompt-prefix reuse
path.
This repo includes a wasm build gate in:
.github/workflows/ci.yml
It builds against pinned llama.cpp tag b9116, uploads build artifacts, and
runs the static CI reliability contract:
python3 scripts/verify_ci_reliability.pyThe reliability contract protects the browser smoke and workflow invariants that are easy to regress during agent-driven maintenance:
- both CI and publish workflows opt into
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24to catch action-runtime deprecation issues early; - the state-persistence browser smoke supports an integrity-checked tiny GGUF model round trip;
- the CI model cache path expands
~before resolving so it matches theactions/cachedirectory; - browser smoke failures upload
state-persistence-smoke-artifactswith console logs, result JSON, and screenshots when available.
Run the model-backed smoke locally after building the bridge if a change touches state persistence, workers, browser smoke, or workflow diagnostics:
python3 scripts/state_persistence_browser_smoke.py \
--dist-dir /path/to/webgpu_bridge_dist \
--model-url https://huggingface.co/aladar/llama-2-tiny-random-GGUF/resolve/main/llama-2-tiny-random.gguf \
--model-sha256 81f226c62d28ed4a1a9b9fa080fcd9f0cc40e0f9d5680036583ff98fbcd035cb \
--model-cache-dir ~/.cache/llama-web-bridge/state-smoke-models \
--artifacts-dir /tmp/llama-web-bridge-state-smokeDo not commit downloaded GGUFs, Playwright screenshots, console logs, generated
dist/ assets, or Emscripten build/cache directories.
Published, versioned artifacts are consumed from:
leehack/llama-web-bridge-assets
Publish workflow:
.github/workflows/publish_assets.yml
Trigger modes:
- Automatic: push a
v*tag in this repo (for examplev0.1.5) - Manual: run workflow dispatch with explicit inputs
Required repository secret:
WEBGPU_BRIDGE_ASSETS_PAT(token with write access toleehack/llama-web-bridge-assets)
Example publish:
- Create/push a release tag in this repo (for example
v0.1.5) Publish Bridge Assetsruns automatically and publishes the same tag toleehack/llama-web-bridge-assets- Workflow also creates/updates the matching GitHub Release in
leehack/llama-web-bridge-assets
Manual override example:
- Run
Publish Bridge Assetsworkflow - Inputs:
assets_tag:v0.1.5assets_repo:leehack/llama-web-bridge-assetsllama_cpp_tag:b9116
After publish, assets are CDN-available at:
https://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@v0.1.1/llama_webgpu_bridge.jshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@v0.1.1/llama_webgpu_bridge_worker.jshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@v0.1.1/llama_webgpu_core.jshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@v0.1.1/llama_webgpu_core.wasm
Note: CDN pinning fundamentally relies on git tags in the assets repo.
AGENTS.md: agent workflow and cross-repo handoffCONTRIBUTING.md: contributor setup/build/publish steps