Skip to content

Add data directory locking to NodeFS to prevent multi-process corruption#892

Open
reisepass wants to merge 1 commit into
electric-sql:mainfrom
reisepass:main
Open

Add data directory locking to NodeFS to prevent multi-process corruption#892
reisepass wants to merge 1 commit into
electric-sql:mainfrom
reisepass:main

Conversation

@reisepass

@reisepass reisepass commented Feb 16, 2026

Copy link
Copy Markdown

I'v been running into issues with pglite getting corrupted in persist to disk mode. The most human scenario is that you
have on pnpm dev running in one window and then you start for a quick test not remembering that you already had one open.
In my dev work this honestly happens daily so it was hard to use pglite practically as an sqlite replacement.

But simple fix just add a lock file.
This is nothing fancy since postgres assumes it has full control we just reject the second process trying to use the data
dir.

@dv2202

dv2202 commented Feb 19, 2026

Copy link
Copy Markdown

This makes a lot of sense. Accidental multi-process access during local dev is very real, especially when using pnpm dev in multiple terminals.

The lock file approach feels like a clean and pragmatic safeguard, especially since Postgres assumes exclusive control over the data directory.

Curious is there any plan to expose a clearer error message when the second process is rejected so it’s obvious what happened?

@SatyabratDivedi

Copy link
Copy Markdown

This is a really practical improvement. Locking the data dir should prevent a whole class of annoying dev-time corruption issues, and the partial initdb handling makes the setup much more robust. Great work!

@tdrz

tdrz commented Feb 19, 2026

Copy link
Copy Markdown
Collaborator

@reisepass Thank you for this!

^I would be fine also editing this out if someone doesn't like this behavior the Lock file is the important feature

I would prefer if you address only the lock file issue for the moment.

I just skimmed through the changes (sorry, really busy with other things), seems like there are a lot of other files (tests?) added. For the sake of simplicity, if the only thing addressed is the locking of folder access, please consolidate in a single test file or add to existing test files the minimum that tests the new functionality.

Curious is there any plan to expose a clearer error message when the second process is rejected so it’s obvious what happened?

Sounds reasonable.

@reisepass reisepass changed the title Add data directory locking and partial corrupt initdb cleanup Add data directory locking to NodeFS to prevent multi-process corruption Feb 22, 2026
@reisepass

Copy link
Copy Markdown
Author

@tdrz what do you think now its much fewer files as requested

@tdrz

tdrz commented Mar 17, 2026

Copy link
Copy Markdown
Collaborator

@tdrz what do you think now its much fewer files as requested

Thank you! Looks pretty good, I'll take a closer look this week. Thank you for your work and patience!

@tdrz tdrz self-requested a review March 17, 2026 20:16
@tdrz tdrz self-assigned this Mar 17, 2026
Comment thread packages/pglite/src/fs/nodefs.ts Outdated
@tdrz tdrz self-requested a review March 30, 2026 12:59
@reisepass

reisepass commented Apr 1, 2026

Copy link
Copy Markdown
Author

Hurray for CI, so

instantiation.test.ts , drop-database.test.ts and pgvector.test.ts do not call close() it seems.

@tdrz

tdrz commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

instantiation.test.ts , drop-database.test.ts and pgvector.test.ts do not call close() it seems.

Some do not call it intentionally:

'should drop postgres db and restart after unclean shutdown'

...
      // we don't close pg here on purpose
      pg = null
    }

    // pause for a bit for GC...
    await new Promise((resolve) => setTimeout(resolve, 10))

    const pg2 = await PGlite.create('./pgdata-test-drop-db2', {
      database: 'postgres',
    })

This test in particular could be adapted to delete the lock before creating pg2. I think that's a valid usage

But this also points out that probably same-process sequential reuse should be allowed

Probably, but needs careful consideration.

@reisepass

reisepass commented Apr 1, 2026

Copy link
Copy Markdown
Author

Also this points out that probably same-process sequential reuse should be allowed. While pglite is single threaded if you are running in the same nodeJS process i guess differetn pglite instances must take tunrs since nodejs is single threaded so multiple connections to the same file is allowed if in the same process

Postgres assumes exclusive control of its data directory; two PGlite
instances writing the same dataDir silently corrupt it. NodeFS now
takes a sibling lock file (dataDir.lock) holding the owner PID:

- Acquisition is atomic: exclusive create ('wx'), so two racing
  processes can never both succeed. A stale lock (holder PID is dead)
  is claimed by atomic rename before retrying, so a losing racer can
  never remove a winner's freshly created lock.
- Another live process holds the lock: throw with the PID and guidance.
  Reclaiming only happens when the holder is dead, so PID reuse can at
  worst cause a conservative refusal, never steal a live lock.
- Release verifies ownership: the lock file is only unlinked if it
  still holds this instance's token, and only after FS teardown.
- Another instance in this same process holds the dataDir: throw at the
  creation site. Deleting the lock file is the explicit override.
- Opt-in takeover for HMR-style dev servers, where module reloads
  create a fresh instance and the abandoned one can never be closed:
  new PGlite({ fs: new NodeFS(dataDir, { takeover: true }) }) closes
  the previous instance cleanly before claiming the directory.

Tests live in a single file (tests/nodefs-lock.test.js). Existing tests
that hold a dataDir now close their instances; the unclean-shutdown
test deletes the stale lock before reopening. test:clean also removes
the sibling .lock file.
@reisepass

Copy link
Copy Markdown
Author

been a while, but looks like upstread did not change too much for this to still be valid.

as per comment the unclean-shutdown test now deletes the stale lock before creating pg2, exactly as you suggested.

On same-process sequential reuse, I gave it the consideration:

  • Sequential reuse with close() works as before, zero lock friction.
  • Reuse after abandoning an instance without close(): I tried detecting "abandoned" via GC (WeakRef registry), but an unclosed PGlite instance is never collected - the wasm function table keeps it reachable until close(). So there is no reliable way to tell "abandoned" apart from "still in use" within a process.
  • And correcting my own single-threaded argument from above: taking turns on the JS thread is not enough. Each instance is a separate wasm postgres with its own shared-buffer/WAL state, so interleaved use against the same files still corrupts between turns.

So the default stays a loud error at creation time, and deleting the lock file (your test adaptation) is the explicit override. Details on the rest of the changes in the next comment.

@reisepass

Copy link
Copy Markdown
Author

Beyond that, since the last review:

  • Rebased on latest main and squashed to a single commit, so the diff is just the lock change (7 files).
  • PID reuse (your earlier concern): the liveness check is back, but reclaiming only happens when the holder PID is dead. PID reuse can only cause a conservative refusal (a reused PID that happens to be alive), it can never remove the lock of a running instance. Without auto-reclaim, every crash/Ctrl-C left a lock blocking the next open until removed by hand - that is also what broke instantiation.test.ts in CI (vitest workers leave locks behind).
  • Atomicity: acquisition is exclusive-create ('wx'), stale locks are removed under a small mkdir-based claim mutex, and close only deletes the lock if it still holds its own token. I stress tested with ~200 concurrent and SIGKILLed processes racing the same dir; earlier versions allowed two processes to own the dir at once.
  • HMR opt-in: new PGlite({ fs: new NodeFS(dataDir, { takeover: true }) }) cleanly closes the previous same-process instance and takes over - for dev servers where reloads abandon instances that can never be closed. Happy to drop this if you want the PR smaller.
  • The other test changes are just close() calls where instances were left open on a shared dataDir; lock tests are consolidated in tests/nodefs-lock.test.js.

@tdrz tdrz removed their request for review June 13, 2026 18:42
@tdrz tdrz removed their assignment Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants