Skip to content

Bug repro: RI.SET AOF ordering divergence under concurrent writes#1819

Draft
tiagonapoli wants to merge 1 commit into
mainfrom
tiagonapoli/ri-set-aof-ordering-divergence
Draft

Bug repro: RI.SET AOF ordering divergence under concurrent writes#1819
tiagonapoli wants to merge 1 commit into
mainfrom
tiagonapoli/ri-set-aof-ordering-divergence

Conversation

@tiagonapoli
Copy link
Copy Markdown
Collaborator

Problem

\RI.SET\ uses a shared lock (not exclusive) because BfTree is internally thread-safe for point operations. However, AOF logging happens after the native insert as a separate unserialized call. When multiple threads concurrently write the same field:

  1. BfTree serializes the inserts internally (last-writer-wins)
  2. AOF enqueue is a separate step — log order may not match BfTree's execution order
  3. On AOF replay (recovery), the replayed \last write\ differs from the primary's actual winner

This causes primary/replica divergence.

Test

The test (\RISetAofOrderingDivergenceTest) uses 8 workers writing the same field per round across 200 rounds with a \Barrier\ for synchronization. After all rounds, it commits AOF, recovers, and compares each field's value.

Sample output (first run):

\
DIVERGENCE round=61 field=field-0061 primary=v-0061-w02 recovered=v-0061-w04
DIVERGENCE round=72 field=field-0072 primary=v-0072-w01 recovered=v-0072-w02
DIVERGENCE round=143 field=field-0143 primary=v-0143-w02 recovered=v-0143-w00
DIVERGENCE round=194 field=field-0194 primary=v-0194-w01 recovered=v-0194-w04
DIVERGENCE round=199 field=field-0199 primary=v-0199-w03 recovered=v-0199-w02
Total rounds=200 workers=8 mismatches=5
\\

~2-5% of rounds show divergence.

Root cause

In \StorageSession.RangeIndexSet():

  1. Shared lock acquired via \ReadRangeIndex()\
  2. \BfTreeService.InsertByPtr()\ — native insert (thread-safe internally)
  3. \ReplicateRangeIndexSet()\ — AOF enqueue (separate, unserialized)
  4. Shared lock released

Steps 2 and 3 are not atomic relative to other threads holding shared locks on the same key, so two threads can interleave their insert+log in different orders.

Possible fixes

  • Serialize same-field RI.SET writes (per-key exclusive lock or funnel through RMW)
  • Log AOF inside BfTree's internal critical section
  • Use a sequence number / logical clock to make AOF replay order-independent

RI.SET uses a shared lock (BfTree is internally thread-safe), so concurrent
writes to the same field execute in parallel. The AOF enqueue happens after
the native insert as a separate unserialized step, so the AOF log order may
not match BfTree's internal last-writer-wins order. On AOF replay (recovery),
the replayed last write may differ from the primary's actual winner, causing
primary/replica divergence.

The test uses 8 workers writing the same field per round across 200 rounds,
then recovers from AOF and compares. Empirically ~2-5% of rounds diverge.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/ri-set-aof-ordering-divergence branch from a67003c to d8d4365 Compare May 21, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant