Skip to content

Chainbase internal optimizations: session allocs, spinlock, snapshot preallocate#283

Open
heifner wants to merge 7 commits into
masterfrom
feature/chainbase-internals
Open

Chainbase internal optimizations: session allocs, spinlock, snapshot preallocate#283
heifner wants to merge 7 commits into
masterfrom
feature/chainbase-internals

Conversation

@heifner
Copy link
Copy Markdown
Contributor

@heifner heifner commented Apr 3, 2026

Summary

  • Eliminate per-transaction heap allocations in undo sessions — Replace vector<unique_ptr<abstract_session>> (18 heap allocations per transaction) with a lightweight database* + bool pair. The abstract_session / session_impl virtual dispatch layer was redundant since database::undo() and database::squash() already iterate _index_list with the same dispatch through abstract_index.

  • Remove null terminator from shared_cow_stringshared_blob stores binary data (KV keys, values, ABI blobs) where null termination is unnecessary. Saves 1 byte per allocation; with 8-byte slab bucket rounding this saves 8 bytes per allocation that crosses a bucket boundary (e.g., 24-byte keys: 33 -> 32 bytes).

  • Replace std::mutex with spinlock in small_size_allocator — Reduces per-bucket overhead from ~40 bytes (pthread_mutex_t) to 1 byte (atomic_flag), saving ~5KB across 128 buckets in shared memory. Uncontended spinlock is ~7-10ns vs ~25ns for mutex, saving ~15ns per alloc+dealloc cycle.

  • Preallocate chainbase node storage during snapshot loading — Expose per-section row count from snapshot readers, then batch-allocate node storage upfront before the row creation loop. Avoids repeated get_some() calls to the segment manager during row-by-row insertion. Covers all index loading paths (controller, KV, authorization, resource limits).

heifner added 7 commits April 3, 2026 17:12
Replace vector<unique_ptr<abstract_session>> with a lightweight
database* + bool pair. The abstract_session / session_impl virtual
dispatch layer was redundant — database::undo() and database::squash()
already iterate _index_list with the same virtual dispatch through
abstract_index.

Removes 18 heap allocations per transaction (1 vector + 17 session_impl
objects for each registered index type).
shared_cow_string is used as shared_blob for binary data (KV keys,
values, ABI blobs) where null termination is unnecessary. No c_str()
method exists — all access is via data() + size().

Saves 1 byte per allocation, which with 8-byte slab bucket rounding
saves 8 bytes per allocation that crosses a bucket boundary (e.g.,
24-byte keys: 33 -> 32 bytes, fitting in a smaller bucket).
Reduces per-bucket overhead from ~40 bytes (pthread_mutex_t) to 1 byte
(atomic_flag), saving ~5KB across 128 buckets in shared memory.
Uncontended spinlock is ~7-10ns vs ~25ns for mutex, saving ~15ns per
alloc+dealloc cycle.
Expose per-section row_count from snapshot readers via
section_reader::row_count(), then call preallocate() before the
row creation loop for all index types. This batch-allocates node
storage from the segment manager upfront, avoiding repeated
get_some() calls during row-by-row insertion.

Covers controller_index_set, kv_database_index_set,
authorization_index_set, and resource_index_set loading paths.
Session destructor calls undo() which throws if _read_only_mode is true.
This causes a crash when nodeop receives SIGTERM during a read window
while a block-building session is still alive. Add undo_from_session()
and squash_from_session() that bypass the read-only guard so RAII
cleanup always succeeds regardless of database mode.
The bare `while (_flag.test_and_set(acquire));` busy-wait can livelock
under ASan or on heavily-loaded CI runners: when the holder thread is
preempted, the spinner burns its entire time slice on the atomic flag
and the holder cannot make progress.

Use TTAS to avoid cache-line ping-pong on test_and_set, then pause for
short waits (x86 PAUSE / ARM YIELD) and yield to the scheduler after
~16 spins so the holder can run.
Two fixes to the billing-accumulation loop:

1. Move the threshold check to the top of the loop body so we break
   before calling push_trx when billing has already crossed the limit.
   With the check at the bottom, push_trx can throw tx_cpu_usage_exceeded
   on the last iteration instead of sysio_assert_message_exception,
   tripping the BOOST_CHECK_THROW and failing the test.

2. Increase num_itrs from 1000 to 5000.  delta_per_action guarantees
   >= 1 us billed to `other` per transaction, so the 1000 us threshold
   needs at most 1000 iterations.  The original bound was exact; 5x
   headroom covers rounding variation on ASan / sys-vm-oc CI builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant