Skip to content

Add StringIndex: a generic open-addressed string set#11660

Draft
dougqh wants to merge 6 commits into
masterfrom
dougqh/tagset
Draft

Add StringIndex: a generic open-addressed string set#11660
dougqh wants to merge 6 commits into
masterfrom
dougqh/tagset

Conversation

@dougqh

@dougqh dougqh commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Draft — /techdebt + review before ready. Split out of a larger tag-id effort so the generic data structure stands on its own.

What

StringIndex (datadog.trace.util, alongside the custom Hashtable) — a flat, allocation-free, open-addressed string set / index:

  • Support — the static algorithm over raw int[] hashes / String[] names. Held in static final fields the refs fold to constants (the hot path).
  • Data — a build-time carrier {int[] hashes, String[] names} (pull into your own fields).
  • an instance wrapper for convenience (of/contains/indexOf).

2×-oversized (load factor ≤ 0.5), linear probe + wraparound, hash gates equals, interned == fast path, 0 = empty sentinel. Generic — no payload baked in; it just knows names. The headline capability is indexOf, which assigns each known string a stable dense slot; consumers attach a parallel array (e.g. long[] ids) indexed by that slot. Membership (contains) falls out as indexOf >= 0.

(Renamed from TagSet — the structure is more general than tags; a fixed name→id / membership index is just one of its uses.)

Why / benchmarks

The motivating use is a fast, declared name→id table — a "pit of success" alternative to hand-rolled string switches and per-name caches — but the structure is general. Shipped with benchmarks (Apple M1 Max, 8 threads, Java 8):

  • SetBenchmark (membership): StringIndexHashSet (~2.0–2.25B ops/s), ~2.5–3× array/sortedArray/treeSet; the static Support path ~12% over the instance and ~12% over HashSet on hits, with much tighter variance.
  • KeyOfBenchmark (name→id): StringIndex a hand-written string switch (at 16 cases; the gap widens at scale) and ~20% over HashMap.

StringIndexTest covers hashing/zero-sentinel, probe + wraparound, table-full, and the parallel-payload usage.

Follow-ups

  • A JOL retained-footprint comparison (StringIndex vs HashSet across sizes) will land in this PR — flat parallel arrays should beat HashSet's per-element Node objects, especially for small fixed sets. (Footprint, not alloc-rate — different axis.)
  • First real consumer to validate the indexOf→parallel-array ergonomics: TagInterceptor's hot-path tag-name switch.

🤖 Generated with Claude Code

TagSet (datadog.trace.util) — a flat, allocation-free, open-addressed string
set: Support (static algorithm over raw int[] hashes / String[] names — the hot
path, folds to constants when held in static finals), Data (build-time carrier),
and a convenience instance wrapper. 2x-oversized (load factor <= 0.5), linear
probe, interned-== fast path, 0 = empty sentinel. Consumers attach a parallel
payload array indexed by the returned slot.

Benchmarks: SetBenchmark (membership vs HashSet/array/sortedArray/treeSet) and
KeyOfBenchmark (name->id resolution vs HashMap and a hand-written switch).
TagSetTest covers hashing, probing/wraparound, table-full, and parallel payload.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dougqh dougqh added comp: core Tracer core tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes type: enhancement Enhancements and improvements labels Jun 17, 2026
@datadog-datadog-prod-us1

This comment has been minimized.

dougqh and others added 4 commits June 17, 2026 07:14
…hmark, not a generic-set one)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…hmark

Adds get_tagSetMap / get_tagSetMap_sameKey as a build-once, read-only map
option in the map menu: keys in a TagSet, values in a parallel int[] (no
boxing), get is Support.indexOf plus one array load. Fastest get in the
benchmark and the tightest error bars; results captured in the javadoc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The shared static sharedLookupIndex was written by every thread under
@threads(8), so its cache-line ping-pong capped the get_* benchmarks near a
~1.4B ops/s contention ceiling -- the same artifact SetBenchmark fixed by
moving its rotation counter to per-thread @State(Scope.Thread). Apply the same
fix here; off the ceiling the get_* numbers rise and the differences between
map options stop being compressed.

Replaces the two stale, unlabeled Java-21 result blocks (and the interim
contention-capped get block) with a single env-stamped table from one run on a
modern, representative JVM (Apple M1 Max, macOS 26.4.1, Zulu 17.0.7+7-LTS, 8
threads, 2 forks). Header conclusions re-checked against these numbers: the
fixed TagSet map leads HashMap ~30%/~50% (rotating/same-key), forEach matches
the fastest map iterators, and create's HashMap-vs-LinkedHashMap edge is ~1.4x.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The structure is more general than tags: a flat, open-addressed String set
whose indexOf returns a stable dense slot for parallel value arrays. "Tag" was
just one prospective consumer. Renames the class, test, and benchmark
references (Support/Data nested types and the SetBenchmark / map-benchmark
usages); leaves the unrelated propagation TagSetChanges in dd-trace-core alone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dougqh dougqh changed the title Add TagSet: a generic open-addressed string set Add StringIndex: a generic open-addressed string set Jun 23, 2026
…edMapBenchmark

StringIndex's benchmark integration is moving to the dedicated benchmark PRs
(set overhaul #11721, map overhaul #11679) and will be folded in there later.
Revert both benchmark files to master so this PR is purely the StringIndex data
structure + tests. Avoids the #11679/#11721 deletions-vs-edits conflicts too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant