Add StringIndex: a generic open-addressed string set by dougqh · Pull Request #11660 · DataDog/dd-trace-java

dougqh · 2026-06-17T11:03:26Z

Draft — /techdebt + review before ready. Split out of a larger tag-id effort so the generic data structure stands on its own.

What

StringIndex (datadog.trace.util, alongside the custom Hashtable) — a flat, allocation-free, open-addressed string set / index:

Support — the static algorithm over raw int[] hashes / String[] names. Held in static final fields the refs fold to constants (the hot path).
Data — a build-time carrier {int[] hashes, String[] names} (pull into your own fields).
an instance wrapper for convenience (of/contains/indexOf).

2×-oversized (load factor ≤ 0.5), linear probe + wraparound, hash gates equals, interned == fast path, 0 = empty sentinel. Generic — no payload baked in; it just knows names. The headline capability is indexOf, which assigns each known string a stable dense slot; consumers attach a parallel array (e.g. long[] ids) indexed by that slot. Membership (contains) falls out as indexOf >= 0.

(Renamed from TagSet — the structure is more general than tags; a fixed name→id / membership index is just one of its uses.)

Why / benchmarks

The motivating use is a fast, declared name→id table — a "pit of success" alternative to hand-rolled string switches and per-name caches — but the structure is general. Shipped with benchmarks (Apple M1 Max, 8 threads, Java 8):

SetBenchmark (membership): StringIndex ≈ HashSet (~2.0–2.25B ops/s), ~2.5–3× array/sortedArray/treeSet; the static Support path ~12% over the instance and ~12% over HashSet on hits, with much tighter variance.
KeyOfBenchmark (name→id): StringIndex ≈ 2× a hand-written string switch (at 16 cases; the gap widens at scale) and ~20% over HashMap.

StringIndexTest covers hashing/zero-sentinel, probe + wraparound, table-full, and the parallel-payload usage.

Follow-ups

A JOL retained-footprint comparison (StringIndex vs HashSet across sizes) will land in this PR — flat parallel arrays should beat HashSet's per-element Node objects, especially for small fixed sets. (Footprint, not alloc-rate — different axis.)
First real consumer to validate the indexOf→parallel-array ergonomics: TagInterceptor's hot-path tag-name switch.

🤖 Generated with Claude Code

TagSet (datadog.trace.util) — a flat, allocation-free, open-addressed string set: Support (static algorithm over raw int[] hashes / String[] names — the hot path, folds to constants when held in static finals), Data (build-time carrier), and a convenience instance wrapper. 2x-oversized (load factor <= 0.5), linear probe, interned-== fast path, 0 = empty sentinel. Consumers attach a parallel payload array indexed by the returned slot. Benchmarks: SetBenchmark (membership vs HashSet/array/sortedArray/treeSet) and KeyOfBenchmark (name->id resolution vs HashMap and a hand-written switch). TagSetTest covers hashing, probing/wraparound, table-full, and parallel payload. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…hmark, not a generic-set one) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…hmark Adds get_tagSetMap / get_tagSetMap_sameKey as a build-once, read-only map option in the map menu: keys in a TagSet, values in a parallel int[] (no boxing), get is Support.indexOf plus one array load. Fastest get in the benchmark and the tightest error bars; results captured in the javadoc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@threads

The shared static sharedLookupIndex was written by every thread under @threads(8), so its cache-line ping-pong capped the get_* benchmarks near a ~1.4B ops/s contention ceiling -- the same artifact SetBenchmark fixed by moving its rotation counter to per-thread @State(Scope.Thread). Apply the same fix here; off the ceiling the get_* numbers rise and the differences between map options stop being compressed. Replaces the two stale, unlabeled Java-21 result blocks (and the interim contention-capped get block) with a single env-stamped table from one run on a modern, representative JVM (Apple M1 Max, macOS 26.4.1, Zulu 17.0.7+7-LTS, 8 threads, 2 forks). Header conclusions re-checked against these numbers: the fixed TagSet map leads HashMap ~30%/~50% (rotating/same-key), forEach matches the fastest map iterators, and create's HashMap-vs-LinkedHashMap edge is ~1.4x. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The structure is more general than tags: a flat, open-addressed String set whose indexOf returns a stable dense slot for parallel value arrays. "Tag" was just one prospective consumer. Renames the class, test, and benchmark references (Support/Data nested types and the SetBenchmark / map-benchmark usages); leaves the unrelated propagation TagSetChanges in dd-trace-core alone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…edMapBenchmark StringIndex's benchmark integration is moving to the dedicated benchmark PRs (set overhaul #11721, map overhaul #11679) and will be folded in there later. Revert both benchmark files to master so this PR is purely the StringIndex data structure + tests. Avoids the #11679/#11721 deletions-vs-edits conflicts too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dougqh added comp: core Tracer core tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes type: enhancement Enhancements and improvements labels Jun 17, 2026

This comment has been minimized.

Sign in to view

dougqh and others added 4 commits June 17, 2026 07:14

Drop KeyOfBenchmark from the TagSet PR (it's a keyOf-integration benc…

39c91c0

…hmark, not a generic-set one) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dougqh changed the title ~~Add TagSet: a generic open-addressed string set~~ Add StringIndex: a generic open-addressed string set Jun 23, 2026

dougqh mentioned this pull request Jun 23, 2026

Overhaul set benchmarks: split Immutable / SingleThreaded, add Set.copyOf #11721

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StringIndex: a generic open-addressed string set#11660

Add StringIndex: a generic open-addressed string set#11660
dougqh wants to merge 6 commits into
masterfrom
dougqh/tagset

dougqh commented Jun 17, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dougqh commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why / benchmarks

Follow-ups

Uh oh!

This comment has been minimized.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dougqh commented Jun 17, 2026 •

edited

Loading