Add dependency-free diarization error rate (DER) scoring by alexkroman · Pull Request #187 · AssemblyAI/cli

alexkroman · 2026-06-16T19:40:08Z

Adds a new aai_cli.core.der module that implements NIST diarization error rate (DER) scoring in pure Python, with no external dependencies beyond the standard library.

Summary

DER is the standard metric for evaluating speaker diarization — it measures the fraction of reference speech time that is misattributed to the wrong speaker. This implementation complements the existing WER (word error rate) scoring in aai_cli.core.wer by providing speaker attribution accuracy.

Key Changes

New module aai_cli/core/der.py: Implements DER scoring with:
- Segment dataclass: represents a span of speech attributed to a speaker
- Score dataclass: holds the three NIST error components (missed, false_alarm, confusion) plus reference speech time, with computed properties for total errors and DER rate
- score() function: computes DER by partitioning the timeline at segment boundaries, tallying errors per atomic interval, and finding the optimal one-to-one speaker mapping via exhaustive search (feasible because diarization files have few speakers)
- pooled() function: aggregates scores across multiple files for corpus-level DER
- Helper functions for timeline analysis, speaker mapping, and cooccurrence tracking
Comprehensive test suite tests/test_der.py: 14 tests covering:
- Immutability of value types
- Perfect diarization (zero error)
- Speaker label remapping (labels are arbitrary, optimal mapping recovers correct attribution)
- Missed speech, false alarms, and speaker confusion scenarios
- Overlapping speakers (counted per speaker, not wall-clock time)
- Optimal vs. greedy speaker assignment
- Corpus-level pooling

Implementation Details

No external dependencies: Uses only Python stdlib (itertools, dataclasses, collections.abc)
Optimal speaker mapping: Exhaustive permutation search over speaker pairs (factorial complexity is acceptable for typical diarization speaker counts of 2–10)
Timeline partitioning: Segments are split at every boundary so each atomic interval has a fixed set of active speakers, enabling accurate per-interval error tallying
Weighted by speaker-time: Reference speech time is counted per concurrent speaker, so overlapping speech is properly attributed
Immutable value types: Both Segment and Score are frozen dataclasses for safety

https://claude.ai/code/session_011qBbDEWrtpPVQVaVKAHnMx

Add aai_cli/core/der.py, a pure-Python diarization error rate scorer mirroring core/wer.py's shape (frozen Score, score, pooled). It computes the standard NIST/pyannote DER — missed / false-alarm / speaker-confusion time over reference speech time — by partitioning the shared timeline at every segment boundary and optimally mapping speaker labels via exact permutation search (diarization speaker counts are small). No new dependency: pyannote.metrics pulls numpy/scipy/pandas, and the lighter PyPI options still pull numpy or compile a C++ extension, whereas the current eval stack (jiwer) has neither numpy nor scipy. Not yet wired into `assembly eval` — DER needs reference speaker timing (RTTM-style segments) that the current text-only dataset path doesn't carry; that integration is a separate change.

Kill two mutation-gate survivors on core/der.py: assert Segment is frozen, and cover the case where reference and hypothesis each carry an unmatched speaker (so the optimal mapping must weigh a non-co-occurring pair). These tests were validated by the full gate but missed the prior commit's stale staging snapshot.

claude added 2 commits June 16, 2026 19:26

alexkroman enabled auto-merge June 16, 2026 19:40

alexkroman added this pull request to the merge queue Jun 16, 2026

Merged via the queue into main with commit f14616d Jun 16, 2026
19 checks passed

alexkroman deleted the claude/sweet-knuth-ozbgwo branch June 16, 2026 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dependency-free diarization error rate (DER) scoring#187

Add dependency-free diarization error rate (DER) scoring#187
alexkroman merged 2 commits into
mainfrom
claude/sweet-knuth-ozbgwo

alexkroman commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkroman commented Jun 16, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants