Add dependency-free diarization error rate (DER) scoring#187
Merged
Conversation
Add aai_cli/core/der.py, a pure-Python diarization error rate scorer mirroring core/wer.py's shape (frozen Score, score, pooled). It computes the standard NIST/pyannote DER — missed / false-alarm / speaker-confusion time over reference speech time — by partitioning the shared timeline at every segment boundary and optimally mapping speaker labels via exact permutation search (diarization speaker counts are small). No new dependency: pyannote.metrics pulls numpy/scipy/pandas, and the lighter PyPI options still pull numpy or compile a C++ extension, whereas the current eval stack (jiwer) has neither numpy nor scipy. Not yet wired into `assembly eval` — DER needs reference speaker timing (RTTM-style segments) that the current text-only dataset path doesn't carry; that integration is a separate change.
Kill two mutation-gate survivors on core/der.py: assert Segment is frozen, and cover the case where reference and hypothesis each carry an unmatched speaker (so the optimal mapping must weigh a non-co-occurring pair). These tests were validated by the full gate but missed the prior commit's stale staging snapshot.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new
aai_cli.core.dermodule that implements NIST diarization error rate (DER) scoring in pure Python, with no external dependencies beyond the standard library.Summary
DER is the standard metric for evaluating speaker diarization — it measures the fraction of reference speech time that is misattributed to the wrong speaker. This implementation complements the existing WER (word error rate) scoring in
aai_cli.core.werby providing speaker attribution accuracy.Key Changes
New module
aai_cli/core/der.py: Implements DER scoring with:Segmentdataclass: represents a span of speech attributed to a speakerScoredataclass: holds the three NIST error components (missed, false_alarm, confusion) plus reference speech time, with computed properties for total errors and DER ratescore()function: computes DER by partitioning the timeline at segment boundaries, tallying errors per atomic interval, and finding the optimal one-to-one speaker mapping via exhaustive search (feasible because diarization files have few speakers)pooled()function: aggregates scores across multiple files for corpus-level DERComprehensive test suite
tests/test_der.py: 14 tests covering:Implementation Details
itertools,dataclasses,collections.abc)SegmentandScoreare frozen dataclasses for safetyhttps://claude.ai/code/session_011qBbDEWrtpPVQVaVKAHnMx