Skip to content

fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728

Open
Alex-Wengg wants to merge 6 commits into
mainfrom
fix/sortformer-bnns-crash-726
Open

fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728
Alex-Wengg wants to merge 6 commits into
mainfrom
fix/sortformer-bnns-crash-726

Conversation

@Alex-Wengg

@Alex-Wengg Alex-Wengg commented Jun 21, 2026

Copy link
Copy Markdown
Member

Summary

Fixes the Sortformer BNNS graph-compile crash from #726 and hardens the config + device-compat path.

The root-level Sortformer CoreML models had chunk_pre_encoder_embs_out as both a graph input and output in the head submodel, which the macOS 26 / newer BNNS compiler rejects:

BNNS Graph Compile: Function main has tensor chunk_pre_encoder_embs_out as both an input and output.

Root cause was a conversion-toolchain artifact (torch 2.9.x folded the identity that kept input/output distinct). The models were rebuilt clean (torch 2.7 + coremltools 9.0), verified no-alias + ANE-loadable + numerically matched to the PyTorch reference (100% speaker-argmax parity), and uploaded to FluidInference/diar-streaming-sortformer-coreml/v3/.

Changes

  • Point downloads at the fixed models: default v3/fp16 set.
  • Runtime precision selection (addresses reporter feedback — modelsSubdirectory was a let): ModelNames.Sortformer.ModelPrecision { .fp16, .palettized } + mutable SortformerConfig.precision (default .fp16); bundle(for:) honors it. Flip to the 6-bit, ~2.5× smaller set without editing source — var c = .highContextV2_1; c.precision = .palettized — or sortformer --palettized.
  • A14 compute-unit auto-fallback (addresses reporter feedback — large fp16 high-context hangs for minutes on .all): SortformerModels.recommendedComputeUnits(for:) routes the large fp16 high-context variants to .cpuOnly on <8 GB devices, while the ~330 MB palettized high-context head (loads fine on ANE) keeps .all. Threaded through load/loadFromHuggingFace/initialize (all default to auto, still overridable). Also fixes load() ignoring its compute-unit argument.
  • efficientV2_1 variant: chunk_len=25, ~2 s output latency, ~4× the RTFx of the default streaming config at near-identical per-inference cost.
  • Config-mismatch guard: SortformerDiarizer validates the diarizer SortformerConfig against the model's embedded metadata (chunk_len/contexts/fifo_len/spkcache_len) on init and logs a clear error on mismatch. spkcacheUpdatePeriod is excluded since the host clamps it.
  • CLI ergonomics: sortformer --config fast|efficient|low|high [--palettized]; sortformer-benchmark --collar/--onset/--offset.

Benchmarks

RAM (highContextV2_1): fp16 ~2.4 GB → palettized ~330 MB (reporter-confirmed).

Offline throughput vs Argmax (M5 Pro, ComputeUnit.ALL, 30.72 s windows, median of 120). Argmax's Sortformer is an offline batch model (no streaming state); exported ours as a single fused offline graph and benchmarked head-to-head against their 3-model chain:

model-exec (mel → preds) end-to-end (incl. mel)
Argmax (3 calls) 14.57 ms · 2108× 16.41 ms · 1872×
FluidAudio (fused) 10.65 ms · 2884× 12.49 ms · 2459×

FluidAudio is 1.3–1.4× faster offline — one fused GPU graph vs their ANE→GPU split. The ">10× faster" sometimes cited for Argmax compares their offline model against our highContextV2_1 streaming config (slowest/largest variant) — apples-to-oranges. For low-latency streaming throughput use .efficientV2_1 (~215× RTFx); Argmax ships no streaming Sortformer. Repro script: mobius#73 offline_argmax_bench.py. Full writeup: Documentation/Diarization/Sortformer.md#benchmarks.

Validation

  • All CI green (build/test macOS + iOS, swift-format, Sortformer benchmark).
  • Numerical parity vs NeMo PyTorch reference = 100% speaker-argmax agreement across all variants.
  • Full AMI-SDM DER (forced-alignment GT, collar 0.25): highContext ~26.5%, default streaming ~29.0%. 6-bit palettization = +0.9 pp avg (streaming; larger on high-context — why fp16 stays default).

…tch guard (#726)

The root-level Sortformer CoreML models hit a BNNS graph-compile crash on newer
BNNS ("tensor chunk_pre_encoder_embs_out as both an input and output"). The fixed
rebuild lives at v3/fp16/ in the HF repo; point ModelNames there so downloads pick
up the working models.

- ModelNames.Sortformer.modelsSubdirectory = "v3/fp16" (BNNS-fixed set); v3/palettized
  is the 6-bit, ~2.5x-smaller set for RAM-constrained devices.
- Add efficientV2_1 variant (chunk_len=25, ~2s latency, ~4x RTFx of fast) + config preset.
- SortformerDiarizer now validates the diarizer config against the model's embedded
  metadata on init and logs a clear error on mismatch (a mismatch silently produced
  incorrect/slow diarization — #726). spkcacheUpdatePeriod excluded (host-clamped).
- CLI: `sortformer --config fast|efficient|low|high`; `sortformer-benchmark --collar`,
  `--onset`, `--offset` (the hardcoded collar=0 / onset=0.5 skewed reported DER).
@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Supertonic3 Smoke Test ✅

Check Result
Build
Model download (incl. VectorEstimatorVariants/ int4 buckets)
Model load
Synthesis pipeline (--ve-variant int4)
Output WAV ✅ (364.7 KB)

Runtime: 0m17s

Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf.

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 9.16x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 16.918 14.8 Fetching diarization models
Model Compile 7.250 6.3 CoreML compilation
Audio Load 0.064 0.1 Loading audio file
Segmentation 28.610 25.0 VAD + speech detection
Embedding 114.228 99.7 Speaker embedding extraction
Clustering (VBx) 0.154 0.1 Hungarian algorithm + VBx clustering
Total 114.618 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 143.0s processing • Test runtime: 2m 30s • 06/24/2026, 10:07 AM EST

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 5.75x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 80.1s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.080s Average chunk processing time
Max Chunk Time 0.160s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m31s • 06/24/2026, 10:01 AM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 22.50x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 12.495 26.8 Fetching diarization models
Model Compile 5.355 11.5 CoreML compilation
Audio Load 0.075 0.2 Loading audio file
Segmentation 13.983 30.0 Detecting speech regions
Embedding 23.305 50.0 Extracting speaker voices
Clustering 9.322 20.0 Grouping same speakers
Total 46.631 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 46.6s diarization time • Test runtime: 2m 34s • 06/24/2026, 10:08 AM EST

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 30.3% <35%
Miss Rate 28.2% - -
False Alarm 0.9% - -
Speaker Error 1.2% - -
RTFx 21.0x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 3m 7s • 2026-06-24T14:00:56.857Z

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (157.5 KB)

Runtime: 0m9s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 571.5x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 721.0x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.08x
test-other 1.19% 0.00% 3.75x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.55x
test-other 1.00% 0.00% 3.61x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.68x Streaming real-time factor
Avg Chunk Time 1.343s Average time to process each chunk
Max Chunk Time 1.599s Maximum chunk processing time
First Token 1.575s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.64x Streaming real-time factor
Avg Chunk Time 1.453s Average time to process each chunk
Max Chunk Time 1.921s Maximum chunk processing time
First Token 1.460s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 6m12s • 06/24/2026, 10:04 AM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@beta-devin-ai-integration beta-devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review (Beta)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 requiredModels now downloads ALL 7 variants including efficientV2_1

Sortformer.requiredModels (Sources/FluidAudio/ModelNames.swift:725) returns Set(Variant.allCases.map(\.fileName)) which now includes all 7 variants. Any code path that downloads the full required set (i.e., when variant is nil) will now also attempt to download v3/fp16/SortformerEfficient_v2.1.mlmodelc. This is fine as long as that model file exists in the HuggingFace repo at https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml/tree/main/v3/fp16/. If it hasn't been uploaded yet, full-set downloads would fail.

(Refers to lines 724-727)

Open in Devin Review (Beta)

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +125 to +148
private func validateConfigMatch(_ models: SortformerModels) {
guard let embedded = models.embeddedConfig else { return }
let current = SortformerModels.EmbeddedConfig(
chunkLen: config.chunkLen,
chunkLeftContext: config.chunkLeftContext,
chunkRightContext: config.chunkRightContext,
fifoLen: config.fifoLen,
spkcacheLen: config.spkcacheLen
)
guard current != embedded else { return }
logger.error(
"""
Sortformer config mismatch — diarizer config does not match the loaded model. \
This produces incorrect and much slower diarization (issue #726). \
diarizer(chunkLen=\(current.chunkLen), leftCtx=\(current.chunkLeftContext), \
rightCtx=\(current.chunkRightContext), fifoLen=\(current.fifoLen), \
spkcacheLen=\(current.spkcacheLen)) \
vs model(chunkLen=\(embedded.chunkLen), leftCtx=\(embedded.chunkLeftContext), \
rightCtx=\(embedded.chunkRightContext), fifoLen=\(embedded.fifoLen), \
spkcacheLen=\(embedded.spkcacheLen)). \
Construct SortformerDiarizer with the SortformerConfig matching the model variant.
"""
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 validateConfigMatch only warns, never fails — silent mismatch possible in production

The validateConfigMatch method at Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift:125-148 logs an error but does not throw or prevent initialization when a config mismatch is detected. This is likely intentional (backward compatibility, older models without metadata), but it means a misconfigured diarizer will silently produce incorrect results in production where log output may not be monitored. If the embedded metadata is present AND mismatched, this is almost certainly a programmer error. Consider whether this should throw in a future iteration.

Open in Devin Review (Beta)

Was this helpful? React with 👍 or 👎 to provide feedback.

…t fallback (#726)

Addresses two follow-ups from the #726 reporter:

- modelsSubdirectory was a let constant, so switching to the smaller
  palettized models (2.4GB -> 330MB RAM) required editing source. Add
  ModelNames.Sortformer.ModelPrecision and a mutable SortformerConfig.precision
  so callers select fp16 (default) vs palettized at runtime; bundle(for:)
  honors it. CLI: --palettized.

- The ~2.4GB fp16 high-context head triggers a multi-minute ANE compile
  hang on RAM-constrained devices (A14). Add recommendedComputeUnits(for:):
  large fp16 high-context variants on <8GB devices load with .cpuOnly,
  everything else (incl. the ANE-friendly palettized head) keeps .all.
  Wired through load/loadFromHuggingFace/initialize; computeUnits remains
  overridable. Also fixes load() ignoring its compute-unit argument.
… benchmark (#726)

Refresh the Model Variants table for the v3 model set (fp16/palettized paths,
efficientV2_1, config-mismatch note). Document precision selection (RAM/DER
trade), the A14 compute-unit auto-fallback, and the offline head-to-head vs
Argmax: FluidAudio's fused offline graph is 1.3-1.4x faster (10.65ms/2884x vs
14.57ms/2108x encoder-only on M5 Pro), with the >10x Argmax claim explained as
a streaming-vs-offline mismatch.
@Alex-Wengg Alex-Wengg force-pushed the fix/sortformer-bnns-crash-726 branch from a6ff757 to 32a5759 Compare June 24, 2026 01:12
…stitching)

Add OfflineSortformerDiarizer backed by the fused offline Sortformer model
(mel -> speaker_preds, 30.72s window, no streaming state) — one CoreML call per
window, the fastest batch path (1.3-1.4x faster than Argmax offline, #726).

- OfflineSortformerConfig / OfflineSortformerModels.runOffline (2-input graph,
  distinct from the streaming 6-input runMainModel)
- Long audio tiled into overlapping windows; SortformerSpeakerStitcher recovers
  the cross-window speaker permutation (brute-force 4! over the overlap) so IDs
  stay globally consistent
- ModelNames.Sortformer.offlineBundle(precision:) -> v3/{fp16,palettized}/SortformerOffline_v2.1.mlmodelc
- CLI: sortformer --offline [--palettized]
- Tests for the stitcher, config, and bundle paths

Validated end-to-end: 288.6s audio -> 13 windows in 1.02s (281.9x RTFx),
consistent speaker IDs across all window boundaries.
Document OfflineSortformerDiarizer whole-file throughput (fused model + speaker
stitching) alongside the existing offline-vs-Argmax model-exec numbers.
… docs

Add --offline to sortformer-benchmark (whole-file fused path) and document the
measured limitation: offline matches streaming on detection (identical Miss/FA)
and is ~1400x RTFx, but speaker confusion on long multi-speaker audio gives
~56% DER vs ~26% streaming on AMI-SDM (no spkcache; cross-window stitching can't
recover within-window confusion). Scope offline to short/few-speaker/throughput;
streaming stays the long-form path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant