fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726) by Alex-Wengg · Pull Request #728 · FluidInference/FluidAudio

Alex-Wengg · 2026-06-21T22:09:38Z

Summary

Fixes the Sortformer BNNS graph-compile crash from #726 and hardens the config + device-compat path.

The root-level Sortformer CoreML models had chunk_pre_encoder_embs_out as both a graph input and output in the head submodel, which the macOS 26 / newer BNNS compiler rejects:

BNNS Graph Compile: Function main has tensor chunk_pre_encoder_embs_out as both an input and output.

Root cause was a conversion-toolchain artifact (torch 2.9.x folded the identity that kept input/output distinct). The models were rebuilt clean (torch 2.7 + coremltools 9.0), verified no-alias + ANE-loadable + numerically matched to the PyTorch reference (100% speaker-argmax parity), and uploaded to FluidInference/diar-streaming-sortformer-coreml/v3/.

Changes

Point downloads at the fixed models: default v3/fp16 set.
Runtime precision selection (addresses reporter feedback — modelsSubdirectory was a let): ModelNames.Sortformer.ModelPrecision { .fp16, .palettized } + mutable SortformerConfig.precision (default .fp16); bundle(for:) honors it. Flip to the 6-bit, ~2.5× smaller set without editing source — var c = .highContextV2_1; c.precision = .palettized — or sortformer --palettized.
A14 compute-unit auto-fallback (addresses reporter feedback — large fp16 high-context hangs for minutes on .all): SortformerModels.recommendedComputeUnits(for:) routes the large fp16 high-context variants to .cpuOnly on <8 GB devices, while the ~330 MB palettized high-context head (loads fine on ANE) keeps .all. Threaded through load/loadFromHuggingFace/initialize (all default to auto, still overridable). Also fixes load() ignoring its compute-unit argument.
efficientV2_1 variant: chunk_len=25, ~2 s output latency, ~4× the RTFx of the default streaming config at near-identical per-inference cost.
Config-mismatch guard: SortformerDiarizer validates the diarizer SortformerConfig against the model's embedded metadata (chunk_len/contexts/fifo_len/spkcache_len) on init and logs a clear error on mismatch. spkcacheUpdatePeriod is excluded since the host clamps it.
CLI ergonomics: sortformer --config fast|efficient|low|high [--palettized]; sortformer-benchmark --collar/--onset/--offset.

Benchmarks

RAM (highContextV2_1): fp16 ~2.4 GB → palettized ~330 MB (reporter-confirmed).

Offline throughput vs Argmax (M5 Pro, ComputeUnit.ALL, 30.72 s windows, median of 120). Argmax's Sortformer is an offline batch model (no streaming state); exported ours as a single fused offline graph and benchmarked head-to-head against their 3-model chain:

	model-exec (mel → preds)	end-to-end (incl. mel)
Argmax (3 calls)	14.57 ms · 2108×	16.41 ms · 1872×
FluidAudio (fused)	10.65 ms · 2884×	12.49 ms · 2459×

FluidAudio is 1.3–1.4× faster offline — one fused GPU graph vs their ANE→GPU split. The ">10× faster" sometimes cited for Argmax compares their offline model against our highContextV2_1 streaming config (slowest/largest variant) — apples-to-oranges. For low-latency streaming throughput use .efficientV2_1 (~215× RTFx); Argmax ships no streaming Sortformer. Repro script: mobius#73 offline_argmax_bench.py. Full writeup: Documentation/Diarization/Sortformer.md#benchmarks.

Validation

All CI green (build/test macOS + iOS, swift-format, Sortformer benchmark).
Numerical parity vs NeMo PyTorch reference = 100% speaker-argmax agreement across all variants.
Full AMI-SDM DER (forced-alignment GT, collar 0.25): highContext ~26.5%, default streaming ~29.0%. 6-bit palettization = +0.9 pp avg (streaming; larger on high-context — why fp16 stays default).

…tch guard (#726) The root-level Sortformer CoreML models hit a BNNS graph-compile crash on newer BNNS ("tensor chunk_pre_encoder_embs_out as both an input and output"). The fixed rebuild lives at v3/fp16/ in the HF repo; point ModelNames there so downloads pick up the working models. - ModelNames.Sortformer.modelsSubdirectory = "v3/fp16" (BNNS-fixed set); v3/palettized is the 6-bit, ~2.5x-smaller set for RAM-constrained devices. - Add efficientV2_1 variant (chunk_len=25, ~2s latency, ~4x RTFx of fast) + config preset. - SortformerDiarizer now validates the diarizer config against the model's embedded metadata on init and logs a clear error on mismatch (a mismatch silently produced incorrect/slow diarization — #726). spkcacheUpdatePeriod excluded (host-clamped). - CLI: `sortformer --config fast|efficient|low|high`; `sortformer-benchmark --collar`, `--onset`, `--offset` (the hardcoded collar=0 / onset=0.5 skewed reported DER).

github-actions · 2026-06-21T22:13:56Z

Supertonic3 Smoke Test ✅

Check	Result
Build	✅
Model download (incl. `VectorEstimatorVariants/` int4 buckets)	✅
Model load	✅
Synthesis pipeline (`--ve-variant int4`)	✅
Output WAV	✅ (364.7 KB)

_{Runtime: 0m17s}

_{Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf.}

github-actions · 2026-06-21T22:15:22Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	10.4%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	9.16x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	16.918	14.8	Fetching diarization models
Model Compile	7.250	6.3	CoreML compilation
Audio Load	0.064	0.1	Loading audio file
Segmentation	28.610	25.0	VAD + speech detection
Embedding	114.228	99.7	Speaker embedding extraction
Clustering (VBx)	0.154	0.1	Hungarian algorithm + VBx clustering
Total	114.618	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	10.4%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 143.0s processing • Test runtime: 2m 30s • 06/24/2026, 10:07 AM EST}

github-actions · 2026-06-21T22:15:30Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	5.75x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	80.1s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.080s	Average chunk processing time
Max Chunk Time	0.160s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 1m31s • 06/24/2026, 10:01 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-06-21T22:16:46Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	22.50x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	12.495	26.8	Fetching diarization models
Model Compile	5.355	11.5	CoreML compilation
Audio Load	0.075	0.2	Loading audio file
Segmentation	13.983	30.0	Detecting speech regions
Embedding	23.305	50.0	Extracting speaker voices
Clustering	9.322	20.0	Grouping same speakers
Total	46.631	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 46.6s diarization time • Test runtime: 2m 34s • 06/24/2026, 10:08 AM EST}

github-actions · 2026-06-21T22:18:56Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	30.3%	<35%	✅
Miss Rate	28.2%	-	-
False Alarm	0.9%	-	-
Speaker Error	1.2%	-	-
RTFx	21.0x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 3m 7s • 2026-06-24T14:00:56.857Z}

github-actions · 2026-06-21T22:20:24Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (157.5 KB)

_{Runtime: 0m9s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.}

github-actions · 2026-06-21T22:23:43Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	571.5x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	721.0x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-06-21T22:26:56Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	5.08x	✅
test-other	1.19%	0.00%	3.75x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	5.55x	✅
test-other	1.00%	0.00%	3.61x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.68x	Streaming real-time factor
Avg Chunk Time	1.343s	Average time to process each chunk
Max Chunk Time	1.599s	Maximum chunk processing time
First Token	1.575s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.64x	Streaming real-time factor
Avg Chunk Time	1.453s	Average time to process each chunk
Max Chunk Time	1.921s	Maximum chunk processing time
First Token	1.460s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 6m12s • 06/24/2026, 10:04 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

beta-devin-ai-integration

Devin Review found 2 potential issues.

beta-devin-ai-integration · 2026-06-21T22:48:32Z

🚩 requiredModels now downloads ALL 7 variants including efficientV2_1

Sortformer.requiredModels (Sources/FluidAudio/ModelNames.swift:725) returns Set(Variant.allCases.map(\.fileName)) which now includes all 7 variants. Any code path that downloads the full required set (i.e., when variant is nil) will now also attempt to download v3/fp16/SortformerEfficient_v2.1.mlmodelc. This is fine as long as that model file exists in the HuggingFace repo at https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml/tree/main/v3/fp16/. If it hasn't been uploaded yet, full-set downloads would fail.

(Refers to lines 724-727)

Was this helpful? React with 👍 or 👎 to provide feedback.

beta-devin-ai-integration · 2026-06-21T22:48:33Z

+    private func validateConfigMatch(_ models: SortformerModels) {
+        guard let embedded = models.embeddedConfig else { return }
+        let current = SortformerModels.EmbeddedConfig(
+            chunkLen: config.chunkLen,
+            chunkLeftContext: config.chunkLeftContext,
+            chunkRightContext: config.chunkRightContext,
+            fifoLen: config.fifoLen,
+            spkcacheLen: config.spkcacheLen
+        )
+        guard current != embedded else { return }
+        logger.error(
+            """
+            Sortformer config mismatch — diarizer config does not match the loaded model. \
+            This produces incorrect and much slower diarization (issue #726). \
+            diarizer(chunkLen=\(current.chunkLen), leftCtx=\(current.chunkLeftContext), \
+            rightCtx=\(current.chunkRightContext), fifoLen=\(current.fifoLen), \
+            spkcacheLen=\(current.spkcacheLen)) \
+            vs model(chunkLen=\(embedded.chunkLen), leftCtx=\(embedded.chunkLeftContext), \
+            rightCtx=\(embedded.chunkRightContext), fifoLen=\(embedded.fifoLen), \
+            spkcacheLen=\(embedded.spkcacheLen)). \
+            Construct SortformerDiarizer with the SortformerConfig matching the model variant.
+            """
+        )
+    }


🚩 validateConfigMatch only warns, never fails — silent mismatch possible in production

The validateConfigMatch method at Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift:125-148 logs an error but does not throw or prevent initialization when a config mismatch is detected. This is likely intentional (backward compatibility, older models without metadata), but it means a misconfigured diarizer will silently produce incorrect results in production where log output may not be monitored. If the embedded metadata is present AND mismatched, this is almost certainly a programmer error. Consider whether this should throw in a future iteration.

Was this helpful? React with 👍 or 👎 to provide feedback.

…t fallback (#726) Addresses two follow-ups from the #726 reporter: - modelsSubdirectory was a let constant, so switching to the smaller palettized models (2.4GB -> 330MB RAM) required editing source. Add ModelNames.Sortformer.ModelPrecision and a mutable SortformerConfig.precision so callers select fp16 (default) vs palettized at runtime; bundle(for:) honors it. CLI: --palettized. - The ~2.4GB fp16 high-context head triggers a multi-minute ANE compile hang on RAM-constrained devices (A14). Add recommendedComputeUnits(for:): large fp16 high-context variants on <8GB devices load with .cpuOnly, everything else (incl. the ANE-friendly palettized head) keeps .all. Wired through load/loadFromHuggingFace/initialize; computeUnits remains overridable. Also fixes load() ignoring its compute-unit argument.

… benchmark (#726) Refresh the Model Variants table for the v3 model set (fp16/palettized paths, efficientV2_1, config-mismatch note). Document precision selection (RAM/DER trade), the A14 compute-unit auto-fallback, and the offline head-to-head vs Argmax: FluidAudio's fused offline graph is 1.3-1.4x faster (10.65ms/2884x vs 14.57ms/2108x encoder-only on M5 Pro), with the >10x Argmax claim explained as a streaming-vs-offline mismatch.

…stitching) Add OfflineSortformerDiarizer backed by the fused offline Sortformer model (mel -> speaker_preds, 30.72s window, no streaming state) — one CoreML call per window, the fastest batch path (1.3-1.4x faster than Argmax offline, #726). - OfflineSortformerConfig / OfflineSortformerModels.runOffline (2-input graph, distinct from the streaming 6-input runMainModel) - Long audio tiled into overlapping windows; SortformerSpeakerStitcher recovers the cross-window speaker permutation (brute-force 4! over the overlap) so IDs stay globally consistent - ModelNames.Sortformer.offlineBundle(precision:) -> v3/{fp16,palettized}/SortformerOffline_v2.1.mlmodelc - CLI: sortformer --offline [--palettized] - Tests for the stitcher, config, and bundle paths Validated end-to-end: 288.6s audio -> 13 windows in 1.02s (281.9x RTFx), consistent speaker IDs across all window boundaries.

Document OfflineSortformerDiarizer whole-file throughput (fused model + speaker stitching) alongside the existing offline-vs-Argmax model-exec numbers.

… docs Add --offline to sortformer-benchmark (whole-file fused path) and document the measured limitation: offline matches streaming on detection (identical Miss/FA) and is ~1400x RTFx, but speaker confusion on long multi-speaker audio gives ~56% DER vs ~26% streaming on AMI-SDM (no spkcache; cross-window stitching can't recover within-window confusion). Scope offline to short/few-speaker/throughput; streaming stays the long-form path.

Alex-Wengg mentioned this pull request Jun 21, 2026

Sortformer BNNS Graph Error #726

Open

beta-devin-ai-integration Bot reviewed Jun 21, 2026

View reviewed changes

Alex-Wengg added 2 commits June 23, 2026 19:18

Alex-Wengg force-pushed the fix/sortformer-bnns-crash-726 branch from a6ff757 to 32a5759 Compare June 24, 2026 01:12

Alex-Wengg added 3 commits June 24, 2026 01:10

docs(sortformer): offline diarizer end-to-end benchmark (~283x RTFx)

c0c0126

Document OfflineSortformerDiarizer whole-file throughput (fused model + speaker stitching) alongside the existing offline-vs-Argmax model-exec numbers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728

fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728
Alex-Wengg wants to merge 6 commits into
mainfrom
fix/sortformer-bnns-crash-726

Alex-Wengg commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

beta-devin-ai-integration Bot left a comment

Uh oh!

beta-devin-ai-integration Bot Jun 21, 2026

Uh oh!

beta-devin-ai-integration Bot Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmarks

Validation

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Supertonic3 Smoke Test ✅

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ✅

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

beta-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

beta-devin-ai-integration Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

beta-devin-ai-integration Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading