texterrors scores ASR or transcription output against a reference and helps you inspect what went wrong.
It supports:
- WER and CER
- standard and character-aware alignment
- detailed aligned error reports
- colored output for inspecting alignments
- comparison of multiple hypothesis files against the same reference
- per-group metrics such as per-speaker WER
- keyword and OOV evaluation
- OOV-CER for targeted analysis of out-of-vocabulary words, as in this paper
- oracle WER
- simple entity accuracy
- per-entity diagnostic TSV output
- aggregate JSON output for scripting
Example of colored detailed output (--usecolor):
For more background on the motivation for the tool, see this post.
Requires Python 3.9 or newer.
pip install texterrorsThis installs both the Python package and the texterrors command-line tool.
If your files are ark-like text with an utterance ID as the first field, use --isark.
Compute aggregate WER only:
texterrors --isark -s ref hypWrite a detailed report to a file:
texterrors --isark --cer -c -o detailed_report.txt ref hypIf you use --usecolor, view the output with less -R.
Compare several systems against the same reference:
texterrors --isark -s ref hyp1 hyp2 hyp3This prints a comparison table with one row per hypothesis file.
Write aggregate-only JSON instead of the normal text report:
texterrors --isark --output-format json ref hypMeasure simple entity accuracy:
texterrors --isark -w ref hypWrite one TSV row per entity occurrence for diagnostics:
texterrors --isark --entity-details entity_details.tsv ref hypBy default, texterrors expects one reference line and one hypothesis line per utterance.
Useful input flags:
--isark: each line starts with an utterance ID--isctm: input is CTM-like and includes timing fields
The default output is a human-readable text report. Unless you pass --skip-detailed, it includes per-utterance aligned detail as well as overall summary statistics.
Useful output options:
--skip-detailed: show only aggregate statistics--out,-o: write the text report to a file--output-format json: write aggregate statistics and top-error summaries as JSON, without per-utterance detail--entity-details FILE: write a TSV with one row per simple-entity occurrence
--entity-details is meant for compact diagnostics. The TSV records the normalized compact hypothesis output aligned to the entity span, not the original surface form with exact spacing and casing.
--cer: compute CER in addition to WER--utt-group-map FILE: report metrics by group, for example by speaker--keywords-f FILE: restrict keyword precision and recall analysis to terms in a file--oov-list-f FILE: compute OOV-CER for words in a file--oracle-wer: pick the lowest-edit-distance hypothesis when multiple hypotheses are available per utterance--freq-sort: sort error summaries by frequency instead of raw count--num-top-errors N: control how many top insertions, deletions, and substitutions are shown
--simple-entity-accuracy uses reference-side casing cues to identify likely entity terms and then scores whether those terms were recognized in the hypothesis.
Entity matching is case-insensitive and ignores whitespace, so examples like GenAI and Gen AI count as the same entity for this metric.
If you want to inspect misses, use --entity-details FILE. That TSV is useful for spotting whether an entity matched cleanly, was substituted, or was effectively deleted.
If you want to use the library directly from Python:
from texterrors import align_texts
ref_aligned, hyp_aligned, cost = align_texts(
["speedbird", "eight", "six", "two"],
["hello", "speedbird", "six", "two"],
use_chardiff=True,
)A small benchmark harness lives in benchmarks/alignment_benchmark.py.
Run it from the repo root:
.venv/bin/python benchmarks/alignment_benchmark.py --repeat 7You can also point it at other ark-like files:
.venv/bin/python benchmarks/alignment_benchmark.py --ref my_ref.txt --hyp my_hyp.txtIf you want to build the extension locally:
uv venv
env UV_CACHE_DIR=/tmp/uv-cache uv pip install --python .venv/bin/python -r requirements.txt
.venv/bin/cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DPython_EXECUTABLE=$PWD/.venv/bin/python -Dnanobind_DIR=$PWD/.venv/lib/python3.12/site-packages/nanobind/cmake
.venv/bin/cmake --build build --config Release
.venv/bin/cmake --install build --config Release --prefix $PWD--use-chardiff enables character-aware alignment. This often gives more intuitive alignments when words are similar, but it can also make WER slightly higher than standard token-only alignment.
If you want behavior closer to a plain token-alignment scorer, leave --use-chardiff off.
For example, a plain token alignment might force a one-to-one mapping:
| test | sentence | okay | words | ending | now |
|---|---|---|---|---|---|
| test | a | sentenc | ok | endin | now |
Character-aware alignment may instead align it like this:
| test | - | sentence | okay | words | ending | now |
|---|---|---|---|---|---|---|
| test | a | sentenc | ok | - | endin | now |
That can increase WER because it exposes an insertion and deletion that the token-only alignment hides.
