This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
FireRedASR is an industrial-grade Automatic Speech Recognition system specializing in Chinese (Mandarin and dialects) and English. It provides two model variants:
- FireRedASR-AED (1.1B params): Attention-based Encoder-Decoder for balanced performance
- FireRedASR-LLM (8.3B params): Encoder-Adapter-LLM framework for SOTA performance
重要: 在这个项目中工作时,Claude 应该默认使用中文回复,即使用户使用英文提问。技术术语和代码相关内容应同时提供中文和英文(英文在括号内)。
# Create conda environment
conda create --name fireredasr python=3.10
conda activate fireredasr
pip install -r requirements.txt
# Set paths (required for CLI tools)
export PATH=$PWD/fireredasr/:$PWD/fireredasr/utils/:$PATH
export PYTHONPATH=$PWD/:$PYTHONPATH# Convert to required format (16kHz, mono, PCM WAV)
ffmpeg -i input_audio -ar 16000 -ac 1 -acodec pcm_s16le -f wav output.wav
# Batch conversion
for file in data/raw_input/*.mp3; do
ffmpeg -i "$file" -ar 16000 -ac 1 -acodec pcm_s16le -f wav "data/formated_input/$(basename "$file" .mp3).wav"
done# Using example scripts
bash examples/inference_fireredasr_aed.sh
bash examples/inference_fireredasr_llm.sh
# Direct CLI usage
speech2text.py --wav_path examples/wav/BAC009S0764W0121.wav \
--asr_type "aed" \
--model_dir pretrained_models/FireRedASR-AED-L# Calculate Word Error Rate
wer.py --print_sentence_wer 1 --do_tn 0 --rm_special 0 \
--ref reference.txt --hyp hypothesis.txt-
Model Layer (
fireredasr/models/)fireredasr.py: Factory pattern for model instantiationfireredasr_aed.py: AED architecture with Conformer encoder + Transformer decoderfireredasr_llm.py: LLM architecture integrating Qwen2-7B-Instructmodule/: Neural network building blocks (attention, convolution, transformers)
-
Data Processing (
fireredasr/data/)- Feature extraction with Kaldi's fbank
- CMVN normalization
- Batch collation with padding
-
Tokenization (
fireredasr/tokenizer/)- Character-level tokenization for Chinese
- BPE tokenization support
- Special token handling for both AED and LLM variants
-
CLI Interface (
fireredasr/speech2text.py)- Unified entry point for all ASR operations
- Supports multiple input formats (single file, batch, directory, scp)
- VAD-based splitting for long audio (>30s/60s)
- Load YAML config from pretrained model directory
- Initialize model architecture based on config
- Load pretrained weights from checkpoint
- Apply PEFT adapters if using LLM variant
- Set up tokenizer with vocabulary
- Raw audio → 16kHz PCM conversion (ffmpeg)
- VAD segmentation for long audio (
vad_split.py) - Feature extraction (80-dim fbank features)
- Model inference with beam search
- Optional LLM refinement (
refine_asr_output/)
- No formal test framework: Testing done via example scripts and WER evaluation
- GPU required: Models need CUDA-enabled GPU for reasonable performance
- Memory requirements: AED needs ~8GB VRAM, LLM needs ~32GB VRAM
- Audio limitations: Max 60s (AED) or 30s (LLM) per segment
- Dependencies: Requires PyTorch ≥2.0.0, Transformers ≥4.46.3, Kaldi tools
- Create new model class in
fireredasr/models/ - Register in factory function in
fireredasr.py - Add corresponding tokenizer support
- Update CLI arguments in
speech2text.py
- Use VAD splitting:
vad_split.py --input long_audio.wav --output_dir segments/ - Process segments individually
- Concatenate results
- Use PEFT/LoRA for parameter-efficient fine-tuning
- Modify adapter configurations in model configs
- Leverage existing training scripts (if available in future updates)