Skip to content

Latest commit

 

History

History
122 lines (95 loc) · 4.21 KB

File metadata and controls

122 lines (95 loc) · 4.21 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

FireRedASR is an industrial-grade Automatic Speech Recognition system specializing in Chinese (Mandarin and dialects) and English. It provides two model variants:

  • FireRedASR-AED (1.1B params): Attention-based Encoder-Decoder for balanced performance
  • FireRedASR-LLM (8.3B params): Encoder-Adapter-LLM framework for SOTA performance

语言设置 (Language Settings)

重要: 在这个项目中工作时,Claude 应该默认使用中文回复,即使用户使用英文提问。技术术语和代码相关内容应同时提供中文和英文(英文在括号内)。

Essential Commands

Environment Setup

# Create conda environment
conda create --name fireredasr python=3.10
conda activate fireredasr
pip install -r requirements.txt

# Set paths (required for CLI tools)
export PATH=$PWD/fireredasr/:$PWD/fireredasr/utils/:$PATH
export PYTHONPATH=$PWD/:$PYTHONPATH

Audio Preprocessing

# Convert to required format (16kHz, mono, PCM WAV)
ffmpeg -i input_audio -ar 16000 -ac 1 -acodec pcm_s16le -f wav output.wav

# Batch conversion
for file in data/raw_input/*.mp3; do
    ffmpeg -i "$file" -ar 16000 -ac 1 -acodec pcm_s16le -f wav "data/formated_input/$(basename "$file" .mp3).wav"
done

Running Inference

# Using example scripts
bash examples/inference_fireredasr_aed.sh
bash examples/inference_fireredasr_llm.sh

# Direct CLI usage
speech2text.py --wav_path examples/wav/BAC009S0764W0121.wav \
    --asr_type "aed" \
    --model_dir pretrained_models/FireRedASR-AED-L

Evaluation

# Calculate Word Error Rate
wer.py --print_sentence_wer 1 --do_tn 0 --rm_special 0 \
    --ref reference.txt --hyp hypothesis.txt

High-Level Architecture

Core Components

  1. Model Layer (fireredasr/models/)

    • fireredasr.py: Factory pattern for model instantiation
    • fireredasr_aed.py: AED architecture with Conformer encoder + Transformer decoder
    • fireredasr_llm.py: LLM architecture integrating Qwen2-7B-Instruct
    • module/: Neural network building blocks (attention, convolution, transformers)
  2. Data Processing (fireredasr/data/)

    • Feature extraction with Kaldi's fbank
    • CMVN normalization
    • Batch collation with padding
  3. Tokenization (fireredasr/tokenizer/)

    • Character-level tokenization for Chinese
    • BPE tokenization support
    • Special token handling for both AED and LLM variants
  4. CLI Interface (fireredasr/speech2text.py)

    • Unified entry point for all ASR operations
    • Supports multiple input formats (single file, batch, directory, scp)
    • VAD-based splitting for long audio (>30s/60s)

Model Loading Flow

  1. Load YAML config from pretrained model directory
  2. Initialize model architecture based on config
  3. Load pretrained weights from checkpoint
  4. Apply PEFT adapters if using LLM variant
  5. Set up tokenizer with vocabulary

Audio Processing Pipeline

  1. Raw audio → 16kHz PCM conversion (ffmpeg)
  2. VAD segmentation for long audio (vad_split.py)
  3. Feature extraction (80-dim fbank features)
  4. Model inference with beam search
  5. Optional LLM refinement (refine_asr_output/)

Key Development Considerations

  • No formal test framework: Testing done via example scripts and WER evaluation
  • GPU required: Models need CUDA-enabled GPU for reasonable performance
  • Memory requirements: AED needs ~8GB VRAM, LLM needs ~32GB VRAM
  • Audio limitations: Max 60s (AED) or 30s (LLM) per segment
  • Dependencies: Requires PyTorch ≥2.0.0, Transformers ≥4.46.3, Kaldi tools

Common Tasks

Adding New Model Variant

  1. Create new model class in fireredasr/models/
  2. Register in factory function in fireredasr.py
  3. Add corresponding tokenizer support
  4. Update CLI arguments in speech2text.py

Processing Long Audio Files

  1. Use VAD splitting: vad_split.py --input long_audio.wav --output_dir segments/
  2. Process segments individually
  3. Concatenate results

Fine-tuning Models

  • Use PEFT/LoRA for parameter-efficient fine-tuning
  • Modify adapter configurations in model configs
  • Leverage existing training scripts (if available in future updates)