Skip to content

Latest commit

 

History

History
96 lines (66 loc) · 4.05 KB

File metadata and controls

96 lines (66 loc) · 4.05 KB

Reference Matching (K-mer Containment)

Overview

The match command finds the closest reference genome for a set of FASTQ reads. It uses k-mer containment scoring — measuring what fraction of the query's k-mers are present in each reference genome.

Algorithm

FASTQ reads ──→ Count query k-mers ──→ Filter noise ──→ Score references (streaming) ──→ Best match

Step 1: Query K-mer Counting

All FASTQ files are processed in parallel (one rayon task per file):

  1. Parse FASTQ with needletail (handles .gz transparently)
  2. Extract canonical k-mers (k=31 by default) via bit_kmers(k, canonical=true)
  3. Count occurrences in a thread-local FxHashMap<u64, u32>
  4. Merge thread-local maps via parallel reduction

Step 2: Noise Filtering

Singleton k-mers (count=1) are likely sequencing errors:

  • Only applied when total unique k-mers > 100,000
  • Default threshold: min_kmer_count = 2
  • Typical reduction: ~90-93% of unique k-mers removed
  • Effect: faster scoring and fewer false matches

Step 3: Streaming Reference Scoring

References are read from a multi-FASTA file in adaptive batches (batch_size = num_threads, clamped to 4–64):

Multi-FASTA ──→ Read batch of N refs ──→ Score batch in parallel ──→ Update best
                                              │                         │
                                              └── Free batch memory ────┘
                                                                        │
                ──→ Read next batch ──────────────────────────────────────┘

Why streaming? With 500 reference genomes of ~4.4 Mb each, loading all references would require ~2.2 GB for sequences alone, plus ~17 GB for k-mer sets. Streaming processes num_threads references at a time, keeping memory bounded.

Scoring a Single Reference

For each reference in the batch (parallelized with rayon):

  1. Extract all canonical k-mers from the reference sequence
  2. Sort and deduplicate (only unique k-mers matter)
  3. For each unique reference k-mer, look it up in the query k-mer map
  4. If found, add the query count to shared_kmer_count

Score Calculation

Shared_Kmer_Fraction = Σ(query_count for shared k-mers) / Σ(all query k-mer counts)

This is a weighted containment score: k-mers that appear more frequently in the reads contribute more. A perfect match (all query k-mers present in the reference) gives a score of 1.0.

Using weighted counts (instead of just presence/absence) makes the score robust to coverage variation — highly covered regions contribute proportionally.

Memory Model

Component Memory Notes
Query k-mers ~50–200 MB All retained k-mers with counts
Reference batch batch_size × ~4.4 MB Raw sequences
Reference k-mers batch_size × ~34 MB Sorted unique k-mer arrays (temporary)
Total ~300–500 MB typical Independent of total number of references

Parameters

Parameter Default Description
--kmer-size 31 Longer k = more specific matching
--min-kmer-count 2 Noise filter threshold
--early-stop-confidence 0.0 (disabled) Stop scoring early if leader is clear
--early-stop-min-kmers 1,000,000 Minimum k-mers scored before early stop
--strict-percentages true Use weighted query counts for score

Output

Query_Files              Best_Match_Reference    Shared_Kmer_Fraction
reads_R1.fastq.gz,...    GCF_000253355.1         0.5440

Only the single best match is reported. The score represents the fraction of query sequencing depth attributable to k-mers shared with that reference.

Use Cases

  1. Pre-mapping reference selection: Find the closest reference before read alignment to maximize mapping quality
  2. Lineage verification: Cross-check ML-predicted lineage against known reference genomes
  3. Contamination detection: Low scores may indicate mixed samples or wrong species