STAR-compatible optional tags on unmapped records#80
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix: STAR-compatible optional tags on unmapped records + mismatch reason classification
Summary
Two related correctness fixes to unmapped-read handling that have no effect on alignment
decisions:
Missing optional tags on unmapped records. STAR emits
NH:i:0,HI:i:0,AS:i:0,nM:i:0, anduT:A:on every unmapped record. rustar-aligner emitted none of them.uT:A:in particular is parsed by MultiQC and similar QC tools to break down unmappingcategories; without it those tools fall back to flag-only counting and lose the
fine-grained breakdown.
Mismatch-filtered reads misclassified as
TooShort. When all transcript candidateswere removed solely by the mismatch count/rate filter
(
--outFilterMismatchNmax/--outFilterMismatchNoverLmax),Log.final.outrecordedthem under "too short" instead of "too many mismatches", inflating the former and keeping
the latter at zero. STAR distinguishes these two cases in
ReadAlign_mappedFilter.cpp:20–30.Closes the residual part of #48.
Changes
src/align/read_align.rsReplaced the catch-all
Some(UnmappedReason::TooShort)at the end of the quality-filterblock with logic that inspects the existing
filter_reasonsmap: if only mismatch filtersfired (
mismatch_max/mismatch_rate) the reason isTooManyMismatches; otherwiseTooShort.src/io/sam.rsinsert_unmapped_tags(record, attrs, reason)insertsNH:i:0,HI:i:0,AS:i:0,nM:i:0(gated onoutSAMattributesas for mapped records) anduT:A:(always emitted, matching STAR's unconditional behaviour).uT:A:values:0=other,1=too short,2=too many mismatches,3=too many loci.build_unmapped_record:rg_id: Option<&str>replaced byparams: &Parameters+unmapped_reason: UnmappedReason; RG tag derived internally as other builders do.build_paired_unmapped_records: gainsunmapped_reason: UnmappedReasonparameter.build_half_mapped_records: unmapped-mate section callsinsert_unmapped_tagswithUnmappedReason::Other.src/lib.rsAll four call sites updated. The now-unused
rg_id_ownedbinding at the top ofrun_single_passremoved.src/io/bam.rsThree test call sites updated to the new
build_unmapped_recordsignature.Test plan
cargo test— 436 tests passing, 0 failurescargo clippy --all-targets— 0 warningscargo fmt --check— cleantest_unmapped_record_tags_emitted— verifies NH/HI/AS present and allfour
uT:A:values (0–3) onbuild_unmapped_recordoutputtest_unmapped_reason_mismatch_classification— verifies thefilter_reasons→UnmappedReasonmapping for all cases (mismatch-only, score-only,mixed, empty)
drift is from batch 1's Gsj/sjdb changes, not this branch)