Skip to content

erinyoung/update_mash_dist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

247 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RefSeq Prokaryotic Mash Reference

Zenodo

This repository provides an automated, high-concurrency pipeline for generating and distributing Mash sketch files for all representative prokaryotic genomes in RefSeq.

The Automated Pipeline

To bypass GitHub Action storage limits and execution timeouts, the process is split into three sequential stages:

  1. ID Generation : Queries NCBI for the latest representative accessions and splits them into 100 manageable chunks.
  2. Distributed Sketching : Uses a job matrix to process all 100 chunks in parallel. Each job downloads genomes, generates sketches via Mash, and verifies integrity.
  3. Consolidation & Release : Merges all chunked sketches into a single master reference, verifies the final accession count, updates the repository via Pull Request, and publishes the new version to Zenodo.

Download

The consolidated reference is updated following RefSeq releases and is available on Zenodo:

Download Latest RefSeq Mash Sketches


Usage

Sketches are generated using Mash v2.3. Mash must be installed in the environment to use these files.

1. Setup

wget https://zenodo.org/records/19211013/files/RefSeqSketches_latest.msh.gz
gunzip RefSeqSketches_latest.msh.gz

2. Fast Sequence Screening

Use mash screen to identify which RefSeq genomes are present in a raw sequencing read set (FastQ) or assembly (Fasta).

# -p 8 enables multithreading
mash screen -p 8 RefSeqSketches_latest.msh query_genome.fasta | sort -gr | head -n 10

3. Genomic Distance Estimation

Use mash dist to calculate Mash distances (an approximation of ANI) between your query and the entire reference set.

mash dist -p 8 RefSeqSketches_latest.msh query_genome.fasta > distances.tab

# Sort by distance (Column 3). 0.0 = identical; 1.0 = no similarity.
sort -gk3 distances.tab | head -n 10

Pipeline Logic (for Replication)

The logic follows these core bash routines.

Stage 1: Taxonomy Filtering

NCBI Datasets CLI extracts representative genomes, specifically cleaning names to ensure compatibility with Mash headers.

./datasets summary genome taxon 2 --reference --as-json-lines | \
  ./dataformat tsv genome --fields accession,organism-name --elide-header | \
  sed 's/\[//g; s/\]//g; s/["'\'']//g; s/endosymbiont of /endosymbiont_of_/g' | \
  grep ^"G" > ids.txt

Stage 2: Download Genomes

This repository uses GitHub accessions to split ids.txt into several files, which are processed in parallel. This is what it would behave like if not split.

cut -f 1 ids.txt > accessions.txt
datasets download genome accession --no-progressbar --inputfile accessions.txt --dehydrated --filename genomes.zip
unzip -n genomes.zip -d genomes
# dehydrate and rehydrate is more stable
datasets rehydrate --no-progressbar --directory genomes

Stage 3: Add mash sketches

This repository uses GitHub accessions to split ids.txt into several files, which are processed in parallel. This is what it would behave like if not split.

ORIG_WORKSPACE=$(pwd)
DATASET_WORKSPACE=$(pwd)/tmp_datasets_workspace
CHUNK_FILE="ids.txt"

cd $ORIG_WORKSPACE
mkdir -p tmp_mash_workspace tmp_datasets_workspace

# getting absolute paths for mash files
NEW_MASH=$ORIG_WORKSPACE/new
TEMP_MASH=$ORIG_WORKSPACE/temp
FINAL_MASH=$ORIG_WORKSPACE/fin
         
while read line
do
  cd $ORIG_WORKSPACE
  rm -rf $DATASET_WORKSPACE/*
            
  id=$(echo "$line" | awk '{print $1}')
  ge=$(echo "$line" | awk '{print $2}')
  sp=$(echo "$line" | awk '{print $3}')

  # getting values for line
  echo "Looking for reference for"
  echo "Genome accession: $id"
  echo "Genus: $ge"
  echo "Species: $sp"

  # getting approximate line number for sanity
  wc -l $CHUNK_FILE
  grep -n $id $CHUNK_FILE

  GENOME_PATH=""
  # first looking to see if it downloaded in the batch (common)
  if [ -d "genomes/ncbi_dataset/data/${id}" ]
  then
    GENOME_PATH=$(find genomes/ncbi_dataset/data/${id} -name "*_genomic.fna" | head -n 1)
  fi

  # if didn't download as batch, try again as individual
  if [ ! -n "$GENOME_PATH" ]
  then
    cd $DATASET_WORKSPACE
    rm -rf $DATASET_WORKSPACE/*
             
    MAX_RETRIES=10
    attempt=0
    success=false
    echo "Accession $id was not found. Re-attempting download."
              
    until [ $attempt -ge $MAX_RETRIES ]
    do
      echo "Downloading attempt number $attempt"
      datasets download genome accession "${id}" --no-progressbar --no-progressbar --filename ncbi_dataset.zip < /dev/null
      if [ -f ncbi_dataset.zip ]
      then
        unzip ncbi_dataset.zip
        GENOME_PATH=$(find ncbi_dataset/data -name "*_genomic.fna" | head -n 1)
      fi
                
      if [ -n "$GENOME_PATH" ]
      then
        success=true
        break
      else
        rm -rf $DATASET_WORKSPACE/*
        attempt=$((attempt+1))
        sleep $((attempt * 10))s
      fi
    done
  fi

  if [ -n "$GENOME_PATH" ]
  then
    if [ ! -f "$FINAL_MASH.msh" ]
    then
      mash sketch "$GENOME_PATH" -o $FINAL_MASH -I "${ge}_${sp}_${id}"
    else
      mash sketch "$GENOME_PATH" -o $NEW_MASH -I "${ge}_${sp}_${id}"
      mash paste $TEMP_MASH $FINAL_MASH.msh $NEW_MASH.msh
      mv $TEMP_MASH.msh $FINAL_MASH.msh
      rm $NEW_MASH.msh
    fi
  else
    echo "Could not download the genome for $id"
    echo "FAILURE"
    # comment out the exit to press on
    exit 1
  fi
done < $CHUNK_FILE

Step 4. Verification

mash info -t fin.msh | cut -f 3 | grep G | rev | cut -f 1-2 -d _ | rev | sort | uniq > final_accessions.txt
cat ids.txt | grep -vf final_accessions.txt > missing_accessions.txt
          
ORIGINAL_COUNT=$(wc -l < ids.txt)
MASH_COUNT=$(wc -l < final_accessions.txt)
MISSING_COUNT=$(wc -l < missing_accessions.txt)
          
echo "- **Total Accessions Expected:** $ORIGINAL_COUNT"
echo "- **Total Accessions Sketched:** $MASH_COUNT"
echo "- **Missing Accessions:** $MISSING_COUNT"

Testing

wget -q --no-check-certificate https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/048/245/GCF_019048245.1_ASM1904824v1/GCF_019048245.1_ASM1904824v1_genomic.fna.gz
gunzip GCF_019048245.1_ASM1904824v1_genomic.fna.gz
mash screen -p 4 ${{ matrix.chunk }}.msh GCF_019048245.1_ASM1904824v1_genomic.fna | sort -gr > test.txt
head test.txt

The output should look something like this

Loading test.msh...
   15766794 distinct hashes.
Streaming from GCF_019048245.1_ASM1904824v1_genomic.fna...
   Estimated distinct k-mers in mixture: 4900608
Summing shared...
Computing coverage medians...
Writing output...
1	1000/1000	1	0	Enterobacter_hormaechei_GCF_019048245.1	[3 seqs] NZ_CP077308.1 Enterobacter hormaechei strain FDAARGOS 1433 chromosome, complete genome [...]
0.94428	300/1000	1	0	Enterobacter_intestinihominis_GCF_048568405.1	[3 seqs] NZ_CP183772.1 Enterobacter intestinihominis strain JNQH618 chromosome, complete genome [...]
0.939555	270/1000	1	0	Enterobacter_quasihormaechei_GCF_004331385.1	[51 seqs] NZ_SJON01000001.1 Enterobacter quasihormaechei strain WCHEQ120003 1, whole genome shotgun sequence [...]
0.920851	177/1000	1	0	Enterobacter_pasteurii_GCF_014930725.1	[48 seqs] NZ_JADBRO010000001.1 Enterobacter pasteurii strain P40RS contig_0001, whole genome shotgun sequence [...]
0.908727	134/1000	1	0	Enterobacter_chuandaensis_GCF_039718975.1	[2 seqs] NZ_CP097173.1 Enterobacter chuandaensis strain E20191216 chromosome, complete genome [...]
0.906747	128/1000	1	0	Enterobacter_quasimori_GCF_018597345.1	[30 seqs] NZ_JAHEVU010000001.1 Enterobacter quasimori strain 120130 contig00001, whole genome shotgun sequence [...]
0.906409	127/1000	1	0	Enterobacter_nematophilus_GCF_026344075.1	[19 seqs] NZ_JAPKNE010000001.1 Enterobacter nematophilus strain E-TC7 Strain6_1a_1, whole genome shotgun sequence [...]
0.906067	126/1000	1	0	Enterobacter_bugandensis_GCF_900324475.1	NZ_LT992502.1 Enterobacter bugandensis isolate EB-247 chromosome I
0.903965	120/1000	1	0	Enterobacter_oligotrophicus_GCF_009176645.1	NZ_AP019007.1 Enterobacter oligotrophicus strain CCA6 chromosome, complete genome
0.901759	114/1000	1	0	Enterobacter_vonholyi_GCF_040834095.1	[2 seqs] NZ_CP162152.1 Enterobacter vonholyi strain STK80-C chromosome, complete genome [...]

Maintenance

  • Trigger: The workflow is currently set to workflow_dispatch but can be automated to track the RefSeq release cycle.
  • Infrastructure: Runs on ubuntu-latest GitHub runners.
  • Storage: Large artifacts are stored in Zenodo; only the master ID list and compressed metadata are kept in this repository.

About

mash works best when given a mash dist file with a bunch of references.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors