This repository provides an automated, high-concurrency pipeline for generating and distributing Mash sketch files for all representative prokaryotic genomes in RefSeq.
To bypass GitHub Action storage limits and execution timeouts, the process is split into three sequential stages:
- ID Generation : Queries NCBI for the latest representative accessions and splits them into 100 manageable chunks.
- Distributed Sketching : Uses a job matrix to process all 100 chunks in parallel. Each job downloads genomes, generates sketches via Mash, and verifies integrity.
- Consolidation & Release : Merges all chunked sketches into a single master reference, verifies the final accession count, updates the repository via Pull Request, and publishes the new version to Zenodo.
The consolidated reference is updated following RefSeq releases and is available on Zenodo:
Download Latest RefSeq Mash Sketches
Sketches are generated using Mash v2.3. Mash must be installed in the environment to use these files.
wget https://zenodo.org/records/19211013/files/RefSeqSketches_latest.msh.gz
gunzip RefSeqSketches_latest.msh.gzUse mash screen to identify which RefSeq genomes are present in a raw sequencing read set (FastQ) or assembly (Fasta).
# -p 8 enables multithreading
mash screen -p 8 RefSeqSketches_latest.msh query_genome.fasta | sort -gr | head -n 10Use mash dist to calculate Mash distances (an approximation of ANI) between your query and the entire reference set.
mash dist -p 8 RefSeqSketches_latest.msh query_genome.fasta > distances.tab
# Sort by distance (Column 3). 0.0 = identical; 1.0 = no similarity.
sort -gk3 distances.tab | head -n 10The logic follows these core bash routines.
NCBI Datasets CLI extracts representative genomes, specifically cleaning names to ensure compatibility with Mash headers.
./datasets summary genome taxon 2 --reference --as-json-lines | \
./dataformat tsv genome --fields accession,organism-name --elide-header | \
sed 's/\[//g; s/\]//g; s/["'\'']//g; s/endosymbiont of /endosymbiont_of_/g' | \
grep ^"G" > ids.txtThis repository uses GitHub accessions to split ids.txt into several files, which are processed in parallel. This is what it would behave like if not split.
cut -f 1 ids.txt > accessions.txt
datasets download genome accession --no-progressbar --inputfile accessions.txt --dehydrated --filename genomes.zip
unzip -n genomes.zip -d genomes
# dehydrate and rehydrate is more stable
datasets rehydrate --no-progressbar --directory genomesThis repository uses GitHub accessions to split ids.txt into several files, which are processed in parallel. This is what it would behave like if not split.
ORIG_WORKSPACE=$(pwd)
DATASET_WORKSPACE=$(pwd)/tmp_datasets_workspace
CHUNK_FILE="ids.txt"
cd $ORIG_WORKSPACE
mkdir -p tmp_mash_workspace tmp_datasets_workspace
# getting absolute paths for mash files
NEW_MASH=$ORIG_WORKSPACE/new
TEMP_MASH=$ORIG_WORKSPACE/temp
FINAL_MASH=$ORIG_WORKSPACE/fin
while read line
do
cd $ORIG_WORKSPACE
rm -rf $DATASET_WORKSPACE/*
id=$(echo "$line" | awk '{print $1}')
ge=$(echo "$line" | awk '{print $2}')
sp=$(echo "$line" | awk '{print $3}')
# getting values for line
echo "Looking for reference for"
echo "Genome accession: $id"
echo "Genus: $ge"
echo "Species: $sp"
# getting approximate line number for sanity
wc -l $CHUNK_FILE
grep -n $id $CHUNK_FILE
GENOME_PATH=""
# first looking to see if it downloaded in the batch (common)
if [ -d "genomes/ncbi_dataset/data/${id}" ]
then
GENOME_PATH=$(find genomes/ncbi_dataset/data/${id} -name "*_genomic.fna" | head -n 1)
fi
# if didn't download as batch, try again as individual
if [ ! -n "$GENOME_PATH" ]
then
cd $DATASET_WORKSPACE
rm -rf $DATASET_WORKSPACE/*
MAX_RETRIES=10
attempt=0
success=false
echo "Accession $id was not found. Re-attempting download."
until [ $attempt -ge $MAX_RETRIES ]
do
echo "Downloading attempt number $attempt"
datasets download genome accession "${id}" --no-progressbar --no-progressbar --filename ncbi_dataset.zip < /dev/null
if [ -f ncbi_dataset.zip ]
then
unzip ncbi_dataset.zip
GENOME_PATH=$(find ncbi_dataset/data -name "*_genomic.fna" | head -n 1)
fi
if [ -n "$GENOME_PATH" ]
then
success=true
break
else
rm -rf $DATASET_WORKSPACE/*
attempt=$((attempt+1))
sleep $((attempt * 10))s
fi
done
fi
if [ -n "$GENOME_PATH" ]
then
if [ ! -f "$FINAL_MASH.msh" ]
then
mash sketch "$GENOME_PATH" -o $FINAL_MASH -I "${ge}_${sp}_${id}"
else
mash sketch "$GENOME_PATH" -o $NEW_MASH -I "${ge}_${sp}_${id}"
mash paste $TEMP_MASH $FINAL_MASH.msh $NEW_MASH.msh
mv $TEMP_MASH.msh $FINAL_MASH.msh
rm $NEW_MASH.msh
fi
else
echo "Could not download the genome for $id"
echo "FAILURE"
# comment out the exit to press on
exit 1
fi
done < $CHUNK_FILEmash info -t fin.msh | cut -f 3 | grep G | rev | cut -f 1-2 -d _ | rev | sort | uniq > final_accessions.txt
cat ids.txt | grep -vf final_accessions.txt > missing_accessions.txt
ORIGINAL_COUNT=$(wc -l < ids.txt)
MASH_COUNT=$(wc -l < final_accessions.txt)
MISSING_COUNT=$(wc -l < missing_accessions.txt)
echo "- **Total Accessions Expected:** $ORIGINAL_COUNT"
echo "- **Total Accessions Sketched:** $MASH_COUNT"
echo "- **Missing Accessions:** $MISSING_COUNT"wget -q --no-check-certificate https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/048/245/GCF_019048245.1_ASM1904824v1/GCF_019048245.1_ASM1904824v1_genomic.fna.gz
gunzip GCF_019048245.1_ASM1904824v1_genomic.fna.gz
mash screen -p 4 ${{ matrix.chunk }}.msh GCF_019048245.1_ASM1904824v1_genomic.fna | sort -gr > test.txt
head test.txtThe output should look something like this
Loading test.msh...
15766794 distinct hashes.
Streaming from GCF_019048245.1_ASM1904824v1_genomic.fna...
Estimated distinct k-mers in mixture: 4900608
Summing shared...
Computing coverage medians...
Writing output...
1 1000/1000 1 0 Enterobacter_hormaechei_GCF_019048245.1 [3 seqs] NZ_CP077308.1 Enterobacter hormaechei strain FDAARGOS 1433 chromosome, complete genome [...]
0.94428 300/1000 1 0 Enterobacter_intestinihominis_GCF_048568405.1 [3 seqs] NZ_CP183772.1 Enterobacter intestinihominis strain JNQH618 chromosome, complete genome [...]
0.939555 270/1000 1 0 Enterobacter_quasihormaechei_GCF_004331385.1 [51 seqs] NZ_SJON01000001.1 Enterobacter quasihormaechei strain WCHEQ120003 1, whole genome shotgun sequence [...]
0.920851 177/1000 1 0 Enterobacter_pasteurii_GCF_014930725.1 [48 seqs] NZ_JADBRO010000001.1 Enterobacter pasteurii strain P40RS contig_0001, whole genome shotgun sequence [...]
0.908727 134/1000 1 0 Enterobacter_chuandaensis_GCF_039718975.1 [2 seqs] NZ_CP097173.1 Enterobacter chuandaensis strain E20191216 chromosome, complete genome [...]
0.906747 128/1000 1 0 Enterobacter_quasimori_GCF_018597345.1 [30 seqs] NZ_JAHEVU010000001.1 Enterobacter quasimori strain 120130 contig00001, whole genome shotgun sequence [...]
0.906409 127/1000 1 0 Enterobacter_nematophilus_GCF_026344075.1 [19 seqs] NZ_JAPKNE010000001.1 Enterobacter nematophilus strain E-TC7 Strain6_1a_1, whole genome shotgun sequence [...]
0.906067 126/1000 1 0 Enterobacter_bugandensis_GCF_900324475.1 NZ_LT992502.1 Enterobacter bugandensis isolate EB-247 chromosome I
0.903965 120/1000 1 0 Enterobacter_oligotrophicus_GCF_009176645.1 NZ_AP019007.1 Enterobacter oligotrophicus strain CCA6 chromosome, complete genome
0.901759 114/1000 1 0 Enterobacter_vonholyi_GCF_040834095.1 [2 seqs] NZ_CP162152.1 Enterobacter vonholyi strain STK80-C chromosome, complete genome [...]- Trigger: The workflow is currently set to workflow_dispatch but can be automated to track the RefSeq release cycle.
- Infrastructure: Runs on ubuntu-latest GitHub runners.
- Storage: Large artifacts are stored in Zenodo; only the master ID list and compressed metadata are kept in this repository.