Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 76 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,11 @@ For this version, the `--read_type` argument accepts any minimap2 options string
conda install -c bioconda mafft iqtree minimap2 samtools
```

For the coalescent species tree (step 4), [ASTER](https://github.com/chaoszhang/ASTER) (the C++ implementation of ASTRAL-III) is required. [ClipKIT](https://github.com/JLSteenwyk/ClipKIT) is optional and used only when `--trim` is passed.
```
conda install -c bioconda aster clipkit
```

Then, you can install the read2tree package after downlaoding the package from this GitHub repo using

```
Expand Down Expand Up @@ -79,7 +84,9 @@ cat marker_genes/*.fna > dna_ref.fa

### output

The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as default.
The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as default (concatenation/supermatrix approach).

For a coalescent-based species tree that accounts for incomplete lineage sorting and differing gene tree histories, run the optional **step 4** after step 3 (see below).


### Single species mode
Expand Down Expand Up @@ -121,6 +128,74 @@ Tunable filters (only active with `--meta`):

**Note on false positives:** metagenomic mode is permissive by design and may include species that share marker reads only by chance. We recommend tuning `--meta_min_markers` and `--meta_marker_fraction` to your dataset (e.g. `50` and `0.5` as a starting point for typical microbial communities) to reduce false positives.

#### step4 (optional: coalescent species tree)

Step 3 builds a supermatrix tree by concatenating all OG alignments. If you want a **coalescent-based species tree** instead, which better handles incomplete lineage sorting and the different evolutionary histories of individual genes, run step 4 after step 3:

```
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24
```

Step 4 does the following automatically:
1. Filters the per-OG alignments from step 3 by taxon occupancy (`--min_samples`, default 10) and gap fraction (`--max_gap`, default 0.80).
2. Runs IQ-TREE on each passing alignment in parallel (`-m LG+F+G`, `-alrt 1000`, `--abayes`, `-fast`) to infer individual gene trees.
3. Collects all gene trees and passes them to [ASTER](https://github.com/chaoszhang/ASTER) to produce the final coalescent species tree.

**ASTER binary.** The ASTER suite provides several binaries, all installed by `conda install aster`. By default, step 4 auto-detects the first available in your PATH (`astral3`,`astral-pro3`,`astral-pro2`,`wastral`,`astral4`). Use `--astral_binary` to opt into a specific estimator:

| Binary | When to use |
|---|---|
| `astral3` | Default. Standard ASTRAL-III for single-copy orthologs. |
| `astral-pro3` | Gene trees with paralogs (multi-copy gene families). |
| `wastral` | Recommended for noisy gene trees. Weights each quartet by gene tree branch support, so poorly supported splits contribute less to the species tree. IQ-TREE's `--abayes` supports (already included in step 4) provide the weighting signal. |
| `astral4` | Large datasets with substantial missing taxa, or when substitution-rate branch lengths on internal nodes are needed for downstream rate analyses. |

```
# weighted ASTRAL - better accuracy when gene tree support is variable
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --astral_binary wastral

# ASTRAL-IV - better robustness under missing data
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --astral_binary astral4
```

**Choosing `--min_samples` for large datasets.** The default of 10 is intentionally permissive so the tool works out of the box for small test datasets. For studies with many samples, the occupancy threshold has a large effect on how many OGs survive filtering and on the quality of the resulting gene trees. A useful empirical guideline is to require at least 30–40% taxon occupancy. For example, `--min_samples 100` for a dataset of ~300 samples. In practice, applying a meaningful occupancy threshold together with `--max_gap 0.80` can reduce the number of OGs from tens of thousands to a few hundred; this is expected and desirable, as the surviving alignments are well-sampled across the tree and produce far more reliable gene trees for ASTRAL than a large set of sparse, gap-heavy alignments would. If the filtered set is very small (fewer than ~50 OGs), consider relaxing `--min_samples` slightly rather than `--max_gap`, since taxon occupancy drives gene tree resolution more than column-level gap content.

Optionally, pass `--trim` to run [ClipKIT](https://github.com/JLSteenwyk/ClipKIT) column-trimming on each alignment before gene tree inference:
```
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --trim
```

**IQ-TREE and ASTER options.** Step 4 uses sensible defaults for per-gene tree inference (`-m LG+F+G -alrt 1000 --abayes -fast`). For cases where these need to be adjusted, three pass-through arguments are available:

| Argument | Default | Purpose |
|---|---|---|
| `--iqtree_model` | `LG+F+G` | Substitution model for per-gene IQ-TREE runs (e.g. `WAG+G`, `LG+G`, `TEST` for ModelFinder) |
| `--iqtree_args` | none | Extra flags appended verbatim to every per-gene IQ-TREE call (e.g. `"-B 1000"`) |
| `--astral_args` | none | Extra flags appended verbatim to the ASTER call; see the ASTER documentation for available options |
| `--no_fast` | off | Disable the `-fast` flag to run a full ML tree search per gene. Required when using bootstrap via `--iqtree_args` (e.g. `--iqtree_args "-B 1000"`), as `-fast` and bootstrap are incompatible in IQ-TREE |
| `--dna` | off | Use DNA alignments from `06_align_merge_dna` instead of amino acid alignments. Recommended for closely related species. Default model switches to `GTR+G` |

```
# Use WAG+G model instead of LG+F+G
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --iqtree_model WAG+G

# Run ModelFinder per gene (much slower, but selects best-fit model for each OG)
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --iqtree_model TEST

# Full ML search with ultrafast bootstrap (--no_fast required when using -B)
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --no_fast --iqtree_args "-B 1000"

# DNA-based coalescent tree (recommended for closely related species)
read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --dna
```

Key output files written to `output/` (suffix is `aa` by default, `dna` when `--dna` is used):
- `07_astral_filtered_aa/` or `07_astral_filtered_dna/` - per-OG FASTA alignments that passed filtering
- `07_astral_trimmed_aa/` or `07_astral_trimmed_dna/` - ClipKIT-trimmed alignments (only when `--trim` is used)
- `08_gene_trees_aa/` or `08_gene_trees_dna/` - individual IQ-TREE gene tree files
- `gene_trees_merge_aa.nwk` or `gene_trees_merge_dna.nwk` - all gene trees concatenated into one file (input to ASTER)
- `astral_tree_merge_aa.nwk` or `astral_tree_merge_dna.nwk` - the final coalescent species tree in Newick format

### bootstraping

To have bootstrap values a metric for quality of internal nodes, you can run the following
Expand Down
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,5 @@ dependencies:
- samtools
- filelock
- pysam
- aster
- clipkit
248 changes: 248 additions & 0 deletions read2tree/CoalescentInference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
#!/usr/bin/env python
'''
CoalescentInference: filter per-OG alignments, infer per-gene trees with IQ-TREE,
and run ASTER to produce a coalescent species tree (step 4astral).

-- Step 4 of the read2tree pipeline.
'''
import os
import glob
import time
import logging
from multiprocessing import Pool
from Bio import SeqIO, AlignIO

from read2tree.wrappers.treebuilders.iqtree import Iqtree, get_gene_tree_options, get_gene_tree_dna_options
from read2tree.wrappers.treebuilders.base_treebuilder import DataType
from read2tree.wrappers.treebuilders.aster import Aster
from read2tree.wrappers.options import StringOption
from read2tree.wrappers import WrapperError

logger = logging.getLogger(__name__)


def _run_gene_tree(task):
"""
Module-level worker for multiprocessing pool.
Runs IQ-TREE on a single alignment file and writes the treefile.
Returns the Newick tree string, or None on failure.
"""
alignment_file, gene_trees_folder, iqtree_model, iqtree_extra, no_fast, seq_type = task
og_name = os.path.basename(alignment_file).rsplit('.', 1)[0]
treefile = os.path.join(gene_trees_folder, og_name + '.treefile')
if os.path.exists(treefile) and os.path.getsize(treefile) > 0:
with open(treefile, 'r') as fh:
return fh.read().strip()
try:
if seq_type == 'dna':
iqtree_wrapper = Iqtree(alignment_file, datatype=DataType.DNA)
iqtree_wrapper.options = get_gene_tree_dna_options()
else:
iqtree_wrapper = Iqtree(alignment_file, datatype=DataType.PROTEIN)
iqtree_wrapper.options = get_gene_tree_options()
if iqtree_model:
iqtree_wrapper.options.options['-m'].set_value(iqtree_model)
if no_fast:
iqtree_wrapper.options.options['-fast'].active = False
if iqtree_extra:
iqtree_wrapper.options.options['_extra'] = StringOption('', iqtree_extra, active=True)
tree = iqtree_wrapper()
if tree:
with open(treefile, 'w') as fh:
fh.write(tree.strip() + '\n')
return tree
except Exception as e:
logger.error('Gene tree failed for {}: {}'.format(og_name, e))
return None


class CoalescentInference(object):
"""
Orchestrates the coalescent species tree pipeline (step 4astral):

1. Filter per-OG alignments from 06_align_merge_aa by gap fraction and
taxon occupancy, writing clean FASTA to 07_astral_filtered_aa.
2. Optionally trim filtered alignments with ClipKIT (--trim flag),
writing results to 07_astral_trimmed_aa.
3. Run IQ-TREE on each alignment in parallel using multiprocessing,
writing individual gene treefiles to 08_gene_trees.
4. Concatenate gene trees into a single treefile and run ASTER to
produce the coalescent species tree.
"""

def __init__(self, args):
self.args = args
self._species_name = 'merge'
self._seq_type = 'dna' if getattr(args, 'dna', False) else 'aa'
self._filtered_folder = self._make_output_path('07_astral_filtered_' + self._seq_type)
self._trimmed_folder = self._make_output_path('07_astral_trimmed_' + self._seq_type) if args.trim else None
self._gene_trees_folder = self._make_output_path('08_gene_trees_' + self._seq_type)
self.elapsed_time = 0
self.tree = None
self._run()

def _run(self):
start = time.time()

filtered_files = self._filter_alignments()
if not filtered_files:
logger.error('{}: No alignments passed filtering for step 4astral.'.format(self._species_name))
return

if self.args.trim:
input_files = self._trim_alignments(filtered_files)
if not input_files:
logger.error('{}: No alignments remain after ClipKIT trimming.'.format(self._species_name))
return
else:
input_files = filtered_files

gene_tree_file = self._infer_gene_trees(input_files)
self.tree = self._infer_species_tree(gene_tree_file)

end = time.time()
self.elapsed_time = end - start
logger.info('{}: Step 4astral coalescent inference took {:.2f}s.'.format(
self._species_name, self.elapsed_time))

def _make_output_path(self, prefix):
path = os.path.join(self.args.output_path, prefix)
if not os.path.exists(path):
os.makedirs(path)
return path

def _filter_alignments(self):
"""
Filter per-OG alignments from 06_align_merge_aa by gap fraction and occupancy.

The .fa files produced by step 3combine are phylip-relaxed format despite their
extension; this method reads them correctly and writes passing alignments as
standard FASTA to 07_astral_filtered_aa.

:return: list of paths to filtered FASTA files
"""
input_folder = os.path.join(self.args.output_path, '06_align_merge_' + self._seq_type)
log_path = os.path.join(self._filtered_folder, 'filtering_summary.txt')
filtered_files = []
total = 0
passed = 0
dropped = 0

with open(log_path, 'w') as log:
log.write('OG_Name\tOriginal_Sequences\tSequences_Passing_Gap_Filter\tStatus\n')
for filepath in sorted(glob.glob(os.path.join(input_folder, '*.fa'))):
total += 1
og_name = os.path.basename(filepath)
orig_count = 0
valid_records = []
try:
alignment = AlignIO.read(filepath, 'phylip-relaxed')
for record in alignment:
orig_count += 1
seq_str = str(record.seq).upper()
gap_count = seq_str.count('-') + seq_str.count('X') + seq_str.count('N')
if len(seq_str) > 0 and (gap_count / len(seq_str)) <= self.args.max_gap:
valid_records.append(record)

if len(valid_records) >= self.args.min_samples:
out_path = os.path.join(self._filtered_folder,
og_name.replace('.fa', '.fasta'))
SeqIO.write(valid_records, out_path, 'fasta')
filtered_files.append(out_path)
passed += 1
status = 'KEPT'
else:
dropped += 1
status = 'DROPPED (Low Occupancy)'
except Exception as e:
dropped += 1
status = 'ERROR: {}'.format(e)

log.write('{}\t{}\t{}\t{}\n'.format(og_name, orig_count, len(valid_records), status))

log.write('\n=== TOTALS ===\n')
log.write('Total evaluated: {}\nKept: {}\nDropped/failed: {}\n'.format(
total, passed, dropped))
if total > 0:
log.write('Retention rate: {:.2f}%\n'.format((passed / total) * 100))

logger.info('{}: Alignment filtering kept {} of {} OGs.'.format(
self._species_name, passed, total))
return filtered_files

def _trim_alignments(self, filtered_files):
"""
Run ClipKIT on each filtered alignment.

:param filtered_files: list of FASTA alignment paths
:return: list of trimmed FASTA paths that are non-empty after trimming
"""
from read2tree.wrappers.aligners.clipkit import Clipkit
trimmed_files = []
for fasta_file in filtered_files:
og_name = os.path.basename(fasta_file)
trimmed_path = os.path.join(self._trimmed_folder, og_name)
try:
clipkit_wrapper = Clipkit(fasta_file, trimmed_path)
result = clipkit_wrapper()
if result:
trimmed_files.append(result)
else:
logger.warning('{}: ClipKIT produced empty output for {}, skipping.'.format(
self._species_name, og_name))
except WrapperError as e:
logger.error('{}: ClipKIT failed for {}: {}'.format(self._species_name, og_name, e))
logger.info('{}: ClipKIT trimming kept {} of {} alignments.'.format(
self._species_name, len(trimmed_files), len(filtered_files)))
return trimmed_files

def _infer_gene_trees(self, alignment_files):
"""
Run IQ-TREE on each alignment in parallel using multiprocessing.Pool.
Collects all gene trees into a single Newick file for ASTER.

:param alignment_files: list of FASTA alignment paths
:return: path to the concatenated gene treefile
"""
iqtree_model = getattr(self.args, 'iqtree_model', None)
iqtree_extra = getattr(self.args, 'iqtree_args', None)
no_fast = getattr(self.args, 'no_fast', False)
tasks = [(f, self._gene_trees_folder, iqtree_model, iqtree_extra, no_fast, self._seq_type) for f in alignment_files]
logger.info('{}: Running per-gene IQ-TREE on {} alignments with {} workers.'.format(
self._species_name, len(tasks), self.args.threads))

p = Pool(self.args.threads)
results = p.map(_run_gene_tree, tasks)
p.close()
p.join()

trees = [t for t in results if t is not None]
gene_tree_file = os.path.join(self.args.output_path,
'gene_trees_{}_{}.nwk'.format(self._species_name, self._seq_type))
with open(gene_tree_file, 'w') as fh:
for tree in trees:
fh.write(tree.strip() + '\n')

logger.info('{}: {} of {} gene trees successfully inferred.'.format(
self._species_name, len(trees), len(tasks)))
return gene_tree_file

def _infer_species_tree(self, gene_tree_file):
"""
Run ASTER on the collected gene trees to estimate a coalescent species tree.

:param gene_tree_file: path to file containing one gene tree (Newick) per line
:return: coalescent species tree in Newick format
"""
species_tree_file = os.path.join(self.args.output_path,
'astral_tree_{}_{}.nwk'.format(self._species_name, self._seq_type))
aster_wrapper = Aster(gene_tree_file, species_tree_file,
binary=getattr(self.args, 'astral_binary', None))
aster_wrapper.options.options['-t'].set_value(self.args.threads)
astral_extra = getattr(self.args, 'astral_args', None)
if astral_extra:
aster_wrapper.options.options['_extra'] = StringOption('', astral_extra, active=True)
tree = aster_wrapper()
logger.info('{}: Coalescent species tree written to {}'.format(
self._species_name, species_tree_file))
return tree
Loading