diff --git a/README.md b/README.md index 13282ea4..95d2e3e2 100644 --- a/README.md +++ b/README.md @@ -51,6 +51,11 @@ For this version, the `--read_type` argument accepts any minimap2 options string conda install -c bioconda mafft iqtree minimap2 samtools ``` +For the coalescent species tree (step 4), [ASTER](https://github.com/chaoszhang/ASTER) (the C++ implementation of ASTRAL-III) is required. [ClipKIT](https://github.com/JLSteenwyk/ClipKIT) is optional and used only when `--trim` is passed. +``` +conda install -c bioconda aster clipkit +``` + Then, you can install the read2tree package after downlaoding the package from this GitHub repo using ``` @@ -79,7 +84,9 @@ cat marker_genes/*.fna > dna_ref.fa ### output -The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as default. +The output of Read2Tree is the concatenated alignments as a fasta file where each record corresponds to one species. We also provide the option `--tree` for inferring the species tree using IQTREE as default (concatenation/supermatrix approach). + +For a coalescent-based species tree that accounts for incomplete lineage sorting and differing gene tree histories, run the optional **step 4** after step 3 (see below). ### Single species mode @@ -121,6 +128,74 @@ Tunable filters (only active with `--meta`): **Note on false positives:** metagenomic mode is permissive by design and may include species that share marker reads only by chance. We recommend tuning `--meta_min_markers` and `--meta_marker_fraction` to your dataset (e.g. `50` and `0.5` as a starting point for typical microbial communities) to reduce false positives. +#### step4 (optional: coalescent species tree) + +Step 3 builds a supermatrix tree by concatenating all OG alignments. If you want a **coalescent-based species tree** instead, which better handles incomplete lineage sorting and the different evolutionary histories of individual genes, run step 4 after step 3: + +``` +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 +``` + +Step 4 does the following automatically: +1. Filters the per-OG alignments from step 3 by taxon occupancy (`--min_samples`, default 10) and gap fraction (`--max_gap`, default 0.80). +2. Runs IQ-TREE on each passing alignment in parallel (`-m LG+F+G`, `-alrt 1000`, `--abayes`, `-fast`) to infer individual gene trees. +3. Collects all gene trees and passes them to [ASTER](https://github.com/chaoszhang/ASTER) to produce the final coalescent species tree. + +**ASTER binary.** The ASTER suite provides several binaries, all installed by `conda install aster`. By default, step 4 auto-detects the first available in your PATH (`astral3`,`astral-pro3`,`astral-pro2`,`wastral`,`astral4`). Use `--astral_binary` to opt into a specific estimator: + +| Binary | When to use | +|---|---| +| `astral3` | Default. Standard ASTRAL-III for single-copy orthologs. | +| `astral-pro3` | Gene trees with paralogs (multi-copy gene families). | +| `wastral` | Recommended for noisy gene trees. Weights each quartet by gene tree branch support, so poorly supported splits contribute less to the species tree. IQ-TREE's `--abayes` supports (already included in step 4) provide the weighting signal. | +| `astral4` | Large datasets with substantial missing taxa, or when substitution-rate branch lengths on internal nodes are needed for downstream rate analyses. | + +``` +# weighted ASTRAL - better accuracy when gene tree support is variable +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --astral_binary wastral + +# ASTRAL-IV - better robustness under missing data +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --astral_binary astral4 +``` + +**Choosing `--min_samples` for large datasets.** The default of 10 is intentionally permissive so the tool works out of the box for small test datasets. For studies with many samples, the occupancy threshold has a large effect on how many OGs survive filtering and on the quality of the resulting gene trees. A useful empirical guideline is to require at least 30–40% taxon occupancy. For example, `--min_samples 100` for a dataset of ~300 samples. In practice, applying a meaningful occupancy threshold together with `--max_gap 0.80` can reduce the number of OGs from tens of thousands to a few hundred; this is expected and desirable, as the surviving alignments are well-sampled across the tree and produce far more reliable gene trees for ASTRAL than a large set of sparse, gap-heavy alignments would. If the filtered set is very small (fewer than ~50 OGs), consider relaxing `--min_samples` slightly rather than `--max_gap`, since taxon occupancy drives gene tree resolution more than column-level gap content. + +Optionally, pass `--trim` to run [ClipKIT](https://github.com/JLSteenwyk/ClipKIT) column-trimming on each alignment before gene tree inference: +``` +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --trim +``` + +**IQ-TREE and ASTER options.** Step 4 uses sensible defaults for per-gene tree inference (`-m LG+F+G -alrt 1000 --abayes -fast`). For cases where these need to be adjusted, three pass-through arguments are available: + +| Argument | Default | Purpose | +|---|---|---| +| `--iqtree_model` | `LG+F+G` | Substitution model for per-gene IQ-TREE runs (e.g. `WAG+G`, `LG+G`, `TEST` for ModelFinder) | +| `--iqtree_args` | none | Extra flags appended verbatim to every per-gene IQ-TREE call (e.g. `"-B 1000"`) | +| `--astral_args` | none | Extra flags appended verbatim to the ASTER call; see the ASTER documentation for available options | +| `--no_fast` | off | Disable the `-fast` flag to run a full ML tree search per gene. Required when using bootstrap via `--iqtree_args` (e.g. `--iqtree_args "-B 1000"`), as `-fast` and bootstrap are incompatible in IQ-TREE | +| `--dna` | off | Use DNA alignments from `06_align_merge_dna` instead of amino acid alignments. Recommended for closely related species. Default model switches to `GTR+G` | + +``` +# Use WAG+G model instead of LG+F+G +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --iqtree_model WAG+G + +# Run ModelFinder per gene (much slower, but selects best-fit model for each OG) +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --iqtree_model TEST + +# Full ML search with ultrafast bootstrap (--no_fast required when using -B) +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --no_fast --iqtree_args "-B 1000" + +# DNA-based coalescent tree (recommended for closely related species) +read2tree --step 4astral --standalone_path marker_genes --dna_reference dna_ref.fa --output_path output --threads 24 --dna +``` + +Key output files written to `output/` (suffix is `aa` by default, `dna` when `--dna` is used): +- `07_astral_filtered_aa/` or `07_astral_filtered_dna/` - per-OG FASTA alignments that passed filtering +- `07_astral_trimmed_aa/` or `07_astral_trimmed_dna/` - ClipKIT-trimmed alignments (only when `--trim` is used) +- `08_gene_trees_aa/` or `08_gene_trees_dna/` - individual IQ-TREE gene tree files +- `gene_trees_merge_aa.nwk` or `gene_trees_merge_dna.nwk` - all gene trees concatenated into one file (input to ASTER) +- `astral_tree_merge_aa.nwk` or `astral_tree_merge_dna.nwk` - the final coalescent species tree in Newick format + ### bootstraping To have bootstrap values a metric for quality of internal nodes, you can run the following diff --git a/environment.yml b/environment.yml index 5355b124..c798783c 100644 --- a/environment.yml +++ b/environment.yml @@ -23,3 +23,5 @@ dependencies: - samtools - filelock - pysam + - aster + - clipkit diff --git a/read2tree/CoalescentInference.py b/read2tree/CoalescentInference.py new file mode 100644 index 00000000..f954a088 --- /dev/null +++ b/read2tree/CoalescentInference.py @@ -0,0 +1,248 @@ +#!/usr/bin/env python +''' + CoalescentInference: filter per-OG alignments, infer per-gene trees with IQ-TREE, + and run ASTER to produce a coalescent species tree (step 4astral). + + -- Step 4 of the read2tree pipeline. +''' +import os +import glob +import time +import logging +from multiprocessing import Pool +from Bio import SeqIO, AlignIO + +from read2tree.wrappers.treebuilders.iqtree import Iqtree, get_gene_tree_options, get_gene_tree_dna_options +from read2tree.wrappers.treebuilders.base_treebuilder import DataType +from read2tree.wrappers.treebuilders.aster import Aster +from read2tree.wrappers.options import StringOption +from read2tree.wrappers import WrapperError + +logger = logging.getLogger(__name__) + + +def _run_gene_tree(task): + """ + Module-level worker for multiprocessing pool. + Runs IQ-TREE on a single alignment file and writes the treefile. + Returns the Newick tree string, or None on failure. + """ + alignment_file, gene_trees_folder, iqtree_model, iqtree_extra, no_fast, seq_type = task + og_name = os.path.basename(alignment_file).rsplit('.', 1)[0] + treefile = os.path.join(gene_trees_folder, og_name + '.treefile') + if os.path.exists(treefile) and os.path.getsize(treefile) > 0: + with open(treefile, 'r') as fh: + return fh.read().strip() + try: + if seq_type == 'dna': + iqtree_wrapper = Iqtree(alignment_file, datatype=DataType.DNA) + iqtree_wrapper.options = get_gene_tree_dna_options() + else: + iqtree_wrapper = Iqtree(alignment_file, datatype=DataType.PROTEIN) + iqtree_wrapper.options = get_gene_tree_options() + if iqtree_model: + iqtree_wrapper.options.options['-m'].set_value(iqtree_model) + if no_fast: + iqtree_wrapper.options.options['-fast'].active = False + if iqtree_extra: + iqtree_wrapper.options.options['_extra'] = StringOption('', iqtree_extra, active=True) + tree = iqtree_wrapper() + if tree: + with open(treefile, 'w') as fh: + fh.write(tree.strip() + '\n') + return tree + except Exception as e: + logger.error('Gene tree failed for {}: {}'.format(og_name, e)) + return None + + +class CoalescentInference(object): + """ + Orchestrates the coalescent species tree pipeline (step 4astral): + + 1. Filter per-OG alignments from 06_align_merge_aa by gap fraction and + taxon occupancy, writing clean FASTA to 07_astral_filtered_aa. + 2. Optionally trim filtered alignments with ClipKIT (--trim flag), + writing results to 07_astral_trimmed_aa. + 3. Run IQ-TREE on each alignment in parallel using multiprocessing, + writing individual gene treefiles to 08_gene_trees. + 4. Concatenate gene trees into a single treefile and run ASTER to + produce the coalescent species tree. + """ + + def __init__(self, args): + self.args = args + self._species_name = 'merge' + self._seq_type = 'dna' if getattr(args, 'dna', False) else 'aa' + self._filtered_folder = self._make_output_path('07_astral_filtered_' + self._seq_type) + self._trimmed_folder = self._make_output_path('07_astral_trimmed_' + self._seq_type) if args.trim else None + self._gene_trees_folder = self._make_output_path('08_gene_trees_' + self._seq_type) + self.elapsed_time = 0 + self.tree = None + self._run() + + def _run(self): + start = time.time() + + filtered_files = self._filter_alignments() + if not filtered_files: + logger.error('{}: No alignments passed filtering for step 4astral.'.format(self._species_name)) + return + + if self.args.trim: + input_files = self._trim_alignments(filtered_files) + if not input_files: + logger.error('{}: No alignments remain after ClipKIT trimming.'.format(self._species_name)) + return + else: + input_files = filtered_files + + gene_tree_file = self._infer_gene_trees(input_files) + self.tree = self._infer_species_tree(gene_tree_file) + + end = time.time() + self.elapsed_time = end - start + logger.info('{}: Step 4astral coalescent inference took {:.2f}s.'.format( + self._species_name, self.elapsed_time)) + + def _make_output_path(self, prefix): + path = os.path.join(self.args.output_path, prefix) + if not os.path.exists(path): + os.makedirs(path) + return path + + def _filter_alignments(self): + """ + Filter per-OG alignments from 06_align_merge_aa by gap fraction and occupancy. + + The .fa files produced by step 3combine are phylip-relaxed format despite their + extension; this method reads them correctly and writes passing alignments as + standard FASTA to 07_astral_filtered_aa. + + :return: list of paths to filtered FASTA files + """ + input_folder = os.path.join(self.args.output_path, '06_align_merge_' + self._seq_type) + log_path = os.path.join(self._filtered_folder, 'filtering_summary.txt') + filtered_files = [] + total = 0 + passed = 0 + dropped = 0 + + with open(log_path, 'w') as log: + log.write('OG_Name\tOriginal_Sequences\tSequences_Passing_Gap_Filter\tStatus\n') + for filepath in sorted(glob.glob(os.path.join(input_folder, '*.fa'))): + total += 1 + og_name = os.path.basename(filepath) + orig_count = 0 + valid_records = [] + try: + alignment = AlignIO.read(filepath, 'phylip-relaxed') + for record in alignment: + orig_count += 1 + seq_str = str(record.seq).upper() + gap_count = seq_str.count('-') + seq_str.count('X') + seq_str.count('N') + if len(seq_str) > 0 and (gap_count / len(seq_str)) <= self.args.max_gap: + valid_records.append(record) + + if len(valid_records) >= self.args.min_samples: + out_path = os.path.join(self._filtered_folder, + og_name.replace('.fa', '.fasta')) + SeqIO.write(valid_records, out_path, 'fasta') + filtered_files.append(out_path) + passed += 1 + status = 'KEPT' + else: + dropped += 1 + status = 'DROPPED (Low Occupancy)' + except Exception as e: + dropped += 1 + status = 'ERROR: {}'.format(e) + + log.write('{}\t{}\t{}\t{}\n'.format(og_name, orig_count, len(valid_records), status)) + + log.write('\n=== TOTALS ===\n') + log.write('Total evaluated: {}\nKept: {}\nDropped/failed: {}\n'.format( + total, passed, dropped)) + if total > 0: + log.write('Retention rate: {:.2f}%\n'.format((passed / total) * 100)) + + logger.info('{}: Alignment filtering kept {} of {} OGs.'.format( + self._species_name, passed, total)) + return filtered_files + + def _trim_alignments(self, filtered_files): + """ + Run ClipKIT on each filtered alignment. + + :param filtered_files: list of FASTA alignment paths + :return: list of trimmed FASTA paths that are non-empty after trimming + """ + from read2tree.wrappers.aligners.clipkit import Clipkit + trimmed_files = [] + for fasta_file in filtered_files: + og_name = os.path.basename(fasta_file) + trimmed_path = os.path.join(self._trimmed_folder, og_name) + try: + clipkit_wrapper = Clipkit(fasta_file, trimmed_path) + result = clipkit_wrapper() + if result: + trimmed_files.append(result) + else: + logger.warning('{}: ClipKIT produced empty output for {}, skipping.'.format( + self._species_name, og_name)) + except WrapperError as e: + logger.error('{}: ClipKIT failed for {}: {}'.format(self._species_name, og_name, e)) + logger.info('{}: ClipKIT trimming kept {} of {} alignments.'.format( + self._species_name, len(trimmed_files), len(filtered_files))) + return trimmed_files + + def _infer_gene_trees(self, alignment_files): + """ + Run IQ-TREE on each alignment in parallel using multiprocessing.Pool. + Collects all gene trees into a single Newick file for ASTER. + + :param alignment_files: list of FASTA alignment paths + :return: path to the concatenated gene treefile + """ + iqtree_model = getattr(self.args, 'iqtree_model', None) + iqtree_extra = getattr(self.args, 'iqtree_args', None) + no_fast = getattr(self.args, 'no_fast', False) + tasks = [(f, self._gene_trees_folder, iqtree_model, iqtree_extra, no_fast, self._seq_type) for f in alignment_files] + logger.info('{}: Running per-gene IQ-TREE on {} alignments with {} workers.'.format( + self._species_name, len(tasks), self.args.threads)) + + p = Pool(self.args.threads) + results = p.map(_run_gene_tree, tasks) + p.close() + p.join() + + trees = [t for t in results if t is not None] + gene_tree_file = os.path.join(self.args.output_path, + 'gene_trees_{}_{}.nwk'.format(self._species_name, self._seq_type)) + with open(gene_tree_file, 'w') as fh: + for tree in trees: + fh.write(tree.strip() + '\n') + + logger.info('{}: {} of {} gene trees successfully inferred.'.format( + self._species_name, len(trees), len(tasks))) + return gene_tree_file + + def _infer_species_tree(self, gene_tree_file): + """ + Run ASTER on the collected gene trees to estimate a coalescent species tree. + + :param gene_tree_file: path to file containing one gene tree (Newick) per line + :return: coalescent species tree in Newick format + """ + species_tree_file = os.path.join(self.args.output_path, + 'astral_tree_{}_{}.nwk'.format(self._species_name, self._seq_type)) + aster_wrapper = Aster(gene_tree_file, species_tree_file, + binary=getattr(self.args, 'astral_binary', None)) + aster_wrapper.options.options['-t'].set_value(self.args.threads) + astral_extra = getattr(self.args, 'astral_args', None) + if astral_extra: + aster_wrapper.options.options['_extra'] = StringOption('', astral_extra, active=True) + tree = aster_wrapper() + logger.info('{}: Coalescent species tree written to {}'.format( + self._species_name, species_tree_file)) + return tree diff --git a/read2tree/main.py b/read2tree/main.py index 2afcdb52..ef1196a7 100644 --- a/read2tree/main.py +++ b/read2tree/main.py @@ -25,6 +25,7 @@ from read2tree.Aligner import Aligner # from read2tree.Progress import Progress from read2tree.TreeInference import TreeInference +from read2tree.CoalescentInference import CoalescentInference from read2tree.parser import OMAOutputParser import argparse import glob @@ -172,8 +173,59 @@ def parse_args(argv, exe_name, desc): help='[Default is false] Compute tree, otherwise just ' 'output concatenated alignment!') + arg_parser.add_argument('--min_samples', type=int, default=10, + help='[Default is 10] Minimum number of sequences per OG ' + 'after gap filtering. Used by step 4astral.') + + arg_parser.add_argument('--max_gap', type=float, default=0.80, + help='[Default is 0.80] Maximum allowed fraction of gaps ' + '(-, X, N) per sequence. Used by step 4astral.') + + arg_parser.add_argument('--trim', action='store_true', + help='[Default is false] Run ClipKIT trimming on filtered ' + 'alignments before per-gene IQ-TREE. Requires clipkit ' + 'in PATH. Used by step 4astral.') + + arg_parser.add_argument('--astral_binary', default=None, + help='[Default is auto-detect] Name or path of the ASTER ' + 'binary to use for coalescent species tree estimation ' + '(e.g. astral3, astral-pro3, astral-pro2). If not set, ' + 'the first available binary is used. Used by step 4astral.') + + arg_parser.add_argument('--iqtree_model', default=None, + help='[Default is LG+F+G] Substitution model passed to IQ-TREE ' + 'for per-gene tree inference in step 4astral. Overrides the ' + 'built-in default (e.g. --iqtree_model WAG+G, ' + '--iqtree_model LG+G, --iqtree_model TEST). ' + 'Used by step 4astral.') + + arg_parser.add_argument('--iqtree_args', default=None, + help='[Default is none] Extra IQ-TREE flags appended verbatim to ' + 'every per-gene tree invocation in step 4astral. Use quotes ' + 'for multiple flags (e.g. --iqtree_args "-bb 1000 -redo"). ' + 'Used by step 4astral.') + + arg_parser.add_argument('--astral_args', default=None, + help='[Default is none] Extra flags appended verbatim to the ASTER ' + 'command in step 4astral. Use quotes for multiple flags. ' + 'See ASTER documentation for available options. ' + 'Used by step 4astral.') + + arg_parser.add_argument('--no_fast', action='store_true', + help='[Default is false] Disable the -fast flag for per-gene ' + 'IQ-TREE runs in step 4astral, enabling a full ML tree ' + 'search. Required when using bootstrap via --iqtree_args ' + '(e.g. --iqtree_args "-B 1000"). Used by step 4astral.') + + arg_parser.add_argument('--dna', action='store_true', + help='[Default is false] Use DNA alignments from ' + '06_align_merge_dna instead of amino acid alignments ' + 'for step 4astral. Recommended for closely related ' + 'species. Default IQ-TREE model switches to GTR+G. ' + 'Used by step 4astral.') + arg_parser.add_argument('--step', default="all", - help='[Default is all 1marker 2map 3combine ') + help='[Default is all 1marker 2map 3combine 4astral') # arg_parser.add_argument('--merge_all_mappings', action='store_true', # help='[Default is off] In case multiple species were mapped to ' @@ -247,7 +299,7 @@ def parse_args(argv, exe_name, desc): if args.species_name: _species_name = args.species_name - if args.step == "3combine": # todo why is needed? + if args.step == "3combine" or args.step == "4astral": _species_name = 'merge' args.reads = _reads @@ -445,6 +497,22 @@ def main(argv, exe_name, desc=''): logger.info(' ------- Read2Tree finished -*- -------') + if args.step == "4astral": + _seq_type = 'dna' if args.dna else 'aa' + input_align_folder = os.path.join(args.output_path, '06_align_merge_' + _seq_type) + if not os.path.exists(input_align_folder): + logger.error( + 'Step 4astral requires completed step 3combine output. ' + 'Folder not found: {}'.format(input_align_folder)) + sys.exit() + logger.info('{}: ------- Read2Tree step 4astral (coalescent species tree) -------'.format( + args.species_name)) + coalescent = CoalescentInference(args) + if coalescent.tree: + logger.info(str(coalescent.tree)) + print("done- 4astral") + logger.info(' ------- Read2Tree step 4astral finished -*- -------') + print("done- main ") # TODO: Check whether all the necessary binaries are available diff --git a/read2tree/wrappers/aligners/__init__.py b/read2tree/wrappers/aligners/__init__.py index 363b16b8..ed9f0fe7 100644 --- a/read2tree/wrappers/aligners/__init__.py +++ b/read2tree/wrappers/aligners/__init__.py @@ -2,4 +2,5 @@ from .muscle import Muscle from .prographmsa import ProGraphMSA from .probcons import ProbCons +from .clipkit import Clipkit from .base_aligner import AlignmentInput, DataType, WrapperError \ No newline at end of file diff --git a/read2tree/wrappers/aligners/clipkit.py b/read2tree/wrappers/aligners/clipkit.py new file mode 100644 index 00000000..9bbec9e9 --- /dev/null +++ b/read2tree/wrappers/aligners/clipkit.py @@ -0,0 +1,71 @@ +import os +import time +import logging +from ..abstract_cli import AbstractCLI +from ..options import StringOption, FloatOption, OptionSet +from read2tree.wrappers import WrapperError + +logger = logging.getLogger(__name__) + + +class ClipkitCLI(AbstractCLI): + @property + def _default_exe(self): + return 'clipkit' + + +class Clipkit(object): + """ + Wrapper for ClipKIT alignment trimmer. + + Takes an input alignment file and writes a trimmed alignment to output_file. + Returns the output file path on success, None if the output is empty. + + :Example: + + :: + + clipkit_wrapper = Clipkit('alignment.fasta', 'alignment_trimmed.fasta') + result = clipkit_wrapper() + time_taken = clipkit_wrapper.elapsed_time + """ + + def __init__(self, input_file, output_file, binary=None): + self.input_file = input_file + self.output_file = output_file + self.options = get_default_options() + self.elapsed_time = None + self.stdout = None + self.stderr = None + self.result = None + try: + self.cli = ClipkitCLI(executable=binary) + except IOError as err: + raise WrapperError('Error searching for clipkit binary: {}'.format(err)) + + def __call__(self, *args, **kwargs): + start = time.time() + output, error = self._call(self.input_file, self.output_file) + self.stdout = output + self.stderr = error + self.result = self.output_file if os.path.exists(self.output_file) and os.path.getsize(self.output_file) > 0 else None + end = time.time() + self.elapsed_time = end - start + return self.result + + def _call(self, input_file, output_file): + self.cli('{} {} -o {}'.format(input_file, self.command(), output_file), wait=True) + return self.cli.get_stdout(), self.cli.get_stderr() + + def command(self): + return str(self.options) + + def _init_cli(self, binary): + return ClipkitCLI(executable=binary) + + +def get_default_options(): + return OptionSet([ + StringOption('-m', 'gappy', active=True), + FloatOption('-g', 0.8, active=True), + ]) diff --git a/read2tree/wrappers/treebuilders/__init__.py b/read2tree/wrappers/treebuilders/__init__.py index 1fa1f09b..4ff07347 100644 --- a/read2tree/wrappers/treebuilders/__init__.py +++ b/read2tree/wrappers/treebuilders/__init__.py @@ -2,3 +2,4 @@ from .raxml import Raxml from .iqtree import Iqtree from .fasttree import Fasttree +from .aster import Aster diff --git a/read2tree/wrappers/treebuilders/aster.py b/read2tree/wrappers/treebuilders/aster.py new file mode 100644 index 00000000..efd48d5b --- /dev/null +++ b/read2tree/wrappers/treebuilders/aster.py @@ -0,0 +1,79 @@ +import os +import time +import logging +from ..abstract_cli import AbstractCLI +from ..options import IntegerOption, OptionSet +from read2tree.wrappers import WrapperError + +logger = logging.getLogger(__name__) +logger.addHandler(logging.StreamHandler()) +logger.setLevel(logging.INFO) + + +class AsterCLI(AbstractCLI): + @property + def _default_exe(self): + return ['astral3', 'astral-pro3', 'astral-pro2', 'wastral', 'astral4'] + + +class Aster(object): + """ + Wrapper for the ASTER suite of coalescent species tree estimators. + + Takes a file of gene trees (one Newick tree per line) and writes a species + tree to output_file using the selected ASTER algorithm. Returns the + species tree in Newick format. + + :Example: + + :: + + aster_wrapper = Aster('gene_trees.nwk', 'species_tree.nwk') + aster_wrapper.options.options['-t'].set_value(8) + result = aster_wrapper() + time_taken = aster_wrapper.elapsed_time + """ + + def __init__(self, gene_tree_file, output_file, binary=None): + self.gene_tree_file = gene_tree_file + self.output_file = output_file + self.options = get_default_options() + self.elapsed_time = None + self.stdout = None + self.stderr = None + self.result = None + try: + self.cli = AsterCLI(executable=binary) + except IOError as err: + raise WrapperError('Error searching for ASTER binary: {}'.format(err)) + + def __call__(self, *args, **kwargs): + start = time.time() + output, error = self._call(self.gene_tree_file, self.output_file) + self.stdout = output + self.stderr = error + self.result = self._read_result(self.output_file) + end = time.time() + self.elapsed_time = end - start + return self.result + + def _call(self, gene_tree_file, output_file): + self.cli('{} -i {} -o {}'.format(self.command(), gene_tree_file, output_file), wait=True) + return self.cli.get_stdout(), self.cli.get_stderr() + + def command(self): + return str(self.options) + + def _read_result(self, output_file): + try: + with open(output_file, 'r') as fh: + return fh.read().strip() + except IOError: + logger.error('Error reading ASTER output: {}'.format(output_file)) + return None + + +def get_default_options(): + return OptionSet([ + IntegerOption('-t', 1, active=True), + ]) diff --git a/read2tree/wrappers/treebuilders/iqtree.py b/read2tree/wrappers/treebuilders/iqtree.py index 4875a9cc..3d824573 100644 --- a/read2tree/wrappers/treebuilders/iqtree.py +++ b/read2tree/wrappers/treebuilders/iqtree.py @@ -147,3 +147,40 @@ def get_default_options(): # Bootstrap + ML tree + consensus tree (>=100) IntegerOption('-b', 0, active=False) ]) + + +def get_gene_tree_options(): + """Options for per-gene protein tree inference in the coalescent pipeline (step 4astral). + + Uses -T (IQ-TREE 2/3 thread flag) rather than the legacy -nt flag. + --abayes computes aBayes posterior branch supports alongside SH-aLRT values. + Both are annotated as node labels on the output tree, which wASTRAL uses for + quartet weighting when --astral_binary wastral is selected. + """ + return OptionSet([ + IntegerOption('-T', 1, active=True), + StringOption('-m', 'LG+F+G', active=True), + StringOption('-st', 'AA', active=True), + StringOption('-mem', '4G', active=True), + IntegerOption('-alrt', 1000, active=True), + FlagOption('-fast', True, active=True), + FlagOption('--abayes', True, active=True), + ]) + + +def get_gene_tree_dna_options(): + """Options for per-gene DNA tree inference in the coalescent pipeline (step 4astral). + + Uses GTR+G as the default model for nucleotide alignments. Recommended for + closely related species where DNA-level variation is more informative than + amino acid sequences. + """ + return OptionSet([ + IntegerOption('-T', 1, active=True), + StringOption('-m', 'GTR+G', active=True), + StringOption('-st', 'DNA', active=True), + StringOption('-mem', '4G', active=True), + IntegerOption('-alrt', 1000, active=True), + FlagOption('-fast', True, active=True), + FlagOption('--abayes', True, active=True), + ])