Skip to content

Too many edges for number of nodes #327

@rbdavid

Description

@rbdavid

We're seeing failures of the Convergence Ratio analysis runs with the below error. After some digging, the fatal issue is that, by the time convergence ratio is calculated, the blast parquet file contains self-alignments still, resulting in convergence ratio values greater than 1 (which should not be possible).

This continues the work I initiated a couple months ago in #258

The flow needs to be:

  • Condense sequences. (ALL_BY_ALL workflow)
  • Run all-by-all BLAST calcs. (ALL_BY_ALL workflow)
  • Run blastreduce to take the top triangle of the all-by-all BLAST results. Keep self-alignments here. (ALL_BY_ALL workflow)
  • Restore from the condensed sequence set to the full sequence set. Remove self-alignments here. (ALL_BY_ALL workflow)
    • a DuckDB call happens that outputs condensed.out.
    • a restore_condensed_sequences.py call happens that outputs 1.out. This is likely where the self-alignments need to be removed.
    • a transcode_restored_blast.py call happens that outputs the 1.out.parquet file.
  • Calculate convergence ratio. (REPORTING workflow)

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions