182 - Sort fasta before axa step#339
Conversation
nilsoberg
left a comment
There was a problem hiding this comment.
It looks good. The only thing I would check -- assuming you haven't yet -- is that the seqkit call doesn't run out of memory with large sequence sets.
…of processors to be used
That kinda bleeds into what I was envisioning was my next major task, which was to revamp the nextflow configuration files to better control how certain memory/compute hungry process blocks are handled and resources are allocated. As seqkit notes in their documentation, |
Past work on duckdb analyses of blast output (w/in
all_by_all_blast(),blastreduce(), andrestore_condensed()process blocks) should standardize the lexicographical sorting, which should solve the concerns of the original post of #182. But there's additional sorting we can do to ensure a standardized creation of the input fasta shards to theall_by_all_blast()process. This change standardizes the organization of those fasta files and does a load-balancing of the sequences spread across those shards. This load balancing should result in more even distribution of computationally expensive (long) sequences and so avoid instances where a random set ofall_by_all_blast()instances take much longer than the rest.Closes #182