Difference in edge number between web-based and code-based EFI-EST: is it due to the blast-num-matches parameter? #194

ycchenchn · 2025-06-11T12:46:50Z

ycchenchn
Jun 11, 2025

Hi,

I am using both the web-based EFI-EST and the code-based (GitHub) version for SSN analysis in family mode. I noticed that, with the same input family and similar parameters (e.g., family ID, database, e-value, domain, taxonomy filter, etc.), the number of edges in the SSN generated by the code-based version (with the default blast-num-matches=250) is significantly lower than that from the web-based version. When I increase the blast-num-matches parameter in the code-based version, the number of edges increases and gets closer to the web-based result.

Does the web-based EFI-EST use a different (larger or dynamic) value for the maximum number of BLAST matches per sequence?
What is the default value for this parameter in the web-based version? Is it fixed or determined by the family size?
If it is determined by the family size, how should I set the blast-num-matches parameter in the code-based version to match the web-based results?
Is there any documentation or recommendation for this parameter?

Thank you very much for your help!

rbdavid · 2025-06-12T15:20:39Z

rbdavid
Jun 12, 2025
Maintainer

Yes, the EST v2.0 tool uses a default blast-num-matches value of 250. As far as I know, the default value for the equivalent parameter in the v1.0 EST tool (currently hosted on EFI-Web) is 1,000,000. Due to this change in default values, you will see drastically different results between the EFI-Web results and the v2.0 tools being run on the command line for sequence sets that are larger than 250 sequences. Setting the command line tool's value to 1 million should recapture the EFI-Web results. But there are other reasons why your local and EFI-Web results may be different. If results still differ, let us know which version of the EFI database you are using for your local runs. The public EFI-Web tools are currently using version 105.

Unfortunately, the change in default values is not well documented. The logic of doing this change is:

Recording 1 million alignment scores per query sequence can result in a very large data sizes, especially as the number of query sequences becomes large.
If the query sequences have high sequence similarity (i.e. they are densely packed in the 2d projected space visualized by the SSN) or a sufficiently low alignment score threshold is applied after the all-to-all BLAST, the resulting number of edges between sequence nodes becomes intractable to visualize in Cytoscape. Not only do the file sizes become too large to load for a high-end desktop computer but also the information content of the SSN begins to be drowned out.
Decreasing the num-blast-hits parameter to 250 results in a minor loss of "local" structure (sequences that are similar but not in the top 250 hits) with, ideally, negligible effects on the global structure in the visualized SSN. Meaning, the local shape and connectivity within a cluster will be different but the separation of clusters into hypothesized iso-functional groups should be maintained.

I hope this helps and please let us know if you are still not recovering the results of the EFI-web data. Thanks for testing the code!

0 replies

nilsoberg · 2025-06-13T05:31:51Z

nilsoberg
Jun 13, 2025
Maintainer

@ycchenchn Could I ask you to share how you are running this on the command line? The exact command and parameters would be useful.

0 replies

ycchenchn · 2025-06-17T10:09:29Z

ycchenchn
Jun 17, 2025
Author

@rbdavid Thank you very much for your detailed explanation regarding the blast-num-matches parameter and the differences between the web-based and code-based versions of EST. Your insights were very helpful in understanding the underlying reasons.

Following your suggestion, I adjusted the blast-num-matches parameter from the default 250 to 1,000,000 in my local code-based run. I also increased the allocated computational resources. This change allowed me to largely replicate the "total number of edges" observed in the web-based results. There is a slight numerical difference (web-based: 43.73 million; code-based: 43.89 million), which I attribute to the different InterPro database versions used (web-based v105 vs. my local v104).

Regarding the impact of blast-num-matches=250 versus 1,000,000, my findings support your explanation. While the total number of edges in the unfiltered SSN decreased by an order of magnitude (from approximately 43.89 million with 1,000,000 matches to 4.56 million with 250 matches), the effect on the filtered SSN was less dramatic. After applying an alignment_score=30 filter, the number of SSN edges only reduced from 2.6 million (for 1,000,000 blast matches) to 2.07 million (for 250 blast matches). This indeed suggests that decreasing blast-num-matches to 250 has a relatively minor impact on the overall "global structure" of the filtered SSN, as you hypothesized.

Thank you again for your assistance and for developing these valuable tools.

0 replies

ycchenchn · 2025-06-17T10:13:41Z

ycchenchn
Jun 17, 2025
Author

@nilsoberg Sure, here are the exact command-line commands I used for my runs:

To generate parameters for EST family mode:

python bin/create_est_nextflow_params.py family --families <family ID> --fasta-db data/efi/blastdb/uniref50.fasta --sequence-version uniref50 --output-dir <output dir> --efi-config efi.config --efi-db data/efi/efi_db.sqlite --nextflow-config conf/est/docker.config

To run the EST pipeline:
```
bash <output dir>/run_nextflow.sh
```

To generate parameters for SSN generation (after EST output):

python bin/create_generatessn_nextflow_params.py auto --est-output-dir <output dir> --filter-parameter alignment_score --filter-min-val 30 --min-length 0 --max-length 50000 --ssn-name <ssn name> --ssn-title <ssn title> --efi-config efi.config --efi-db data/efi/efi_db.sqlite --nextflow-config conf/generatessn/docker.config

To run the SSN generation pipeline:
```
bash <output dir>/ssn/run_nextflow.sh
```

0 replies

nilsoberg · 2025-06-18T22:44:08Z

nilsoberg
Jun 18, 2025
Maintainer

@ycchenchn Thanks for the information. We are interested in real-world benchmarks. Could I ask if you are running the software on a PC or on a cluster? Do you know the approximate number of sequences you are using in the computations?

0 replies

ycchenchn · 2025-06-19T13:14:58Z

ycchenchn
Jun 19, 2025
Author

@nilsoberg Thank you for your follow-up. Sorry for not specifying the family ID earlier. The family I am working with is IPR002123. The full size is 219,292 sequences, the UniRef90 size is 115,176, and the UniRef50 size is 26,830.
I am running the software on a cluster, not a PC. The basic configuration of the cluster I am using includes 56 CPUs and 251GB of RAM.
When I set blast-num-matches to 1,000,000, I encountered an error during the blastreduce step. To address this, I modified the following parameters:

In params.yml:

duckdb_memory_limit from 8GB to 128GB
duckdb_threads from 1 to 16

In conf/est/docker.config (for the blastreduce process):

cpus from 2 to 16
memory from 16GB to 64GB

However, I am not sure how much this configuration exceeds the actual requirements. According to the Nextflow report, the blastreduce step used 14.762 GB of virtual memory (vmem), 11.842 GB of resident memory (rss), with peak values of 15.634 GB (peak_vmem) and 12.582 GB (peak_rss). The total run time is 1h2m35s. I can provide the full Nextflow report if needed.

0 replies

ycchenchn · 2025-07-07T10:59:28Z

ycchenchn
Jul 7, 2025
Author

@nilsoberg @rbdavid Hi, I have two follow-up questions regarding the differences between the web-based and code-based SSN pipelines:

According to the official documentation (pipelines/est/index.rst), the --exclude-fragments parameter should be supported in the EST step to filter out UniProt-defined fragment sequences. However, after checking the code (specifically create_est_nextflow_params.py and shared_args.py), I couldn't find any implementation or handling of the exclude-fragments parameter. Is there a recommended way to enable this feature, or is there an updated version of the code that supports it? If not, are there any suggested workarounds for fragment filtering in the current pipeline?
I noticed that the web-based SSN tool is already using UniProt: 2025-02 and InterPro: 105, while the GitHub repository still provides UniProt: 2025-01 and InterPro: 104. We would like to further confirm whether some differences in results are caused by differences in database versions. Is it possible to update the GitHub version to match the latest database releases? If there is a regular update schedule, could you please share it?

Thank you very much for your help!

0 replies

nilsoberg · 2025-07-10T16:36:11Z

nilsoberg
Jul 10, 2025
Maintainer

Could you email me at noberg@illinois.edu? I would like to help you out and it might be best to do that over email.

0 replies

ycchenchn · 2025-07-17T06:01:52Z

ycchenchn
Jul 17, 2025
Author

Could you email me at noberg@illinois.edu? I would like to help you out and it might be best to do that over email.

Thank you for your reply! I have sent you an email as requested. Looking forward to your help.

0 replies

Difference in edge number between web-based and code-based EFI-EST: is it due to the blast-num-matches parameter? #194

Uh oh!

ycchenchn Jun 11, 2025

Replies: 9 comments

Uh oh!

rbdavid Jun 12, 2025 Maintainer

Uh oh!

nilsoberg Jun 13, 2025 Maintainer

Uh oh!

ycchenchn Jun 17, 2025 Author

Uh oh!

ycchenchn Jun 17, 2025 Author

Uh oh!

nilsoberg Jun 18, 2025 Maintainer

Uh oh!

Uh oh!

ycchenchn Jun 19, 2025 Author

Uh oh!

Uh oh!

ycchenchn Jul 7, 2025 Author

Uh oh!

nilsoberg Jul 10, 2025 Maintainer

Uh oh!

ycchenchn Jul 17, 2025 Author

ycchenchn
Jun 11, 2025

rbdavid
Jun 12, 2025
Maintainer

nilsoberg
Jun 13, 2025
Maintainer

ycchenchn
Jun 17, 2025
Author

ycchenchn
Jun 17, 2025
Author

nilsoberg
Jun 18, 2025
Maintainer

ycchenchn
Jun 19, 2025
Author

ycchenchn
Jul 7, 2025
Author

nilsoberg
Jul 10, 2025
Maintainer

ycchenchn
Jul 17, 2025
Author