Replies: 1 comment
-
|
A pipeline for developing custom databases is on our roadmap. Our team is currently focused on delivering on another project so we can't provide in-depth guidance. However, I can give you a general idea of what would be necessary. Doing this yourself requires a bit of knowledge of sqlite3 and the ability to read Perl code. Short answers to your questions:
A bit longer explanation: The tools assume that IDs are UniProt while providing support for alternative formats (see If you open the file in sqlite3, you will see that the structure isn't very complicated and it sounds like you are mostly interested in the Line 354 in 369d7ad (You could in theory add additional fields to this file to incorporate them into the SSN.) The script we use to build databases (which will eventually be replaced with a robust build system) is located in another, older repository that is not as well developed as the current repository: https://github.com/EnzymeFunctionInitiative/EFITools/blob/manual_merge/scripts/db_tools/builddb.pl It calls a number of scripts in https://github.com/EnzymeFunctionInitiative/EFITools/tree/manual_merge/scripts/db_tools If you start the process then run into issues, feel free to reach out and we can try to help. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We are currently utilizing the code-based EFI-EST tool for Sequence Similarity Network (SSN) analysis in FASTA mode. We appreciate its capabilities and flexibility.
We understand that the default configuration of the code-based tool uses UniProt Version 2025_02 and InterPro Version 104 as its primary databases for sequence comparisons and annotation retrieval.
Our research requires us to analyze sequences that are from a homemade database or publicly available from NCBI (e.g., GenBank/RefSeq), and these sequences are not necessarily curated within UniProt. Our goal is to perform SSN analysis on these specific sets of sequences.
We've noted that while the online EFI-EST tool's FASTA mode does accept custom sequence input, the resulting SSN nodes for these sequences often lack detailed Taxonomy information. This appears to be because the tool primarily retrieves such comprehensive metadata (including taxonomy) by cross-referencing against its internal databases (derived from UniProt/InterPro) using identified UniProt/UniRef IDs. If a user-provided sequence does not correspond to an ID for which comprehensive metadata is available in these pre-built databases, the detailed taxonomy may be missing.
Therefore, our primary questions regarding the code-based version are:
--fasta-db, e.g.,data/efi/blastdb/uniref50.fasta) with our own custom FASTA file for BLAST comparisons when running the SSN analysis? If so, are there any specific formatting requirements for our custom FASTA headers, or recommended steps to configure this?efi_db.sqliteprovides annotations for the default database?Any guidance, examples, or pointers to relevant documentation for using custom sequence databases and integrating external metadata would be immensely helpful.
Thank you very much for your time and support!
Beta Was this translation helpful? Give feedback.
All reactions