Question: Using custom sequence databases (FASTA mode) with code-based SSN analysis #196

ycchenchn · 2025-06-17T11:13:59Z

ycchenchn
Jun 17, 2025

Hi,

We are currently utilizing the code-based EFI-EST tool for Sequence Similarity Network (SSN) analysis in FASTA mode. We appreciate its capabilities and flexibility.

We understand that the default configuration of the code-based tool uses UniProt Version 2025_02 and InterPro Version 104 as its primary databases for sequence comparisons and annotation retrieval.

Our research requires us to analyze sequences that are from a homemade database or publicly available from NCBI (e.g., GenBank/RefSeq), and these sequences are not necessarily curated within UniProt. Our goal is to perform SSN analysis on these specific sets of sequences.

We've noted that while the online EFI-EST tool's FASTA mode does accept custom sequence input, the resulting SSN nodes for these sequences often lack detailed Taxonomy information. This appears to be because the tool primarily retrieves such comprehensive metadata (including taxonomy) by cross-referencing against its internal databases (derived from UniProt/InterPro) using identified UniProt/UniRef IDs. If a user-provided sequence does not correspond to an ID for which comprehensive metadata is available in these pre-built databases, the detailed taxonomy may be missing.

Therefore, our primary questions regarding the code-based version are:

Can we substitute the default FASTA sequence database (specified by --fasta-db, e.g., data/efi/blastdb/uniref50.fasta) with our own custom FASTA file for BLAST comparisons when running the SSN analysis? If so, are there any specific formatting requirements for our custom FASTA headers, or recommended steps to configure this?
If we use a custom FASTA database (containing sequences not in UniProt), how can we effectively incorporate associated metadata, such as Taxonomy information, for these custom sequences into the generated SSN nodes? Is there a mechanism or a specific file format (e.g., a tab-separated file) that the code-based tool can accept to parse and integrate these custom annotations (like Organism, Taxonomy ID, etc.) into the XGMML output, similar to how the efi_db.sqlite provides annotations for the default database?

Any guidance, examples, or pointers to relevant documentation for using custom sequence databases and integrating external metadata would be immensely helpful.

Thank you very much for your time and support!

nilsoberg · 2025-06-18T23:15:56Z

nilsoberg
Jun 18, 2025
Maintainer

A pipeline for developing custom databases is on our roadmap. Our team is currently focused on delivering on another project so we can't provide in-depth guidance. However, I can give you a general idea of what would be necessary. Doing this yourself requires a bit of knowledge of sqlite3 and the ability to read Perl code.

Short answers to your questions:

Yes, but it won't work unless the IDs in the FASTA file match the IDs in the annotations table
There is no automatic process or current support for custom metadata.

A bit longer explanation:

The tools assume that IDs are UniProt while providing support for alternative formats (see lib/EFI/IdMapping/Util.pm) as input. There is nothing special about UniProt, it's simply the convention that we use. You could develop a custom naming system that assigns numbers (internally the tools support GI IDs in the form of numbers) to every sequence in your custom FASTA dataset, then creating metadata that can be loaded into the efi_db.sqlite file from another file.

If you open the file in sqlite3, you will see that the structure isn't very complicated and it sounds like you are mostly interested in the annotations and taxonomy tables. In theory you could insert the appropriate metadata into the annotations table with IDs that match the IDs in your FASTA dataset. sqlite3 allows you to import data from a tab or csv file but you could also insert using a Python script or other language. The annotations are stored in the column as a json string which allows for a variable amount of metadata to be included (i.e. not all sequences need the same information), and the available metadata values can be seen in lib/EFI/Annotations.pm. The ones that are supported by the SSN are ones with the json_name attribute.

EST/lib/EFI/Annotations.pm

Line 354 in 369d7ad

    
           push @fields, {name => "accession",                 field_type => "db",     type_spec => "VARCHAR(10)",     display => "",                                                                                      db_primary_col => 1,index_name => "uniprot_accession_idx",                              primary_key => 1};

.

(You could in theory add additional fields to this file to incorporate them into the SSN.)

The script we use to build databases (which will eventually be replaced with a robust build system) is located in another, older repository that is not as well developed as the current repository:

https://github.com/EnzymeFunctionInitiative/EFITools/blob/manual_merge/scripts/db_tools/builddb.pl

It calls a number of scripts in https://github.com/EnzymeFunctionInitiative/EFITools/tree/manual_merge/scripts/db_tools

If you start the process then run into issues, feel free to reach out and we can try to help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Using custom sequence databases (FASTA mode) with code-based SSN analysis #196

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question: Using custom sequence databases (FASTA mode) with code-based SSN analysis #196

Uh oh!

ycchenchn Jun 17, 2025

Replies: 1 comment

Uh oh!

nilsoberg Jun 18, 2025 Maintainer

ycchenchn
Jun 17, 2025

nilsoberg
Jun 18, 2025
Maintainer