Design Discussion - How FT CDS blocks are handled. #7

rbdavid · 2025-03-20T19:54:41Z

rbdavid
Mar 20, 2025
Maintainer

This discussion thread is being opened to document our current rationale for how each FT CDS block within the EMBL flat files is handled. Feature Table "blocks" of text have a defined format, described here. For the purpose of EFI, we are only interested in coding sequence (CDS) blocks in the feature table. There are two major sections of an FT CDS block that we are interested in: (a) the location and (b) the feature qualifiers.

The location information for the protein starts on the same line as the FT CDS delimiter and may continue onto subsequent lines if the nucleobase range(s) encoding the protein are sufficiently large. Work is on-going to appropriately parse the location information to get accurate start and stop indices for a given protein (see #5 ) for further discussion on the parsing challenges. Specifically, the start and stop indices are used to determine the gene length, which is used to scale the arrow length in the GND visualization. Conversations within the EFI group suggest that exact scaling of genes' arrows and intron spaces is not essential for the intended usage of the GND visualizations; just focus on getting spacing and scaling approximately right without missing any import CDS blocks.

The feature qualifiers section are all lines following the location string. Each individual qualifier is denoted by a / character followed by the qualifier key. The most important keys are /db_xref and /protein_id. Specifically, the protein_id is used in a "foreignId" search query in the EFI DB to grab any associated "uniprotId" strings. Similarly, the db_xref lines are parsed to get any "UniProtKB" accession IDs that may be associated with the CDS entry. Other qualifiers that may or should be of interest are:

/psuedo and /pseudogene, which indicate that the CDS feature does not get translated into a functional protein. These entries are currently being parsed but never written to the final tab file (and so never incorporated into the ENA table in EFI DB). They are hidden neighbors that won't ever be visualized in the GND.
/translation, which provides the amino acid sequence encoded by the nucleobase sequence range provided in the location string. This sequence string could be used to get a translated length of the gene from which a length in DNA sequence space could be approximated.

Big picture stuff:
The current workflow gathers all FT CDS blocks, parsing lines for important information that is saved in Locus objects' instance attributes, stashing each Locus in a dict where the associated key is the count value. Once the full set of CDS blocks are gathered for a chromosome Record object, this dict of Locus objects is looped over and processed. Its only at this stage where a locus's information is written to file or not, based on whether a UniProt accession ID is associated with the Locus object or not. The count value is of utmost importance because its used to determine the ordering of genes on the chromosome, which is then used to determine the local neighborhood for the GNN analysis and GND visualization. For instances where a CDS block does not map to a UniProt ID, that gene count/position will be absent from the GNN and GND analyses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Discussion - How FT CDS blocks are handled. #7

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Design Discussion - How FT CDS blocks are handled. #7

Uh oh!

rbdavid Mar 20, 2025 Maintainer

Replies: 0 comments

rbdavid
Mar 20, 2025
Maintainer