You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This discussion thread is being opened to document our current rationale for how each FT CDS block within the EMBL flat files is handled. Feature Table "blocks" of text have a defined format, described here. For the purpose of EFI, we are only interested in coding sequence (CDS) blocks in the feature table. There are two major sections of an FT CDS block that we are interested in: (a) the location and (b) the feature qualifiers.
The location information for the protein starts on the same line as the FT CDS delimiter and may continue onto subsequent lines if the nucleobase range(s) encoding the protein are sufficiently large. Work is on-going to appropriately parse the location information to get accurate start and stop indices for a given protein (see #5 ) for further discussion on the parsing challenges. Specifically, the start and stop indices are used to determine the gene length, which is used to scale the arrow length in the GND visualization. Conversations within the EFI group suggest that exact scaling of genes' arrows and intron spaces is not essential for the intended usage of the GND visualizations; just focus on getting spacing and scaling approximately right without missing any import CDS blocks.
The feature qualifiers section are all lines following the location string. Each individual qualifier is denoted by a / character followed by the qualifier key. The most important keys are /db_xref and /protein_id. Specifically, the protein_id is used in a "foreignId" search query in the EFI DB to grab any associated "uniprotId" strings. Similarly, the db_xref lines are parsed to get any "UniProtKB" accession IDs that may be associated with the CDS entry. Other qualifiers that may or should be of interest are:
/psuedo and /pseudogene, which indicate that the CDS feature does not get translated into a functional protein. These entries are currently being parsed but never written to the final tab file (and so never incorporated into the ENA table in EFI DB). They are hidden neighbors that won't ever be visualized in the GND.
/translation, which provides the amino acid sequence encoded by the nucleobase sequence range provided in the location string. This sequence string could be used to get a translated length of the gene from which a length in DNA sequence space could be approximated.
Big picture stuff:
The current workflow gathers all FT CDS blocks, parsing lines for important information that is saved in Locus objects' instance attributes, stashing each Locus in a dict where the associated key is the count value. Once the full set of CDS blocks are gathered for a chromosome Record object, this dict of Locus objects is looped over and processed. Its only at this stage where a locus's information is written to file or not, based on whether a UniProt accession ID is associated with the Locus object or not. The count value is of utmost importance because its used to determine the ordering of genes on the chromosome, which is then used to determine the local neighborhood for the GNN analysis and GND visualization. For instances where a CDS block does not map to a UniProt ID, that gene count/position will be absent from the GNN and GND analyses.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
This discussion thread is being opened to document our current rationale for how each
FT CDSblock within the EMBL flat files is handled. Feature Table "blocks" of text have a defined format, described here. For the purpose of EFI, we are only interested in coding sequence (CDS) blocks in the feature table. There are two major sections of an FT CDS block that we are interested in: (a) the location and (b) the feature qualifiers.The location information for the protein starts on the same line as the
FT CDSdelimiter and may continue onto subsequent lines if the nucleobase range(s) encoding the protein are sufficiently large. Work is on-going to appropriately parse the location information to get accurate start and stop indices for a given protein (see #5 ) for further discussion on the parsing challenges. Specifically, the start and stop indices are used to determine the gene length, which is used to scale the arrow length in the GND visualization. Conversations within the EFI group suggest that exact scaling of genes' arrows and intron spaces is not essential for the intended usage of the GND visualizations; just focus on getting spacing and scaling approximately right without missing any import CDS blocks.The feature qualifiers section are all lines following the location string. Each individual qualifier is denoted by a
/character followed by the qualifier key. The most important keys are/db_xrefand/protein_id. Specifically, the protein_id is used in a "foreignId" search query in the EFI DB to grab any associated "uniprotId" strings. Similarly, the db_xref lines are parsed to get any "UniProtKB" accession IDs that may be associated with the CDS entry. Other qualifiers that may or should be of interest are:/psuedoand/pseudogene, which indicate that the CDS feature does not get translated into a functional protein. These entries are currently being parsed but never written to the final tab file (and so never incorporated into the ENA table in EFI DB). They are hidden neighbors that won't ever be visualized in the GND./translation, which provides the amino acid sequence encoded by the nucleobase sequence range provided in the location string. This sequence string could be used to get a translated length of the gene from which a length in DNA sequence space could be approximated.Big picture stuff:
The current workflow gathers all
FT CDSblocks, parsing lines for important information that is saved inLocusobjects' instance attributes, stashing eachLocusin a dict where the associated key is thecountvalue. Once the full set of CDS blocks are gathered for a chromosomeRecordobject, this dict ofLocusobjects is looped over and processed. Its only at this stage where a locus's information is written to file or not, based on whether a UniProt accession ID is associated with theLocusobject or not. The count value is of utmost importance because its used to determine the ordering of genes on the chromosome, which is then used to determine the local neighborhood for the GNN analysis and GND visualization. For instances where a CDS block does not map to a UniProt ID, that gene count/position will be absent from the GNN and GND analyses.Beta Was this translation helpful? Give feedback.
All reactions