-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathstring_help.py
More file actions
118 lines (103 loc) · 7.79 KB
/
string_help.py
File metadata and controls
118 lines (103 loc) · 7.79 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
HELP_TOPICS = {
"gsea": (
"GSEA cannot be performed directly by the agent, but it is available in STRING. "
"Use the 'Proteins with Values/Ranks' option on the STRING input page. "
"This requires providing the **complete set of proteins from your experiment** "
"(no cut-offs or subsets) together with values such as p-values, fold-changes, "
"or t-statistics. STRING combines the KS (Kolmogorov–Smirnov) and AFC (Aggregate Fold Change) tests to achieve high sensitivity, "
"and evaluates your experiment against a broad set of functional categories. "
"Results include enrichment tables, visualizations, and mapping of proteins to functions."
),
"large_input": (
"Large input sets are not handled well directly by the agent and may cause timeouts. "
"Please direct the user to the STRING web interface instead:\n\n"
"- **Proteins with values/ranks**: If the user has quantitative values (e.g. p-values, fold changes, ranks), "
"tell them to use the 'Proteins with Values/Ranks' input box. They must provide the *full experiment* without cut-offs "
"to enable GSEA-like analysis.\n"
"- **Selected protein sets**: For network visualization, fewer than 300 proteins is optimal. "
"Above this, networks become 'hairballs'. Suggest raising the confidence cutoff to reduce network density. "
"STRING can visualize up to ~2000 proteins, but beyond that, visualization is disabled and only enrichment (ORA) is available.\n"
"- **Very large or proteome-wide networks**: For larger analyses, advise the user to use the Cytoscape STRING app "
"(https://apps.cytoscape.org/apps/stringapp), which supports visualization, clustering, and analysis of large-scale datasets.\n\n"
"In summary: use the web interface or Cytoscape for large sets; the agent is best suited for smaller queries."
),
"scores": (
"STRING interaction scores range from 0 to 1000 (roughly corresponding to probabilities from 0 to 1). "
"Common thresholds: 400 = medium confidence, 700 = high confidence.\n\n"
"The combined score integrates evidence from multiple channels (experiments, databases, co-expression, text mining, etc.). "
"Each channel is benchmarked and equally weighted; weaker channels naturally give lower scores. It is not recommended to remove channels, "
"as this reduces biological signal. Channels also cannot be removed by the agent — only through the STRING web interface (settings tab).\n\n"
"The combination uses a Bayesian scheme: a prior is removed from each channel, scores are combined multiplicatively, "
"and the prior is added back once. The result is a probability-like confidence score.\n\n"
"For details see: von Mering et al., Nucleic Acids Res. 2005.\n\n"
"For details about meaning of the lines in the network refer to topic: 'line_colors'."
),
"missing_proteins": (
"STRING accepts many identifiers (gene symbols, UniProt, Ensembl). "
"If a protein still cannot be found:\n"
"- You can query the protein name 'random' to display an example network in the chosen species.\n"
"- Alternatively, try searching by a functional term for that species.\n\n"
"Common reasons for missing proteins:\n"
"1. In bacteria, some plasmid-encoded proteins are sometimes not included in STRING.\n"
"2. In human, proteins such as VEGFA or VDR may be absent because they were not annotated as 'protein coding' "
"in the Ensembl release used for STRING v12.\n\n"
"If you suspect this, check the older STRING v11.5 at https://version-11-5.string-db.org."
),
"missing_species": (
"If the species cannot be found in STRING (e.g. `string_query_species` does not return the correct match), "
"direct the user to use the **Add species** functionality on the STRING input page. "
"By uploading a complete species proteome, STRING will build its interaction network and predict protein functions. "
"These predicted functions include assignments to Gene Ontology terms and KEGG pathways. "
"Once uploaded, the user can explore and analyze the proteome through the web interface, download results in bulk, "
"or provide species identifiers (starting with `STRG`) to this chat interface for further queries."
),
"proteome_annotation": (
"Direct the user to use the **Add species** functionality on the STRING input page. "
"By uploading a complete species proteome, STRING will build its interaction network and predict protein functions. "
"These predicted functions include assignments to Gene Ontology terms and KEGG pathways. "
"Once uploaded, the user can explore and analyze the proteome through the web interface, download results in bulk, "
"or provide species identifiers (starting with `STRG`) to this chat interface for further queries."
),
"regulatory_networks": (
"Regulatory or directed networks are not available in STRING at this time. "
"All STRING links are **undirected** and represent functional or physical associations, "
"not regulatory direction. \n\n"
"Apologies for the inconvenience — regulatory network support is planned for a future STRING release."
),
"how_to_use_string": (
"Do not describe the usage of the MCP / Agent, but focus on general STRING usage.\n\n"
"STRING is a database for exploring protein–protein interactions and functional enrichment. "
"It is designed to reveal how proteins work together in biological pathways, complexes, or cellular processes.\n\n"
"To begin, provide a single protein or a set of proteins of your interest, or from your experiment. "
"STRING will retrieve known and predicted interaction partners and display them as a network.\n\n"
"Beyond visualization, STRING analyzes your input to find functional patterns. Under the *Analysis* tab, "
"you will see enrichment results for pathways, Gene Ontology terms, protein domains, and other annotation sources. "
"These enrichments help identify common biological processes shared by your proteins.\n\n"
"STRING also offers clustering (MCL or k-means), which groups proteins into modules based on network connectivity. "
"These clusters can represent protein complexes, signaling pathways, or co-regulated functional units.\n\n"
"At the STRING input page, above each input box, you will find example protein sets. "
"You can click these to explore STRING’s capabilities before submitting your own data.\n\n"
"For additional guidance visit the full help pages:\n"
"https://string-db.org/cgi/help?"
),
"line_colors": (
"STRING networks can be visualized in two modes: **Confidence** and **Evidence**.\n\n"
"**Confidence view**:\n"
"- All edges use a single color.\n"
"- Line **thickness** reflects the confidence score (0–1000).\n\n"
"**Evidence view** (default):\n"
"Edges are colored according to the type of supporting evidence. All edges have equal thickness.\n\n"
"**Known interactions**:\n"
"- From curated databases — grey / blue-grey\n"
"- Experimentally determined — violet\n\n"
"**Predicted interactions**:\n"
"- Gene neighborhood — dark green\n"
"- Gene fusions — red\n"
"- Gene co-occurrence — dark blue\n\n"
"**Others**:\n"
"- Textmining — light green (lime)\n"
"- Co-expression — black\n"
"- Protein homology — light blue\n\n"
"**Note:** Protein homology is shown for reference only and is *not included* in the combined confidence score."
),
}