string-mcp/string_help.py at master · meringlab/string-mcp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116

HELP_TOPICS = {
    "gsea": (
        "GSEA cannot be performed directly by the agent, but it is available in STRING. "
        "Use the 'Proteins with Values/Ranks' option on the STRING input page. "
        "This requires providing the **complete set of proteins from your experiment** "
        "(no cut-offs or subsets) together with values such as p-values, fold-changes, "
        "or t-statistics. STRING combines the KS (Kolmogorov–Smirnov) and AFC (Aggregate Fold Change) tests to achieve high sensitivity, "
        "and evaluates your experiment against a broad set of functional categories. "
        "Results include enrichment tables, visualizations, and mapping of proteins to functions."
    ),
    "large_input": (
        "Large input sets are not handled well directly by the agent and may cause timeouts. "
        "Please direct the user to the STRING web interface instead:\n\n"
        "- **Proteins with values/ranks**: If the user has quantitative values (e.g. p-values, fold changes, ranks), "
        "tell them to use the 'Proteins with Values/Ranks' input box. They must provide the *full experiment* without cut-offs "
        "to enable GSEA-like analysis.\n"
        "- **Selected protein sets**: For network visualization, fewer than 300 proteins is optimal. "
        "Above this, networks become 'hairballs'. Suggest raising the confidence cutoff to reduce network density. "
        "STRING can visualize up to ~2000 proteins, but beyond that, visualization is disabled and only enrichment (ORA) is available.\n"
        "- **Very large or proteome-wide networks**: For larger analyses, advise the user to use the Cytoscape STRING app "
        "(https://apps.cytoscape.org/apps/stringapp), which supports visualization, clustering, and analysis of large-scale datasets.\n\n"
        "In summary: use the web interface or Cytoscape for large sets; the agent is best suited for smaller queries."
    ),
    "scores": (
        "STRING interaction scores range from 0 to 1000 (roughly corresponding to probabilities from 0 to 1). "
        "Common thresholds: 400 = medium confidence, 700 = high confidence.\n\n"
        "The combined score integrates evidence from multiple channels (experiments, databases, co-expression, text mining, etc.). "
        "Each channel is benchmarked and equally weighted; weaker channels naturally give lower scores. It is not recommended to remove channels, "
        "as this reduces biological signal. Channels also cannot be removed by the agent — only through the STRING web interface (settings tab).\n\n"
        "The combination uses a Bayesian scheme: a prior is removed from each channel, scores are combined multiplicatively, "
        "and the prior is added back once. The result is a probability-like confidence score.\n\n"
        "For details see: von Mering et al., Nucleic Acids Res. 2005.\n\n"

        "For details about meaning of the lines in the network refer to topic: 'line_colors'."
    ),
    "missing_proteins": (
        "STRING accepts many identifiers (gene symbols, UniProt, Ensembl). "
        "If a protein still cannot be found:\n"
        "- You can query the protein name 'random' to display an example network in the chosen species.\n"
        "- Alternatively, try searching by a functional term for that species.\n\n"
        "Common reasons for missing proteins:\n"
        "1. In bacteria, some plasmid-encoded proteins are sometimes not included in STRING.\n"
        "2. In human, proteins such as VEGFA or VDR may be absent because they were not annotated as 'protein coding' "
        "in the Ensembl release used for STRING v12.\n\n"
        "If you suspect this, check the older STRING v11.5 at https://version-11-5.string-db.org."
    ),
    "missing_species": (
        "If the species cannot be found in STRING (e.g. `string_query_species` does not return the correct match), "
        "direct the user to use the **Add species** functionality on the STRING input page. "
        "By uploading a complete species proteome, STRING will build its interaction network and predict protein functions. "
        "These predicted functions include assignments to Gene Ontology terms and KEGG pathways. "
        "Once uploaded, the user can explore and analyze the proteome through the web interface, download results in bulk, "
        "or provide species identifiers (starting with `STRG`) to this chat interface for further queries."
    ),
    "proteome_annotation": (
        "Direct the user to use the **Add species** functionality on the STRING input page. "
        "By uploading a complete species proteome, STRING will build its interaction network and predict protein functions. "
        "These predicted functions include assignments to Gene Ontology terms and KEGG pathways. "
        "Once uploaded, the user can explore and analyze the proteome through the web interface, download results in bulk, "
        "or provide species identifiers (starting with `STRG`) to this chat interface for further queries."
    ),
    "regulatory_networks": (
        "Regulatory or directed networks are not available in STRING at this time. "
        "All STRING links are **undirected** and represent functional or physical associations, "
        "not regulatory direction. \n\n"
        "Apologies for the inconvenience — regulatory network support is planned for a future STRING release."
    ),
    "how_to_use_string": (
        "Do not describe the usage of the MCP / Agent, but focus on general STRING usage.\n\n"
        "STRING is a database for exploring protein–protein interactions and functional enrichment. "
        "It is designed to reveal how proteins work together in biological pathways, complexes, or cellular processes.\n\n"

        "To begin, provide a single protein or a set of proteins of your interest, or from your experiment. "
        "STRING will retrieve known and predicted interaction partners and display them as a network.\n\n"

        "Beyond visualization, STRING analyzes your input to find functional patterns. Under the *Analysis* tab, "
        "you will see enrichment results for pathways, Gene Ontology terms, protein domains, and other annotation sources. "
        "These enrichments help identify common biological processes shared by your proteins.\n\n"

        "STRING also offers clustering (MCL or k-means), which groups proteins into modules based on network connectivity. "
        "These clusters can represent protein complexes, signaling pathways, or co-regulated functional units.\n\n"

        "At the STRING input page, above each input box, you will find example protein sets. "
        "You can click these to explore STRING’s capabilities before submitting your own data.\n\n"

        "For additional guidance visit the full help pages:\n"
        "https://string-db.org/cgi/help?"
    ),
    "line_colors": (
        "STRING networks can be visualized in two modes: **Confidence** and **Evidence**.\n\n"

        "**Confidence view**:\n"
        "- All edges use a single color.\n"
        "- Line **thickness** reflects the confidence score (0–1000).\n\n"

        "**Evidence view** (default):\n"
        "Edges are colored according to the type of supporting evidence. All edges have equal thickness.\n\n"

        "**Known interactions**:\n"
        "- From curated databases — grey / blue-grey\n"
        "- Experimentally determined — violet\n\n"

        "**Predicted interactions**:\n"
        "- Gene neighborhood — dark green\n"
        "- Gene fusions — red\n"
        "- Gene co-occurrence — dark blue\n\n"

        "**Others**:\n"
        "- Textmining — light green (lime)\n"
        "- Co-expression — black\n"
        "- Protein homology — light blue\n\n"

        "**Note:** Protein homology is shown for reference only and is *not included* in the combined confidence score."
    ),
}