Fix/clean english kg prompts by yogurtss · Pull Request #183 · InternScience/GraphGen

yogurtss · 2026-03-21T07:13:52Z

No description provided.

…to-graphgen Add tree-aware pipeline operators, example config, and integration test

…er-adding-tree Support relation type, evidence and confidence in KG extraction; filter by confidence/evidence

…process Improve DRAM-focused VQA generation quality pipeline

…op-process Codex-generated pull request

- parse markdown into text, table, and image components for structure analyze - attach table captions above html tables when available - attach image captions below markdown images and preserve note text - add fixture, verification notes, and focused tests for structure analysis

…n-tree_atomic_config.yaml feat(tree_atomic): preserve pre-segmented tree nodes

…eline feat(tree_pipeline): add evidence-grounded tree VQA and KG extraction

…extraction-prompt Refine KG extraction prompts for semiconductor memory domain

gemini-code-assist · 2026-03-21T07:15:32Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands GraphGen's capabilities by introducing a dedicated data platform for visualizing generated outputs, enhancing the tree pipeline to intelligently parse and structure markdown content, and strengthening the knowledge graph construction with robust evidence grounding mechanisms. These changes aim to improve the quality and traceability of generated QA pairs, particularly for VQA tasks, and are complemented by comprehensive documentation outlining future development plans.

Highlights

New Data Platform: A new data platform has been introduced, featuring a FastAPI backend and a React frontend, designed for local visualization and exploration of GraphGen's generated QA and VQA results, including interactive sub-graph and evidence span displays.
Enhanced Tree Pipeline for Markdown Parsing: The tree pipeline now includes services to intelligently parse structured markdown documents, splitting them into modality-aware components (text, table, image) and extracting associated captions, notes, and metadata, ensuring better preservation of document structure.
Stricter KG Grounding and Evidence Validation: Knowledge Graph (KG) builders have been updated to enforce stricter evidence grounding, allowing configuration of confidence thresholds and requiring explicit evidence spans for entities and relations, with validation against source text.
Improved VQA Generation Quality Controls: The VQA generator now incorporates advanced quality controls, including filtering of empty or uncertain QA pairs, deduplication, and explicit grounding of generated questions and answers to context keywords, along with enhanced DRAM-centric guidance in prompts.
Comprehensive VLM/VQA Documentation: Extensive new documentation has been added, outlining the GraphGen VLM/VQA roadmap, detailed plans for KG grounding, multimodal alignment, question depth optimization, and evaluation benchmarks, providing a clear strategic direction.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant set of features and improvements far beyond what the title 'Fix/clean english kg prompts' suggests. It adds a comprehensive local data platform with a FastAPI backend and a React frontend for visualizing and exploring generated data. Furthermore, it implements a new tree-based processing pipeline for structured markdown, enhancing the system's ability to handle complex documents. The knowledge graph extraction and generation processes have been substantially refactored to enforce stricter evidence grounding and quality control, with updated prompts and more robust data models. While the overall changes are excellent and well-implemented, I have a few suggestions to improve security and robustness in the new data platform backend.

gemini-code-assist · 2026-03-21T07:17:22Z

data_platform/backend/main.py

+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)


Using allow_origins=["*"] is convenient for local development but poses a security risk if this application is ever deployed in a more open environment, as it allows any origin to make requests. For a local-only tool this might be acceptable, but it's better to restrict it to the specific frontend origin (e.g., http://localhost:5173 or http://127.0.0.1:5173). This would prevent other malicious sites from making requests to the backend on the user's behalf.

Suggested change

app.add_middleware(

CORSMiddleware,

allow_origins=["*"],

allow_credentials=True,

allow_methods=["*"],

allow_headers=["*"],

)

app.add_middleware(

CORSMiddleware,

allow_origins=["http://localhost:5173", "http://127.0.0.1:5173"],

allow_credentials=True,

allow_methods=["*"],

allow_headers=["*"],

)

gemini-code-assist · 2026-03-21T07:17:22Z

data_platform/backend/store.py

+def _parse_json_summary(value: Any) -> dict[str, Any] | None:
+    if isinstance(value, dict):
+        return value
+    if isinstance(value, str):
+        try:
+            parsed = json.loads(value)
+        except json.JSONDecodeError:
+            return None
+        if isinstance(parsed, dict):
+            return parsed
+    return None


The JSONDecodeError is silently ignored here. While this prevents a crash, it also hides potential data corruption or formatting issues in the sub_graph_summary field. It would be beneficial to log these errors to aid in debugging data quality problems.

Suggested change

def _parse_json_summary(value: Any) -> dict[str, Any] | None:

if isinstance(value, dict):

return value

if isinstance(value, str):

try:

parsed = json.loads(value)

except json.JSONDecodeError:

return None

if isinstance(parsed, dict):

return parsed

return None

def _parse_json_summary(value: Any) -> dict[str, Any] | None:

if isinstance(value, dict):

return value

if isinstance(value, str):

try:

parsed = json.loads(value)

except json.JSONDecodeError:

# Consider logging the error to help debug data quality issues.

return None

if isinstance(parsed, dict):

return parsed

return None

gemini-code-assist · 2026-03-21T07:17:22Z

data_platform/backend/store.py

+                        payload = json.loads(line)
+                        sample = self._normalize_sample(
+                            payload=payload,
+                            run_id=run_id,
+                            source_file=jsonl_file,
+                            line_number=line_number,
+                        )


The call to json.loads(line) is not wrapped in a try...except block. If any line in a .jsonl file is malformed, this will raise a JSONDecodeError and stop the processing of the entire file. To make the scanning process more robust, it's advisable to wrap this call in a try...except block and log any parsing errors, allowing the process to continue with the next valid lines.

try: payload = json.loads(line) except json.JSONDecodeError: # Consider logging the error and the problematic line number continue sample = self._normalize_sample( payload=payload, run_id=run_id, source_file=jsonl_file, line_number=line_number, )

yogurtss and others added 24 commits March 14, 2026 23:10

add MoRora files for future develop

0b9c1d4

feat: add tree-based operator pipeline and sample generate config

ecc062d

Merge pull request #2 from yogurtss/codex/plan-migration-from-modora-…

f84a201

…to-graphgen Add tree-aware pipeline operators, example config, and integration test

Constrain KG relations with evidence and memory-focused prompts

ad91b44

Merge pull request #3 from yogurtss/codex/improve-kg-enhancements-aft…

7378d9a

…er-adding-tree Support relation type, evidence and confidence in KG extraction; filter by confidence/evidence

Improve DRAM-focused VQA generation quality pipeline

8f6e90b

Merge pull request #4 from yogurtss/codex/refine-vqa-data-generation-…

1b48de7

…process Improve DRAM-focused VQA generation quality pipeline

docs: add multi-hop pipeline flowchart and diagram prompt

df33bb5

Merge pull request #5 from yogurtss/codex/create-flowchart-for-multih…

496cb17

…op-process Codex-generated pull request

feat(tree_atomic): preserve pre-segmented tree nodes

0fb6d6e

Merge pull request #6 from yogurtss/codex/integrate-modora-features-i…

5525661

…n-tree_atomic_config.yaml feat(tree_atomic): preserve pre-segmented tree nodes

feat(tree_pipeline): add grounded tree vqa support

09eb98c

Merge pull request #7 from yogurtss/codex/add-vqa-support-to-tree-pip…

c3863ba

…eline feat(tree_pipeline): add evidence-grounded tree VQA and KG extraction

Refine memory-domain KG extraction prompts

d008876

Merge pull request #8 from yogurtss/codex/update-entity-and-relation-…

4f37afc

…extraction-prompt Refine KG extraction prompts for semiconductor memory domain

chore: update project state and add VLM VQA planning docs

62cdddc

docs: reorganize documentation structure

71cccd9

fix(tree_pipeline): preserve section nodes and stable tree parenting

22b4a84

Add sub-graph metadata to generated QA outputs

002a5ad

Add GraphGen data platform v1

093894e

Remove frontend dependencies from git

72f8ce9

Revert VQA length filtering

2ecefa6

fix: clean english kg extraction prompt examples

cbeb69e

github-actions bot added documentation Improvements or additions to documentation core examples tests labels Mar 21, 2026

yogurtss closed this Mar 21, 2026

gemini-code-assist bot reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/clean english kg prompts#183

Fix/clean english kg prompts#183
yogurtss wants to merge 24 commits intoInternScience:mainfrom
yogurtss:fix/clean-english-kg-prompts

yogurtss commented Mar 21, 2026

Uh oh!

gemini-code-assist bot commented Mar 21, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 21, 2026

Uh oh!

gemini-code-assist bot Mar 21, 2026

Uh oh!

gemini-code-assist bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yogurtss commented Mar 21, 2026

Uh oh!

gemini-code-assist bot commented Mar 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant