Skip to content

Fix/clean english kg prompts#183

Closed
yogurtss wants to merge 24 commits intoInternScience:mainfrom
yogurtss:fix/clean-english-kg-prompts
Closed

Fix/clean english kg prompts#183
yogurtss wants to merge 24 commits intoInternScience:mainfrom
yogurtss:fix/clean-english-kg-prompts

Conversation

@yogurtss
Copy link

No description provided.

yogurtss and others added 24 commits March 14, 2026 23:10
…to-graphgen

Add tree-aware pipeline operators, example config, and integration test
…er-adding-tree

Support relation type, evidence and confidence in KG extraction; filter by confidence/evidence
…process

Improve DRAM-focused VQA generation quality pipeline
- parse markdown into text, table, and image components for structure analyze
- attach table captions above html tables when available
- attach image captions below markdown images and preserve note text
- add fixture, verification notes, and focused tests for structure analysis
…n-tree_atomic_config.yaml

feat(tree_atomic): preserve pre-segmented tree nodes
…eline

feat(tree_pipeline): add evidence-grounded tree VQA and KG extraction
…extraction-prompt

Refine KG extraction prompts for semiconductor memory domain
@github-actions github-actions bot added documentation Improvements or additions to documentation core examples tests labels Mar 21, 2026
@yogurtss yogurtss closed this Mar 21, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands GraphGen's capabilities by introducing a dedicated data platform for visualizing generated outputs, enhancing the tree pipeline to intelligently parse and structure markdown content, and strengthening the knowledge graph construction with robust evidence grounding mechanisms. These changes aim to improve the quality and traceability of generated QA pairs, particularly for VQA tasks, and are complemented by comprehensive documentation outlining future development plans.

Highlights

  • New Data Platform: A new data platform has been introduced, featuring a FastAPI backend and a React frontend, designed for local visualization and exploration of GraphGen's generated QA and VQA results, including interactive sub-graph and evidence span displays.
  • Enhanced Tree Pipeline for Markdown Parsing: The tree pipeline now includes services to intelligently parse structured markdown documents, splitting them into modality-aware components (text, table, image) and extracting associated captions, notes, and metadata, ensuring better preservation of document structure.
  • Stricter KG Grounding and Evidence Validation: Knowledge Graph (KG) builders have been updated to enforce stricter evidence grounding, allowing configuration of confidence thresholds and requiring explicit evidence spans for entities and relations, with validation against source text.
  • Improved VQA Generation Quality Controls: The VQA generator now incorporates advanced quality controls, including filtering of empty or uncertain QA pairs, deduplication, and explicit grounding of generated questions and answers to context keywords, along with enhanced DRAM-centric guidance in prompts.
  • Comprehensive VLM/VQA Documentation: Extensive new documentation has been added, outlining the GraphGen VLM/VQA roadmap, detailed plans for KG grounding, multimodal alignment, question depth optimization, and evaluation benchmarks, providing a clear strategic direction.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant set of features and improvements far beyond what the title 'Fix/clean english kg prompts' suggests. It adds a comprehensive local data platform with a FastAPI backend and a React frontend for visualizing and exploring generated data. Furthermore, it implements a new tree-based processing pipeline for structured markdown, enhancing the system's ability to handle complex documents. The knowledge graph extraction and generation processes have been substantially refactored to enforce stricter evidence grounding and quality control, with updated prompts and more robust data models. While the overall changes are excellent and well-implemented, I have a few suggestions to improve security and robustness in the new data platform backend.

Comment on lines +13 to +19
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using allow_origins=["*"] is convenient for local development but poses a security risk if this application is ever deployed in a more open environment, as it allows any origin to make requests. For a local-only tool this might be acceptable, but it's better to restrict it to the specific frontend origin (e.g., http://localhost:5173 or http://127.0.0.1:5173). This would prevent other malicious sites from making requests to the backend on the user's behalf.

Suggested change
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:5173", "http://127.0.0.1:5173"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

Comment on lines +82 to +92
def _parse_json_summary(value: Any) -> dict[str, Any] | None:
if isinstance(value, dict):
return value
if isinstance(value, str):
try:
parsed = json.loads(value)
except json.JSONDecodeError:
return None
if isinstance(parsed, dict):
return parsed
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSONDecodeError is silently ignored here. While this prevents a crash, it also hides potential data corruption or formatting issues in the sub_graph_summary field. It would be beneficial to log these errors to aid in debugging data quality problems.

Suggested change
def _parse_json_summary(value: Any) -> dict[str, Any] | None:
if isinstance(value, dict):
return value
if isinstance(value, str):
try:
parsed = json.loads(value)
except json.JSONDecodeError:
return None
if isinstance(parsed, dict):
return parsed
return None
def _parse_json_summary(value: Any) -> dict[str, Any] | None:
if isinstance(value, dict):
return value
if isinstance(value, str):
try:
parsed = json.loads(value)
except json.JSONDecodeError:
# Consider logging the error to help debug data quality issues.
return None
if isinstance(parsed, dict):
return parsed
return None

Comment on lines +227 to +233
payload = json.loads(line)
sample = self._normalize_sample(
payload=payload,
run_id=run_id,
source_file=jsonl_file,
line_number=line_number,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to json.loads(line) is not wrapped in a try...except block. If any line in a .jsonl file is malformed, this will raise a JSONDecodeError and stop the processing of the entire file. To make the scanning process more robust, it's advisable to wrap this call in a try...except block and log any parsing errors, allowing the process to continue with the next valid lines.

                        try:
                            payload = json.loads(line)
                        except json.JSONDecodeError:
                            # Consider logging the error and the problematic line number
                            continue
                        sample = self._normalize_sample(
                            payload=payload,
                            run_id=run_id,
                            source_file=jsonl_file,
                            line_number=line_number,
                        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core documentation Improvements or additions to documentation examples tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant