Conversation
…to-graphgen Add tree-aware pipeline operators, example config, and integration test
…er-adding-tree Support relation type, evidence and confidence in KG extraction; filter by confidence/evidence
…process Improve DRAM-focused VQA generation quality pipeline
…op-process Codex-generated pull request
- parse markdown into text, table, and image components for structure analyze - attach table captions above html tables when available - attach image captions below markdown images and preserve note text - add fixture, verification notes, and focused tests for structure analysis
…n-tree_atomic_config.yaml feat(tree_atomic): preserve pre-segmented tree nodes
…eline feat(tree_pipeline): add evidence-grounded tree VQA and KG extraction
…extraction-prompt Refine KG extraction prompts for semiconductor memory domain
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands GraphGen's capabilities by introducing a dedicated data platform for visualizing generated outputs, enhancing the tree pipeline to intelligently parse and structure markdown content, and strengthening the knowledge graph construction with robust evidence grounding mechanisms. These changes aim to improve the quality and traceability of generated QA pairs, particularly for VQA tasks, and are complemented by comprehensive documentation outlining future development plans. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant set of features and improvements far beyond what the title 'Fix/clean english kg prompts' suggests. It adds a comprehensive local data platform with a FastAPI backend and a React frontend for visualizing and exploring generated data. Furthermore, it implements a new tree-based processing pipeline for structured markdown, enhancing the system's ability to handle complex documents. The knowledge graph extraction and generation processes have been substantially refactored to enforce stricter evidence grounding and quality control, with updated prompts and more robust data models. While the overall changes are excellent and well-implemented, I have a few suggestions to improve security and robustness in the new data platform backend.
| app.add_middleware( | ||
| CORSMiddleware, | ||
| allow_origins=["*"], | ||
| allow_credentials=True, | ||
| allow_methods=["*"], | ||
| allow_headers=["*"], | ||
| ) |
There was a problem hiding this comment.
Using allow_origins=["*"] is convenient for local development but poses a security risk if this application is ever deployed in a more open environment, as it allows any origin to make requests. For a local-only tool this might be acceptable, but it's better to restrict it to the specific frontend origin (e.g., http://localhost:5173 or http://127.0.0.1:5173). This would prevent other malicious sites from making requests to the backend on the user's behalf.
| app.add_middleware( | |
| CORSMiddleware, | |
| allow_origins=["*"], | |
| allow_credentials=True, | |
| allow_methods=["*"], | |
| allow_headers=["*"], | |
| ) | |
| app.add_middleware( | |
| CORSMiddleware, | |
| allow_origins=["http://localhost:5173", "http://127.0.0.1:5173"], | |
| allow_credentials=True, | |
| allow_methods=["*"], | |
| allow_headers=["*"], | |
| ) |
| def _parse_json_summary(value: Any) -> dict[str, Any] | None: | ||
| if isinstance(value, dict): | ||
| return value | ||
| if isinstance(value, str): | ||
| try: | ||
| parsed = json.loads(value) | ||
| except json.JSONDecodeError: | ||
| return None | ||
| if isinstance(parsed, dict): | ||
| return parsed | ||
| return None |
There was a problem hiding this comment.
The JSONDecodeError is silently ignored here. While this prevents a crash, it also hides potential data corruption or formatting issues in the sub_graph_summary field. It would be beneficial to log these errors to aid in debugging data quality problems.
| def _parse_json_summary(value: Any) -> dict[str, Any] | None: | |
| if isinstance(value, dict): | |
| return value | |
| if isinstance(value, str): | |
| try: | |
| parsed = json.loads(value) | |
| except json.JSONDecodeError: | |
| return None | |
| if isinstance(parsed, dict): | |
| return parsed | |
| return None | |
| def _parse_json_summary(value: Any) -> dict[str, Any] | None: | |
| if isinstance(value, dict): | |
| return value | |
| if isinstance(value, str): | |
| try: | |
| parsed = json.loads(value) | |
| except json.JSONDecodeError: | |
| # Consider logging the error to help debug data quality issues. | |
| return None | |
| if isinstance(parsed, dict): | |
| return parsed | |
| return None |
| payload = json.loads(line) | ||
| sample = self._normalize_sample( | ||
| payload=payload, | ||
| run_id=run_id, | ||
| source_file=jsonl_file, | ||
| line_number=line_number, | ||
| ) |
There was a problem hiding this comment.
The call to json.loads(line) is not wrapped in a try...except block. If any line in a .jsonl file is malformed, this will raise a JSONDecodeError and stop the processing of the entire file. To make the scanning process more robust, it's advisable to wrap this call in a try...except block and log any parsing errors, allowing the process to continue with the next valid lines.
try:
payload = json.loads(line)
except json.JSONDecodeError:
# Consider logging the error and the problematic line number
continue
sample = self._normalize_sample(
payload=payload,
run_id=run_id,
source_file=jsonl_file,
line_number=line_number,
)
No description provided.