-
Notifications
You must be signed in to change notification settings - Fork 78
Fix/clean english kg prompts #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
yogurtss
wants to merge
24
commits into
InternScience:main
from
yogurtss:fix/clean-english-kg-prompts
Closed
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
0b9c1d4
add MoRora files for future develop
yogurtss ecc062d
feat: add tree-based operator pipeline and sample generate config
yogurtss f84a201
Merge pull request #2 from yogurtss/codex/plan-migration-from-modora-…
yogurtss ad91b44
Constrain KG relations with evidence and memory-focused prompts
yogurtss 7378d9a
Merge pull request #3 from yogurtss/codex/improve-kg-enhancements-aft…
yogurtss 8f6e90b
Improve DRAM-focused VQA generation quality pipeline
yogurtss 1b48de7
Merge pull request #4 from yogurtss/codex/refine-vqa-data-generation-…
yogurtss df33bb5
docs: add multi-hop pipeline flowchart and diagram prompt
yogurtss 496cb17
Merge pull request #5 from yogurtss/codex/create-flowchart-for-multih…
yogurtss 07845c6
feat(tree_pipeline): split markdown text table image blocks
yogurtss 0fb6d6e
feat(tree_atomic): preserve pre-segmented tree nodes
yogurtss 5525661
Merge pull request #6 from yogurtss/codex/integrate-modora-features-i…
yogurtss 09eb98c
feat(tree_pipeline): add grounded tree vqa support
yogurtss c3863ba
Merge pull request #7 from yogurtss/codex/add-vqa-support-to-tree-pip…
yogurtss d008876
Refine memory-domain KG extraction prompts
yogurtss 4f37afc
Merge pull request #8 from yogurtss/codex/update-entity-and-relation-…
yogurtss 62cdddc
chore: update project state and add VLM VQA planning docs
yogurtss 71cccd9
docs: reorganize documentation structure
yogurtss 22b4a84
fix(tree_pipeline): preserve section nodes and stable tree parenting
yogurtss 002a5ad
Add sub-graph metadata to generated QA outputs
yogurtss 093894e
Add GraphGen data platform v1
yogurtss 72f8ce9
Remove frontend dependencies from git
yogurtss 2ecefa6
Revert VQA length filtering
yogurtss cbeb69e
fix: clean english kg extraction prompt examples
yogurtss File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # GraphGen Data Platform | ||
|
|
||
| 独立的数据平台用于浏览 GraphGen 的生成结果,重点支持: | ||
|
|
||
| - 导入 `cache` 这类 GraphGen 输出目录 | ||
| - 浏览 Question / Answer | ||
| - 预览 VQA 图片 | ||
| - 可视化 `sub_graph` | ||
| - 展示节点和边上的 `evidence_span` | ||
|
|
||
| ## 目录结构 | ||
|
|
||
| - `data_platform/backend` | ||
| Python + FastAPI 后端,负责扫描 `cache/output/<run_id>/generate/*.jsonl` | ||
| - `data_platform/frontend` | ||
| React + Vite 前端,负责三栏工作台和交互图谱 | ||
|
|
||
| ## 启动后端 | ||
|
|
||
| 在项目根目录执行: | ||
|
|
||
| ```bash | ||
| uvicorn data_platform.backend.main:app --reload | ||
| ``` | ||
|
|
||
| 默认监听 `http://127.0.0.1:8000`。 | ||
|
|
||
| ## 启动前端 | ||
|
|
||
| 在另一个终端执行: | ||
|
|
||
| ```bash | ||
| cd data_platform/frontend | ||
| npm install | ||
| npm run dev | ||
| ``` | ||
|
|
||
| 默认监听 `http://127.0.0.1:5173`,并通过 Vite 代理把 `/api/*` 请求转发到后端。 | ||
|
|
||
| ## 使用方式 | ||
|
|
||
| 1. 启动后端和前端。 | ||
| 2. 打开前端页面。 | ||
| 3. 在左上角导入框输入 GraphGen 输出目录,例如 `cache`。 | ||
| 4. 导入后选择某个 run。 | ||
| 5. 在中间栏浏览样本,在右侧查看图片、图谱和 evidence。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| """GraphGen local data platform package.""" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| """Backend package for the GraphGen data platform.""" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| from fastapi import FastAPI, HTTPException, Query | ||
| from fastapi.middleware.cors import CORSMiddleware | ||
| from fastapi.responses import FileResponse | ||
|
|
||
| from .models import RunRecord, SamplePage, SampleRecord, ScanRequest, ScanResponse | ||
| from .store import DataPlatformStore | ||
|
|
||
| app = FastAPI(title="GraphGen Data Platform API", version="0.1.0") | ||
| app.add_middleware( | ||
| CORSMiddleware, | ||
| allow_origins=["*"], | ||
| allow_credentials=True, | ||
| allow_methods=["*"], | ||
| allow_headers=["*"], | ||
| ) | ||
|
|
||
| store = DataPlatformStore(base_dir=Path.cwd()) | ||
|
|
||
|
|
||
| @app.get("/api/health") | ||
| def healthcheck() -> dict[str, str]: | ||
| return {"status": "ok"} | ||
|
|
||
|
|
||
| @app.post("/api/imports/scan", response_model=ScanResponse) | ||
| def scan_imports(request: ScanRequest) -> ScanResponse: | ||
| try: | ||
| runs, sample_count = store.scan(request.root_path) | ||
| except FileNotFoundError as exc: | ||
| raise HTTPException(status_code=404, detail=str(exc)) from exc | ||
| except ValueError as exc: | ||
| raise HTTPException(status_code=400, detail=str(exc)) from exc | ||
|
|
||
| return ScanResponse( | ||
| root_path=request.root_path, | ||
| run_count=len(runs), | ||
| sample_count=sample_count, | ||
| runs=runs, | ||
| ) | ||
|
|
||
|
|
||
| @app.get("/api/runs", response_model=list[RunRecord]) | ||
| def list_runs() -> list[RunRecord]: | ||
| return store.list_runs() | ||
|
|
||
|
|
||
| @app.get("/api/runs/{run_id}/samples", response_model=SamplePage) | ||
| def list_samples( | ||
| run_id: str, | ||
| page: int = Query(default=1, ge=1), | ||
| page_size: int = Query(default=20, ge=1, le=100), | ||
| search: str | None = None, | ||
| has_image: bool | None = None, | ||
| has_graph: bool | None = None, | ||
| ) -> SamplePage: | ||
| try: | ||
| return store.list_samples( | ||
| run_id, | ||
| page=page, | ||
| page_size=page_size, | ||
| search=search, | ||
| has_image=has_image, | ||
| has_graph=has_graph, | ||
| ) | ||
| except KeyError as exc: | ||
| raise HTTPException(status_code=404, detail=f"Run not found: {run_id}") from exc | ||
|
|
||
|
|
||
| @app.get("/api/samples/{sample_id}", response_model=SampleRecord) | ||
| def get_sample(sample_id: str) -> SampleRecord: | ||
| try: | ||
| return store.get_sample(sample_id) | ||
| except KeyError as exc: | ||
| raise HTTPException(status_code=404, detail=f"Sample not found: {sample_id}") from exc | ||
|
|
||
|
|
||
| @app.get("/api/assets") | ||
| def get_asset(path: str = Query(..., min_length=1)) -> FileResponse: | ||
| asset_path = Path(path).resolve() | ||
| if not store.is_asset_allowed(str(asset_path)): | ||
| raise HTTPException(status_code=403, detail="Asset path is not indexed") | ||
| if not asset_path.exists() or not asset_path.is_file(): | ||
| raise HTTPException(status_code=404, detail="Asset not found") | ||
| return FileResponse(asset_path) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from typing import Any, Literal | ||
|
|
||
| from pydantic import BaseModel, Field | ||
|
|
||
|
|
||
| class EvidenceItem(BaseModel): | ||
| kind: Literal["node", "edge"] | ||
| label: str | ||
| evidence_span: str | ||
| source_id: str | None = None | ||
| description: str | None = None | ||
|
|
||
|
|
||
| class RunStats(BaseModel): | ||
| question_texts: list[str] = Field(default_factory=list) | ||
| answer_texts: list[str] = Field(default_factory=list) | ||
| entity_type_counts: dict[str, int] = Field(default_factory=dict) | ||
| relation_type_counts: dict[str, int] = Field(default_factory=dict) | ||
| evidence_coverage: float = 0.0 | ||
|
|
||
|
|
||
| class RunRecord(BaseModel): | ||
| run_id: str | ||
| root_path: str | ||
| config_path: str | None = None | ||
| generated_at: int | None = None | ||
| sample_count: int = 0 | ||
| task_type: str = "unknown" | ||
| has_image: bool = False | ||
| has_sub_graph: bool = False | ||
| stats: RunStats = Field(default_factory=RunStats) | ||
|
|
||
|
|
||
| class SampleListItem(BaseModel): | ||
| sample_id: str | ||
| run_id: str | ||
| question: str | ||
| answer_preview: str | ||
| image_path: str | None = None | ||
| node_count: int = 0 | ||
| edge_count: int = 0 | ||
| has_graph: bool = False | ||
|
|
||
|
|
||
| class SampleRecord(BaseModel): | ||
| sample_id: str | ||
| run_id: str | ||
| source_file: str | ||
| trace_id: str | None = None | ||
| question: str | ||
| answer: str | ||
| image_path: str | None = None | ||
| sub_graph: dict[str, Any] | None = None | ||
| sub_graph_summary: dict[str, Any] | None = None | ||
| evidence_items: list[EvidenceItem] = Field(default_factory=list) | ||
| raw_record: dict[str, Any] | ||
| graph_parse_error: str | None = None | ||
|
|
||
|
|
||
| class SamplePage(BaseModel): | ||
| items: list[SampleListItem] | ||
| total: int | ||
| page: int | ||
| page_size: int | ||
|
|
||
|
|
||
| class ScanRequest(BaseModel): | ||
| root_path: str | ||
|
|
||
|
|
||
| class ScanResponse(BaseModel): | ||
| root_path: str | ||
| run_count: int | ||
| sample_count: int | ||
| runs: list[RunRecord] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using
allow_origins=["*"]is convenient for local development but poses a security risk if this application is ever deployed in a more open environment, as it allows any origin to make requests. For a local-only tool this might be acceptable, but it's better to restrict it to the specific frontend origin (e.g.,http://localhost:5173orhttp://127.0.0.1:5173). This would prevent other malicious sites from making requests to the backend on the user's behalf.