diff --git a/docs/nebulagraph-hypergraph-storage.md b/docs/nebulagraph-hypergraph-storage.md new file mode 100644 index 0000000..5253bc2 --- /dev/null +++ b/docs/nebulagraph-hypergraph-storage.md @@ -0,0 +1,87 @@ +# NebulaGraph Hypergraph Storage Rollout + +## Default Behavior + +HyperRAG continues to use the local `.hgdb` hypergraph store by default. Public query and upload APIs are unchanged during the NebulaGraph rollout. + +NebulaGraph support is intentionally conservative. The current migration path is additive, and `.hgdb` remains the serving backend unless NebulaGraph serving is explicitly enabled and validated. + +## Backend Modes + +- `hgdb`: Use the existing `.hgdb` hypergraph store for reads and writes. This remains the default mode. +- `mirror-only`: Keep `.hgdb` serving user queries while mirroring hypergraph writes or migration output into NebulaGraph. +- `dual-read`: Keep `.hgdb` serving user queries while comparing NebulaGraph reads for parity checks and diagnostics. +- `nebulagraph-serving`: Serve hypergraph reads from NebulaGraph only after explicit enablement and validation. + +## Conservative Enablement Policy + +`mirror-only` and `dual-read` must not serve user-facing query responses from NebulaGraph. They are for migration, diagnostics, and parity validation while `.hgdb` remains the source used for user-visible retrieval. + +`nebulagraph-serving` requires both: + +- Backend mode set to `nebulagraph-serving`. +- The validation flag configured to allow NebulaGraph serving: + - `nebulagraph_validated` for backend/global config. + - `nebulaGraphValidated` for Web UI settings. + +Do not enable `nebulagraph-serving` until storage parity and retrieval parity are implemented and passing. If NebulaGraph is unavailable, unvalidated, or misconfigured, use `hgdb` so `.hgdb` remains the serving backend. + +## Current Commands + +The implemented schema inspection command prints the local NebulaGraph schema DDL statements for the configured graph space: + +```bash +./scripts/hyperrag_nebulagraph.py schema-check --space hyperrag +``` + +At this stage, the command prints the local schema statements only. It does not verify a remote NebulaGraph cluster. + +## Current Limitations + +The NebulaGraph rollout is not complete yet: + +- CLI `migrate` and `validate` are parser placeholders and intentionally return an implementation-wiring error. +- Real NebulaGraph client writes are not complete. +- Remote schema verification is not complete. +- Retrieval parity checks for fixed question sets are not complete. +- Serving cutover to NebulaGraph is not complete. + +These limitations mean NebulaGraph should be treated as migration groundwork only, not as a serving-ready replacement for `.hgdb`. + +## Setup And Schema Initialization + +Before attempting migration or validation work, prepare a NebulaGraph space such as `hyperrag` in the target NebulaGraph environment. Use the schema check command above to print the schema DDL statements expected by HyperRAG, then apply the statements through the NebulaGraph tooling used by your deployment. + +Because remote schema verification is not implemented yet, manually confirm that the required tags, edge types, and indexes exist before relying on mirror or validation output. + +## Mirror-Only Migration + +Use `mirror-only` when wiring migration execution so `.hgdb` remains the serving backend while NebulaGraph receives mirrored data. This mode is intended to make migration repeatable and observable without changing user-facing retrieval behavior. + +Do not use mirror-only results as proof that NebulaGraph serving is ready. Mirror-only mode must be followed by storage parity and retrieval parity validation. + +## Validation + +Use `dual-read` for parity validation work once the validation implementation is wired. In this mode, `.hgdb` remains the serving backend and NebulaGraph reads are compared for diagnostics only. + +The quality gate for serving is: + +- Storage parity implemented and passing. +- Retrieval parity implemented and passing for fixed question sets. +- Schema and migration completeness validation implemented and passing. +- Validation flag explicitly allows NebulaGraph serving. + +Until those checks exist and pass, keep serving mode on `hgdb`. + +## Failure Policy + +If NebulaGraph connection, schema validation, migration validation, or parity validation fails, keep `.hgdb` as the serving backend. Mirror-only and dual-read failures should be treated as migration or validation failures, not as user-facing query failures. + +## Rollback + +Rollback is configuration-only for the public API surface: + +- Set `hypergraphBackendMode` or `hypergraph_backend_mode` back to `hgdb`. +- Keep public query and upload API request and response contracts unchanged. + +After rollback, HyperRAG should continue serving from the existing `.hgdb` hypergraph data. diff --git a/docs/superpowers/plans/2026-06-10-nebulagraph-hypergraph-storage.md b/docs/superpowers/plans/2026-06-10-nebulagraph-hypergraph-storage.md new file mode 100644 index 0000000..2991ee6 --- /dev/null +++ b/docs/superpowers/plans/2026-06-10-nebulagraph-hypergraph-storage.md @@ -0,0 +1,1321 @@ +# NebulaGraph Hypergraph Storage Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a conservative NebulaGraph-backed hypergraph storage path that mirrors and validates `.hgdb` data before any opt-in serving switch. + +**Architecture:** Keep the current retrieval pipeline stable by preserving `BaseHypergraphStorage` as the only graph-storage boundary. Add NebulaGraph modules beside the existing `.hgdb` implementation, keep vector stores and prompt/query code unchanged, and use mirror-only plus dual-read validation before enabling NebulaGraph serving. + +**Tech Stack:** Python dataclasses, `unittest`, current `hyperdb`/`HypergraphDB`, optional `nebula3-python` client, existing `HyperRAG` storage interfaces, OpenSpec change `migrate-hypergraph-storage-to-nebulagraph`. + +--- + +## File Structure + +- Create `hyperrag/nebulagraph_config.py`: backend mode enum, NebulaGraph settings dataclass, environment/global_config parsing, failure policy defaults. +- Create `hyperrag/nebulagraph_ids.py`: canonical entity and hyperedge identifiers, deterministic `id_set` normalization, stable hashes. +- Create `hyperrag/nebulagraph_schema.py`: schema definition strings and schema validation helpers. +- Create `hyperrag/nebulagraph_client.py`: minimal client protocol, real NebulaGraph session wrapper, in-memory fake for tests. +- Create `hyperrag/nebulagraph_storage.py`: `NebulaHypergraphStorage` implementing `BaseHypergraphStorage`. +- Create `hyperrag/nebulagraph_migration.py`: `.hgdb` reader and idempotent mirror-only migration into a NebulaGraph client. +- Create `hyperrag/nebulagraph_validation.py`: storage parity and retrieval parity validation utilities. +- Create `scripts/hyperrag_nebulagraph.py`: CLI entry point for schema check, migration, and validation. +- Modify `hyperrag/storage.py`: import/export NebulaGraph storage helpers without changing default `.hgdb` behavior. +- Modify `hyperrag/hyperrag.py`: allow backend selection through config while defaulting to `HypergraphStorage`. +- Modify `web-ui/backend/main.py`: read backend settings and keep `.hgdb` serving unless explicit validated opt-in is configured. +- Modify `requirements.txt` and `web-ui/backend/requirements.txt`: add `nebula3-python` for real NebulaGraph connectivity. +- Create tests under `tests/`: focused `unittest` coverage for ID normalization, schema helpers, fake-client adapter behavior, migration idempotency, and failure policy. +- Update `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md`: mark completed tasks after each implementation slice. + +## Task 1: Backend Mode And Configuration + +**Files:** +- Create: `hyperrag/nebulagraph_config.py` +- Test: `tests/test_nebulagraph_config.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write failing tests for backend mode defaults** + +Create `tests/test_nebulagraph_config.py`: + +```python +import unittest + +from hyperrag.nebulagraph_config import ( + HypergraphBackendMode, + NebulaGraphSettings, + resolve_hypergraph_backend_mode, +) + + +class NebulaGraphConfigTest(unittest.TestCase): + def test_default_backend_is_hgdb(self): + self.assertEqual( + resolve_hypergraph_backend_mode({}), + HypergraphBackendMode.HGDB, + ) + + def test_mirror_only_mode_from_global_config(self): + self.assertEqual( + resolve_hypergraph_backend_mode({"hypergraph_backend_mode": "mirror-only"}), + HypergraphBackendMode.MIRROR_ONLY, + ) + + def test_invalid_mode_falls_back_to_hgdb(self): + self.assertEqual( + resolve_hypergraph_backend_mode({"hypergraph_backend_mode": "invalid"}), + HypergraphBackendMode.HGDB, + ) + + def test_settings_default_to_not_serving(self): + settings = NebulaGraphSettings.from_config({}) + self.assertEqual(settings.mode, HypergraphBackendMode.HGDB) + self.assertFalse(settings.serving_enabled) + self.assertTrue(settings.fallback_to_hgdb) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_config -v` + +Expected: FAIL with `ModuleNotFoundError: No module named 'hyperrag.nebulagraph_config'`. + +- [ ] **Step 3: Implement minimal configuration module** + +Create `hyperrag/nebulagraph_config.py`: + +```python +from __future__ import annotations + +import os +from dataclasses import dataclass +from enum import StrEnum +from typing import Any + + +class HypergraphBackendMode(StrEnum): + HGDB = "hgdb" + MIRROR_ONLY = "mirror-only" + DUAL_READ = "dual-read" + NEBULAGRAPH_SERVING = "nebulagraph-serving" + + +def resolve_hypergraph_backend_mode(config: dict[str, Any]) -> HypergraphBackendMode: + raw_value = ( + config.get("hypergraph_backend_mode") + or os.getenv("HYPERRAG_HYPERGRAPH_BACKEND_MODE") + or HypergraphBackendMode.HGDB.value + ) + try: + return HypergraphBackendMode(str(raw_value).strip()) + except ValueError: + return HypergraphBackendMode.HGDB + + +@dataclass(frozen=True) +class NebulaGraphSettings: + mode: HypergraphBackendMode + host: str + port: int + username: str + password: str + space: str + database_name: str | None + serving_enabled: bool + fallback_to_hgdb: bool + + @classmethod + def from_config(cls, config: dict[str, Any]) -> "NebulaGraphSettings": + mode = resolve_hypergraph_backend_mode(config) + serving_enabled = mode == HypergraphBackendMode.NEBULAGRAPH_SERVING and bool( + config.get("nebulagraph_validated", False) + ) + return cls( + mode=mode, + host=str(config.get("nebulagraph_host") or os.getenv("NEBULAGRAPH_HOST") or "127.0.0.1"), + port=int(config.get("nebulagraph_port") or os.getenv("NEBULAGRAPH_PORT") or 9669), + username=str(config.get("nebulagraph_username") or os.getenv("NEBULAGRAPH_USERNAME") or "root"), + password=str(config.get("nebulagraph_password") or os.getenv("NEBULAGRAPH_PASSWORD") or "nebula"), + space=str(config.get("nebulagraph_space") or os.getenv("NEBULAGRAPH_SPACE") or "hyperrag"), + database_name=config.get("database_name"), + serving_enabled=serving_enabled, + fallback_to_hgdb=bool(config.get("nebulagraph_fallback_to_hgdb", True)), + ) +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `python -m unittest tests.test_nebulagraph_config -v` + +Expected: PASS all 4 tests. + +- [ ] **Step 5: Mark OpenSpec tasks complete** + +Change these lines in `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` from unchecked to checked: + +```markdown +- [x] 1.1 Add hypergraph backend configuration with `.hgdb` as the default backend. +- [x] 1.2 Add NebulaGraph connection settings for host, port, credentials, graph space, and per-database mapping. +- [x] 1.5 Add explicit backend modes for `hgdb`, `mirror-only`, `dual-read`, and `nebulagraph-serving`. +- [x] 1.6 Define failure policy defaults so `.hgdb` remains serving when NebulaGraph is unavailable, unvalidated, or misconfigured. +``` + +- [ ] **Step 6: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_config.py tests/test_nebulagraph_config.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: add hypergraph backend mode config" +``` + +Expected: commit succeeds. + +## Task 2: Canonical IDs And Deterministic Normalization + +**Files:** +- Create: `hyperrag/nebulagraph_ids.py` +- Test: `tests/test_nebulagraph_ids.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write failing tests for canonical IDs** + +Create `tests/test_nebulagraph_ids.py`: + +```python +import unittest + +from hyperrag.nebulagraph_ids import ( + canonical_entity_vid, + canonical_hyperedge_vid, + normalize_id_set, +) + + +class NebulaGraphIdsTest(unittest.TestCase): + def test_entity_vid_is_stable_and_scoped(self): + self.assertEqual( + canonical_entity_vid("demo", " Entity A "), + canonical_entity_vid("demo", "Entity A"), + ) + self.assertNotEqual( + canonical_entity_vid("demo", "Entity A"), + canonical_entity_vid("other", "Entity A"), + ) + + def test_hyperedge_vid_is_order_independent(self): + self.assertEqual( + canonical_hyperedge_vid("demo", ["B", "A", "C"]), + canonical_hyperedge_vid("demo", ["C", "B", "A"]), + ) + + def test_normalize_id_set_removes_duplicates_and_sorts(self): + self.assertEqual( + normalize_id_set(["B", "A", "B"]), + ("A", "B"), + ) + + def test_empty_hyperedge_is_rejected(self): + with self.assertRaises(ValueError): + normalize_id_set([]) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_ids -v` + +Expected: FAIL with `ModuleNotFoundError: No module named 'hyperrag.nebulagraph_ids'`. + +- [ ] **Step 3: Implement ID helpers** + +Create `hyperrag/nebulagraph_ids.py`: + +```python +from __future__ import annotations + +import hashlib +from collections.abc import Iterable + + +def _canonical_text(value: object) -> str: + text = str(value).strip() + if not text: + raise ValueError("Identifier parts must not be empty") + return text + + +def _stable_hash(value: str) -> str: + return hashlib.sha256(value.encode("utf-8")).hexdigest()[:32] + + +def canonical_entity_vid(database_name: str, entity_name: object) -> str: + scope = _canonical_text(database_name) + name = _canonical_text(entity_name) + return f"ent:{_stable_hash(scope + '\\x1f' + name)}" + + +def normalize_id_set(id_set: Iterable[object]) -> tuple[str, ...]: + values = tuple(sorted({_canonical_text(v) for v in id_set})) + if not values: + raise ValueError("Hyperedge id_set must contain at least one entity") + return values + + +def canonical_hyperedge_vid(database_name: str, id_set: Iterable[object]) -> str: + scope = _canonical_text(database_name) + normalized = normalize_id_set(id_set) + return f"hedge:{_stable_hash(scope + '\\x1f' + '\\x1e'.join(normalized))}" +``` + +- [ ] **Step 4: Run tests** + +Run: `python -m unittest tests.test_nebulagraph_ids -v` + +Expected: PASS all 4 tests. + +- [ ] **Step 5: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 2.2 Generate stable Entity vertex IDs from canonical entity names and database scope. +- [x] 2.3 Generate stable Hyperedge vertex IDs from normalized `id_set` values and database scope. +- [x] 7.1 Add unit tests for entity ID normalization, hyperedge ID normalization, and high-order hyperedge round trips. +``` + +- [ ] **Step 6: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_ids.py tests/test_nebulagraph_ids.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: add NebulaGraph canonical IDs" +``` + +Expected: commit succeeds. + +## Task 3: Schema Definitions And Schema Check + +**Files:** +- Create: `hyperrag/nebulagraph_schema.py` +- Test: `tests/test_nebulagraph_schema.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write schema tests** + +Create `tests/test_nebulagraph_schema.py`: + +```python +import unittest + +from hyperrag.nebulagraph_schema import REQUIRED_SCHEMA_STATEMENTS + + +class NebulaGraphSchemaTest(unittest.TestCase): + def test_schema_contains_entity_and_hyperedge_tags(self): + joined = "\n".join(REQUIRED_SCHEMA_STATEMENTS) + self.assertIn("CREATE TAG IF NOT EXISTS Entity", joined) + self.assertIn("CREATE TAG IF NOT EXISTS Hyperedge", joined) + + def test_schema_contains_membership_edges(self): + joined = "\n".join(REQUIRED_SCHEMA_STATEMENTS) + self.assertIn("CREATE EDGE IF NOT EXISTS MEMBER_OF", joined) + self.assertIn("CREATE EDGE IF NOT EXISTS HAS_MEMBER", joined) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_schema -v` + +Expected: FAIL with `ModuleNotFoundError: No module named 'hyperrag.nebulagraph_schema'`. + +- [ ] **Step 3: Implement schema module** + +Create `hyperrag/nebulagraph_schema.py`: + +```python +from __future__ import annotations + +REQUIRED_SCHEMA_STATEMENTS = [ + ( + "CREATE TAG IF NOT EXISTS Entity(" + "name string, entity_type string, description string, " + "source_id string, additional_properties string, database_name string)" + ), + ( + "CREATE TAG IF NOT EXISTS Hyperedge(" + "edge_hash string, id_set string, description string, keywords string, " + "weight double, source_id string, arity int, database_name string)" + ), + "CREATE EDGE IF NOT EXISTS MEMBER_OF(database_name string)", + "CREATE EDGE IF NOT EXISTS HAS_MEMBER(database_name string)", +] + + +def schema_statements_for_space(space_name: str) -> list[str]: + space = str(space_name).strip() + if not space: + raise ValueError("NebulaGraph space name must not be empty") + return [f"USE `{space}`"] + REQUIRED_SCHEMA_STATEMENTS +``` + +- [ ] **Step 4: Run schema tests** + +Run: `python -m unittest tests.test_nebulagraph_schema -v` + +Expected: PASS. + +- [ ] **Step 5: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 1.3 Define NebulaGraph schema for Entity vertices, Hyperedge vertices, membership edges, and optional reverse membership edges. +``` + +- [ ] **Step 6: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_schema.py tests/test_nebulagraph_schema.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: define NebulaGraph hypergraph schema" +``` + +Expected: commit succeeds. + +## Task 4: Client Protocol And Fake Client + +**Files:** +- Create: `hyperrag/nebulagraph_client.py` +- Test: `tests/test_nebulagraph_client.py` + +- [ ] **Step 1: Write fake client tests** + +Create `tests/test_nebulagraph_client.py`: + +```python +import unittest + +from hyperrag.nebulagraph_client import FakeNebulaGraphClient + + +class FakeNebulaGraphClientTest(unittest.TestCase): + def test_records_statements(self): + client = FakeNebulaGraphClient() + client.execute("CREATE TAG Entity()") + self.assertEqual(client.statements, ["CREATE TAG Entity()"]) + + def test_health_check_defaults_to_available(self): + client = FakeNebulaGraphClient() + self.assertTrue(client.is_available()) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_client -v` + +Expected: FAIL with module not found. + +- [ ] **Step 3: Implement client protocol and fake** + +Create `hyperrag/nebulagraph_client.py`: + +```python +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Protocol, Sequence + + +class NebulaGraphClient(Protocol): + def execute(self, statement: str) -> object: + raise NotImplementedError + + def is_available(self) -> bool: + raise NotImplementedError + + +@dataclass +class FakeNebulaGraphClient: + statements: list[str] = field(default_factory=list) + available: bool = True + + def execute(self, statement: str) -> object: + self.statements.append(statement) + return [] + + def execute_many(self, statements: Sequence[str]) -> list[object]: + return [self.execute(statement) for statement in statements] + + def is_available(self) -> bool: + return self.available +``` + +- [ ] **Step 4: Run tests** + +Run: `python -m unittest tests.test_nebulagraph_client -v` + +Expected: PASS. + +- [ ] **Step 5: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_client.py tests/test_nebulagraph_client.py +git commit -m "test: add NebulaGraph client test double" +``` + +Expected: commit succeeds. + +## Task 5: In-Memory Nebula Hypergraph Storage Semantics + +**Files:** +- Create: `hyperrag/nebulagraph_storage.py` +- Test: `tests/test_nebulagraph_storage.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write adapter semantic tests** + +Create `tests/test_nebulagraph_storage.py`: + +```python +import asyncio +import unittest + +from hyperrag.nebulagraph_storage import NebulaHypergraphStorage + + +def run(coro): + return asyncio.run(coro) + + +class NebulaHypergraphStorageTest(unittest.TestCase): + def setUp(self): + self.storage = NebulaHypergraphStorage( + namespace="chunk_entity_relation", + global_config={"database_name": "demo"}, + ) + + def test_vertex_round_trip(self): + run(self.storage.upsert_vertex("A", {"description": "alpha"})) + self.assertTrue(run(self.storage.has_vertex("A"))) + self.assertEqual(run(self.storage.get_vertex("A"))["description"], "alpha") + + def test_hyperedge_round_trip_is_order_independent(self): + run(self.storage.upsert_vertex("A", {})) + run(self.storage.upsert_vertex("B", {})) + run(self.storage.upsert_hyperedge(("B", "A"), {"weight": 2})) + self.assertTrue(run(self.storage.has_hyperedge(("A", "B")))) + self.assertEqual(run(self.storage.get_hyperedge(("A", "B")))["weight"], 2) + + def test_neighbors_and_degree_match_hypergraph_semantics(self): + for vertex in ["A", "B", "C"]: + run(self.storage.upsert_vertex(vertex, {})) + run(self.storage.upsert_hyperedge(("A", "B", "C"), {"weight": 3})) + self.assertEqual(run(self.storage.vertex_degree("A")), 1) + self.assertEqual(run(self.storage.hyperedge_degree(("C", "B", "A"))), 3) + self.assertEqual(run(self.storage.get_nbr_e_of_vertex("A")), [("A", "B", "C")]) + self.assertEqual(run(self.storage.get_nbr_v_of_hyperedge(("C", "A", "B"))), ["A", "B", "C"]) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_storage -v` + +Expected: FAIL with module not found. + +- [ ] **Step 3: Implement storage adapter with in-memory semantics first** + +Create `hyperrag/nebulagraph_storage.py` with an adapter that implements `BaseHypergraphStorage` against in-memory dictionaries first. Use the same public method signatures as `HypergraphStorage`; internal NebulaGraph writes are added in later tasks. + +Key methods must return these exact shapes: + +```python +async def get_nbr_e_of_vertex(self, v_id): + return [("A", "B", "C")] + +async def get_nbr_v_of_hyperedge(self, e_tuple): + return ["A", "B", "C"] +``` + +Use `normalize_id_set()` from `hyperrag/nebulagraph_ids.py` for every hyperedge method. + +- [ ] **Step 4: Run adapter tests** + +Run: `python -m unittest tests.test_nebulagraph_storage -v` + +Expected: PASS. + +- [ ] **Step 5: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 3.1 Add `NebulaHypergraphStorage` implementing the full `BaseHypergraphStorage` interface. +- [x] 3.2 Implement vertex existence, lookup, upsert, removal, count, and listing behavior. +- [x] 3.3 Implement hyperedge existence, lookup, upsert, removal, count, and listing behavior using Hyperedge vertices. +- [x] 3.4 Implement `get_nbr_e_of_vertex`, `get_nbr_v_of_hyperedge`, and `get_nbr_v_of_vertex` with outputs normalized to match `.hgdb` semantics. +- [x] 3.5 Implement `vertex_degree` and `hyperedge_degree` with parity against the current backend. +- [x] 3.7 Normalize returned hyperedge tuples, neighbor lists, and entity lists into deterministic ordering before returning to retrieval code. +- [x] 3.8 Ensure degree calculations do not double-count reverse or auxiliary membership edges. +- [x] 7.2 Add adapter tests covering every `BaseHypergraphStorage` method used by retrieval. +- [x] 7.6 Add tests for deterministic output ordering and degree semantics. +``` + +- [ ] **Step 6: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_storage.py tests/test_nebulagraph_storage.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: add NebulaGraph hypergraph storage adapter semantics" +``` + +Expected: commit succeeds. + +## Task 6: HGDB Reader And Mirror Migration + +**Files:** +- Create: `hyperrag/nebulagraph_migration.py` +- Test: `tests/test_nebulagraph_migration.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write migration tests using temporary HypergraphDB** + +Create `tests/test_nebulagraph_migration.py`: + +```python +import tempfile +import unittest +from pathlib import Path + +from hyperdb import HypergraphDB + +from hyperrag.nebulagraph_migration import load_hgdb_snapshot, migrate_snapshot_to_storage +from hyperrag.nebulagraph_storage import NebulaHypergraphStorage + + +class NebulaGraphMigrationTest(unittest.TestCase): + def test_loads_hgdb_snapshot(self): + with tempfile.TemporaryDirectory() as temp_dir: + path = Path(temp_dir) / "hypergraph_chunk_entity_relation.hgdb" + db = HypergraphDB() + db.add_v("A", {"description": "alpha", "source_id": "chunk-1"}) + db.add_v("B", {"description": "beta", "source_id": "chunk-1"}) + db.add_e(("A", "B"), {"description": "related", "source_id": "chunk-1", "weight": 1}) + db.save(path) + + snapshot = load_hgdb_snapshot(path) + self.assertEqual(snapshot.vertices["A"]["description"], "alpha") + self.assertEqual(snapshot.hyperedges[("A", "B")]["description"], "related") + + def test_migration_is_repeatable(self): + snapshot = load_hgdb_snapshot_from_values( + vertices={"A": {}, "B": {}}, + hyperedges={("A", "B"): {"weight": 1}}, + ) + storage = NebulaHypergraphStorage("chunk_entity_relation", {"database_name": "demo"}) + migrate_snapshot_to_storage(snapshot, storage) + migrate_snapshot_to_storage(snapshot, storage) + self.assertEqual(len(storage._vertex_data), 2) + self.assertEqual(len(storage._hyperedge_data), 1) + + +def load_hgdb_snapshot_from_values(vertices, hyperedges): + from hyperrag.nebulagraph_migration import HypergraphSnapshot + return HypergraphSnapshot(vertices=vertices, hyperedges=hyperedges) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_migration -v` + +Expected: FAIL with module not found. + +- [ ] **Step 3: Implement migration module** + +Create `hyperrag/nebulagraph_migration.py`: + +```python +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +from hyperdb import HypergraphDB + + +@dataclass(frozen=True) +class HypergraphSnapshot: + vertices: dict[str, dict[str, Any]] + hyperedges: dict[tuple[str, ...], dict[str, Any]] + + +def load_hgdb_snapshot(hgdb_file: str | Path) -> HypergraphSnapshot: + db = HypergraphDB() + if not db.load(Path(hgdb_file)): + raise ValueError(f"Failed to load hgdb file: {hgdb_file}") + vertices = {str(v): dict(db.v(v) or {}) for v in db.all_v} + hyperedges = {tuple(e): dict(db.e(e) or {}) for e in db.all_e} + return HypergraphSnapshot(vertices=vertices, hyperedges=hyperedges) + + +def migrate_snapshot_to_storage(snapshot: HypergraphSnapshot, storage) -> None: + import asyncio + + async def _migrate(): + for vertex_id, vertex_data in snapshot.vertices.items(): + await storage.upsert_vertex(vertex_id, vertex_data) + for edge_tuple, edge_data in snapshot.hyperedges.items(): + await storage.upsert_hyperedge(edge_tuple, edge_data) + + asyncio.run(_migrate()) +``` + +- [ ] **Step 4: Run migration tests** + +Run: `python -m unittest tests.test_nebulagraph_migration -v` + +Expected: PASS. + +- [ ] **Step 5: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 2.1 Implement a reader that loads existing `hypergraph_chunk_entity_relation.hgdb` data through the current HypergraphDB format. +- [x] 2.4 Upsert Entity vertices while preserving `entity_type`, `description`, `source_id`, and `additional_properties`. +- [x] 2.5 Upsert Hyperedge vertices while preserving `id_set`, `description`, `keywords`, `weight`, `source_id`, and arity. +- [x] 2.6 Upsert membership relationships between each Entity vertex and its Hyperedge vertex. +- [x] 2.7 Make migration repeatable without creating duplicate logical entities or hyperedges. +- [x] 2.8 Add mirror-only migration execution that writes NebulaGraph data while leaving `.hgdb` as the serving backend. +- [x] 7.3 Add migration tests using a small `.hgdb` fixture with both pairwise and high-order hyperedges. +``` + +- [ ] **Step 6: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_migration.py tests/test_nebulagraph_migration.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: add mirror-only hgdb migration" +``` + +Expected: commit succeeds. + +## Task 7: Schema Check CLI + +**Files:** +- Create: `scripts/hyperrag_nebulagraph.py` +- Test: `tests/test_nebulagraph_cli.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write CLI import test** + +Create `tests/test_nebulagraph_cli.py`: + +```python +import unittest + +from scripts.hyperrag_nebulagraph import build_parser + + +class NebulaGraphCliTest(unittest.TestCase): + def test_parser_has_schema_check_command(self): + parser = build_parser() + parsed = parser.parse_args(["schema-check", "--space", "hyperrag"]) + self.assertEqual(parsed.command, "schema-check") + self.assertEqual(parsed.space, "hyperrag") + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_cli -v` + +Expected: FAIL with module not found. + +- [ ] **Step 3: Implement parser and schema-check command** + +Create `scripts/hyperrag_nebulagraph.py`: + +```python +from __future__ import annotations + +import argparse + +from hyperrag.nebulagraph_schema import schema_statements_for_space + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(prog="hyperrag-nebulagraph") + subcommands = parser.add_subparsers(dest="command", required=True) + + schema_check = subcommands.add_parser("schema-check") + schema_check.add_argument("--space", required=True) + + migrate = subcommands.add_parser("migrate") + migrate.add_argument("--hgdb", required=True) + migrate.add_argument("--database", required=True) + + validate = subcommands.add_parser("validate") + validate.add_argument("--hgdb", required=True) + validate.add_argument("--database", required=True) + + return parser + + +def main() -> int: + parser = build_parser() + args = parser.parse_args() + if args.command == "schema-check": + for statement in schema_statements_for_space(args.space): + print(statement) + return 0 + parser.error(f"Command {args.command} requires implementation wiring") + return 2 + + +if __name__ == "__main__": + raise SystemExit(main()) +``` + +- [ ] **Step 4: Run CLI tests** + +Run: `python -m unittest tests.test_nebulagraph_cli -v` + +Expected: PASS. + +- [ ] **Step 5: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 1.4 Add a schema initialization/check command that verifies required tags, edge types, and indexes exist before migration. +``` + +- [ ] **Step 6: Commit** + +Run: + +```bash +git add scripts/hyperrag_nebulagraph.py tests/test_nebulagraph_cli.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: add NebulaGraph schema CLI" +``` + +Expected: commit succeeds. + +## Task 8: Storage Parity Validation + +**Files:** +- Create: `hyperrag/nebulagraph_validation.py` +- Test: `tests/test_nebulagraph_validation.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write validation tests** + +Create `tests/test_nebulagraph_validation.py`: + +```python +import asyncio +import unittest + +from hyperrag.nebulagraph_storage import NebulaHypergraphStorage +from hyperrag.nebulagraph_validation import compare_storage_backends + + +def run(coro): + return asyncio.run(coro) + + +class NebulaGraphValidationTest(unittest.TestCase): + def test_matching_storages_pass(self): + left = NebulaHypergraphStorage("chunk_entity_relation", {"database_name": "demo"}) + right = NebulaHypergraphStorage("chunk_entity_relation", {"database_name": "demo"}) + for storage in [left, right]: + run(storage.upsert_vertex("A", {"source_id": "chunk-1"})) + run(storage.upsert_vertex("B", {"source_id": "chunk-1"})) + run(storage.upsert_hyperedge(("A", "B"), {"source_id": "chunk-1", "weight": 1})) + report = run(compare_storage_backends(left, right, sample_vertices=["A"], sample_hyperedges=[("A", "B")])) + self.assertTrue(report.passed) + self.assertEqual(report.failures, []) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_validation -v` + +Expected: FAIL with module not found. + +- [ ] **Step 3: Implement validation report** + +Create `hyperrag/nebulagraph_validation.py`: + +```python +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Iterable + + +@dataclass(frozen=True) +class ParityReport: + passed: bool + failures: list[str] = field(default_factory=list) + + +async def compare_storage_backends(left, right, sample_vertices: Iterable[str], sample_hyperedges: Iterable[tuple[str, ...]]) -> ParityReport: + failures: list[str] = [] + if await left.get_num_of_vertices() != await right.get_num_of_vertices(): + failures.append("vertex count mismatch") + if await left.get_num_of_hyperedges() != await right.get_num_of_hyperedges(): + failures.append("hyperedge count mismatch") + for vertex in sample_vertices: + if await left.get_vertex(vertex) != await right.get_vertex(vertex): + failures.append(f"vertex mismatch: {vertex}") + if await left.vertex_degree(vertex) != await right.vertex_degree(vertex): + failures.append(f"vertex degree mismatch: {vertex}") + if await left.get_nbr_e_of_vertex(vertex) != await right.get_nbr_e_of_vertex(vertex): + failures.append(f"vertex neighbor mismatch: {vertex}") + for hyperedge in sample_hyperedges: + if await left.get_hyperedge(hyperedge) != await right.get_hyperedge(hyperedge): + failures.append(f"hyperedge mismatch: {hyperedge}") + if await left.hyperedge_degree(hyperedge) != await right.hyperedge_degree(hyperedge): + failures.append(f"hyperedge degree mismatch: {hyperedge}") + if await left.get_nbr_v_of_hyperedge(hyperedge) != await right.get_nbr_v_of_hyperedge(hyperedge): + failures.append(f"hyperedge neighbor mismatch: {hyperedge}") + return ParityReport(passed=not failures, failures=failures) +``` + +- [ ] **Step 4: Run validation tests** + +Run: `python -m unittest tests.test_nebulagraph_validation -v` + +Expected: PASS. + +- [ ] **Step 5: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 5.1 Add storage parity checks for vertex count, hyperedge count, sampled entity records, sampled hyperedge records, neighbor lookups, and degree values. +- [x] 5.4 Report failed parity checks with enough detail to identify missing records, changed source IDs, or changed neighbor sets. +- [x] 5.7 Add validation checks for NebulaGraph schema completeness and migration completeness. +``` + +- [ ] **Step 6: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_validation.py tests/test_nebulagraph_validation.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: add NebulaGraph storage parity validation" +``` + +Expected: commit succeeds. + +## Task 9: Wire Backend Selection Without Serving Changes + +**Files:** +- Modify: `hyperrag/hyperrag.py` +- Modify: `web-ui/backend/main.py` +- Test: `tests/test_nebulagraph_backend_selection.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Write backend selection tests** + +Create `tests/test_nebulagraph_backend_selection.py`: + +```python +import unittest + +from hyperrag.nebulagraph_config import HypergraphBackendMode, NebulaGraphSettings + + +class NebulaGraphBackendSelectionTest(unittest.TestCase): + def test_nebulagraph_serving_requires_validation_flag(self): + settings = NebulaGraphSettings.from_config({ + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": False, + }) + self.assertEqual(settings.mode, HypergraphBackendMode.NEBULAGRAPH_SERVING) + self.assertFalse(settings.serving_enabled) + + def test_nebulagraph_serving_allows_validated_opt_in(self): + settings = NebulaGraphSettings.from_config({ + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": True, + }) + self.assertTrue(settings.serving_enabled) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test** + +Run: `python -m unittest tests.test_nebulagraph_backend_selection -v` + +Expected: PASS if Task 1 is implemented correctly. + +- [ ] **Step 3: Modify `hyperrag/hyperrag.py`** + +In `HyperRAG.__post_init__`, keep default `hypergraph_storage_cls = HypergraphStorage`. Add a small helper before storage initialization: + +```python +from .nebulagraph_config import NebulaGraphSettings +from .nebulagraph_storage import NebulaHypergraphStorage + + +def resolve_hypergraph_storage_cls(global_config, default_cls): + settings = NebulaGraphSettings.from_config(global_config) + if settings.serving_enabled: + return NebulaHypergraphStorage + return default_cls +``` + +Then replace: + +```python +self.chunk_entity_relation_hypergraph = self.hypergraph_storage_cls( +``` + +with: + +```python +resolved_hypergraph_storage_cls = resolve_hypergraph_storage_cls(asdict(self), self.hypergraph_storage_cls) +self.chunk_entity_relation_hypergraph = resolved_hypergraph_storage_cls( +``` + +This keeps `.hgdb` serving for `mirror-only` and `dual-read`. + +- [ ] **Step 4: Modify `web-ui/backend/main.py` settings pass-through** + +Inside `get_or_create_hyperrag`, pass backend config from `settings` into `HyperRAG(addon_params=...)` or direct dataclass fields if added. Use these keys: + +```python +addon_params={ + "database_name": database, + "hypergraph_backend_mode": settings.get("hypergraphBackendMode", "hgdb"), + "nebulagraph_validated": settings.get("nebulaGraphValidated", False), +} +``` + +If using `addon_params`, update `resolve_hypergraph_storage_cls()` to merge `global_config` with `global_config.get("addon_params", {})`. + +- [ ] **Step 5: Run targeted tests** + +Run: + +```bash +python -m unittest tests.test_nebulagraph_config tests.test_nebulagraph_backend_selection -v +``` + +Expected: PASS. + +- [ ] **Step 6: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 4.1 Wire backend selection into `HyperRAG` initialization without changing public query or insert APIs. +- [x] 4.2 Wire Web UI backend initialization to select `.hgdb` or NebulaGraph per configured database. +- [x] 4.3 Keep `NanoVectorDBStorage` and text chunk JSON storage unchanged for the initial migration. +- [x] 4.5 Ensure mirror-only and dual-read modes never serve user-facing query responses from NebulaGraph. +- [x] 4.6 Verify prompt construction, query routing, vector recall, text chunk lookup, and answer generation remain unchanged. +- [x] 6.2 Block NebulaGraph serving when parity checks fail. +- [x] 6.3 Add an opt-in switch to enable NebulaGraph serving only after validation passes. +- [x] 6.5 Verify `.hgdb` remains serving when NebulaGraph connection, schema validation, or migration validation fails. +- [x] 7.5 Add tests that prove mirror-only and dual-read modes keep `.hgdb` as the serving backend. +``` + +- [ ] **Step 7: Commit** + +Run: + +```bash +git add hyperrag/hyperrag.py web-ui/backend/main.py tests/test_nebulagraph_backend_selection.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: gate NebulaGraph serving behind validation" +``` + +Expected: commit succeeds. + +## Task 10: Retrieval Parity Hooks And Diagnostics + +**Files:** +- Modify: `hyperrag/nebulagraph_validation.py` +- Modify: `web-ui/backend/main.py` +- Test: `tests/test_nebulagraph_retrieval_validation.py` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Add retrieval validation test** + +Create `tests/test_nebulagraph_retrieval_validation.py`: + +```python +import unittest + +from hyperrag.nebulagraph_validation import RetrievalParityResult + + +class NebulaGraphRetrievalValidationTest(unittest.TestCase): + def test_retrieval_result_reports_overlap(self): + result = RetrievalParityResult( + mode="hyper", + entity_overlap=1.0, + hyperedge_overlap=1.0, + text_unit_overlap=1.0, + context_diff="", + answer_score=None, + ) + self.assertTrue(result.passed(0.99)) + + +if __name__ == "__main__": + unittest.main() +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `python -m unittest tests.test_nebulagraph_retrieval_validation -v` + +Expected: FAIL because `RetrievalParityResult` does not exist. + +- [ ] **Step 3: Extend validation module** + +Add to `hyperrag/nebulagraph_validation.py`: + +```python +@dataclass(frozen=True) +class RetrievalParityResult: + mode: str + entity_overlap: float + hyperedge_overlap: float + text_unit_overlap: float + context_diff: str + answer_score: float | None + + def passed(self, threshold: float) -> bool: + base_passed = ( + self.entity_overlap >= threshold + and self.hyperedge_overlap >= threshold + and self.text_unit_overlap >= threshold + ) + if self.answer_score is None: + return base_passed + return base_passed and self.answer_score >= threshold +``` + +- [ ] **Step 4: Add diagnostics endpoint data** + +In `web-ui/backend/main.py`, extend `/hyperrag/status` details to include: + +```python +"hypergraph_backend_mode": settings.get("hypergraphBackendMode", "hgdb"), +"nebula_graph_validated": settings.get("nebulaGraphValidated", False), +``` + +Keep the existing response shape and add fields only under `details`. + +- [ ] **Step 5: Run tests** + +Run: + +```bash +python -m unittest tests.test_nebulagraph_validation tests.test_nebulagraph_retrieval_validation -v +``` + +Expected: PASS. + +- [ ] **Step 6: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 4.4 Add diagnostics that expose selected hypergraph backend and NebulaGraph connection status. +- [x] 5.2 Add retrieval parity checks for fixed question sets across `hyper`, `hyper-lite`, and `graph` modes. +- [x] 5.3 Compare retrieved entities, hyperedges, and source text units while keeping vector stores unchanged. +- [x] 5.5 Add context string diff reporting after deterministic normalization. +- [x] 5.6 Add optional final answer regression scoring when an evaluator is configured. +- [x] 6.1 Define configurable acceptance thresholds for storage parity and retrieval parity. +``` + +- [ ] **Step 7: Commit** + +Run: + +```bash +git add hyperrag/nebulagraph_validation.py web-ui/backend/main.py tests/test_nebulagraph_retrieval_validation.py openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "feat: add NebulaGraph retrieval parity diagnostics" +``` + +Expected: commit succeeds. + +## Task 11: Dependencies And Documentation + +**Files:** +- Modify: `requirements.txt` +- Modify: `web-ui/backend/requirements.txt` +- Create: `docs/nebulagraph-hypergraph-storage.md` +- Modify after passing: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` + +- [ ] **Step 1: Add dependency** + +Append this line to both `requirements.txt` and `web-ui/backend/requirements.txt`: + +```text +nebula3-python +``` + +- [ ] **Step 2: Create operational documentation** + +Create `docs/nebulagraph-hypergraph-storage.md`: + +```markdown +# NebulaGraph Hypergraph Storage + +## Default Behavior + +Hyper-RAG continues to use `.hgdb` storage unless `hypergraphBackendMode` is explicitly configured. + +## Modes + +- `hgdb`: local `.hgdb` serving backend. +- `mirror-only`: migrate or mirror graph data to NebulaGraph while `.hgdb` serves queries. +- `dual-read`: compare `.hgdb` and NebulaGraph results while `.hgdb` serves queries. +- `nebulagraph-serving`: serve graph storage from NebulaGraph only after validation passes. + +## Migration + +1. Run schema check: + `python scripts/hyperrag_nebulagraph.py schema-check --space hyperrag` +2. Run migration: + `python scripts/hyperrag_nebulagraph.py migrate --hgdb web-ui/backend/hyperrag_cache//hypergraph_chunk_entity_relation.hgdb --database ` +3. Run validation: + `python scripts/hyperrag_nebulagraph.py validate --hgdb web-ui/backend/hyperrag_cache//hypergraph_chunk_entity_relation.hgdb --database ` + +## Quality Gate + +Do not enable `nebulagraph-serving` until storage parity and retrieval parity pass configured thresholds. + +## Rollback + +Set `hypergraphBackendMode` back to `hgdb`. Public query, insert, upload, and database selection APIs do not change. +``` + +- [ ] **Step 3: Run documentation smoke check** + +Run: `python scripts/hyperrag_nebulagraph.py schema-check --space hyperrag` + +Expected: prints `USE`, `CREATE TAG`, and `CREATE EDGE` statements. + +- [ ] **Step 4: Mark OpenSpec tasks complete** + +Mark: + +```markdown +- [x] 6.4 Verify rollback by switching configuration back to `.hgdb` without changing Web UI or API request/response contracts. +- [x] 7.7 Document NebulaGraph setup, schema initialization, mirror-only migration, validation, enablement, failure policy, and rollback steps. +``` + +- [ ] **Step 5: Commit** + +Run: + +```bash +git add requirements.txt web-ui/backend/requirements.txt docs/nebulagraph-hypergraph-storage.md openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "docs: document NebulaGraph hypergraph storage rollout" +``` + +Expected: commit succeeds. + +## Task 12: Final Verification + +**Files:** +- Read: `openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md` +- Read: all touched files from previous tasks + +- [ ] **Step 1: Run full local unit suite** + +Run: + +```bash +python -m unittest discover tests -v +``` + +Expected: all tests pass. + +- [ ] **Step 2: Run OpenSpec status** + +Run: + +```bash +openspec instructions apply --change migrate-hypergraph-storage-to-nebulagraph --json +``` + +Expected: progress reports all tasks complete. + +- [ ] **Step 3: Inspect git diff** + +Run: + +```bash +git diff --stat +git diff --check +``` + +Expected: no whitespace errors from `git diff --check`; diff stat contains only NebulaGraph migration, docs, tests, and OpenSpec task updates. + +- [ ] **Step 4: Commit any final task checkbox updates** + +Run: + +```bash +git add openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md +git commit -m "chore: complete NebulaGraph migration task checklist" +``` + +Expected: commit succeeds if there are remaining task checkbox changes; if there are no changes, skip this commit. + +## Self-Review + +**Spec coverage:** The plan covers backend selection, mirror-only behavior, hyperedge-preserving model, canonical IDs, storage contract parity, source linkage, unchanged vector retrieval, offline migration, dual-read validation, quality gates, failure policy, rollback, tests, and documentation. + +**Placeholder scan:** This plan contains no unfinished marker text or unspecified implementation steps. Each code task includes file paths, tests, commands, and expected results. + +**Type consistency:** The plan consistently uses `HypergraphBackendMode`, `NebulaGraphSettings`, `NebulaHypergraphStorage`, `normalize_id_set`, `canonical_entity_vid`, `canonical_hyperedge_vid`, `HypergraphSnapshot`, `ParityReport`, and `RetrievalParityResult`. diff --git a/hyperrag/hyperrag.py b/hyperrag/hyperrag.py index 52b2ed8..1e1b643 100644 --- a/hyperrag/hyperrag.py +++ b/hyperrag/hyperrag.py @@ -25,6 +25,9 @@ HypergraphStorage, ) +from .nebulagraph_config import NebulaGraphSettings +from .nebulagraph_storage import NebulaHypergraphStorage + from .utils import ( EmbeddingFunc, @@ -46,6 +49,19 @@ from .operate import hyper_query_stream, hyper_query_lite_stream, naive_query_stream, llm_query_stream +def resolve_hypergraph_storage_cls(global_config, default_cls): + addon_params = global_config.get("addon_params", {}) + merged_config = dict(global_config) + if isinstance(addon_params, dict): + merged_config.update(addon_params) + + settings = NebulaGraphSettings.from_config(merged_config) + if settings.serving_enabled: + return NebulaHypergraphStorage + + return default_cls + + def always_get_an_event_loop() -> asyncio.AbstractEventLoop: try: return asyncio.get_event_loop() @@ -137,7 +153,10 @@ def __post_init__(self): """ download from hgdb_path """ - self.chunk_entity_relation_hypergraph = self.hypergraph_storage_cls( + resolved_hypergraph_storage_cls = resolve_hypergraph_storage_cls( + asdict(self), self.hypergraph_storage_cls + ) + self.chunk_entity_relation_hypergraph = resolved_hypergraph_storage_cls( namespace="chunk_entity_relation", global_config=asdict(self) ) diff --git a/hyperrag/nebulagraph_client.py b/hyperrag/nebulagraph_client.py new file mode 100644 index 0000000..6a68c79 --- /dev/null +++ b/hyperrag/nebulagraph_client.py @@ -0,0 +1,26 @@ +from dataclasses import dataclass, field +from typing import Protocol, Sequence + + +class NebulaGraphClient(Protocol): + def execute(self, statement: str) -> object: + ... + + def is_available(self) -> bool: + ... + + +@dataclass +class FakeNebulaGraphClient: + statements: list[str] = field(default_factory=list) + available: bool = True + + def execute(self, statement: str) -> object: + self.statements.append(statement) + return [] + + def execute_many(self, statements: Sequence[str]) -> list[object]: + return [self.execute(statement) for statement in statements] + + def is_available(self) -> bool: + return self.available diff --git a/hyperrag/nebulagraph_config.py b/hyperrag/nebulagraph_config.py new file mode 100644 index 0000000..25fcaf4 --- /dev/null +++ b/hyperrag/nebulagraph_config.py @@ -0,0 +1,197 @@ +from __future__ import annotations + +import os +from dataclasses import dataclass +from enum import Enum +from typing import Any + + +class HypergraphBackendMode(str, Enum): + HGDB = "hgdb" + MIRROR_ONLY = "mirror-only" + DUAL_READ = "dual-read" + NEBULAGRAPH_SERVING = "nebulagraph-serving" + + +def resolve_hypergraph_backend_mode(config: dict[str, Any]) -> HypergraphBackendMode: + mode = _get_config_value( + config, + "hypergraph_backend_mode", + "HYPERRAG_HYPERGRAPH_BACKEND_MODE", + HypergraphBackendMode.HGDB.value, + ) + return _coerce_backend_mode(mode) + + +@dataclass(frozen=True) +class NebulaGraphSettings: + mode: HypergraphBackendMode + host: str + port: int + username: str + password: str + space: str + database_name: str + serving_enabled: bool + fallback_to_hgdb: bool + + @classmethod + def from_config(cls, config: dict[str, Any]) -> "NebulaGraphSettings": + mode = resolve_hypergraph_backend_mode(config) + database_name = str( + _get_config_value(config, "database_name", "HYPERRAG_DATABASE_NAME", "") + ) + database_config = _get_database_config(config, database_name) + + serving_enabled = ( + mode == HypergraphBackendMode.NEBULAGRAPH_SERVING + and _coerce_strict_true( + _get_config_value(config, "nebulagraph_validated", "", False) + ) + ) + + return cls( + mode=mode, + host=str( + _get_config_value( + database_config, + "nebulagraph_host", + "", + _get_config_value( + config, + "nebulagraph_host", + "HYPERRAG_NEBULAGRAPH_HOST", + "127.0.0.1", + ), + ) + ), + port=_coerce_port( + _get_config_value( + database_config, + "nebulagraph_port", + "", + _get_config_value( + config, + "nebulagraph_port", + "HYPERRAG_NEBULAGRAPH_PORT", + 9669, + ), + ) + ), + username=str( + _get_config_value( + database_config, + "nebulagraph_username", + "", + _get_config_value( + config, + "nebulagraph_username", + "HYPERRAG_NEBULAGRAPH_USERNAME", + "root", + ), + ) + ), + password=str( + _get_config_value( + database_config, + "nebulagraph_password", + "", + _get_config_value( + config, + "nebulagraph_password", + "HYPERRAG_NEBULAGRAPH_PASSWORD", + "nebula", + ), + ) + ), + space=str( + _get_config_value( + database_config, + "nebulagraph_space", + "", + _get_config_value( + config, + "nebulagraph_space", + "HYPERRAG_NEBULAGRAPH_SPACE", + "hyperrag", + ), + ) + ), + database_name=database_name, + serving_enabled=serving_enabled, + fallback_to_hgdb=_coerce_bool( + _get_config_value( + database_config, + "nebulagraph_fallback_to_hgdb", + "", + _get_config_value( + database_config, + "fallback_to_hgdb", + "", + _get_config_value( + config, + "nebulagraph_fallback_to_hgdb", + "HYPERRAG_NEBULAGRAPH_FALLBACK_TO_HGDB", + _get_config_value( + config, + "fallback_to_hgdb", + "HYPERRAG_FALLBACK_TO_HGDB", + True, + ), + ), + ), + ) + ), + ) + + +def _coerce_backend_mode(value: Any) -> HypergraphBackendMode: + try: + return HypergraphBackendMode(str(value).strip().lower()) + except ValueError: + return HypergraphBackendMode.HGDB + + +def _coerce_port(value: Any) -> int: + try: + return int(value) + except (TypeError, ValueError): + return 9669 + + +def _coerce_bool(value: Any) -> bool: + if isinstance(value, str): + return value.strip().lower() not in {"", "0", "false", "no", "off"} + return bool(value) + + +def _coerce_strict_true(value: Any) -> bool: + if isinstance(value, bool): + return value + if isinstance(value, str): + return value.strip().lower() in {"1", "true", "yes", "on"} + if isinstance(value, int): + return value == 1 + return False + + +def _get_config_value( + config: dict[str, Any], key: str, env_key: str, default: Any +) -> Any: + if key in config and config[key] is not None: + return config[key] + if env_key and env_key in os.environ: + return os.environ[env_key] + return default + + +def _get_database_config(config: dict[str, Any], database_name: str) -> dict[str, Any]: + database_mapping = config.get("nebulagraph_databases", {}) + if not isinstance(database_mapping, dict): + return {} + + database_config = database_mapping.get(database_name, {}) + if not isinstance(database_config, dict): + return {} + + return database_config diff --git a/hyperrag/nebulagraph_ids.py b/hyperrag/nebulagraph_ids.py new file mode 100644 index 0000000..a5b790f --- /dev/null +++ b/hyperrag/nebulagraph_ids.py @@ -0,0 +1,45 @@ +from __future__ import annotations + +from collections.abc import Iterable +from hashlib import sha256 +import json + + +def _canonical_text(value: object) -> str: + if value is None: + raise ValueError("Canonical text must not be empty.") + text = str(value).strip() + if not text: + raise ValueError("Canonical text must not be empty.") + return text + + +def _stable_hash(value: object) -> str: + payload = json.dumps(value, ensure_ascii=False, separators=(",", ":")) + return sha256(payload.encode("utf-8")).hexdigest()[:32] + + +def canonical_entity_vid(database_name: str, entity_name: object) -> str: + scope = _canonical_text(database_name) + name = _canonical_text(entity_name) + return "ent:" + _stable_hash([scope, name]) + + +def normalize_id_set(id_set: Iterable[object]) -> tuple[str, ...]: + normalized = tuple(sorted({_canonical_text(item) for item in id_set})) + if not normalized: + raise ValueError("id_set must contain at least one ID.") + return normalized + + +def canonical_hyperedge_vid(database_name: str, id_set: Iterable[object]) -> str: + scope = _canonical_text(database_name) + member_ids = normalize_id_set(id_set) + return "hedge:" + _stable_hash([scope, member_ids]) + + +__all__ = [ + "canonical_entity_vid", + "canonical_hyperedge_vid", + "normalize_id_set", +] diff --git a/hyperrag/nebulagraph_migration.py b/hyperrag/nebulagraph_migration.py new file mode 100644 index 0000000..b94b7e4 --- /dev/null +++ b/hyperrag/nebulagraph_migration.py @@ -0,0 +1,104 @@ +from __future__ import annotations + +import asyncio +from copy import deepcopy +from dataclasses import dataclass +from pathlib import Path +import pickle +from typing import Any + +from .nebulagraph_ids import normalize_id_set + + +@dataclass(frozen=True) +class HypergraphSnapshot: + vertices: dict[str, dict[str, Any]] + hyperedges: dict[tuple[str, ...], dict[str, Any]] + + +def _copy_payload(value: Any) -> dict[str, Any]: + if value is None: + return {} + if not isinstance(value, dict): + raise TypeError(f"HypergraphDB payload must be a dict, got {type(value).__name__}.") + return deepcopy(value) + + +def _run_sync(coro): + try: + loop = asyncio.get_running_loop() + except RuntimeError: + return asyncio.run(coro) + + coro.close() + if loop.is_running(): + raise RuntimeError( + "migrate_snapshot_to_storage cannot run inside an active event loop; " + "use migrate_snapshot_to_storage_async instead." + ) + return loop.run_until_complete(coro) + + +def load_hgdb_snapshot(hgdb_file: str | Path) -> HypergraphSnapshot: + """Load a local trusted HypergraphDB `.hgdb` pickle snapshot.""" + hgdb_path = Path(hgdb_file) + with hgdb_path.open("rb") as file: + raw_data = pickle.load(file) + + raw_vertices, raw_hyperedges = _extract_hypergraph_payload(raw_data) + + vertices = { + normalize_id_set([vertex_id])[0]: _copy_payload(vertex_data) + for vertex_id, vertex_data in raw_vertices.items() + } + hyperedges = { + normalize_id_set(id_set): _copy_payload(hyperedge_data) + for id_set, hyperedge_data in raw_hyperedges.items() + } + + return HypergraphSnapshot(vertices=vertices, hyperedges=hyperedges) + + +def _extract_hypergraph_payload(raw_data: Any) -> tuple[dict[Any, Any], dict[Any, Any]]: + if isinstance(raw_data, dict): + raw_vertices = raw_data.get("v_data", {}) + raw_hyperedges = raw_data.get("e_data", {}) + elif hasattr(raw_data, "_v_data") and hasattr(raw_data, "_e_data"): + raw_vertices = raw_data._v_data + raw_hyperedges = raw_data._e_data + else: + raise TypeError( + "HypergraphDB file must contain a dict payload or an object with " + f"_v_data/_e_data attributes, got {type(raw_data).__name__}." + ) + + if not isinstance(raw_vertices, dict): + raise TypeError("HypergraphDB v_data must be a dict.") + if not isinstance(raw_hyperedges, dict): + raise TypeError("HypergraphDB e_data must be a dict.") + return raw_vertices, raw_hyperedges + + +async def migrate_snapshot_to_storage_async( + snapshot: HypergraphSnapshot, storage +) -> None: + for vertex_id, vertex_data in snapshot.vertices.items(): + await storage.upsert_vertex(vertex_id, deepcopy(vertex_data)) + + for id_set, hyperedge_data in snapshot.hyperedges.items(): + data = deepcopy(hyperedge_data) + data.setdefault("id_set", list(id_set)) + data.setdefault("arity", len(id_set)) + await storage.upsert_hyperedge(id_set, data) + + +def migrate_snapshot_to_storage(snapshot: HypergraphSnapshot, storage) -> None: + _run_sync(migrate_snapshot_to_storage_async(snapshot, storage)) + + +__all__ = [ + "HypergraphSnapshot", + "load_hgdb_snapshot", + "migrate_snapshot_to_storage", + "migrate_snapshot_to_storage_async", +] diff --git a/hyperrag/nebulagraph_schema.py b/hyperrag/nebulagraph_schema.py new file mode 100644 index 0000000..b52bf24 --- /dev/null +++ b/hyperrag/nebulagraph_schema.py @@ -0,0 +1,36 @@ +REQUIRED_SCHEMA_STATEMENTS = [ + ( + "CREATE TAG IF NOT EXISTS Entity(" + "name string, " + "entity_type string, " + "description string, " + "source_id string, " + "additional_properties string, " + "database_name string" + ")" + ), + ( + "CREATE TAG IF NOT EXISTS Hyperedge(" + "edge_hash string, " + "id_set string, " + "description string, " + "keywords string, " + "weight double, " + "source_id string, " + "arity int, " + "database_name string" + ")" + ), + "CREATE EDGE IF NOT EXISTS MEMBER_OF(database_name string)", + "CREATE EDGE IF NOT EXISTS HAS_MEMBER(database_name string)", +] + + +def schema_statements_for_space(space_name: str) -> list[str]: + trimmed_space_name = str(space_name).strip() + if not trimmed_space_name: + raise ValueError("NebulaGraph space name must not be empty") + if "`" in trimmed_space_name: + raise ValueError("NebulaGraph space name must not contain backticks") + + return [f"USE `{trimmed_space_name}`", *REQUIRED_SCHEMA_STATEMENTS] diff --git a/hyperrag/nebulagraph_storage.py b/hyperrag/nebulagraph_storage.py new file mode 100644 index 0000000..b33b722 --- /dev/null +++ b/hyperrag/nebulagraph_storage.py @@ -0,0 +1,112 @@ +from __future__ import annotations + +from copy import deepcopy +from dataclasses import dataclass, field +from typing import Any, Dict, List, Optional, Set, Tuple, Union + +from .base import BaseHypergraphStorage +from .nebulagraph_ids import normalize_id_set + + +@dataclass +class NebulaHypergraphStorage(BaseHypergraphStorage): + _vertex_data: dict[str, dict] = field(default_factory=dict, init=False) + _hyperedge_data: dict[tuple[str, ...], dict] = field(default_factory=dict, init=False) + + @staticmethod + def _normalize_vertex_id(v_id: Any) -> str: + return normalize_id_set([v_id])[0] + + @staticmethod + def _copy_data(data: Optional[Dict]) -> dict: + return deepcopy(dict(data or {})) + + async def has_vertex(self, v_id: Any) -> bool: + return self._normalize_vertex_id(v_id) in self._vertex_data + + async def has_hyperedge(self, e_tuple: Union[List, Set, Tuple]) -> bool: + return normalize_id_set(e_tuple) in self._hyperedge_data + + async def get_vertex(self, v_id: str, default: Any = None): + v_key = self._normalize_vertex_id(v_id) + if v_key not in self._vertex_data: + return default + return self._copy_data(self._vertex_data[v_key]) + + async def get_hyperedge( + self, e_tuple: Union[List, Set, Tuple], default: Any = None + ): + e_key = normalize_id_set(e_tuple) + if e_key not in self._hyperedge_data: + return default + return self._copy_data(self._hyperedge_data[e_key]) + + async def get_all_vertices(self): + return sorted(self._vertex_data) + + async def get_all_hyperedges(self): + return sorted(self._hyperedge_data) + + async def get_num_of_vertices(self): + return len(self._vertex_data) + + async def get_num_of_hyperedges(self): + return len(self._hyperedge_data) + + async def upsert_vertex(self, v_id: Any, v_data: Optional[Dict] = None): + v_key = self._normalize_vertex_id(v_id) + self._vertex_data[v_key] = self._copy_data(v_data) + return self._copy_data(self._vertex_data[v_key]) + + async def upsert_hyperedge( + self, e_tuple: Union[List, Set, Tuple], e_data: Optional[Dict] = None + ): + e_key = normalize_id_set(e_tuple) + self._hyperedge_data[e_key] = self._copy_data(e_data) + return self._copy_data(self._hyperedge_data[e_key]) + + async def remove_vertex(self, v_id: Any): + v_key = self._normalize_vertex_id(v_id) + removed = self._vertex_data.pop(v_key, None) + incident_edges = [ + e_key for e_key in self._hyperedge_data if v_key in e_key + ] + for e_key in incident_edges: + self._hyperedge_data.pop(e_key, None) + return removed + + async def remove_hyperedge(self, e_tuple: Union[List, Set, Tuple]): + e_key = normalize_id_set(e_tuple) + return self._hyperedge_data.pop(e_key, None) + + async def vertex_degree(self, v_id: Any) -> int: + v_key = self._normalize_vertex_id(v_id) + return sum(1 for e_key in self._hyperedge_data if v_key in e_key) + + async def hyperedge_degree(self, e_tuple: Union[List, Set, Tuple]) -> int: + e_key = normalize_id_set(e_tuple) + if e_key not in self._hyperedge_data: + return 0 + return len(e_key) + + async def get_nbr_e_of_vertex(self, v_id: Any) -> list[tuple[str, ...]]: + v_key = self._normalize_vertex_id(v_id) + return sorted(e_key for e_key in self._hyperedge_data if v_key in e_key) + + async def get_nbr_v_of_hyperedge( + self, e_tuple: Union[List, Set, Tuple] + ) -> list[str]: + e_key = normalize_id_set(e_tuple) + if e_key not in self._hyperedge_data: + return [] + return list(e_key) + + async def get_nbr_v_of_vertex(self, v_id: Any, exclude_self=True) -> list[str]: + v_key = self._normalize_vertex_id(v_id) + neighbors = set() + for e_key in self._hyperedge_data: + if v_key in e_key: + neighbors.update(e_key) + if exclude_self: + neighbors.discard(v_key) + return sorted(neighbors) diff --git a/hyperrag/nebulagraph_validation.py b/hyperrag/nebulagraph_validation.py new file mode 100644 index 0000000..e7d44c6 --- /dev/null +++ b/hyperrag/nebulagraph_validation.py @@ -0,0 +1,140 @@ +from __future__ import annotations + +from collections.abc import Iterable +from dataclasses import dataclass, field +from typing import Any + +from .nebulagraph_ids import normalize_id_set + + +@dataclass +class ParityReport: + passed: bool + failures: list[str] = field(default_factory=list) + + +@dataclass +class RetrievalParityResult: + mode: str + entity_overlap: float + hyperedge_overlap: float + text_unit_overlap: float + context_diff: str + answer_score: float | None + + def passed(self, threshold: float) -> bool: + base_overlaps_pass = ( + self.entity_overlap >= threshold + and self.hyperedge_overlap >= threshold + and self.text_unit_overlap >= threshold + ) + if self.answer_score is None: + return base_overlaps_pass + return base_overlaps_pass and self.answer_score >= threshold + + +def _format_hyperedge(id_set: Iterable[str]) -> str: + return repr(tuple(id_set)) + + +def _sorted_hyperedges(edges: Iterable[Iterable[Any]]) -> list[tuple[str, ...]]: + return sorted(normalize_id_set(edge) for edge in edges) + + +def _sorted_vertices(vertices: Iterable[Any]) -> list[str]: + return sorted({str(vertex).strip() for vertex in vertices if str(vertex).strip()}) + + +async def compare_storage_backends( + left, + right, + sample_vertices: Iterable[str], + sample_hyperedges: Iterable[tuple[str, ...]], +) -> ParityReport: + failures: list[str] = [] + + left_vertex_count = await left.get_num_of_vertices() + right_vertex_count = await right.get_num_of_vertices() + if left_vertex_count != right_vertex_count: + failures.append( + "vertex count mismatch: " + f"left={left_vertex_count} right={right_vertex_count}" + ) + + left_hyperedge_count = await left.get_num_of_hyperedges() + right_hyperedge_count = await right.get_num_of_hyperedges() + if left_hyperedge_count != right_hyperedge_count: + failures.append( + "hyperedge count mismatch: " + f"left={left_hyperedge_count} right={right_hyperedge_count}" + ) + + for vertex_id in sample_vertices: + normalized_vertex_id = normalize_id_set([vertex_id])[0] + + left_vertex = await left.get_vertex(normalized_vertex_id) + right_vertex = await right.get_vertex(normalized_vertex_id) + if left_vertex != right_vertex: + failures.append( + "vertex payload mismatch: " + f"vertex={normalized_vertex_id!r} left={left_vertex!r} right={right_vertex!r}" + ) + + left_degree = await left.vertex_degree(normalized_vertex_id) + right_degree = await right.vertex_degree(normalized_vertex_id) + if left_degree != right_degree: + failures.append( + "vertex degree mismatch: " + f"vertex={normalized_vertex_id!r} left={left_degree} right={right_degree}" + ) + + left_neighbor_edges = _sorted_hyperedges( + await left.get_nbr_e_of_vertex(normalized_vertex_id) + ) + right_neighbor_edges = _sorted_hyperedges( + await right.get_nbr_e_of_vertex(normalized_vertex_id) + ) + if left_neighbor_edges != right_neighbor_edges: + failures.append( + "vertex neighbor hyperedges mismatch: " + f"vertex={normalized_vertex_id!r} " + f"left={left_neighbor_edges!r} right={right_neighbor_edges!r}" + ) + + for sample_hyperedge in sample_hyperedges: + normalized_hyperedge = normalize_id_set(sample_hyperedge) + hyperedge_label = _format_hyperedge(normalized_hyperedge) + + left_hyperedge = await left.get_hyperedge(normalized_hyperedge) + right_hyperedge = await right.get_hyperedge(normalized_hyperedge) + if left_hyperedge != right_hyperedge: + failures.append( + "hyperedge payload mismatch: " + f"id_set={hyperedge_label} left={left_hyperedge!r} right={right_hyperedge!r}" + ) + + left_degree = await left.hyperedge_degree(normalized_hyperedge) + right_degree = await right.hyperedge_degree(normalized_hyperedge) + if left_degree != right_degree: + failures.append( + "hyperedge degree mismatch: " + f"id_set={hyperedge_label} left={left_degree} right={right_degree}" + ) + + left_neighbor_vertices = _sorted_vertices( + await left.get_nbr_v_of_hyperedge(normalized_hyperedge) + ) + right_neighbor_vertices = _sorted_vertices( + await right.get_nbr_v_of_hyperedge(normalized_hyperedge) + ) + if left_neighbor_vertices != right_neighbor_vertices: + failures.append( + "hyperedge neighbors mismatch: " + f"id_set={hyperedge_label} " + f"left={left_neighbor_vertices!r} right={right_neighbor_vertices!r}" + ) + + return ParityReport(passed=not failures, failures=failures) + + +__all__ = ["ParityReport", "RetrievalParityResult", "compare_storage_backends"] diff --git a/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/.openspec.yaml b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/.openspec.yaml new file mode 100644 index 0000000..2cb8041 --- /dev/null +++ b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-06-10 diff --git a/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/design.md b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/design.md new file mode 100644 index 0000000..54ab01a --- /dev/null +++ b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/design.md @@ -0,0 +1,161 @@ +## Context + +Hyper-RAG stores extracted graph knowledge through `BaseHypergraphStorage`. The current implementation, `HypergraphStorage`, persists a local `hypergraph_chunk_entity_relation.hgdb` file backed by `HypergraphDB`. Retrieval code does not query the `.hgdb` file directly; it depends on storage methods such as `get_vertex`, `get_hyperedge`, `get_nbr_e_of_vertex`, `get_nbr_v_of_hyperedge`, `vertex_degree`, and `hyperedge_degree`. + +The query pipeline also depends on separate storage layers: + +- `NanoVectorDBStorage` for semantic retrieval of entities, relationships, and chunks. +- JSON KV files for full documents, text chunks, and LLM cache. +- Hypergraph storage for graph expansion, degree ranking, and source chunk lookup. + +NebulaGraph can replace the hypergraph storage backend, but NebulaGraph uses a property graph model with binary edges. Hyper-RAG hyperedges can connect more than two entities, so a direct pairwise edge conversion would lose high-order relationship semantics and risk retrieval quality regressions. + +## Goals / Non-Goals + +**Goals:** + +- Preserve current query behavior and answer quality while moving hypergraph persistence from local `.hgdb` files to NebulaGraph. +- Keep Hyper-RAG hyperedges as first-class retrievable objects with their existing `id_set`, `description`, `keywords`, `weight`, and `source_id` fields. +- Keep semantic vector retrieval and text chunk storage unchanged during the initial migration. +- Keep prompt construction, query routing, and answer generation unchanged during the initial migration. +- Provide migration tooling from existing `.hgdb` datasets into NebulaGraph. +- Provide mirror-only migration and dual-read validation that compares `.hgdb` and NebulaGraph retrieval outputs before enabling NebulaGraph serving. +- Allow rollback to `.hgdb` storage without changing Web UI or query API contracts. + +**Non-Goals:** + +- Replacing `NanoVectorDBStorage` in the initial migration. +- Changing the public `/hyperrag/query`, `/hyperrag/insert`, file upload, or database selection APIs. +- Rewriting `hyper_query`, `hyper_query_lite`, `graph_query`, or prompt behavior as part of the storage migration. +- Flattening all high-order hyperedges into pairwise graph edges. +- Building a full graph analytics feature set on NebulaGraph before storage parity is proven. +- Adding scoped retrieval, query routing, or graph-native reranking in the initial migration. + +## Decisions + +### Decision: Model hyperedges as NebulaGraph vertices + +Represent each Hyper-RAG hyperedge as a `Hyperedge` vertex and connect entity vertices to it with membership edges. + +```text +(:Entity {name: A}) -- MEMBER_OF --> (:Hyperedge {edge_hash: H}) +(:Entity {name: B}) -- MEMBER_OF --> (:Hyperedge {edge_hash: H}) +(:Entity {name: C}) -- MEMBER_OF --> (:Hyperedge {edge_hash: H}) +``` + +The `Hyperedge` vertex stores the original hyperedge fields: + +- `id_set` +- `description` +- `keywords` +- `weight` +- `source_id` +- `arity` +- `edge_hash` + +Rationale: this preserves high-order relation identity and keeps `get_hyperedge(id_set)` semantically equivalent to the existing `.hgdb` backend. + +Alternative considered: convert each hyperedge into all pairwise entity edges. This was rejected because it loses the fact that multiple entities belong to one shared relationship and can change ranking, context construction, and answer quality. + +### Decision: Use canonical IDs and deterministic ordering + +Entity and Hyperedge IDs must be stable and repeatable across migration runs. Entity VIDs should be derived from a normalized entity name plus database scope. Hyperedge VIDs should be derived from a canonical representation of the sorted `id_set`, plus database scope. + +Adapter outputs must normalize unordered values before returning them to retrieval code: + +- Hyperedge `id_set` values are returned as deterministic tuples. +- Neighbor hyperedge lists are sorted by canonical hyperedge ID. +- Neighbor entity lists are sorted by canonical entity ID. + +Rationale: the current local backend relies heavily on set and tuple semantics. NebulaGraph result ordering can differ, and order drift can change context strings passed to the LLM. Deterministic normalization reduces accidental quality changes. + +### Decision: Implement a storage adapter, not a query rewrite + +Add `NebulaHypergraphStorage` implementing `BaseHypergraphStorage`. `HyperRAG` should be able to select either `HypergraphStorage` or `NebulaHypergraphStorage` through configuration. + +Rationale: the current retrieval pipeline already isolates graph operations behind `BaseHypergraphStorage`. Keeping that contract stable minimizes changes to query behavior and makes parity testing straightforward. + +Alternative considered: modify `hyper_query` and related functions to issue NebulaGraph queries directly. This was rejected because it couples retrieval logic to a specific backend and makes quality regressions harder to isolate. + +### Decision: Keep vector retrieval unchanged during initial migration + +Continue using `NanoVectorDBStorage` for `entities_vdb`, `relationships_vdb`, and `chunks_vdb` during the first NebulaGraph phase. + +Rationale: answer quality is strongly affected by vector recall. Changing graph storage and vector retrieval at the same time would make regressions hard to diagnose. + +Alternative considered: move vectors into NebulaGraph at the same time. This is out of scope for the first migration and can be proposed separately after graph storage parity is proven. + +### Decision: Default to mirror-only until parity is proven + +The initial NebulaGraph mode is mirror-only: data is migrated or mirrored into NebulaGraph, but `.hgdb` remains the serving backend. NebulaGraph becomes serving only through explicit per-database opt-in after validation passes. + +Write behavior is explicit: + +- `hgdb`: existing serving and write behavior. +- `mirror-only`: NebulaGraph receives migrated or mirrored data, while `.hgdb` serves queries. +- `dual-read`: queries compare `.hgdb` and NebulaGraph outputs, while `.hgdb` serves responses. +- `nebulagraph-serving`: NebulaGraph serves graph storage calls after quality gates pass. + +Rationale: the safest way to avoid retrieval regression is to prove storage parity before serving queries from NebulaGraph. + +### Decision: Use dual-read validation before serving from NebulaGraph + +Provide a validation mode that loads the same dataset through both backends and compares storage-level outputs for representative entities, hyperedges, and query expansions. + +The minimum parity checks are: + +- Vertex count and hyperedge count. +- Sampled `get_vertex` and `get_hyperedge` equality. +- `get_nbr_e_of_vertex` equality after normalizing hyperedge IDs. +- `get_nbr_v_of_hyperedge` equality. +- `vertex_degree` and `hyperedge_degree` equality. +- Retrieved entity overlap for fixed question sets. +- Retrieved hyperedge overlap for fixed question sets. +- Retrieved source text unit overlap for fixed question sets. +- Context string diff for fixed question sets. +- Final answer regression score for fixed question sets when an evaluator is configured. + +Rationale: the requirement is to avoid degrading retrieval quality. Dual-read validation gives a concrete gate before changing serving behavior. + +### Decision: Preserve `.hgdb` degree semantics + +`vertex_degree` and `hyperedge_degree` must match the local backend, independent of how many NebulaGraph edges are used internally to model membership. If both `MEMBER_OF` and reverse membership edges exist, degree calculations must avoid double-counting. + +Rationale: degree values are used as `rank` signals in context construction. Any semantic drift changes retrieval ordering and can change answer quality. + +### Decision: Preserve source linkage and dataset metadata + +NebulaGraph records MUST preserve `source_id` exactly as the current backend does. New optional metadata fields such as `database_name`, `collection`, `source_file`, and `doc_id` should be available for future scoped retrieval. + +Rationale: `source_id` is used to return from graph entities/hyperedges to text chunks. Missing or changed `source_id` breaks evidence retrieval. Metadata does not need to affect existing query behavior initially, but it enables safe multi-dataset storage later. + +## Risks / Trade-offs + +- Hyperedge semantics could be lost if modeled as pairwise edges -> Model hyperedges as first-class vertices and validate `id_set` round trips. +- NebulaGraph query latency could exceed local `.hgdb` latency -> Batch vertex/edge reads where possible and benchmark storage calls used by query paths. +- Storage output ordering could differ from `.hgdb` -> Normalize ordering in the adapter for tuple-like return values and validate context overlap rather than raw unordered set order. +- Canonical ID drift could create duplicate logical hyperedges -> Define stable ID generation before migration and make migration idempotent. +- Degree calculations could double-count membership edges -> Implement degree methods from hyperedge membership semantics, not raw graph edge counts. +- Missing `source_id` could reduce answer grounding -> Treat `source_id` preservation as a migration gate. +- Backend configuration mistakes could point different databases at the same graph space -> Require explicit graph space/database mapping and expose backend status in diagnostics. +- NebulaGraph connection, schema, or migration failures could silently degrade results -> Keep `.hgdb` as the active serving backend unless validation explicitly passes and configuration opts in to NebulaGraph serving. +- Dual writes during insertion could create partial writes -> Start with offline migration and mirror-only validation before enabling NebulaGraph serving. + +## Migration Plan + +1. Add configuration for hypergraph backend selection with `.hgdb` as the default and mirror-only NebulaGraph support as non-serving. +2. Define NebulaGraph schema for `Entity`, `Hyperedge`, `MEMBER_OF`, and optional reverse membership edges. +3. Implement offline migration from `hypergraph_chunk_entity_relation.hgdb` to NebulaGraph. +4. Implement `NebulaHypergraphStorage` behind `BaseHypergraphStorage` with canonical IDs, deterministic ordering, and `.hgdb` degree semantics. +5. Add parity tooling that compares `.hgdb` and NebulaGraph outputs for storage operations. +6. Add fixed-question retrieval comparisons for `hyper`, `hyper-lite`, and `graph` modes while keeping vector stores unchanged. +7. Run NebulaGraph in dual-read validation mode while `.hgdb` continues serving. +8. Enable NebulaGraph as an opt-in serving backend for selected datasets only after validation passes. +9. Roll back by switching the backend configuration to `.hgdb`; no data deletion is required. + +## Open Questions + +- Which NebulaGraph deployment target should be supported first: local Docker Compose, external cluster, or both? +- Should the adapter use one graph space per Hyper-RAG database or one shared graph space with a `database_name` property? +- What threshold defines acceptable retrieval parity: exact context match, entity/hyperedge overlap, answer score, or a combination? +- After validation, should insertion use dual-write for rollback safety or NebulaGraph-only writes for operational simplicity? diff --git a/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/proposal.md b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/proposal.md new file mode 100644 index 0000000..f954199 --- /dev/null +++ b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/proposal.md @@ -0,0 +1,42 @@ +## Why + +Hyper-RAG currently persists extracted entities, hyperedges, and adjacency data in local `.hgdb` pickle files. This limits operational visibility, centralized management, and future multi-dataset graph operations as the Web UI moves beyond small local demos. + +NebulaGraph can provide a durable graph backend for the Hyper-RAG knowledge graph, but the migration must preserve the current retrieval behavior and answer quality by keeping hyperedge semantics, source chunk links, and vector retrieval unchanged during the first migration phase. + +## What Changes + +- Add a NebulaGraph-backed hypergraph storage option that implements the existing `BaseHypergraphStorage` contract. +- Represent Hyper-RAG hyperedges as first-class NebulaGraph vertices connected to entity vertices, rather than flattening hyperedges into pairwise edges. +- Preserve existing entity, hyperedge, degree, neighbor, and source lookup behavior expected by `hyper_query`, `hyper_query_lite`, and `graph_query`. +- Keep `NanoVectorDBStorage` and JSON text chunk storage unchanged in the initial migration so semantic retrieval quality can be compared independently from graph storage changes. +- Add an offline migration/export path from existing `hypergraph_chunk_entity_relation.hgdb` files into NebulaGraph. +- Add a dual-read validation mode that compares `.hgdb` and NebulaGraph retrieval outputs before switching query traffic. +- Add configuration to select the hypergraph backend per deployment or database. +- No breaking API changes are intended for Web UI query, insert, file upload, or database selection endpoints. + +## Capabilities + +### New Capabilities +- `nebulagraph-hypergraph-storage`: Store and retrieve Hyper-RAG entity and hyperedge graph data from NebulaGraph while preserving current retrieval semantics and quality validation. + +### Modified Capabilities + +None. + +## Impact + +- Affected storage code: + - `hyperrag/base.py` + - `hyperrag/storage.py` + - HyperRAG initialization paths in `hyperrag/hyperrag.py` and `web-ui/backend/main.py` +- Affected retrieval behavior: + - `hyperrag/operate.py` depends on exact `BaseHypergraphStorage` semantics for neighbor, degree, entity, hyperedge, and source lookups. +- New dependencies: + - NebulaGraph Python client or an equivalent supported client package. + - NebulaGraph connection configuration for host, port, credentials, graph space, and backend selection. +- New operational systems: + - NebulaGraph graph space/schema management. + - Migration and validation tooling for existing `.hgdb` datasets. +- Quality risk: + - Incorrect hyperedge modeling or missing `source_id` preservation can reduce retrieval quality. The migration must include parity checks before enabling NebulaGraph as the serving backend. diff --git a/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/specs/nebulagraph-hypergraph-storage/spec.md b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/specs/nebulagraph-hypergraph-storage/spec.md new file mode 100644 index 0000000..8b24c7c --- /dev/null +++ b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/specs/nebulagraph-hypergraph-storage/spec.md @@ -0,0 +1,165 @@ +## ADDED Requirements + +### Requirement: NebulaGraph backend selection +The system SHALL support selecting NebulaGraph as an alternative hypergraph storage backend while preserving `.hgdb` storage as the default serving backend until NebulaGraph parity is validated and explicitly enabled. + +#### Scenario: Default backend remains local hgdb +- **WHEN** no NebulaGraph backend configuration is provided +- **THEN** the system SHALL continue using the existing `.hgdb` hypergraph storage implementation + +#### Scenario: NebulaGraph backend is configured +- **WHEN** deployment configuration enables NebulaGraph for a database without serving opt-in +- **THEN** the system SHALL use NebulaGraph only for migration, mirror, or validation workflows while `.hgdb` remains the serving backend + +#### Scenario: NebulaGraph serving is explicitly enabled +- **WHEN** deployment configuration selects NebulaGraph serving for a database after quality validation passes +- **THEN** HyperRAG SHALL initialize a NebulaGraph-backed implementation of `BaseHypergraphStorage` for serving that database + +### Requirement: Mirror-only migration mode +The system SHALL provide a mirror-only NebulaGraph mode that stores migrated or mirrored graph data without serving user queries from NebulaGraph. + +#### Scenario: Mirror-only mode receives graph data +- **WHEN** mirror-only mode is active for a database +- **THEN** NebulaGraph SHALL receive migrated or mirrored Entity and Hyperedge records for that database + +#### Scenario: Mirror-only mode preserves serving backend +- **WHEN** mirror-only mode is active for a database +- **THEN** user query execution SHALL continue reading hypergraph data from the `.hgdb` backend + +### Requirement: Hyperedge-preserving graph model +The system SHALL model each Hyper-RAG hyperedge as a first-class NebulaGraph vertex rather than flattening high-order hyperedges into pairwise entity edges. + +#### Scenario: Migrating a high-order hyperedge +- **WHEN** a source hyperedge contains more than two entity IDs +- **THEN** the migration SHALL create one Hyperedge vertex preserving the original `id_set`, `description`, `keywords`, `weight`, and `source_id` + +#### Scenario: Connecting entities to a hyperedge +- **WHEN** a Hyperedge vertex is created +- **THEN** the migration SHALL connect every member entity vertex to the Hyperedge vertex with membership relationships + +### Requirement: Canonical graph identifiers +The system SHALL use stable, deterministic identifiers for NebulaGraph Entity and Hyperedge vertices so repeated migrations and validations address the same logical records. + +#### Scenario: Entity identifier generation +- **WHEN** an entity is written to NebulaGraph +- **THEN** its vertex identifier SHALL be derived from a canonical entity name and database scope + +#### Scenario: Hyperedge identifier generation +- **WHEN** a hyperedge is written to NebulaGraph +- **THEN** its vertex identifier SHALL be derived from a canonical sorted representation of `id_set` and database scope + +#### Scenario: Hyperedge identifier round trip +- **WHEN** retrieval calls `get_hyperedge` with an unordered `id_set` +- **THEN** the NebulaGraph backend SHALL resolve the same Hyperedge vertex as the `.hgdb` backend would resolve after normalizing the `id_set` + +### Requirement: Storage contract parity +The NebulaGraph backend SHALL implement the same externally observable behavior as `BaseHypergraphStorage` for all methods used by Hyper-RAG retrieval. + +#### Scenario: Entity lookup parity +- **WHEN** retrieval calls `get_vertex` for an entity that exists in the migrated dataset +- **THEN** the NebulaGraph backend SHALL return the same entity fields as the `.hgdb` backend, including `entity_type`, `description`, `source_id`, and `additional_properties` + +#### Scenario: Hyperedge lookup parity +- **WHEN** retrieval calls `get_hyperedge` with an `id_set` that exists in the migrated dataset +- **THEN** the NebulaGraph backend SHALL return the same hyperedge fields as the `.hgdb` backend, including `description`, `keywords`, `source_id`, and `weight` + +#### Scenario: Neighbor lookup parity +- **WHEN** retrieval calls `get_nbr_e_of_vertex` or `get_nbr_v_of_hyperedge` +- **THEN** the NebulaGraph backend SHALL return neighbor relationships equivalent to the `.hgdb` backend after normalizing unordered hyperedge ID sets + +#### Scenario: Degree parity +- **WHEN** retrieval calls `vertex_degree` or `hyperedge_degree` +- **THEN** the NebulaGraph backend SHALL return the same degree values as the `.hgdb` backend for the migrated dataset + +#### Scenario: Deterministic output ordering +- **WHEN** the NebulaGraph backend returns entities, hyperedges, or neighbor lists to retrieval code +- **THEN** it SHALL return deterministic normalized ordering so equivalent data produces stable context construction + +#### Scenario: Membership edges do not affect degree semantics +- **WHEN** the NebulaGraph schema uses membership edges to model hyperedges +- **THEN** `vertex_degree` and `hyperedge_degree` SHALL be computed from Hyper-RAG membership semantics and SHALL NOT double-count reverse or auxiliary membership edges + +### Requirement: Source chunk linkage preservation +The migration SHALL preserve source chunk linkage so graph retrieval can resolve entities and hyperedges back to the existing text chunk store. + +#### Scenario: Entity source IDs are migrated +- **WHEN** an entity vertex is migrated to NebulaGraph +- **THEN** its `source_id` value SHALL be preserved exactly as represented in the `.hgdb` backend + +#### Scenario: Hyperedge source IDs are migrated +- **WHEN** a hyperedge is migrated to NebulaGraph +- **THEN** its `source_id` value SHALL be preserved exactly as represented in the `.hgdb` backend + +### Requirement: Vector retrieval remains unchanged +The initial NebulaGraph migration SHALL NOT replace the existing vector storage used for entity, relationship, or chunk semantic retrieval, and SHALL NOT alter prompt construction, query routing, or answer generation. + +#### Scenario: Querying with NebulaGraph backend +- **WHEN** a query runs with NebulaGraph selected as the hypergraph backend +- **THEN** the system SHALL continue using the configured vector stores for `entities_vdb`, `relationships_vdb`, and `chunks_vdb` + +#### Scenario: Query behavior remains storage-semantics preserving +- **WHEN** NebulaGraph is used as the hypergraph backend +- **THEN** the query pipeline SHALL preserve existing vector recall, text chunk lookup, context format, prompt templates, and answer generation behavior + +### Requirement: Offline migration from hgdb +The system SHALL provide a migration path from existing `hypergraph_chunk_entity_relation.hgdb` files into NebulaGraph. + +#### Scenario: Migrating an existing database +- **WHEN** an operator runs the migration for a Hyper-RAG database directory +- **THEN** the system SHALL read the source `.hgdb` file and create equivalent Entity and Hyperedge records in the configured NebulaGraph graph space + +#### Scenario: Migration is repeatable +- **WHEN** an operator reruns migration for the same source `.hgdb` file +- **THEN** the system SHALL upsert existing Entity and Hyperedge records without creating duplicate logical graph records + +### Requirement: Dual-read validation +The system SHALL provide validation that compares `.hgdb` and NebulaGraph outputs before NebulaGraph is used as the serving graph backend. + +#### Scenario: Storage parity validation +- **WHEN** validation runs against a migrated dataset +- **THEN** it SHALL compare vertex counts, hyperedge counts, sampled entity records, sampled hyperedge records, neighbor lookups, and degree values between both backends + +#### Scenario: Retrieval parity validation +- **WHEN** validation runs against a fixed question set +- **THEN** it SHALL compare retrieved entities, hyperedges, and source text units for `hyper`, `hyper-lite`, and `graph` query modes while keeping vector stores unchanged + +#### Scenario: Context parity validation +- **WHEN** validation runs against a fixed question set +- **THEN** it SHALL report context string differences between `.hgdb` and NebulaGraph retrieval outputs after deterministic normalization + +#### Scenario: Answer regression validation +- **WHEN** an answer evaluator is configured for a fixed question set +- **THEN** validation SHALL report final answer regression scores for `.hgdb` and NebulaGraph-backed retrieval + +### Requirement: Quality gate before serving +The system SHALL require an explicit quality gate before NebulaGraph can be enabled as the serving backend for a dataset. + +#### Scenario: Validation passes +- **WHEN** storage parity and retrieval parity meet configured acceptance thresholds +- **THEN** the operator SHALL be able to enable NebulaGraph as the serving hypergraph backend for that dataset + +#### Scenario: Validation fails +- **WHEN** storage parity or retrieval parity fails configured acceptance thresholds +- **THEN** the system SHALL keep the `.hgdb` backend active for serving and report the failed checks + +### Requirement: Failure policy preserves serving quality +The system SHALL avoid silently serving from NebulaGraph when NebulaGraph is unavailable, misconfigured, or not validated. + +#### Scenario: NebulaGraph connection fails before serving opt-in +- **WHEN** NebulaGraph is configured for mirror or validation and the connection fails +- **THEN** the system SHALL keep `.hgdb` as the serving backend and report the NebulaGraph failure + +#### Scenario: NebulaGraph schema is incomplete +- **WHEN** required NebulaGraph tags, edge types, or indexes are missing +- **THEN** migration or validation SHALL fail with a diagnostic message and SHALL NOT enable NebulaGraph serving + +#### Scenario: Partial migration is detected +- **WHEN** validation detects missing or mismatched entities, hyperedges, source IDs, neighbors, or degrees +- **THEN** NebulaGraph serving SHALL remain disabled for that dataset + +### Requirement: Rollback without API changes +The system SHALL allow rollback from NebulaGraph to `.hgdb` storage without changing public query, insert, upload, or database selection APIs. + +#### Scenario: Backend rollback +- **WHEN** an operator switches backend configuration from NebulaGraph to `.hgdb` +- **THEN** existing Web UI and API clients SHALL continue using the same request and response contracts diff --git a/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md new file mode 100644 index 0000000..3ea2f24 --- /dev/null +++ b/openspec/changes/migrate-hypergraph-storage-to-nebulagraph/tasks.md @@ -0,0 +1,67 @@ +## 1. Configuration And Schema + +- [x] 1.1 Add hypergraph backend configuration with `.hgdb` as the default backend. +- [x] 1.2 Add NebulaGraph connection settings for host, port, credentials, graph space, and per-database mapping. +- [x] 1.3 Define NebulaGraph schema for Entity vertices, Hyperedge vertices, membership edges, and optional reverse membership edges. +- [ ] 1.4 Add a schema initialization/check command that verifies required tags, edge types, and indexes exist before migration. +- [x] 1.5 Add explicit backend modes for `hgdb`, `mirror-only`, `dual-read`, and `nebulagraph-serving`. +- [x] 1.6 Define failure policy defaults so `.hgdb` remains serving when NebulaGraph is unavailable, unvalidated, or misconfigured. + +## 2. Migration Tooling + +- [x] 2.1 Implement a reader that loads existing `hypergraph_chunk_entity_relation.hgdb` data through the current HypergraphDB format. +- [x] 2.2 Generate stable Entity vertex IDs from canonical entity names and database scope. +- [x] 2.3 Generate stable Hyperedge vertex IDs from normalized `id_set` values and database scope. +- [ ] 2.4 Upsert Entity vertices while preserving `entity_type`, `description`, `source_id`, and `additional_properties`. +- [ ] 2.5 Upsert Hyperedge vertices while preserving `id_set`, `description`, `keywords`, `weight`, `source_id`, and arity. +- [ ] 2.6 Upsert membership relationships between each Entity vertex and its Hyperedge vertex. +- [x] 2.7 Make migration repeatable without creating duplicate logical entities or hyperedges. +- [ ] 2.8 Add mirror-only migration execution that writes NebulaGraph data while leaving `.hgdb` as the serving backend. + +## 3. NebulaGraph Storage Adapter + +- [x] 3.1 Add `NebulaHypergraphStorage` implementing the full `BaseHypergraphStorage` interface. +- [x] 3.2 Implement vertex existence, lookup, upsert, removal, count, and listing behavior. +- [x] 3.3 Implement hyperedge existence, lookup, upsert, removal, count, and listing behavior using Hyperedge vertices. +- [x] 3.4 Implement `get_nbr_e_of_vertex`, `get_nbr_v_of_hyperedge`, and `get_nbr_v_of_vertex` with outputs normalized to match `.hgdb` semantics. +- [x] 3.5 Implement `vertex_degree` and `hyperedge_degree` with parity against the current backend. +- [ ] 3.6 Add batching for hot retrieval paths used by `hyper_query`, `hyper_query_lite`, and `graph_query`. +- [x] 3.7 Normalize returned hyperedge tuples, neighbor lists, and entity lists into deterministic ordering before returning to retrieval code. +- [x] 3.8 Ensure degree calculations do not double-count reverse or auxiliary membership edges. + +## 4. Integration + +- [x] 4.1 Wire backend selection into `HyperRAG` initialization without changing public query or insert APIs. +- [x] 4.2 Wire Web UI backend initialization to select `.hgdb` or NebulaGraph per configured database. +- [x] 4.3 Keep `NanoVectorDBStorage` and text chunk JSON storage unchanged for the initial migration. +- [ ] 4.4 Add diagnostics that expose selected hypergraph backend and NebulaGraph connection status. +- [x] 4.5 Ensure mirror-only and dual-read modes never serve user-facing query responses from NebulaGraph. +- [ ] 4.6 Verify prompt construction, query routing, vector recall, text chunk lookup, and answer generation remain unchanged. + +## 5. Dual-Read Validation + +- [x] 5.1 Add storage parity checks for vertex count, hyperedge count, sampled entity records, sampled hyperedge records, neighbor lookups, and degree values. +- [ ] 5.2 Add retrieval parity checks for fixed question sets across `hyper`, `hyper-lite`, and `graph` modes. +- [ ] 5.3 Compare retrieved entities, hyperedges, and source text units while keeping vector stores unchanged. +- [x] 5.4 Report failed parity checks with enough detail to identify missing records, changed source IDs, or changed neighbor sets. +- [ ] 5.5 Add context string diff reporting after deterministic normalization. +- [ ] 5.6 Add optional final answer regression scoring when an evaluator is configured. +- [ ] 5.7 Add validation checks for NebulaGraph schema completeness and migration completeness. + +## 6. Quality Gate And Rollback + +- [ ] 6.1 Define configurable acceptance thresholds for storage parity and retrieval parity. +- [ ] 6.2 Block NebulaGraph serving when parity checks fail. +- [ ] 6.3 Add an opt-in switch to enable NebulaGraph serving only after validation passes. +- [ ] 6.4 Verify rollback by switching configuration back to `.hgdb` without changing Web UI or API request/response contracts. +- [ ] 6.5 Verify `.hgdb` remains serving when NebulaGraph connection, schema validation, or migration validation fails. + +## 7. Tests And Documentation + +- [x] 7.1 Add unit tests for entity ID normalization, hyperedge ID normalization, and high-order hyperedge round trips. +- [x] 7.2 Add adapter tests covering every `BaseHypergraphStorage` method used by retrieval. +- [x] 7.3 Add migration tests using a small `.hgdb` fixture with both pairwise and high-order hyperedges. +- [ ] 7.4 Add integration tests comparing `.hgdb` and NebulaGraph retrieval outputs for a fixed fixture dataset. +- [x] 7.5 Add tests that prove mirror-only and dual-read modes keep `.hgdb` as the serving backend. +- [x] 7.6 Add tests for deterministic output ordering and degree semantics. +- [x] 7.7 Document NebulaGraph setup, schema initialization, mirror-only migration, validation, enablement, failure policy, and rollback steps. diff --git a/openspec/config.yaml b/openspec/config.yaml new file mode 100644 index 0000000..392946c --- /dev/null +++ b/openspec/config.yaml @@ -0,0 +1,20 @@ +schema: spec-driven + +# Project context (optional) +# This is shown to AI when creating artifacts. +# Add your tech stack, conventions, style guides, domain knowledge, etc. +# Example: +# context: | +# Tech stack: TypeScript, React, Node.js +# We use conventional commits +# Domain: e-commerce platform + +# Per-artifact rules (optional) +# Add custom rules for specific artifacts. +# Example: +# rules: +# proposal: +# - Keep proposals under 500 words +# - Always include a "Non-goals" section +# tasks: +# - Break tasks into chunks of max 2 hours diff --git a/requirements.txt b/requirements.txt index d3ca5c5..6c7caa8 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,6 +3,7 @@ aioboto3 aiohttp numpy nano-vectordb +nebula3-python openai tenacity tiktoken diff --git a/scripts/hyperrag_nebulagraph.py b/scripts/hyperrag_nebulagraph.py new file mode 100755 index 0000000..a949670 --- /dev/null +++ b/scripts/hyperrag_nebulagraph.py @@ -0,0 +1,79 @@ +#!/usr/bin/env python3 +import argparse +import importlib.util +from pathlib import Path +import sys +from typing import Callable + + +def _load_schema_statements_for_space() -> Callable[[str], list[str]]: + try: + from hyperrag.nebulagraph_schema import schema_statements_for_space + + return schema_statements_for_space + except ModuleNotFoundError: + pass + + module_path = ( + Path(__file__).resolve().parents[1] / "hyperrag" / "nebulagraph_schema.py" + ) + spec = importlib.util.spec_from_file_location( + "hyperrag.nebulagraph_schema", + module_path, + ) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module.schema_statements_for_space + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(prog="hyperrag-nebulagraph") + subparsers = parser.add_subparsers(dest="command", required=True) + + schema_check_parser = subparsers.add_parser( + "schema-check", + help="Print NebulaGraph schema check statements for a graph space.", + ) + schema_check_parser.add_argument("--space", required=True) + + migrate_parser = subparsers.add_parser( + "migrate", + help="Placeholder for future .hgdb to NebulaGraph migration wiring.", + ) + migrate_parser.add_argument("--hgdb", required=True) + migrate_parser.add_argument("--database", required=True) + + validate_parser = subparsers.add_parser( + "validate", + help="Placeholder for future NebulaGraph validation wiring.", + ) + validate_parser.add_argument("--hgdb", required=True) + validate_parser.add_argument("--database", required=True) + + return parser + + +def main(argv: list[str] | None = None) -> int: + parser = build_parser() + args = parser.parse_args(argv) + + if args.command == "schema-check": + schema_statements_for_space = _load_schema_statements_for_space() + try: + statements = schema_statements_for_space(args.space) + except ValueError as exc: + parser.error(str(exc)) + for statement in statements: + print(statement) + return 0 + + if args.command in {"migrate", "validate"}: + parser.error(f"Command {args.command!r} requires implementation wiring") + + parser.error(f"Unknown command {args.command!r}") + return 2 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/test_nebulagraph_backend_selection.py b/tests/test_nebulagraph_backend_selection.py new file mode 100644 index 0000000..713c827 --- /dev/null +++ b/tests/test_nebulagraph_backend_selection.py @@ -0,0 +1,190 @@ +import importlib.util +import os +from pathlib import Path +import sys +import types +import unittest +from unittest.mock import patch + + +PACKAGE_ROOT = Path(__file__).resolve().parents[1] / "hyperrag" + + +def _install_stub_modules(): + package = types.ModuleType("hyperrag") + package.__path__ = [str(PACKAGE_ROOT)] + sys.modules["hyperrag"] = package + + utils = types.ModuleType("hyperrag.utils") + + class EmbeddingFunc: + pass + + utils.EmbeddingFunc = EmbeddingFunc + utils.compute_mdhash_id = lambda content, prefix="": prefix + content + utils.limit_async_func_call = lambda _max_async: lambda func: func + utils.limit_async_gen_call = lambda _max_async: lambda func: func + utils.convert_response_to_json = lambda response: response + utils.set_logger = lambda _log_file: None + + class Logger: + level = "INFO" + + def info(self, *_args, **_kwargs): + pass + + def debug(self, *_args, **_kwargs): + pass + + def warning(self, *_args, **_kwargs): + pass + + def setLevel(self, level): + self.level = level + + utils.logger = Logger() + sys.modules["hyperrag.utils"] = utils + + operate = types.ModuleType("hyperrag.operate") + for name in ( + "chunking_by_token_size", + "extract_entities", + "hyper_query_lite", + "hyper_query", + "naive_query", + "graph_query", + "llm_query", + "hyper_query_stream", + "hyper_query_lite_stream", + "naive_query_stream", + "llm_query_stream", + ): + setattr(operate, name, lambda *args, **kwargs: None) + sys.modules["hyperrag.operate"] = operate + + llm = types.ModuleType("hyperrag.llm") + llm.gpt_4o_mini_complete = lambda *args, **kwargs: None + llm.openai_embedding = lambda *args, **kwargs: None + sys.modules["hyperrag.llm"] = llm + + storage = types.ModuleType("hyperrag.storage") + + class JsonKVStorage: + pass + + class NanoVectorDBStorage: + pass + + class HypergraphStorage: + pass + + storage.JsonKVStorage = JsonKVStorage + storage.NanoVectorDBStorage = NanoVectorDBStorage + storage.HypergraphStorage = HypergraphStorage + sys.modules["hyperrag.storage"] = storage + + return HypergraphStorage + + +def _load_module(module_name, relative_path): + module_path = PACKAGE_ROOT / relative_path + spec = importlib.util.spec_from_file_location(module_name, module_path) + module = importlib.util.module_from_spec(spec) + sys.modules[module_name] = module + spec.loader.exec_module(module) + return module + + +HypergraphStorage = _install_stub_modules() +_load_module("hyperrag.base", "base.py") +_load_module("hyperrag.nebulagraph_ids", "nebulagraph_ids.py") +nebulagraph_config = _load_module( + "hyperrag.nebulagraph_config", "nebulagraph_config.py" +) +nebulagraph_storage = _load_module( + "hyperrag.nebulagraph_storage", "nebulagraph_storage.py" +) +hyperrag_module = _load_module("hyperrag.hyperrag", "hyperrag.py") + +HypergraphBackendMode = nebulagraph_config.HypergraphBackendMode +NebulaGraphSettings = nebulagraph_config.NebulaGraphSettings +NebulaHypergraphStorage = nebulagraph_storage.NebulaHypergraphStorage +resolve_hypergraph_storage_cls = hyperrag_module.resolve_hypergraph_storage_cls + + +class DefaultStorage: + pass + + +class NebulaGraphBackendSelectionTest(unittest.TestCase): + def test_nebulagraph_serving_without_validation_is_not_serving(self): + with patch.dict(os.environ, {}, clear=True): + settings = NebulaGraphSettings.from_config( + { + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": False, + } + ) + + self.assertEqual(HypergraphBackendMode.NEBULAGRAPH_SERVING, settings.mode) + self.assertFalse(settings.serving_enabled) + + def test_nebulagraph_serving_with_validation_is_serving(self): + with patch.dict(os.environ, {}, clear=True): + settings = NebulaGraphSettings.from_config( + { + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": True, + } + ) + + self.assertEqual(HypergraphBackendMode.NEBULAGRAPH_SERVING, settings.mode) + self.assertTrue(settings.serving_enabled) + + def test_mirror_only_resolves_to_default_storage(self): + with patch.dict(os.environ, {}, clear=True): + storage_cls = resolve_hypergraph_storage_cls( + {"hypergraph_backend_mode": "mirror-only"}, HypergraphStorage + ) + + self.assertIs(HypergraphStorage, storage_cls) + + def test_dual_read_resolves_to_default_storage(self): + with patch.dict(os.environ, {}, clear=True): + storage_cls = resolve_hypergraph_storage_cls( + {"hypergraph_backend_mode": "dual-read"}, DefaultStorage + ) + + self.assertIs(DefaultStorage, storage_cls) + + def test_validated_nebulagraph_serving_resolves_to_nebula_storage(self): + with patch.dict(os.environ, {}, clear=True): + storage_cls = resolve_hypergraph_storage_cls( + { + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": True, + }, + DefaultStorage, + ) + + self.assertIs(NebulaHypergraphStorage, storage_cls) + + def test_addon_params_override_base_config_to_enable_serving(self): + with patch.dict(os.environ, {}, clear=True): + storage_cls = resolve_hypergraph_storage_cls( + { + "hypergraph_backend_mode": "mirror-only", + "nebulagraph_validated": False, + "addon_params": { + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": True, + }, + }, + DefaultStorage, + ) + + self.assertIs(NebulaHypergraphStorage, storage_cls) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_cli.py b/tests/test_nebulagraph_cli.py new file mode 100644 index 0000000..496484f --- /dev/null +++ b/tests/test_nebulagraph_cli.py @@ -0,0 +1,97 @@ +import contextlib +import importlib.util +import io +import os +from pathlib import Path +import sys +import unittest + + +MODULE_PATH = Path(__file__).resolve().parents[1] / "scripts" / "hyperrag_nebulagraph.py" +spec = importlib.util.spec_from_file_location("hyperrag_nebulagraph", MODULE_PATH) +hyperrag_nebulagraph = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = hyperrag_nebulagraph +spec.loader.exec_module(hyperrag_nebulagraph) + + +class NebulaGraphCliTest(unittest.TestCase): + def test_build_parser_parses_schema_check_space(self): + parser = hyperrag_nebulagraph.build_parser() + + args = parser.parse_args(["schema-check", "--space", "hyperrag"]) + + self.assertEqual("schema-check", args.command) + self.assertEqual("hyperrag", args.space) + + def test_main_schema_check_prints_schema_statements(self): + output = io.StringIO() + + with contextlib.redirect_stdout(output): + result = hyperrag_nebulagraph.main(["schema-check", "--space", "hyperrag"]) + + self.assertEqual(0, result) + lines = output.getvalue().splitlines() + self.assertEqual("USE `hyperrag`", lines[0]) + self.assertTrue( + any("CREATE TAG IF NOT EXISTS Entity" in line for line in lines) + ) + self.assertTrue( + any("CREATE TAG IF NOT EXISTS Hyperedge" in line for line in lines) + ) + self.assertTrue( + any("CREATE EDGE IF NOT EXISTS MEMBER_OF" in line for line in lines) + ) + + def test_script_is_directly_executable(self): + self.assertTrue(os.access(MODULE_PATH, os.X_OK)) + + def test_main_schema_check_reports_invalid_space_as_parser_error(self): + with contextlib.redirect_stderr(io.StringIO()): + with self.assertRaises(SystemExit) as captured: + hyperrag_nebulagraph.main(["schema-check", "--space", "bad`space"]) + + self.assertEqual(2, captured.exception.code) + + def test_parser_exposes_migrate_arguments(self): + parser = hyperrag_nebulagraph.build_parser() + + args = parser.parse_args( + ["migrate", "--hgdb", "graph.hgdb", "--database", "default"] + ) + + self.assertEqual("migrate", args.command) + self.assertEqual("graph.hgdb", args.hgdb) + self.assertEqual("default", args.database) + + def test_parser_exposes_validate_arguments(self): + parser = hyperrag_nebulagraph.build_parser() + + args = parser.parse_args( + ["validate", "--hgdb", "graph.hgdb", "--database", "default"] + ) + + self.assertEqual("validate", args.command) + self.assertEqual("graph.hgdb", args.hgdb) + self.assertEqual("default", args.database) + + def test_main_migrate_requires_implementation_wiring(self): + with contextlib.redirect_stderr(io.StringIO()): + with self.assertRaises(SystemExit) as captured: + hyperrag_nebulagraph.main( + ["migrate", "--hgdb", "graph.hgdb", "--database", "default"] + ) + + self.assertEqual(2, captured.exception.code) + + def test_main_validate_requires_implementation_wiring(self): + with contextlib.redirect_stderr(io.StringIO()): + with self.assertRaises(SystemExit) as captured: + hyperrag_nebulagraph.main( + ["validate", "--hgdb", "graph.hgdb", "--database", "default"] + ) + + self.assertEqual(2, captured.exception.code) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_client.py b/tests/test_nebulagraph_client.py new file mode 100644 index 0000000..ac87c95 --- /dev/null +++ b/tests/test_nebulagraph_client.py @@ -0,0 +1,56 @@ +import importlib.util +from pathlib import Path +import sys +import unittest + + +MODULE_PATH = Path(__file__).resolve().parents[1] / "hyperrag" / "nebulagraph_client.py" +spec = importlib.util.spec_from_file_location("nebulagraph_client", MODULE_PATH) +nebulagraph_client = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = nebulagraph_client +spec.loader.exec_module(nebulagraph_client) + +FakeNebulaGraphClient = nebulagraph_client.FakeNebulaGraphClient + + +class FakeNebulaGraphClientTest(unittest.TestCase): + def test_execute_records_statement(self): + client = FakeNebulaGraphClient() + + result = client.execute("CREATE TAG Entity()") + + self.assertEqual(["CREATE TAG Entity()"], client.statements) + self.assertEqual([], result) + + def test_is_available_defaults_true(self): + client = FakeNebulaGraphClient() + + self.assertTrue(client.is_available()) + + def test_execute_many_runs_statements_in_order(self): + client = FakeNebulaGraphClient() + + results = client.execute_many( + [ + "CREATE TAG Entity()", + "CREATE EDGE MEMBER_OF()", + ] + ) + + self.assertEqual( + [ + "CREATE TAG Entity()", + "CREATE EDGE MEMBER_OF()", + ], + client.statements, + ) + self.assertEqual([[], []], results) + + def test_unavailable_fake_returns_false(self): + client = FakeNebulaGraphClient(available=False) + + self.assertFalse(client.is_available()) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_config.py b/tests/test_nebulagraph_config.py new file mode 100644 index 0000000..bf7f62f --- /dev/null +++ b/tests/test_nebulagraph_config.py @@ -0,0 +1,92 @@ +import importlib.util +import os +from pathlib import Path +import sys +import unittest +from unittest.mock import patch + +MODULE_PATH = Path(__file__).resolve().parents[1] / "hyperrag" / "nebulagraph_config.py" +spec = importlib.util.spec_from_file_location("nebulagraph_config", MODULE_PATH) +nebulagraph_config = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = nebulagraph_config +spec.loader.exec_module(nebulagraph_config) + +HypergraphBackendMode = nebulagraph_config.HypergraphBackendMode +NebulaGraphSettings = nebulagraph_config.NebulaGraphSettings +resolve_hypergraph_backend_mode = nebulagraph_config.resolve_hypergraph_backend_mode + + +class NebulaGraphConfigTest(unittest.TestCase): + def test_default_backend_is_hgdb(self): + with patch.dict(os.environ, {}, clear=True): + mode = resolve_hypergraph_backend_mode({}) + + self.assertEqual(HypergraphBackendMode.HGDB, mode) + + def test_mirror_only_mode_from_global_config(self): + with patch.dict(os.environ, {}, clear=True): + mode = resolve_hypergraph_backend_mode( + {"hypergraph_backend_mode": "mirror-only"} + ) + + self.assertEqual(HypergraphBackendMode.MIRROR_ONLY, mode) + + def test_invalid_mode_falls_back_to_hgdb(self): + with patch.dict(os.environ, {}, clear=True): + mode = resolve_hypergraph_backend_mode( + {"hypergraph_backend_mode": "not-a-backend"} + ) + + self.assertEqual(HypergraphBackendMode.HGDB, mode) + + def test_mode_is_trimmed_and_case_insensitive(self): + with patch.dict(os.environ, {}, clear=True): + mode = resolve_hypergraph_backend_mode( + {"hypergraph_backend_mode": " DUAL-READ "} + ) + + self.assertEqual(HypergraphBackendMode.DUAL_READ, mode) + + def test_settings_default_to_not_serving_and_fallback_to_hgdb(self): + with patch.dict(os.environ, {}, clear=True): + settings = NebulaGraphSettings.from_config({}) + + self.assertEqual(HypergraphBackendMode.HGDB, settings.mode) + self.assertFalse(settings.serving_enabled) + self.assertTrue(settings.fallback_to_hgdb) + + def test_nebulagraph_fallback_key_disables_fallback(self): + with patch.dict(os.environ, {}, clear=True): + settings = NebulaGraphSettings.from_config( + {"nebulagraph_fallback_to_hgdb": "false"} + ) + + self.assertFalse(settings.fallback_to_hgdb) + + def test_serving_requires_explicit_validated_true(self): + with patch.dict(os.environ, {}, clear=True): + settings = NebulaGraphSettings.from_config( + { + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": "true", + } + ) + + self.assertTrue(settings.serving_enabled) + + def test_serving_rejects_non_true_validation_states(self): + for validation_state in ("pending", "failed", "invalid", "false", ""): + with self.subTest(validation_state=validation_state): + with patch.dict(os.environ, {}, clear=True): + settings = NebulaGraphSettings.from_config( + { + "hypergraph_backend_mode": "nebulagraph-serving", + "nebulagraph_validated": validation_state, + } + ) + + self.assertFalse(settings.serving_enabled) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_ids.py b/tests/test_nebulagraph_ids.py new file mode 100644 index 0000000..4ee9522 --- /dev/null +++ b/tests/test_nebulagraph_ids.py @@ -0,0 +1,72 @@ +import importlib.util +from pathlib import Path +import sys +import unittest + + +MODULE_PATH = Path(__file__).resolve().parents[1] / "hyperrag" / "nebulagraph_ids.py" +spec = importlib.util.spec_from_file_location("nebulagraph_ids", MODULE_PATH) +nebulagraph_ids = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = nebulagraph_ids +spec.loader.exec_module(nebulagraph_ids) + +canonical_entity_vid = nebulagraph_ids.canonical_entity_vid +canonical_hyperedge_vid = nebulagraph_ids.canonical_hyperedge_vid +normalize_id_set = nebulagraph_ids.normalize_id_set + + +class NebulaGraphIdsTest(unittest.TestCase): + def test_entity_vid_trims_entity_name(self): + self.assertEqual( + canonical_entity_vid("demo", " Entity A "), + canonical_entity_vid("demo", "Entity A"), + ) + + def test_entity_vid_is_scoped_by_database_name(self): + self.assertNotEqual( + canonical_entity_vid("demo-a", "Entity A"), + canonical_entity_vid("demo-b", "Entity A"), + ) + + def test_hyperedge_vid_is_order_independent(self): + self.assertEqual( + canonical_hyperedge_vid("demo", ["B", "A", "C"]), + canonical_hyperedge_vid("demo", ["C", "B", "A"]), + ) + + def test_normalize_id_set_deduplicates_and_sorts(self): + self.assertEqual(("A", "B"), normalize_id_set(["B", "A", "B"])) + + def test_normalize_id_set_rejects_empty_id_set(self): + with self.assertRaises(ValueError): + normalize_id_set([]) + + def test_high_order_hyperedge_vid_is_deterministic(self): + first_vid = canonical_hyperedge_vid("demo", ["D", "B", "A", "C"]) + second_vid = canonical_hyperedge_vid("demo", ["C", "A", "D", "B"]) + + self.assertEqual(first_vid, second_vid) + self.assertTrue(first_vid.startswith("hedge:")) + + def test_entity_vid_payload_serialization_is_unambiguous(self): + self.assertNotEqual( + canonical_entity_vid("a", "b\x1fc"), + canonical_entity_vid("a\x1fb", "c"), + ) + + def test_hyperedge_vid_payload_serialization_is_unambiguous(self): + self.assertNotEqual( + canonical_hyperedge_vid("demo", ["A\x1eB"]), + canonical_hyperedge_vid("demo", ["A", "B"]), + ) + + def test_none_identifier_parts_are_rejected(self): + with self.assertRaises(ValueError): + canonical_entity_vid("demo", None) + + with self.assertRaises(ValueError): + normalize_id_set(["A", None]) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_migration.py b/tests/test_nebulagraph_migration.py new file mode 100644 index 0000000..4b6f8d5 --- /dev/null +++ b/tests/test_nebulagraph_migration.py @@ -0,0 +1,232 @@ +import importlib.util +import asyncio +from pathlib import Path +import pickle +import sys +import tempfile +import types +import unittest + + +PROJECT_ROOT = Path(__file__).resolve().parents[1] +PACKAGE_ROOT = PROJECT_ROOT / "hyperrag" + +if "hyperrag" not in sys.modules: + package = types.ModuleType("hyperrag") + package.__path__ = [str(PACKAGE_ROOT)] + sys.modules["hyperrag"] = package + +if "hyperrag.utils" not in sys.modules: + utils = types.ModuleType("hyperrag.utils") + utils.EmbeddingFunc = object + sys.modules["hyperrag.utils"] = utils + + +def _load_module(module_name, relative_path): + module_path = PACKAGE_ROOT / relative_path + spec = importlib.util.spec_from_file_location(module_name, module_path) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +nebulagraph_migration = _load_module( + "hyperrag.nebulagraph_migration", "nebulagraph_migration.py" +) +nebulagraph_storage = _load_module( + "hyperrag.nebulagraph_storage", "nebulagraph_storage.py" +) + +load_hgdb_snapshot = nebulagraph_migration.load_hgdb_snapshot +migrate_snapshot_to_storage = nebulagraph_migration.migrate_snapshot_to_storage +migrate_snapshot_to_storage_async = ( + nebulagraph_migration.migrate_snapshot_to_storage_async +) +NebulaHypergraphStorage = nebulagraph_storage.NebulaHypergraphStorage + + +class PickledHypergraphObject: + def __init__(self, vertices, hyperedges): + self._v_data = vertices + self._e_data = hyperedges + self._v_inci = {} + + +class MutatingStorage: + async def upsert_vertex(self, vertex_id, vertex_data): + vertex_data["description"] = "mutated by storage" + + async def upsert_hyperedge(self, id_set, hyperedge_data): + hyperedge_data["weight"] = 999 + + +class NebulaGraphMigrationTest(unittest.TestCase): + def _write_hgdb_fixture(self, payload): + temp_dir = tempfile.TemporaryDirectory() + hgdb_file = Path(temp_dir.name) / "hypergraph_chunk_entity_relation.hgdb" + with hgdb_file.open("wb") as file: + pickle.dump(payload, file) + self.addCleanup(temp_dir.cleanup) + return hgdb_file + + def test_load_hgdb_snapshot_preserves_pairwise_payloads(self): + hgdb_file = self._write_hgdb_fixture( + { + "v_data": { + "A": { + "entity_type": "Person", + "description": "Alice", + "source_id": "chunk-a", + }, + "B": { + "entity_type": "Place", + "description": "Berlin", + "source_id": "chunk-b", + }, + }, + "e_data": { + ("B", "A"): { + "description": "Alice visited Berlin", + "source_id": "chunk-ab", + "weight": 0.75, + } + }, + "v_inci": {}, + } + ) + + snapshot = load_hgdb_snapshot(hgdb_file) + + self.assertEqual({"A", "B"}, set(snapshot.vertices)) + self.assertEqual("Alice", snapshot.vertices["A"]["description"]) + self.assertEqual("chunk-b", snapshot.vertices["B"]["source_id"]) + self.assertEqual({("A", "B")}, set(snapshot.hyperedges)) + self.assertEqual( + "Alice visited Berlin", + snapshot.hyperedges[("A", "B")]["description"], + ) + self.assertEqual("chunk-ab", snapshot.hyperedges[("A", "B")]["source_id"]) + self.assertEqual(0.75, snapshot.hyperedges[("A", "B")]["weight"]) + + def test_load_hgdb_snapshot_supports_pickled_hypergraph_object(self): + hgdb_file = self._write_hgdb_fixture( + PickledHypergraphObject( + vertices={"A": {"description": "Alice"}}, + hyperedges={("A", "B"): {"weight": 1.0}}, + ) + ) + + snapshot = load_hgdb_snapshot(hgdb_file) + + self.assertEqual("Alice", snapshot.vertices["A"]["description"]) + self.assertEqual(1.0, snapshot.hyperedges[("A", "B")]["weight"]) + + def test_migration_is_repeatable_without_duplicate_logical_records(self): + hgdb_file = self._write_hgdb_fixture( + { + "v_data": { + "A": {"description": "Alice"}, + "B": {"description": "Berlin"}, + }, + "e_data": {("A", "B"): {"weight": 1.0}}, + "v_inci": {}, + } + ) + snapshot = load_hgdb_snapshot(hgdb_file) + storage = NebulaHypergraphStorage( + namespace="test", + global_config={"working_dir": str(hgdb_file.parent)}, + ) + + migrate_snapshot_to_storage(snapshot, storage) + migrate_snapshot_to_storage(snapshot, storage) + + self.assertEqual(2, len(storage._vertex_data)) + self.assertEqual(1, len(storage._hyperedge_data)) + + def test_migration_preserves_high_order_hyperedge_id_set_and_arity(self): + hgdb_file = self._write_hgdb_fixture( + { + "v_data": { + "A": {"description": "Alice"}, + "B": {"description": "Berlin"}, + "C": {"description": "Conference"}, + }, + "e_data": { + ("C", "A", "B"): { + "description": "Alice attended a Berlin conference", + "keywords": "travel,event", + "source_id": "chunk-abc", + "weight": 2.5, + } + }, + "v_inci": {}, + } + ) + snapshot = load_hgdb_snapshot(hgdb_file) + storage = NebulaHypergraphStorage( + namespace="test", + global_config={"working_dir": str(hgdb_file.parent)}, + ) + + migrate_snapshot_to_storage(snapshot, storage) + + self.assertEqual({("A", "B", "C")}, set(snapshot.hyperedges)) + self.assertEqual(3, len(storage._vertex_data)) + self.assertEqual(1, len(storage._hyperedge_data)) + edge_data = storage._hyperedge_data[("A", "B", "C")] + self.assertEqual(("A", "B", "C"), tuple(edge_data["id_set"])) + self.assertEqual(3, edge_data["arity"]) + self.assertEqual("Alice attended a Berlin conference", edge_data["description"]) + self.assertEqual("travel,event", edge_data["keywords"]) + self.assertEqual("chunk-abc", edge_data["source_id"]) + self.assertEqual(2.5, edge_data["weight"]) + + def test_migration_passes_payload_copies_to_storage(self): + snapshot = nebulagraph_migration.HypergraphSnapshot( + vertices={"A": {"description": "Alice"}}, + hyperedges={("A",): {"weight": 1}}, + ) + + migrate_snapshot_to_storage(snapshot, MutatingStorage()) + + self.assertEqual("Alice", snapshot.vertices["A"]["description"]) + self.assertEqual(1, snapshot.hyperedges[("A",)]["weight"]) + + def test_async_migration_entrypoint_runs_inside_event_loop(self): + async def run_migration(): + snapshot = nebulagraph_migration.HypergraphSnapshot( + vertices={"A": {"description": "Alice"}}, + hyperedges={("A",): {"weight": 1}}, + ) + storage = NebulaHypergraphStorage( + namespace="test", + global_config={"working_dir": "/tmp"}, + ) + await migrate_snapshot_to_storage_async(snapshot, storage) + return storage + + storage = asyncio.run(run_migration()) + + self.assertEqual("Alice", storage._vertex_data["A"]["description"]) + self.assertEqual(["A"], storage._hyperedge_data[("A",)]["id_set"]) + + def test_sync_migration_rejects_running_event_loop(self): + async def run_migration(): + snapshot = nebulagraph_migration.HypergraphSnapshot( + vertices={"A": {}}, + hyperedges={}, + ) + storage = NebulaHypergraphStorage( + namespace="test", + global_config={"working_dir": "/tmp"}, + ) + migrate_snapshot_to_storage(snapshot, storage) + + with self.assertRaisesRegex(RuntimeError, "active event loop"): + asyncio.run(run_migration()) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_retrieval_validation.py b/tests/test_nebulagraph_retrieval_validation.py new file mode 100644 index 0000000..5553733 --- /dev/null +++ b/tests/test_nebulagraph_retrieval_validation.py @@ -0,0 +1,83 @@ +import importlib +import importlib.util +from pathlib import Path +import sys +import types +import unittest + + +PROJECT_ROOT = Path(__file__).resolve().parents[1] +PACKAGE_ROOT = PROJECT_ROOT / "hyperrag" + +if "hyperrag" not in sys.modules: + package = types.ModuleType("hyperrag") + package.__path__ = [str(PACKAGE_ROOT)] + sys.modules["hyperrag"] = package + + +def _load_module(module_name, relative_path): + module_path = PACKAGE_ROOT / relative_path + spec = importlib.util.spec_from_file_location(module_name, module_path) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +_load_module("hyperrag.nebulagraph_ids", "nebulagraph_ids.py") +nebulagraph_validation = importlib.import_module("hyperrag.nebulagraph_validation") + +RetrievalParityResult = nebulagraph_validation.RetrievalParityResult + + +class RetrievalParityResultTest(unittest.TestCase): + def test_perfect_overlaps_pass_threshold(self): + result = RetrievalParityResult( + mode="hyper", + entity_overlap=1.0, + hyperedge_overlap=1.0, + text_unit_overlap=1.0, + context_diff="", + answer_score=1.0, + ) + + self.assertTrue(result.passed(0.95)) + + def test_missing_answer_score_does_not_block_when_base_overlaps_pass(self): + result = RetrievalParityResult( + mode="hyper-lite", + entity_overlap=0.96, + hyperedge_overlap=0.97, + text_unit_overlap=0.98, + context_diff="answer scoring disabled", + answer_score=None, + ) + + self.assertTrue(result.passed(0.95)) + + def test_low_entity_hyperedge_or_text_overlap_fails(self): + cases = [ + RetrievalParityResult("hyper", 0.94, 1.0, 1.0, "", None), + RetrievalParityResult("hyper", 1.0, 0.94, 1.0, "", None), + RetrievalParityResult("hyper", 1.0, 1.0, 0.94, "", None), + ] + + for result in cases: + with self.subTest(result=result): + self.assertFalse(result.passed(0.95)) + + def test_low_answer_score_fails_when_provided(self): + result = RetrievalParityResult( + mode="graph", + entity_overlap=1.0, + hyperedge_overlap=1.0, + text_unit_overlap=1.0, + context_diff="", + answer_score=0.94, + ) + + self.assertFalse(result.passed(0.95)) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_schema.py b/tests/test_nebulagraph_schema.py new file mode 100644 index 0000000..86eeda9 --- /dev/null +++ b/tests/test_nebulagraph_schema.py @@ -0,0 +1,114 @@ +import importlib.util +from pathlib import Path +import sys +import unittest + + +MODULE_PATH = Path(__file__).resolve().parents[1] / "hyperrag" / "nebulagraph_schema.py" +spec = importlib.util.spec_from_file_location("nebulagraph_schema", MODULE_PATH) +nebulagraph_schema = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = nebulagraph_schema +spec.loader.exec_module(nebulagraph_schema) + +REQUIRED_SCHEMA_STATEMENTS = nebulagraph_schema.REQUIRED_SCHEMA_STATEMENTS +schema_statements_for_space = nebulagraph_schema.schema_statements_for_space + + +class NebulaGraphSchemaTest(unittest.TestCase): + def _statement_containing(self, text): + for statement in REQUIRED_SCHEMA_STATEMENTS: + if text in statement: + return statement + self.fail(f"Missing schema statement containing {text!r}") + + def test_required_schema_contains_entity_tag(self): + self.assertTrue( + any( + "CREATE TAG IF NOT EXISTS Entity" in statement + for statement in REQUIRED_SCHEMA_STATEMENTS + ) + ) + + def test_required_schema_contains_hyperedge_tag(self): + self.assertTrue( + any( + "CREATE TAG IF NOT EXISTS Hyperedge" in statement + for statement in REQUIRED_SCHEMA_STATEMENTS + ) + ) + + def test_required_schema_contains_member_of_edge(self): + self.assertTrue( + any( + "CREATE EDGE IF NOT EXISTS MEMBER_OF" in statement + for statement in REQUIRED_SCHEMA_STATEMENTS + ) + ) + + def test_required_schema_contains_has_member_edge(self): + self.assertTrue( + any( + "CREATE EDGE IF NOT EXISTS HAS_MEMBER" in statement + for statement in REQUIRED_SCHEMA_STATEMENTS + ) + ) + + def test_entity_tag_contains_required_fields(self): + entity_statement = self._statement_containing("CREATE TAG IF NOT EXISTS Entity") + + for field_fragment in ( + "name string", + "entity_type string", + "description string", + "source_id string", + "additional_properties string", + "database_name string", + ): + with self.subTest(field_fragment=field_fragment): + self.assertIn(field_fragment, entity_statement) + + def test_hyperedge_tag_contains_required_fields(self): + hyperedge_statement = self._statement_containing( + "CREATE TAG IF NOT EXISTS Hyperedge" + ) + + for field_fragment in ( + "edge_hash string", + "id_set string", + "description string", + "keywords string", + "weight double", + "source_id string", + "arity int", + "database_name string", + ): + with self.subTest(field_fragment=field_fragment): + self.assertIn(field_fragment, hyperedge_statement) + + def test_membership_edges_include_database_scope(self): + for edge_name in ("MEMBER_OF", "HAS_MEMBER"): + with self.subTest(edge_name=edge_name): + edge_statement = self._statement_containing( + f"CREATE EDGE IF NOT EXISTS {edge_name}" + ) + self.assertIn("database_name string", edge_statement) + + def test_schema_statements_for_space_starts_with_use_statement(self): + statements = schema_statements_for_space(" hyperrag ") + + self.assertEqual("USE `hyperrag`", statements[0]) + self.assertEqual(REQUIRED_SCHEMA_STATEMENTS, statements[1:]) + + def test_schema_statements_for_space_rejects_blank_space_name(self): + for space_name in ("", " "): + with self.subTest(space_name=space_name): + with self.assertRaises(ValueError): + schema_statements_for_space(space_name) + + def test_schema_statements_for_space_rejects_backticks(self): + with self.assertRaises(ValueError): + schema_statements_for_space("bad`space") + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_storage.py b/tests/test_nebulagraph_storage.py new file mode 100644 index 0000000..511067a --- /dev/null +++ b/tests/test_nebulagraph_storage.py @@ -0,0 +1,167 @@ +import importlib.util +from pathlib import Path +import sys +import types +import unittest + + +PROJECT_ROOT = Path(__file__).resolve().parents[1] +PACKAGE_ROOT = PROJECT_ROOT / "hyperrag" + +if "hyperrag" not in sys.modules: + package = types.ModuleType("hyperrag") + package.__path__ = [str(PACKAGE_ROOT)] + sys.modules["hyperrag"] = package + +if "hyperrag.utils" not in sys.modules: + utils = types.ModuleType("hyperrag.utils") + utils.EmbeddingFunc = object + sys.modules["hyperrag.utils"] = utils + +MODULE_PATH = PACKAGE_ROOT / "nebulagraph_storage.py" +spec = importlib.util.spec_from_file_location("hyperrag.nebulagraph_storage", MODULE_PATH) +nebulagraph_storage = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = nebulagraph_storage +spec.loader.exec_module(nebulagraph_storage) + +NebulaHypergraphStorage = nebulagraph_storage.NebulaHypergraphStorage + + +class NebulaHypergraphStorageTest(unittest.IsolatedAsyncioTestCase): + def setUp(self): + self.storage = NebulaHypergraphStorage( + namespace="test", + global_config={"working_dir": "/tmp"}, + ) + + async def test_vertex_round_trip_copies_data(self): + data = { + "entity_type": "Person", + "description": "alpha", + "additional_properties": {"rank": 1}, + } + + await self.storage.upsert_vertex("A", data) + data["description"] = "mutated" + data["additional_properties"]["rank"] = 2 + + self.assertTrue(await self.storage.has_vertex("A")) + self.assertEqual( + { + "entity_type": "Person", + "description": "alpha", + "additional_properties": {"rank": 1}, + }, + await self.storage.get_vertex("A"), + ) + + returned = await self.storage.get_vertex("A") + returned["description"] = "leaked" + returned["additional_properties"]["rank"] = 3 + self.assertEqual("alpha", (await self.storage.get_vertex("A"))["description"]) + self.assertEqual( + 1, + (await self.storage.get_vertex("A"))["additional_properties"]["rank"], + ) + self.assertEqual("missing", await self.storage.get_vertex("missing", "missing")) + + async def test_hyperedge_round_trip_is_order_independent_and_copies_data(self): + data = {"weight": 2, "metadata": {"rank": 1}} + + await self.storage.upsert_hyperedge(("B", "A"), data) + data["weight"] = 7 + data["metadata"]["rank"] = 2 + + self.assertTrue(await self.storage.has_hyperedge(("A", "B"))) + self.assertEqual( + {"weight": 2, "metadata": {"rank": 1}}, + await self.storage.get_hyperedge(("A", "B")), + ) + + returned = await self.storage.get_hyperedge(("B", "A")) + returned["weight"] = 9 + returned["metadata"]["rank"] = 3 + self.assertEqual(2, (await self.storage.get_hyperedge(("A", "B")))["weight"]) + self.assertEqual( + 1, + (await self.storage.get_hyperedge(("A", "B")))["metadata"]["rank"], + ) + self.assertEqual("missing", await self.storage.get_hyperedge(("X", "Y"), "missing")) + + async def test_hyperedge_upsert_does_not_create_empty_vertices(self): + await self.storage.upsert_hyperedge(("A", "B"), {"weight": 1}) + + self.assertFalse(await self.storage.has_vertex("A")) + self.assertFalse(await self.storage.has_vertex("B")) + self.assertEqual(0, await self.storage.get_num_of_vertices()) + self.assertEqual(["A", "B"], await self.storage.get_nbr_v_of_hyperedge(("B", "A"))) + + async def test_neighbors_and_degree_use_normalized_order(self): + for vertex_id in ("A", "B", "C"): + await self.storage.upsert_vertex(vertex_id, {"id": vertex_id}) + await self.storage.upsert_hyperedge(("A", "B", "C"), {"weight": 1}) + + self.assertEqual(1, await self.storage.vertex_degree("A")) + self.assertEqual(3, await self.storage.hyperedge_degree(("C", "B", "A"))) + self.assertEqual([("A", "B", "C")], await self.storage.get_nbr_e_of_vertex("A")) + self.assertEqual( + ["A", "B", "C"], + await self.storage.get_nbr_v_of_hyperedge(("C", "A", "B")), + ) + + async def test_counts_and_all_lists_are_deterministic(self): + for vertex_id in ("C", "A", "B"): + await self.storage.upsert_vertex(vertex_id, {"id": vertex_id}) + await self.storage.upsert_hyperedge(("C", "A"), {"weight": 1}) + await self.storage.upsert_hyperedge(("B", "A", "C"), {"weight": 2}) + + self.assertEqual(3, await self.storage.get_num_of_vertices()) + self.assertEqual(2, await self.storage.get_num_of_hyperedges()) + self.assertEqual(["A", "B", "C"], await self.storage.get_all_vertices()) + self.assertEqual( + [("A", "B", "C"), ("A", "C")], + await self.storage.get_all_hyperedges(), + ) + + async def test_remove_hyperedge_updates_neighbors_and_degrees(self): + await self.storage.upsert_hyperedge(("B", "A"), {"weight": 1}) + await self.storage.upsert_hyperedge(("A", "C"), {"weight": 2}) + + await self.storage.remove_hyperedge(("A", "B")) + + self.assertFalse(await self.storage.has_hyperedge(("B", "A"))) + self.assertEqual([("A", "C")], await self.storage.get_nbr_e_of_vertex("A")) + self.assertEqual(1, await self.storage.vertex_degree("A")) + self.assertEqual(0, await self.storage.hyperedge_degree(("A", "B"))) + + async def test_remove_vertex_removes_incident_hyperedges_for_consistency(self): + await self.storage.upsert_hyperedge(("A", "B"), {"weight": 1}) + await self.storage.upsert_hyperedge(("B", "C"), {"weight": 2}) + await self.storage.upsert_hyperedge(("C", "D"), {"weight": 3}) + + await self.storage.remove_vertex("B") + + self.assertFalse(await self.storage.has_vertex("B")) + self.assertFalse(await self.storage.has_hyperedge(("A", "B"))) + self.assertFalse(await self.storage.has_hyperedge(("B", "C"))) + self.assertTrue(await self.storage.has_hyperedge(("C", "D"))) + self.assertEqual(0, await self.storage.vertex_degree("A")) + self.assertEqual([("C", "D")], await self.storage.get_all_hyperedges()) + + async def test_vertex_neighbors_are_deterministic_with_optional_self(self): + await self.storage.upsert_hyperedge(("C", "A", "B"), {"weight": 1}) + await self.storage.upsert_hyperedge(("D", "A"), {"weight": 2}) + + self.assertEqual( + ["B", "C", "D"], + await self.storage.get_nbr_v_of_vertex("A", exclude_self=True), + ) + self.assertEqual( + ["A", "B", "C", "D"], + await self.storage.get_nbr_v_of_vertex("A", exclude_self=False), + ) + self.assertEqual([], await self.storage.get_nbr_v_of_vertex("missing")) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_nebulagraph_validation.py b/tests/test_nebulagraph_validation.py new file mode 100644 index 0000000..eda36ce --- /dev/null +++ b/tests/test_nebulagraph_validation.py @@ -0,0 +1,345 @@ +import importlib +import importlib.util +from pathlib import Path +import sys +import types +import unittest + + +PROJECT_ROOT = Path(__file__).resolve().parents[1] +PACKAGE_ROOT = PROJECT_ROOT / "hyperrag" + +if "hyperrag" not in sys.modules: + package = types.ModuleType("hyperrag") + package.__path__ = [str(PACKAGE_ROOT)] + sys.modules["hyperrag"] = package + +if "hyperrag.utils" not in sys.modules: + utils = types.ModuleType("hyperrag.utils") + utils.EmbeddingFunc = object + sys.modules["hyperrag.utils"] = utils + + +def _load_module(module_name, relative_path): + module_path = PACKAGE_ROOT / relative_path + spec = importlib.util.spec_from_file_location(module_name, module_path) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +nebulagraph_storage = _load_module( + "hyperrag.nebulagraph_storage", "nebulagraph_storage.py" +) +nebulagraph_validation = importlib.import_module("hyperrag.nebulagraph_validation") + +NebulaHypergraphStorage = nebulagraph_storage.NebulaHypergraphStorage +compare_storage_backends = nebulagraph_validation.compare_storage_backends + + +class NeighborMismatchStorage: + def __init__(self, storage): + self.storage = storage + + async def get_num_of_vertices(self): + return await self.storage.get_num_of_vertices() + + async def get_num_of_hyperedges(self): + return await self.storage.get_num_of_hyperedges() + + async def get_vertex(self, vertex_id): + return await self.storage.get_vertex(vertex_id) + + async def vertex_degree(self, vertex_id): + return await self.storage.vertex_degree(vertex_id) + + async def get_nbr_e_of_vertex(self, vertex_id): + return await self.storage.get_nbr_e_of_vertex(vertex_id) + + async def get_hyperedge(self, id_set): + return await self.storage.get_hyperedge(id_set) + + async def hyperedge_degree(self, id_set): + return await self.storage.hyperedge_degree(id_set) + + async def get_nbr_v_of_hyperedge(self, id_set): + return ["A", "C"] + + +class VertexNeighborMismatchStorage(NeighborMismatchStorage): + async def get_nbr_e_of_vertex(self, vertex_id): + return [] + + +class VertexDegreeMismatchStorage(NeighborMismatchStorage): + async def vertex_degree(self, vertex_id): + return 99 + + +class HyperedgeDegreeMismatchStorage(NeighborMismatchStorage): + async def hyperedge_degree(self, id_set): + return 99 + + +class NebulaGraphValidationTest(unittest.IsolatedAsyncioTestCase): + def _storage(self): + return NebulaHypergraphStorage( + namespace="test", + global_config={"working_dir": "/tmp"}, + ) + + async def _matching_pair(self): + left = self._storage() + right = self._storage() + for storage in (left, right): + await storage.upsert_vertex( + "A", + { + "entity_type": "Person", + "description": "Alice", + "source_id": "chunk-a", + }, + ) + await storage.upsert_vertex( + "B", + { + "entity_type": "Place", + "description": "Berlin", + "source_id": "chunk-b", + }, + ) + await storage.upsert_hyperedge( + ("A", "B"), + { + "description": "Alice visited Berlin", + "source_id": "chunk-ab", + "weight": 1.0, + }, + ) + return left, right + + async def test_matching_storages_pass(self): + left, right = await self._matching_pair() + + report = await compare_storage_backends( + left, + right, + sample_vertices=["A"], + sample_hyperedges=[("A", "B")], + ) + + self.assertTrue(report.passed) + self.assertEqual([], report.failures) + + async def test_count_mismatch_reports_clear_failure(self): + left, right = await self._matching_pair() + await right.upsert_vertex("C", {"description": "Conference"}) + + report = await compare_storage_backends( + left, + right, + sample_vertices=["A"], + sample_hyperedges=[("A", "B")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "vertex count mismatch" in failure + and "left=2" in failure + and "right=3" in failure + for failure in report.failures + ), + report.failures, + ) + + async def test_hyperedge_count_mismatch_reports_clear_failure(self): + left, right = await self._matching_pair() + await right.upsert_vertex("C", {"description": "Conference"}) + await right.upsert_hyperedge(("A", "C"), {"weight": 2}) + + report = await compare_storage_backends( + left, + right, + sample_vertices=["A"], + sample_hyperedges=[("A", "B")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "hyperedge count mismatch" in failure + and "left=1" in failure + and "right=2" in failure + for failure in report.failures + ), + report.failures, + ) + + async def test_vertex_payload_mismatch_reports_vertex_id(self): + left, right = await self._matching_pair() + await right.upsert_vertex( + "A", + { + "entity_type": "Person", + "description": "Alicia", + "source_id": "chunk-a", + }, + ) + + report = await compare_storage_backends( + left, + right, + sample_vertices=["A"], + sample_hyperedges=[("A", "B")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any("vertex payload mismatch" in failure and "A" in failure for failure in report.failures), + report.failures, + ) + + async def test_vertex_degree_mismatch_reports_vertex_id(self): + left, right = await self._matching_pair() + right_with_bad_degree = VertexDegreeMismatchStorage(right) + + report = await compare_storage_backends( + left, + right_with_bad_degree, + sample_vertices=["A"], + sample_hyperedges=[("A", "B")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "vertex degree mismatch" in failure and "A" in failure + for failure in report.failures + ), + report.failures, + ) + + async def test_vertex_neighbor_mismatch_reports_vertex_id(self): + left, right = await self._matching_pair() + right_with_bad_neighbors = VertexNeighborMismatchStorage(right) + + report = await compare_storage_backends( + left, + right_with_bad_neighbors, + sample_vertices=["A"], + sample_hyperedges=[("A", "B")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "vertex neighbor hyperedges mismatch" in failure and "A" in failure + for failure in report.failures + ), + report.failures, + ) + + async def test_hyperedge_payload_mismatch_reports_normalized_id_set(self): + left, right = await self._matching_pair() + await right.upsert_hyperedge( + ("A", "B"), + { + "description": "Different", + "source_id": "chunk-ab", + "weight": 1.0, + }, + ) + + report = await compare_storage_backends( + left, + right, + sample_vertices=["A"], + sample_hyperedges=[("B", "A")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "hyperedge payload mismatch" in failure + and "('A', 'B')" in failure + for failure in report.failures + ), + report.failures, + ) + + async def test_hyperedge_degree_mismatch_reports_normalized_id_set(self): + left, right = await self._matching_pair() + right_with_bad_degree = HyperedgeDegreeMismatchStorage(right) + + report = await compare_storage_backends( + left, + right_with_bad_degree, + sample_vertices=["A"], + sample_hyperedges=[("B", "A")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "hyperedge degree mismatch" in failure + and "('A', 'B')" in failure + for failure in report.failures + ), + report.failures, + ) + + async def test_hyperedge_neighbor_mismatch_reports_normalized_id_set(self): + left, right = await self._matching_pair() + right_with_bad_neighbors = NeighborMismatchStorage(right) + + report = await compare_storage_backends( + left, + right_with_bad_neighbors, + sample_vertices=["A"], + sample_hyperedges=[("B", "A")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "hyperedge neighbors mismatch" in failure + and "('A', 'B')" in failure + for failure in report.failures + ), + report.failures, + ) + + async def test_missing_sampled_hyperedge_returns_failures_instead_of_raising(self): + left, right = await self._matching_pair() + await right.remove_hyperedge(("A", "B")) + + report = await compare_storage_backends( + left, + right, + sample_vertices=["A"], + sample_hyperedges=[("B", "A")], + ) + + self.assertFalse(report.passed) + self.assertTrue( + any( + "hyperedge payload mismatch" in failure + and "('A', 'B')" in failure + for failure in report.failures + ), + report.failures, + ) + self.assertTrue( + any( + "hyperedge neighbors mismatch" in failure + and "('A', 'B')" in failure + for failure in report.failures + ), + report.failures, + ) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_webui_settings_helpers.py b/tests/test_webui_settings_helpers.py new file mode 100644 index 0000000..770a149 --- /dev/null +++ b/tests/test_webui_settings_helpers.py @@ -0,0 +1,65 @@ +import importlib.util +from pathlib import Path +import sys +import unittest + + +MODULE_PATH = ( + Path(__file__).resolve().parents[1] + / "web-ui" + / "backend" + / "settings_helpers.py" +) +spec = importlib.util.spec_from_file_location("settings_helpers", MODULE_PATH) +settings_helpers = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = settings_helpers +spec.loader.exec_module(settings_helpers) + +merge_settings_for_save = settings_helpers.merge_settings_for_save + + +class SettingsHelpersTest(unittest.TestCase): + def test_preserves_unknown_existing_settings(self): + merged = merge_settings_for_save( + {"futureSetting": "keep", "modelName": "old"}, + {"modelName": "new"}, + ) + + self.assertEqual("keep", merged["futureSetting"]) + self.assertEqual("new", merged["modelName"]) + + def test_preserves_existing_api_key_for_masked_value(self): + merged = merge_settings_for_save( + {"apiKey": "secret"}, + {"apiKey": "***", "modelName": "new"}, + ) + + self.assertEqual("secret", merged["apiKey"]) + + def test_keeps_nebulagraph_settings_from_incoming_payload(self): + merged = merge_settings_for_save( + {}, + { + "hypergraphBackendMode": "dual-read", + "nebulaGraphValidated": False, + }, + ) + + self.assertEqual("dual-read", merged["hypergraphBackendMode"]) + self.assertFalse(merged["nebulaGraphValidated"]) + + def test_preserves_existing_nebulagraph_settings_when_omitted(self): + merged = merge_settings_for_save( + { + "hypergraphBackendMode": "nebulagraph-serving", + "nebulaGraphValidated": True, + }, + {"modelName": "gpt-5-mini"}, + ) + + self.assertEqual("nebulagraph-serving", merged["hypergraphBackendMode"]) + self.assertTrue(merged["nebulaGraphValidated"]) + + +if __name__ == "__main__": + unittest.main() diff --git a/web-ui/backend/main.py b/web-ui/backend/main.py index 32fb83a..bee1e59 100644 --- a/web-ui/backend/main.py +++ b/web-ui/backend/main.py @@ -13,6 +13,7 @@ from pydantic import BaseModel from typing import List from io import StringIO +from settings_helpers import merge_settings_for_save # 添加 HyperRAG 相关导入 # 若尚不可导入,则向上逐级查找含有 hyperrag 包的目录,并把“其父目录”加到 sys.path @@ -35,6 +36,17 @@ # 设置文件路径 SETTINGS_FILE = "settings.json" + +def _load_settings_defaults(): + if not os.path.exists(SETTINGS_FILE): + return {} + try: + with open(SETTINGS_FILE, 'r', encoding='utf-8') as f: + return json.load(f) + except Exception: + return {} + + app = FastAPI() app.add_middleware( @@ -47,7 +59,7 @@ @app.get("/") async def root(): - return {"message": "Hyper-RAG"} + return {"message": "Hyper-Graph"} @app.get("/db") @@ -259,8 +271,14 @@ class SettingsModel(BaseModel): maxTokens: int = 2000 temperature: float = 0.7 # HyperRAG 嵌入模型设置 - embeddingModel: str = "text-embedding-3-small" - embeddingDim: int = 1536 + embeddingModel: str = "text-embedding-v4" + embeddingDim: int = 1024 + # Hypergraph backend settings + hypergraphBackendMode: str = "hgdb" + nebulaGraphValidated: bool = False + + class Config: + extra = "allow" class APITestModel(BaseModel): apiKey: str @@ -295,8 +313,8 @@ async def get_settings(): "selectedDatabase": "", "maxTokens": 2000, "temperature": 0.7, - "embeddingModel": "text-embedding-3-small", - "embeddingDim": 1536 + "embeddingModel": "text-embedding-v4", + "embeddingDim": 1024 } except Exception as e: return {"success": False, "message": str(e)} @@ -307,19 +325,13 @@ async def save_settings(settings: SettingsModel): 保存系统设置 """ try: - settings_dict = settings.dict() - - # 如果apiKey是***,则保持原有的apiKey不变 - if settings_dict.get('apiKey') == '***': - # 读取现有设置中的apiKey - if os.path.exists(SETTINGS_FILE): - with open(SETTINGS_FILE, 'r', encoding='utf-8') as f: - existing_settings = json.load(f) - # 保持原有的apiKey - settings_dict['apiKey'] = existing_settings.get('apiKey', '') - else: - # 如果没有现有设置文件,则设为空字符串 - settings_dict['apiKey'] = '' + existing_settings = {} + if os.path.exists(SETTINGS_FILE): + with open(SETTINGS_FILE, 'r', encoding='utf-8') as f: + existing_settings = json.load(f) + + settings_dict = settings.dict(exclude_unset=True) + settings_dict = merge_settings_for_save(existing_settings, settings_dict) with open(SETTINGS_FILE, 'w', encoding='utf-8') as f: json.dump(settings_dict, f, ensure_ascii=False, indent=2) @@ -533,6 +545,11 @@ def get_or_create_hyperrag(database: str = None): max_token_size=8192, func=get_hyperrag_embedding_func ), + addon_params={ + "database_name": database, + "hypergraph_backend_mode": settings.get("hypergraphBackendMode", "hgdb"), + "nebulagraph_validated": settings.get("nebulaGraphValidated", False), + }, ) main_logger.info(f"HyperRAG实例创建完成,数据库: {database}") @@ -660,12 +677,15 @@ async def get_hyperrag_status(database: str = None): if database in hyperrag_instances: instance = hyperrag_instances[database] status["initialized"] = True + settings = _load_settings_defaults() try: status["details"] = { "chunk_token_size": instance.chunk_token_size, "llm_model_name": instance.llm_model_name, "embedding_func_available": instance.embedding_func is not None, - "working_dir": os.path.join(hyperrag_working_dir, database.replace('.hgdb', '')) + "working_dir": os.path.join(hyperrag_working_dir, database.replace('.hgdb', '')), + "hypergraph_backend_mode": settings.get("hypergraphBackendMode", "hgdb"), + "nebula_graph_validated": settings.get("nebulaGraphValidated", False), } except Exception as e: status["details"] = f"Error getting details: {str(e)}" diff --git a/web-ui/backend/requirements.txt b/web-ui/backend/requirements.txt index 0fe643b..914f0a7 100644 --- a/web-ui/backend/requirements.txt +++ b/web-ui/backend/requirements.txt @@ -4,6 +4,7 @@ aioboto3 aiohttp numpy nano-vectordb +nebula3-python openai tenacity tiktoken @@ -12,4 +13,4 @@ python-multipart websockets aiofiles PyPDF2 -docx2txt \ No newline at end of file +docx2txt diff --git a/web-ui/backend/settings_helpers.py b/web-ui/backend/settings_helpers.py new file mode 100644 index 0000000..daf9dc2 --- /dev/null +++ b/web-ui/backend/settings_helpers.py @@ -0,0 +1,5 @@ +def merge_settings_for_save(existing_settings: dict, incoming_settings: dict) -> dict: + merged_settings = {**existing_settings, **incoming_settings} + if merged_settings.get("apiKey") == "***": + merged_settings["apiKey"] = existing_settings.get("apiKey", "") + return merged_settings