Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,60 @@ PhyslibSearch is split into two largely independent parts: an **indexing pipelin

---

## Design decisions

The pipeline looks the way it does because **the indexed corpus is not text — it is formal Lean code**. Generic RAGs (like [gbrain](https://github.com/garrytan/gbrain)) assume the corpus is already prose, so they chunk by paragraph, embed once, and retrieve. That collapses here: a user's plain-English query lives in a completely different distribution than a terse Lean signature, so direct nearest-neighbour search in a single embedding space returns noise. Every non-obvious choice in PhyslibSearch exists to close that gap.

### Per-declaration chunking

The unit of retrieval is a single Lean declaration (theorem, definition, instance), not a file, module, or fixed-size window. This is because Lean's semantic unit *is* the declaration — splitting mid-file would break signatures and proofs, while grouping by file would dilute the embedding across unrelated lemmas. jixia already exposes declaration-level structure (name, signature, value, dependencies), so symbol-level chunking comes for free without a custom splitter. See the `declaration` table below and [`database/jixia_db.py`](../database/jixia_db.py).

### Informalize formal code before embedding

Rather than embedding the raw Lean signature, PhyslibSearch uses Gemini (`GEMINI_MODEL`, quality-tuned) to generate a **natural-language name and description** for every declaration, following the [Herald](https://arxiv.org/abs/2410.10878v2) "formalization reversal" idea. The final text that gets embedded is a concatenation of the formal and informal forms:

```text
{kind} {name} {signature}
{informal_name}: {informal_description}
```

(see [`database/vector_db.py`](../database/vector_db.py), line 43). This hybrid embedding lets the vector store match both exact-Lean queries (`mem_cons_self`) and natural-language queries ("an element is in the list it was prepended to"). A pure-formal embedding would miss the second; a pure-informal embedding would miss the first.

### Dependency-aware, topological informalization

The informalization pass is not independent per declaration. Declarations are translated in **topological order over the dependency graph** (the `level` table; see [`database/informalize.py`](../database/informalize.py)) so that when Gemini translates a theorem, it has already-translated descriptions of every symbol that theorem references, plus the ±2 neighbouring declarations in the same module (`find_neighbor`). This replicates how a mathematician reads a library: you understand a lemma in terms of things you already understand. A generic RAG has no analogue because prose has no typed dependency graph.

### HyDE query expansion

Even with hybrid embeddings, a short user query ("Schrödinger equation") is much shorter and less structured than a full indexed document. [HyDE](https://arxiv.org/abs/2212.10496) closes this final gap: Gemini (`GEMINI_FAST_MODEL`, latency-tuned) rewrites the query into a plausible hypothetical Lean declaration, and *that* is what gets embedded ([`query_expansion.py`](../query_expansion.py)). The hypothetical declaration lives in the same distribution as the indexed documents, so nearest-neighbour search becomes meaningful. HyDE is toggleable — for short formal queries where the distribution gap is already small, users can skip it.

### Dual store: PostgreSQL + ChromaDB

The pipeline is graph-heavy: dependency edges, topological levels, neighbour lookups, joins between declarations and their informalizations. That's what relational stores are good at, and the indexing pipeline relies on SQL throughout. ChromaDB, meanwhile, only needs to store vectors and return nearest IDs — it would be wasteful to replicate the full record schema inside a vector DB. At query time Chroma returns IDs, and Postgres assembles the human-readable records via the `record` view (see [`engine.py`](../engine.py)).

### Task-type-aware embeddings + model tiering

Three different Gemini knobs serve three different jobs:

- `GEMINI_MODEL` (default `gemini-2.5-pro`) — offline informalization. Quality matters, latency doesn't; this is a one-time cost per corpus.
- `GEMINI_FAST_MODEL` (default `gemini-2.5-flash`) — real-time HyDE expansion on every query. Latency matters.
- `GEMINI_EMBEDDING_MODEL` — invoked with `task_type="RETRIEVAL_DOCUMENT"` during indexing and `"RETRIEVAL_QUERY"` at query time ([`database/embedding.py`](../database/embedding.py)), letting Gemini tune each embedding for its role in the asymmetric retrieval setup.

### Why this differs from a gbrain-style RAG

| Aspect | Generic notes RAG | PhyslibSearch |
|---|---|---|
| Corpus | Prose the user already wrote | Formal Lean declarations |
| Chunking | By paragraph / section | By declaration (jixia-provided) |
| Embedding input | Raw text | Hybrid (formal + Herald-style informal) |
| Indexing pass | Single embedding pass | Dependency-ordered informalization, then embedding |
| Query expansion | Often unnecessary | HyDE — queries rewritten into Lean syntax |
| Stores | Vector store only | Postgres (graph + records) + Chroma (vectors) |

In short: a generic RAG can treat the corpus as a bag of text because its users already speak the corpus's language. For PhyslibSearch the whole point is that users *don't* — they speak physics, not Lean — so the pipeline's job is to translate both sides (documents → informal via Herald, queries → formal via HyDE) until they meet in a shared embedding space.

---

## Database schema

All tables live in the default PostgreSQL schema. PhyslibSearch adds its own `physlibsearch` schema for operational tables.
Expand Down
Loading