A small, polished RAG (Retrieval-Augmented Generation) document Q&A service built with FastAPI and the OpenAI API. Upload PDFs, text, or markdown, then ask questions and get answers grounded in your documents — with inline citations back to the source passages.
This is a deliberately compact project meant to teach the whole RAG loop end to end without hiding anything behind a heavyweight vector database. Every moving part — chunking, embedding, similarity search, prompt construction — is plain, readable Python you can step through.
Stack: FastAPI · Pydantic v2 · OpenAI embeddings + chat · NumPy vector search · zero external services.
- Upload & index
.pdf,.txt, and.mdfiles via a REST endpoint or web UI. - Cited answers — every response lists the exact passages it used, with cosine-similarity scores.
- Transparent retrieval — a ~150-line NumPy vector store you can actually read, persisted to a single JSON file.
- Clean architecture — ingestion, embeddings, storage, retrieval, and the LLM call are each isolated and independently testable.
- Auto docs — interactive OpenAPI explorer at
/docsfor free. - Tested without an API key — the pure logic (chunking, vector math, HTTP layer) is covered by
pytestwith the OpenAI calls mocked. - Dockerized and ready to deploy.
upload ask
┌──────────┐ ─────► ┌───────────┐ ┌───────────┐
│ document │ │ chunk + │ │ embed │
│ (.pdf…) │ │ embed │ │ question │
└──────────┘ └─────┬─────┘ └─────┬─────┘
│ │
▼ ▼
┌───────────────────────────────┐
│ JSON vector store (NumPy) │
│ cosine-similarity search │
└───────────────┬───────────────┘
│ top-k chunks
▼
┌──────────────────────────┐
│ build numbered context │
│ → OpenAI chat completion │
│ → answer + citations │
└──────────────────────────┘
| Layer | File | Responsibility |
|---|---|---|
| Config | app/config.py |
Typed settings from .env (pydantic-settings) |
| Ingestion | app/ingestion.py |
Parse files, normalize, chunk with overlap |
| Embeddings | app/embeddings.py |
Thin OpenAI embeddings wrapper (batched) |
| Vector store | app/vectorstore.py |
Persisted cosine-similarity search |
| LLM | app/llm.py |
Grounded chat completion + system prompt |
| RAG | app/rag.py |
Orchestrates ingest & ask |
| API | app/main.py |
FastAPI routes, DI, error handling |
| UI | app/static/index.html |
Single-file upload + chat front end |
# 1. Clone & install
pip install -r requirements.txt
# 2. Configure
cp .env.example .env
# edit .env and set OPENAI_API_KEY=sk-...
# 3. Run
uvicorn app.main:app --reloadThen open:
- http://localhost:8000 — the web UI (upload a doc, ask a question)
- http://localhost:8000/docs — interactive API docs
Try it with the included sample:
curl -F "file=@sample_docs/refund_policy.md" http://localhost:8000/documents
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "How long do I have to request a refund?"}'| Method | Path | Description |
|---|---|---|
POST |
/documents |
Upload & index a file (multipart file) |
GET |
/documents |
List indexed documents & chunk counts |
DELETE |
/documents/{id} |
Remove a document from the index |
POST |
/ask |
Ask a question → cited answer |
GET |
/health |
Liveness check |
make test # pytest — no API key needed; OpenAI calls are faked
make lint # ruff check + format --check
make format # ruff format + autofix
make ci # lint + test (what CI runs)Coverage includes chunking edge cases, cosine-similarity ranking, store
persistence, the RAG orchestration (with stubbed embedder + LLM), and the HTTP
layer (via FastAPI dependency overrides). CI runs on Python 3.11 and 3.12 — see
.github/workflows/ci.yml.
docker build -t askdocs .
docker run --rm -p 8000:8000 --env-file .env -v $(pwd)/data:/app/data askdocsAll settings come from environment variables / .env (see .env.example):
| Variable | Default | Meaning |
|---|---|---|
OPENAI_API_KEY |
— | Required. Your OpenAI key |
EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model |
CHAT_MODEL |
gpt-4o-mini |
Answer-generation model |
CHUNK_SIZE |
800 |
Target characters per chunk |
CHUNK_OVERLAP |
150 |
Overlap between chunks |
TOP_K |
4 |
Chunks retrieved per question |
DATA_DIR |
./data |
Where the index persists |
- Swap the JSON store for pgvector, FAISS, or Chroma (only
vectorstore.pychanges). - Add streaming answers with Server-Sent Events.
- Re-rank retrieved chunks with a cross-encoder.
- Per-user document collections + auth.
- Evaluation harness (faithfulness / answer-relevance scoring).
{ "answer": "You may request a full refund within 30 days of purchase, as long as the product is unused and in its original packaging [1].", "citations": [ { "filename": "refund_policy.md", "chunk_index": 0, "score": 0.83, "snippet": "Customers may request a full refund within 30 days…" } ] }