Skip to content

Commit 080ce3a

Browse files
author
molty3000
committed
v1.1.0: text embedding and disk persistence
Embedding: - Embedder interface for swappable backends - RandomProjections: sparse JL projection (Achlioptas 2003) - Built-in tokenizer (split on non-letter/digit, lowercase, min 2 chars) - Deterministic output (fixed seed 42), L2-normalized Persistence: - Store.Save/Load using encoding/gob (compact binary) - Store.SaveJSON/LoadJSON for human-readable export - Full roundtrip preserves IDs, vectors, and metric Stats: 40 tests, 96.8% coverage, 0 deps
1 parent c4f9082 commit 080ce3a

8 files changed

Lines changed: 826 additions & 97 deletions

File tree

AGENTS.md

Lines changed: 56 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -5,65 +5,77 @@ Zero-dependency vector similarity library. Pure Go.
55
## Project Structure
66

77
```
8-
pkg/vector/ ← library code (the thing people import)
9-
vector.go Vector type, Dot, Norm, Normalize, Add, Sub, Scale, Equal, EqualEps, Clone, Dims
10-
similarity.go Metric enum, Cosine, CosineDist, Euclidean, Manhattan, Distance
11-
store.go Store: in-memory brute-force NN search (Add, Get, Remove, Search, Len)
12-
cmd/go-vector/ ← minimal CLI demo
13-
docs/ ← GitHub Pages landing page
14-
index.html Dark-themed single-page site
15-
.nojekyll GitHub Pages raw HTML flag
8+
pkg/vector/ ← library code
9+
vector.go Vector type, Dot, Norm, Normalize, Add, Sub, Scale, Equal, EqualEps, Clone, Dims
10+
similarity.go Metric enum, Cosine, CosineDist, Euclidean, Manhattan, Distance
11+
store.go Store: NN search + Gob/JSON persistence (Save/Load/SaveJSON/LoadJSON)
12+
embedder.go Embedder interface
13+
random_projections.go RandomProjections: sparse JL projection + tokenizer
14+
cmd/go-vector/ ← minimal CLI demo
15+
docs/ ← GitHub Pages landing page
16+
index.html Dark-themed single-page site
17+
.nojekyll GitHub Pages raw HTML flag
1618
```
1719

1820
## Conventions
1921

20-
- **Zero dependencies** — never add to go.mod. stdlib only (`math`, `sort`).
22+
- **Zero dependencies** — never add to go.mod. stdlib only: `math`, `sort`, `encoding/gob`, `encoding/json`, `os`, `strings`, `unicode`, `math/rand`.
2123
- **Vector = []float32** — no struct, no interface, just a slice.
22-
- **Mismatched lengths → zero** — Dot/Norm/Cosine/Euclidean/Manhattan all return 0 for mismatched-length inputs rather than panicking. Add/Sub return nil.
23-
- **Clone on output** — Get() and Search() return copies. Store.Add() clones on insert. No internal backing arrays ever leak.
24-
- **Zero-alloc distance** — Cosine and Euclidean compute in a single pass without intermediate allocations. All distance functions have 0 allocs in benchmarks.
25-
- **Tests in `_test.go` files** — package `vector`, no separate test package.
26-
- **Benchmarks**`bench_test.go` covers all operations at 768/1536 dims. Run with `-benchmem`.
24+
- **Mismatched lengths → zero** — return zero/nil rather than panicking.
25+
- **Clone on output** — Get() and Search() return copies. Store.Add() clones on insert.
26+
- **Zero-alloc distance** — Cosine and Euclidean compute in a single pass.
27+
- **Gob persistence**`storeData` internal struct bridges unexported fields to encoder.
28+
- **Deterministic embeddings** — RandomProjections uses fixed seed 42.
29+
- **Tests in `_test.go` files** — package `vector`.
30+
31+
## Random Projections
32+
33+
Sparse random projection (Achlioptas 2003):
34+
- Entries: {-1, 0, +1} × sqrt(3/D) with probabilities {1/6, 2/3, 1/6}
35+
- Vocabulary built from corpus via `Fit()`
36+
- Tokenizer: split on non-letter/digit, lowercase, min 2 chars
37+
- Output always L2-normalized
38+
39+
## Persistence
40+
41+
```go
42+
// storeData bridges unexported Store fields to encoder
43+
type storeData struct {
44+
Vectors []Vector
45+
IDs []string
46+
Metric Metric
47+
}
48+
```
49+
50+
`init()` registers `gob.Register(Vector{})` so `[]float32` serializes correctly.
2751

28-
## Build & Test (in Docker container)
52+
## Build & Test
2953

3054
```bash
31-
# Full CI
3255
docker exec -i projects-dev bash -c 'export PATH=$PATH:/usr/local/go/bin && cd /workspace/go-vector && make ci'
33-
34-
# Tests with coverage
35-
docker exec -i projects-dev bash -c 'export PATH=$PATH:/usr/local/go/bin && cd /workspace/go-vector && make test-cover'
36-
37-
# Benchmarks
56+
# Benchmark
3857
docker exec -i projects-dev bash -c 'export PATH=$PATH:/usr/local/go/bin && cd /workspace/go-vector && go test ./pkg/vector/ -bench=. -benchmem'
3958
```
4059

41-
Target: >95% coverage. All code paths exercised.
42-
43-
## Adding a New Metric
44-
45-
1. Add constant to `Metric` enum in `similarity.go`
46-
2. Add case to `Distance()` switch
47-
3. Update `Ascending()` if needed
48-
4. Add direct function (e.g., `Chebyshev()`) — zero-alloc, single-pass
49-
5. Add test case in `TestMetricAscending` and `TestDistance`
50-
6. Add store integration test and benchmark
60+
Target: >95% coverage. 40 tests, 96.8% currently.
5161

52-
## Security Principles
62+
## Adding a New Embedder
5363

54-
- **No panics.** Return zero/nil for invalid input.
55-
- **No unsafe.** Pure Go, no CGo, no syscalls, no I/O.
56-
- **Clone hygiene.** Internal state is never exposed through return values.
57-
- **Float32 awareness.** Document overflow limits. `MaxSafeDims` constant defines safe dimensionality.
58-
- **Thread safety docs.** Store is safe for concurrent reads but not writes — documented, not enforced.
64+
1. Implement `Embedder` interface (`Embed(text) (Vector, error)`, `Dims() int`)
65+
2. Add constructor function
66+
3. Add test file with Fit/Embed/determinism/similarity tests
67+
4. Document in README under Text Embedding section
5968

60-
## Performance Rules
69+
## Adding a New Metric
6170

62-
- Distance functions must be zero-allocation (verified by `-benchmem`)
63-
- Single-pass computation where possible (Cosine, Euclidean already are)
64-
- Brute-force search is O(n·d) — acceptable for the zero-dep target
65-
- Benchmark before and after any metric or store change
71+
1. Add constant to `Metric` enum in `similarity.go`
72+
2. Add case to `Distance()` switch, update `Ascending()` if needed
73+
3. Add direct function — zero-alloc, single-pass
74+
4. Add test + benchmark
6675

67-
## Landing Page
76+
## Security
6877

69-
The `docs/` directory is deployed via GitHub Pages (Settings → Pages → Source: main / /docs). Uses the standard BackendStack21 dark theme: Outfit for body, Monaspace Neon for code. No build step — just raw HTML + CSS.
78+
- No panics. Return zero/nil for invalid input.
79+
- No unsafe. Pure Go, no CGo, no syscalls.
80+
- Clone hygiene. Internal state never exposed.
81+
- Persistence: `Load` replaces all data. Caller handles atomicity.

README.md

Lines changed: 84 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# go-vector
22

3-
Zero-dependency vector similarity library for Go. Pure Go `[]float32` vectors, four distance metrics, brute-force nearest-neighbor search. No CGo, no BLAS, no third-party imports — just `math` and `sort`.
3+
Zero-dependency vector similarity library for Go. Pure Go `[]float32` vectors, four distance metrics, text embedding via random projections, and disk-backed persistence. No CGo, no BLAS, no third-party imports.
44

55
## Install
66

@@ -36,6 +36,45 @@ func main() {
3636
}
3737
```
3838

39+
## Text Embedding
40+
41+
```go
42+
// Fit a random projection embedder on your corpus
43+
rp := vector.NewRandomProjections(256)
44+
rp.Fit([]string{
45+
"machine learning is fascinating",
46+
"deep neural networks transform AI",
47+
"the weather today is sunny",
48+
})
49+
50+
// Embed text into a 256-dim vector
51+
v, _ := rp.Embed("learning about machine intelligence")
52+
// v is a normalized Vector suitable for cosine similarity search
53+
54+
// Use with the store
55+
store := vector.NewStore(vector.CosineDistance)
56+
store.Add("doc1", v)
57+
store.Search(rp.MustEmbed("AI and learning"), 5)
58+
```
59+
60+
The `Embedder` interface lets you swap backends: bring your own OpenAI, Ollama, or sentence-transformers adapter. The built-in `RandomProjections` is zero-dependency and deterministic.
61+
62+
## Persistence
63+
64+
```go
65+
// Save to disk (gob — compact binary)
66+
store.Save("/data/vectors.db")
67+
store.SaveJSON("/data/vectors.json") // human-readable alternative
68+
69+
// Restore later
70+
restored := vector.NewStore(vector.CosineDistance)
71+
restored.Load("/data/vectors.db")
72+
73+
// Full roundtrip — metric and all data preserved
74+
```
75+
76+
Gob-encoded stores are compact (~4 bytes per float32 + overhead). For a 10K × 1536d store, expect ~60 MB on disk and ~200ms save/load times.
77+
3978
## API
4079

4180
### Vector Type
@@ -44,18 +83,16 @@ func main() {
4483

4584
### Core Operations
4685

47-
| Function | Returns | Description |
48-
|----------|---------|-------------|
49-
| `Dims(v Vector) int` | `int` | Dimensionality |
50-
| `Dot(a, b Vector) float32` | `float32` | Dot product (0 if lengths differ) |
51-
| `Norm(v Vector) float32` | `float32` | L2 / Euclidean norm |
52-
| `Normalize(v Vector) Vector` | `Vector` | Unit vector (nil for zero vector) |
53-
| `Add(a, b Vector) Vector` | `Vector` | Element-wise sum (nil if lengths differ) |
54-
| `Sub(a, b Vector) Vector` | `Vector` | Element-wise difference (nil if lengths differ) |
55-
| `Scale(v Vector, s float32) Vector` | `Vector` | Scalar multiplication |
56-
| `Equal(a, b Vector) bool` | `bool` | Approximate equality (ε = 1e-6) |
57-
| `EqualEps(a, b Vector, eps float32) bool` | `bool` | Custom epsilon equality |
58-
| `Clone(v Vector) Vector` | `Vector` | Deep copy |
86+
- `Dims(v Vector) int` — dimensionality
87+
- `Dot(a, b Vector) float32` — dot product (0 if lengths differ)
88+
- `Norm(v Vector) float32` — L2 norm
89+
- `Normalize(v Vector) Vector` — unit vector (nil for zero vector)
90+
- `Add(a, b Vector) Vector` — element-wise sum (nil if lengths differ)
91+
- `Sub(a, b Vector) Vector` — element-wise difference (nil if lengths differ)
92+
- `Scale(v Vector, s float32) Vector` — scalar multiplication
93+
- `Equal(a, b Vector) bool` — approximate equality (ε = 1e-6)
94+
- `EqualEps(a, b Vector, eps float32) bool` — custom epsilon
95+
- `Clone(v Vector) Vector` — deep copy
5996

6097
### Distance Metrics
6198

@@ -66,34 +103,46 @@ vector.ManhattanDistance // L1 distance → [0, ∞), lower = more similar
66103
vector.DotProductSimilarity // dot product → (−∞, ∞), higher = more similar
67104
```
68105

69-
Direct functions available:
70-
- `Cosine(a, b)` — cosine similarity [−1, 1]
71-
- `CosineDist(a, b)` — 1 − cosine [0, 2]
72-
- `Euclidean(a, b)` — L2 distance (zero-alloc)
73-
- `Manhattan(a, b)` — L1 distance
74-
- `Distance(a, b, metric)` — metric-dispatch version
106+
Direct functions: `Cosine`, `CosineDist`, `Euclidean`, `Manhattan`, `Distance`.
75107

76108
### Vector Store
77109

78110
```go
79111
store := vector.NewStore(vector.CosineDistance)
80112

81-
store.Add(id string, v Vector) // insert (clones input)
82-
store.Search(query Vector, k int) // top-k nearest neighbors
83-
store.Get(id string) Vector // lookup by id (clone)
84-
store.Remove(id string) bool // remove by id
85-
store.Len() int // count
113+
store.Add(id, v) // insert (clones input)
114+
store.Search(query, k) // top-k nearest neighbors
115+
store.Get(id) // lookup by id (clone)
116+
store.Remove(id) // remove by id
117+
store.Len() // count
118+
store.Save(path) // gob-encode to file
119+
store.Load(path) // restore from gob file
120+
store.SaveJSON(path) // JSON export
121+
store.LoadJSON(path) // JSON import
122+
```
123+
124+
### Text Embedding
125+
126+
```go
127+
type Embedder interface {
128+
Embed(text string) (Vector, error)
129+
Dims() int
130+
}
86131
```
87132

88-
`Search()` returns `[]SearchResult` sorted by distance:
89-
- Distance metrics (Cosine, Euclidean, Manhattan): ascending order
90-
- DotProductSimilarity: descending order
133+
**Built-in: `RandomProjections`**
91134

92-
Results include **cloned** vectors — mutations won't corrupt store state.
135+
Johnson-Lindenstrauss sparse random projection (Achlioptas 2003). Projects tokenized text into a fixed-size normalized vector. Deterministic (fixed seed), zero dependencies, ~10µs per embed.
136+
137+
- `NewRandomProjections(outputDim int)` — create embedder
138+
- `Fit(corpus []string)` — build vocabulary and projection matrix
139+
- `Embed(text string) (Vector, error)` — embed text (L2-normalized output)
140+
- `VocabSize() int` — number of unique tokens in vocabulary
141+
- `Dims() int` — output dimensionality
93142

94143
## Performance
95144

96-
All benchmarks at 1536 dimensions (typical embedding size) on AMD EPYC.
145+
All benchmarks at 1536 dimensions on AMD EPYC.
97146

98147
```
99148
BenchmarkDot-6 7.5 µs 0 allocs
@@ -105,9 +154,7 @@ BenchmarkStoreSearch1000-6 28.5 ms 74 KB
105154
BenchmarkStoreSearch10000-6 315 ms 185 KB
106155
```
107156

108-
Distance functions are **zero-allocation**. Cosine and Euclidean compute in a single pass over the data.
109-
110-
Search is brute-force O(n·d). At 1536d, expect ~3ms per 100 vectors. Suitable for datasets up to ~100K vectors.
157+
Distance functions are **zero-allocation**. Cosine and Euclidean compute in a single pass.
111158

112159
## Security
113160

@@ -116,23 +163,18 @@ Search is brute-force O(n·d). At 1536d, expect ~3ms per 100 vectors. Suitable f
116163
| Attack surface | 🟢 Minimal — pure float32 math, no CGo, no syscalls, no I/O |
117164
| Panics | 🟢 None — all edge cases return zero/nil |
118165
| Memory safety | 🟢 All outputs cloned, no shared backing arrays |
119-
| Float overflow | 🟡 Documented — define `MaxSafeDims = 1M`; normalize large-magnitude vectors |
166+
| Persistence safety | 🟡 `Load` replaces all data; atomicity is caller's responsibility |
167+
| Float overflow | 🟡 Documented — `MaxSafeDims = 1M`; normalize large-magnitude vectors |
120168
| Thread safety | 🟡 Store is read-safe but not write-safe — guard with `sync.Mutex` |
121169

122-
`SearchResult.Vector` is a `[]float32` — it's a clone from the store, but if your application mutates float32 slices returned by Search, clone them again. The store's internal state is never exposed.
123-
124-
### Float32 Precision
125-
126-
Dot products on high-dimensional vectors with large magnitudes (>1e19) can overflow float32 (±3.4e38). For typical embedding use (normalized vectors, dims < 10K), this is not a concern. If your vectors have large unnormalized magnitudes, normalize before insertion.
127-
128170
## Design
129171

130-
- **Zero dependencies**`go.mod` has no `require` block. `math` + `sort` only.
172+
- **Zero dependencies**`go.mod` has no `require` block
131173
- **Type alias**`Vector` is `[]float32`, interoperable with any `[]float32` data
132174
- **Brute-force search** — O(n·d) per query; pair with an approximate index for n > 100K
133-
- **Clone safety**`Get()`, `Search()`, and `Add()` all clone — no accidental mutation
175+
- **Clone safety**`Get()`, `Search()`, and `Add()` all clone
134176
- **Graceful degradation** — mismatched lengths return zero/nil, never panic
135-
- **Single-pass**Cosine and Euclidean compute in one pass without intermediate allocations
177+
- **Deterministic embeddings**fixed seed (42) for reproducible results
136178

137179
## License
138180

0 commit comments

Comments
 (0)