Skip to content

Improve search: multi-term AND + relevance ranking (FTS spike)#95

Open
rdhyee wants to merge 1 commit intoisamplesorg:mainfrom
rdhyee:feature/fts-spike
Open

Improve search: multi-term AND + relevance ranking (FTS spike)#95
rdhyee wants to merge 1 commit intoisamplesorg:mainfrom
rdhyee:feature/fts-spike

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 9, 2026

Summary

Closes #84 — FTS spike complete with immediate search improvements and documented future path.

Shipped now (zero new dependencies):

  • Multi-term search: "pottery Cyprus" requires BOTH words to match (was OR on the full phrase)
  • Relevance ranking: results sorted by score when searching — label match = 3pts, place = 2pts, description = 1pt
  • When not searching, results remain random for exploration variety

FTS spike findings:

  • Built offline DuckDB FTS index with tools/build_fts_index.py
  • Full index (label + description + place_name): 358 MB — too large for auto-download
  • Lite index (label + place_name only): 211 MB — still substantial
  • BM25 scoring works well (Porter stemming, English stopwords)
  • ATTACH over HTTP in DuckDB-WASM is supported but downloading 200-358 MB is impractical

Recommended next steps (not in this PR):

  1. Explore pre-tokenized search parquet (inverted index as parquet, much smaller)
  2. Consider on-demand FTS loading behind an "Enhanced Search" toggle
  3. Evaluate DuckDB text analytics functions (stemming without full index)

Test plan

  • Search "pottery" → results ranked by relevance (label matches first)
  • Search "pottery Cyprus" → only samples matching BOTH words
  • Search "basalt" → geological samples with label matches at top
  • Clear search → results return to random sampling
  • Verify tools/build_fts_index.py runs successfully with local parquet

🤖 Generated with Claude Code

Search improvements (immediate):
- Multi-term search: "pottery Cyprus" requires BOTH words to match
- Relevance ranking: label matches weighted 3x, place 2x, description 1x
- Results sorted by relevance score when searching (random for browsing)

FTS spike (future path, documented):
- Added tools/build_fts_index.py to build DuckDB FTS index offline
- Tested: 358 MB full index, 211 MB lite — too large for auto-download
- BM25 scoring works correctly (Porter stemming, stopwords)
- Next step: explore smaller index strategies or on-demand loading

Closes isamplesorg#84 (spike complete — findings documented in PR)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore DuckDB FTS extension for full-text search in Explorer

1 participant