Add cross-filtering to Explorer facet counts#94
Add cross-filtering to Explorer facet counts#94rdhyee wants to merge 1 commit intoisamplesorg:mainfrom
Conversation
When any filter is active, facet counts now reflect the intersection of all OTHER active filters. For example, selecting SESAR as source updates material/context/specimen counts to show only what exists in SESAR data. Uses parallel GROUP BY queries via DuckDB-WASM. Counts update via DOM manipulation to avoid resetting checkbox selections. Zero-count facet values are dimmed for visual clarity. When no filters are active, pre-computed summaries are used (instant). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-cached cross-filter strategy (from Eric Kansa via Slack)Note: analysis below generated by Claude Code based on Eric's suggestion and the current Explorer architecture. Eric suggested pre-caching facet counts for filtered subsets, similar to how Open Context uses Django caching with a cache-warming script. Here's the full analysis for our dataset: Our facet dimensions
CombinatoricsSingle-value-per-facet (Eric's model): 5 × 11 × 9 × 9 = 4,455 combinations. Each combination stores counts for all ~30 facet values across all dimensions. That's ~130K rows — trivially small as a parquet file, probably under 1 MB. Multi-value-per-facet (checkboxes allow this): each facet has 2^n subsets. That's 2^4 × 2^10 × 2^8 × 2^8 = 2^30 ≈ 1 billion combinations. Obviously not pre-cacheable. Practical middle ground: pre-cache the single-value combinations (covers the most common interaction pattern — user clicks one checkbox at a time), and fall back to on-the-fly for multi-value selections. This is exactly Eric's "not in cache → calculate on the fly" pattern. Comparison with current PR approach
How we'd build the pre-cacheWe already have a pipeline that generates -- For each combination of single-value filters, compute cross-filtered counts
-- Example: "given source=SESAR, what are the material counts?"
SELECT
'SESAR' as filter_source,
NULL as filter_material,
NULL as filter_context,
NULL as filter_specimen,
'material' as facet_type,
has_material_category as facet_value,
COUNT(*) as count
FROM samples
WHERE n = 'SESAR'
AND otype = 'MaterialSampleRecord'
AND latitude IS NOT NULL
GROUP BY has_material_categoryMultiply that pattern across all 4,455 combinations. DuckDB can generate the entire file in under a minute locally. The resulting file schema: In the browser, lookup is a simple filtered read — DuckDB-WASM with HTTP range requests would resolve it in milliseconds since the file is tiny. RecommendationThe hybrid approach is the clear winner:
This mirrors how Open Context does it with Django caching, just with parquet files instead of a database cache layer. The "cache warming script" is a DuckDB query that runs offline. |
Summary
Test plan
🤖 Generated with Claude Code