Add cross-filtering to Explorer facet counts by rdhyee · Pull Request #94 · isamplesorg/isamplesorg.github.io

rdhyee · 2026-04-09T00:10:00Z

Summary

When any filter is active, facet counts update to reflect the intersection of all other active filters (standard faceted search behavior)
Selecting SESAR as source → material/context/specimen counts show only what exists in SESAR
4 parallel GROUP BY queries via DuckDB-WASM, each excluding its own dimension
DOM manipulation updates count labels without re-rendering checkboxes (preserves selections)
Zero-count facet values dimmed for visual clarity
When no filters active, pre-computed 2KB summaries used (instant, unchanged)

Test plan

Load Explorer with no filters — counts should match pre-computed summaries
Check SESAR source → material counts should drop (no archaeology materials)
Check SESAR + Rock material → context/specimen counts narrow further
Clear all filters → counts restore to pre-computed values
Verify checkbox selections persist when counts update
Zero-count items should appear dimmed

🤖 Generated with Claude Code

When any filter is active, facet counts now reflect the intersection of all OTHER active filters. For example, selecting SESAR as source updates material/context/specimen counts to show only what exists in SESAR data. Uses parallel GROUP BY queries via DuckDB-WASM. Counts update via DOM manipulation to avoid resetting checkbox selections. Zero-count facet values are dimmed for visual clarity. When no filters are active, pre-computed summaries are used (instant). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rdhyee · 2026-04-09T00:39:03Z

Pre-cached cross-filter strategy (from Eric Kansa via Slack)

Note: analysis below generated by Claude Code based on Eric's suggestion and the current Explorer architecture.

Eric suggested pre-caching facet counts for filtered subsets, similar to how Open Context uses Django caching with a cache-warming script. Here's the full analysis for our dataset:

Our facet dimensions

Facet	Values	States (any + each value)
Source	4 (SESAR, OpenContext, GEOME, Smithsonian)	5
Material	~10	11
Context (Sampled Feature)	~8	9
Specimen Type	~8	9

Combinatorics

Single-value-per-facet (Eric's model): 5 × 11 × 9 × 9 = 4,455 combinations. Each combination stores counts for all ~30 facet values across all dimensions. That's ~130K rows — trivially small as a parquet file, probably under 1 MB.

Multi-value-per-facet (checkboxes allow this): each facet has 2^n subsets. That's 2^4 × 2^10 × 2^8 × 2^8 = 2^30 ≈ 1 billion combinations. Obviously not pre-cacheable.

Practical middle ground: pre-cache the single-value combinations (covers the most common interaction pattern — user clicks one checkbox at a time), and fall back to on-the-fly for multi-value selections. This is exactly Eric's "not in cache → calculate on the fly" pattern.

Comparison with current PR approach

Approach	Latency	Extra download	Maintenance	Multi-value support
Pre-cached file (Eric's pattern)	Instant (~0ms lookup)	~1 MB parquet	Rebuild when data changes	Falls back to on-the-fly
On-the-fly GROUP BY (this PR)	1-3s per change	None	Zero	Works for any combination
Hybrid (pre-cache + on-the-fly fallback)	Instant for common, 1-3s for complex	~1 MB parquet	Rebuild when data changes	Full coverage

How we'd build the pre-cache

We already have a pipeline that generates isamples_202601_facet_summaries.parquet (2KB, the unfiltered counts). The pre-cache would be a natural extension:

-- For each combination of single-value filters, compute cross-filtered counts
-- Example: "given source=SESAR, what are the material counts?"
SELECT
  'SESAR' as filter_source,
  NULL as filter_material,
  NULL as filter_context,
  NULL as filter_specimen,
  'material' as facet_type,
  has_material_category as facet_value,
  COUNT(*) as count
FROM samples
WHERE n = 'SESAR'
  AND otype = 'MaterialSampleRecord'
  AND latitude IS NOT NULL
GROUP BY has_material_category

Multiply that pattern across all 4,455 combinations. DuckDB can generate the entire file in under a minute locally.

The resulting file schema:

filter_source       | VARCHAR (nullable — NULL means "any")
filter_material     | VARCHAR (nullable)
filter_context      | VARCHAR (nullable)
filter_specimen     | VARCHAR (nullable)
facet_type          | VARCHAR (source/material/context/object_type)
facet_value         | VARCHAR
count               | BIGINT

In the browser, lookup is a simple filtered read — DuckDB-WASM with HTTP range requests would resolve it in milliseconds since the file is tiny.

Recommendation

The hybrid approach is the clear winner:

Ship the pre-cache file alongside the existing facet summaries on data.isamples.org
Use it for instant single-value lookups (covers 90%+ of user interactions)
Fall back to on-the-fly GROUP BY for multi-value or text search combinations
Regenerate the pre-cache whenever we update the main parquet files

This mirrors how Open Context does it with Django caching, just with parquet files instead of a database cache layer. The "cache warming script" is a DuckDB query that runs offline.

rdhyee mentioned this pull request Apr 9, 2026

iSamples MVP Cleanup & Simplification Strategy #49

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cross-filtering to Explorer facet counts#94

Add cross-filtering to Explorer facet counts#94
rdhyee wants to merge 1 commit intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering

rdhyee commented Apr 9, 2026

Uh oh!

rdhyee commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdhyee commented Apr 9, 2026

Summary

Test plan

Uh oh!

rdhyee commented Apr 9, 2026

Pre-cached cross-filter strategy (from Eric Kansa via Slack)

Our facet dimensions

Combinatorics

Comparison with current PR approach

How we'd build the pre-cache

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant