Skip to content

Add cross-filtering to Explorer facet counts#94

Open
rdhyee wants to merge 1 commit intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering
Open

Add cross-filtering to Explorer facet counts#94
rdhyee wants to merge 1 commit intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 9, 2026

Summary

  • When any filter is active, facet counts update to reflect the intersection of all other active filters (standard faceted search behavior)
  • Selecting SESAR as source → material/context/specimen counts show only what exists in SESAR
  • 4 parallel GROUP BY queries via DuckDB-WASM, each excluding its own dimension
  • DOM manipulation updates count labels without re-rendering checkboxes (preserves selections)
  • Zero-count facet values dimmed for visual clarity
  • When no filters active, pre-computed 2KB summaries used (instant, unchanged)

Test plan

  • Load Explorer with no filters — counts should match pre-computed summaries
  • Check SESAR source → material counts should drop (no archaeology materials)
  • Check SESAR + Rock material → context/specimen counts narrow further
  • Clear all filters → counts restore to pre-computed values
  • Verify checkbox selections persist when counts update
  • Zero-count items should appear dimmed

🤖 Generated with Claude Code

When any filter is active, facet counts now reflect the intersection
of all OTHER active filters. For example, selecting SESAR as source
updates material/context/specimen counts to show only what exists
in SESAR data. Uses parallel GROUP BY queries via DuckDB-WASM.

Counts update via DOM manipulation to avoid resetting checkbox
selections. Zero-count facet values are dimmed for visual clarity.
When no filters are active, pre-computed summaries are used (instant).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 9, 2026

Pre-cached cross-filter strategy (from Eric Kansa via Slack)

Note: analysis below generated by Claude Code based on Eric's suggestion and the current Explorer architecture.

Eric suggested pre-caching facet counts for filtered subsets, similar to how Open Context uses Django caching with a cache-warming script. Here's the full analysis for our dataset:

Our facet dimensions

Facet Values States (any + each value)
Source 4 (SESAR, OpenContext, GEOME, Smithsonian) 5
Material ~10 11
Context (Sampled Feature) ~8 9
Specimen Type ~8 9

Combinatorics

Single-value-per-facet (Eric's model): 5 × 11 × 9 × 9 = 4,455 combinations. Each combination stores counts for all ~30 facet values across all dimensions. That's ~130K rows — trivially small as a parquet file, probably under 1 MB.

Multi-value-per-facet (checkboxes allow this): each facet has 2^n subsets. That's 2^4 × 2^10 × 2^8 × 2^8 = 2^30 ≈ 1 billion combinations. Obviously not pre-cacheable.

Practical middle ground: pre-cache the single-value combinations (covers the most common interaction pattern — user clicks one checkbox at a time), and fall back to on-the-fly for multi-value selections. This is exactly Eric's "not in cache → calculate on the fly" pattern.

Comparison with current PR approach

Approach Latency Extra download Maintenance Multi-value support
Pre-cached file (Eric's pattern) Instant (~0ms lookup) ~1 MB parquet Rebuild when data changes Falls back to on-the-fly
On-the-fly GROUP BY (this PR) 1-3s per change None Zero Works for any combination
Hybrid (pre-cache + on-the-fly fallback) Instant for common, 1-3s for complex ~1 MB parquet Rebuild when data changes Full coverage

How we'd build the pre-cache

We already have a pipeline that generates isamples_202601_facet_summaries.parquet (2KB, the unfiltered counts). The pre-cache would be a natural extension:

-- For each combination of single-value filters, compute cross-filtered counts
-- Example: "given source=SESAR, what are the material counts?"
SELECT
  'SESAR' as filter_source,
  NULL as filter_material,
  NULL as filter_context,
  NULL as filter_specimen,
  'material' as facet_type,
  has_material_category as facet_value,
  COUNT(*) as count
FROM samples
WHERE n = 'SESAR'
  AND otype = 'MaterialSampleRecord'
  AND latitude IS NOT NULL
GROUP BY has_material_category

Multiply that pattern across all 4,455 combinations. DuckDB can generate the entire file in under a minute locally.

The resulting file schema:

filter_source       | VARCHAR (nullable — NULL means "any")
filter_material     | VARCHAR (nullable)
filter_context      | VARCHAR (nullable)
filter_specimen     | VARCHAR (nullable)
facet_type          | VARCHAR (source/material/context/object_type)
facet_value         | VARCHAR
count               | BIGINT

In the browser, lookup is a simple filtered read — DuckDB-WASM with HTTP range requests would resolve it in milliseconds since the file is tiny.

Recommendation

The hybrid approach is the clear winner:

  1. Ship the pre-cache file alongside the existing facet summaries on data.isamples.org
  2. Use it for instant single-value lookups (covers 90%+ of user interactions)
  3. Fall back to on-the-fly GROUP BY for multi-value or text search combinations
  4. Regenerate the pre-cache whenever we update the main parquet files

This mirrors how Open Context does it with Django caching, just with parquet files instead of a database cache layer. The "cache warming script" is a DuckDB query that runs offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant