Date: 2025-12-10 Status: Recovery session - resumed after interrupted session, confirmed cleanup plan NOT executed
Session was interrupted before /wrap. Recovered context from SESSION_SUMMARY.md and dev-journal. Confirmed that the extensive repo cleanup plan was documented but NOT executed - archive/ directory doesn't exist yet.
Previous session (Dec 9) completed query performance profiling. This session was brief - just recovery and status check.
- Session Recovery: Recovered context from interrupted session via /resume
- Status Verification: Confirmed
archive/dir not created, cleanup plan not executed - Uncommitted Changes Identified: Found staged changes in 3 repos (website, python, export_client)
- Query Profiler Created:
scripts/profile_queries.py- benchmarks all key Cesium queries - Performance Baseline Established: Remote R2 parquet query times measured
- Bottlenecks Identified:
list_contains()JOINs and full-table scans are the culprits - Optimization Strategy Defined: Two-tier data architecture with pre-computed artifacts
- Repo Inventory Documented: Full assessment of 14 repos with cleanup recommendations
| Query | Time | Verdict |
|---|---|---|
| Locations (cold) | 3,875ms | Too slow for initial load |
| Locations (warm) | 1,598ms | Still slow even cached |
| Point selection (direct) | 4,341ms | Unacceptable for click |
| Point selection (site-mediated) | 578ms | Borderline |
| Entity counts | 158ms | Fast enough |
| Classification | SKIPPED | Machine-killer (minutes+, GB memory) |
Root Causes:
- Locations: Scanning 19.5M rows for 5.98M geocodes, returning 47 columns when 3 needed
- Point selection:
list_contains()on arrays requires full table scan - no index - Classification: LEFT JOINs with
list_contains()= exponential complexity
| File | Description | Keep/Regenerate |
|---|---|---|
scripts/profile_queries.py |
Query benchmarking tool | Keep |
/tmp/query_profile_results.txt |
Latest profiling output | Regenerate |
| File | Description | Keep/Regenerate |
|---|---|---|
/tmp/zenodo_narrow_strict.parquet |
Narrow PQG (709MB) | Keep - on R2 |
/tmp/zenodo_wide_strict.parquet |
Wide PQG (242MB) | Keep - on R2 |
~/.claude/skills/gemini/SKILL.md |
Gemini skill doc | Keep |
- Wide:
https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202512_wide.parquet - Narrow:
https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202512_narrow.parquet
From Eric's email (confirmed Dec 10) - three-part plan:
- Use PostgreSQL dump Dave can recover
- Create comprehensive iSamples Central PQG export
- Archive to Zenodo for preservation
Requirements for parquet-powered iSamples Central:
| Feature | Implementation Notes |
|---|---|
| Global Cesium map | Use H3 geohash (https://h3geo.org/) to aggregate locations for fast rendering |
| Faceted filtering | Facets with counts: object type, material type, collection |
| Map updates on filter | Filtering facets should update world map dynamically |
| Click → sample table | Point click shows sample records (like OpenContext demo) |
| Links to source | Sample results link back to home collections |
| Full-text search | Search updates world map (stretch goal?) |
Key insight from Eric: May need even MORE denormalized parquet than "PQG wide" - specifically designed for these UI needs.
- Most records lack thumbnails
- Use collection logos as stand-ins
- Use NounProject icons (Eric has account) for sample object types
- Icons from: https://isamples.org/models/generated/vocabularies/material_sample_object_type.html
Aligns with Eric's Part 2 - simplified parquet for frontend
Recommended artifacts:
-
locations_h3.parquet(~1-5MB) - NEW based on Eric's suggestion- H3 hexagonal aggregation at multiple resolutions
- h3_index, count, representative_lat, representative_lon
- For fast initial map render with clustering
-
locations_summary.parquet(~5-10MB)- Pre-filtered: pid, latitude, longitude, location_type
- Only 5.98M rows × 4 columns
- Target: <500ms initial load
-
facets_precomputed.parquet(~1MB) - NEW for Eric's faceting- Pre-aggregated counts by: object_type, material_type, collection
- Enables instant facet rendering
-
location_samples_lookup.parquet(~50MB?)- Pre-computed: geo_pid → [sample_pids, sample_labels, source_url]
- Eliminates
list_contains()JOINs at query time - Target: <100ms point selection
-
Keep full wide parquet for detail drill-down only
- Merge
parquet_cesium_wide.qmdandparquet_cesium_isamples_wide.qmd - Update to use optimized artifacts
- Consider React SPA for production quality
- Features: Search, filter by source, map exploration, export
- Add collection logos/NounProject icons per Eric's suggestion
- Normalize
sample_identifier_col→sample_identifier - Add column order tests
Goal: Pivot fully to parquet workflows while preserving API code for potential future revival.
Strategy - "Soft Deprecation":
- Don't delete API client code - move to
_legacy/or mark with deprecation warnings - Update all tutorials/examples to use parquet-first patterns
- Add clear banners/callouts: "iSamples Central API is offline - using parquet archive"
- Keep API code importable but not in default examples
Repositories affected:
| Repo | Action |
|---|---|
isamples-python |
Mark IsbClient, IsbClient2, ISamplesBulkHandler as deprecated; keep in codebase |
isamplesorg.github.io |
Remove/archive API-dependent tutorials; focus on parquet demos |
pqg |
Already parquet-native - no changes needed |
Code preservation pattern:
# In isamples-python/src/isamples_client/isbclient.py
import warnings
class IsbClient:
"""
DEPRECATED: iSamples Central API is offline as of 2025.
Use parquet workflows instead - see examples/basic/geoparquet.ipynb
This class is preserved for potential future API revival.
"""
def __init__(self, ...):
warnings.warn(
"IsbClient is deprecated - iSamples Central API offline. "
"Use parquet workflows: examples/basic/geoparquet.ipynb",
DeprecationWarning,
stacklevel=2
)
...Documentation updates:
- README.md: Lead with parquet, mention API as "archived"
- CLAUDE.md: Already notes API offline - strengthen language
- Tutorials: Archive API-dependent ones, create new parquet-only versions
Parquet format focus (per Eric's direction):
- PQG narrow format: Full fidelity, archival
- PQG wide format: Query-optimized, entity-centric
- Frontend-optimized: H3 aggregated, pre-computed facets (new)
Inventory completed Dec 9, 2025 - Assessment of all iSamples repos:
| Repo | Last Commit | 6-Mo Commits | Size | Notes |
|---|---|---|---|---|
isamplesorg.github.io |
Dec 6 | 71 | 1.4G | Primary website, Cesium demos |
isamples-python |
Dec 4 | 30 | 997M | Python client, Jupyter examples |
pqg |
Dec 6 | 21 | 18G | Property graph framework |
| Repo | Last Commit | Notes |
|---|---|---|
export_client |
Dec 5 | CLI for batch downloads |
isamplesorg-metadata |
Nov 14 | LinkML schemas, vocabularies |
| Repo | Last Commit | Size | Notes |
|---|---|---|---|
isamples_inabox |
Feb 2023 | 19M | Original server (PostgreSQL/Solr/FastAPI) |
isamples_docker |
Mar 2022 | 340M | Docker deployment - obsolete |
isamples_docker_upstream |
Mar 2023 | 357M | Docker mirror - obsolete |
isamples-ansible |
Mar 2023 | 381M | Ansible deployment - obsolete |
noid-generation |
Oct 2023 | 168M | NOID identifier tool |
noid-1 |
Oct 2021 | 372K | Original NOID Python port |
noidy |
Apr 2023 | 284K | NOID variant |
pynoid |
Apr 2023 | 192K | NOID alternative |
ezid |
May 2023 | 93M | EZID identifier service |
ezid-client-tools |
Jun 2023 | 1.6M | EZID client tools |
opencontext_rdhyee |
Mar 2023 | 373M | Exploratory OC work |
Keep (essential docs):
CLAUDE.md,SESSION_SUMMARY.md- Active guidanceEDGE_TYPE_FLOW.md,PQG_LEARNING_GUIDE.md- Valuable reference
Archive/Delete (Oct 2025 scratch files):
test_*.py,test_*.js- Exploratory test scripts*_output.txt- Test outputs (regenerable)find_pkap_geos.py,investigate_path1.py- One-off scriptspackage.json,node_modules/- Minimal npm setup (not needed)GEMINI.md- Empty placeholderIMPLEMENTATION_SUMMARY.md,BILLING_UPDATE.md,QUERY_COMPARISON.md,AGENTS.md- Possibly stale
Suggested cleanup action:
cd /Users/raymondyee/C/src/iSamples
mkdir -p archive
mv isamples_inabox isamples_docker isamples_docker_upstream isamples-ansible archive/
mv noid-generation noid-1 noidy pynoid ezid ezid-client-tools archive/
mv opencontext_rdhyee archive/
# Consider: rm -rf node_modules package.json package-lock.jsonSpace recovery potential: ~1.7GB from archiving legacy repos
Most Active Files (commits since Jun 2025):
| File | Commits | Status |
|---|---|---|
tutorials/parquet_cesium.qmd |
27 | ACTIVE - main Cesium demo |
_quarto.yml |
9 | Config |
tutorials/zenodo_isamples_analysis.qmd |
7 | ACTIVE |
index.qmd |
6 | Homepage |
tutorials/parquet_cesium_wide.qmd |
2 | ACTIVE - wide format demo |
tutorials/parquet_cesium_isamples_wide.qmd |
1 | ACTIVE - full iSamples demo |
Space Hogs:
assets/oc_isamples_pqg.parquet- 691MB (duplicated in docs/assets!)docs/assets/- 695MB (duplicate of assets/)
Cleanup Opportunities:
# Remove duplicate parquet (use R2 URL instead)
rm assets/oc_isamples_pqg.parquet
# Or add to .gitignore and reference R2 URL in tutorialsFiles to consider archiving:
PERFORMANCE_OPTIMIZATION_PLAN.md,OPTIMIZATION_SUMMARY.md,LAZY_LOADING_IMPLEMENTATION.md- One-off planning docs
Most Active Files:
| File | Commits | Status |
|---|---|---|
examples/basic/oc_parquet_analysis_enhanced.ipynb |
13 | ACTIVE |
examples/basic/geoparquet.ipynb |
5 | ACTIVE - main parquet demo |
examples/basic/isample-archive.ipynb |
4 | ACTIVE |
README.md, CLAUDE.md, pyproject.toml |
4 each | Config/docs |
src/isamples_client/isbclient.py |
1 | API client (TO DEPRECATE) |
Space Hogs:
examples/basic/oc_isamples_pqg.parquet- 691MBexamples/basic/oc_isamples_pqg_wide.parquet- 275MB
Cleanup Opportunities:
# Add parquet files to .gitignore, document R2 URLs instead
echo "*.parquet" >> .gitignore
# Or keep one canonical copy and symlinkFiles to consider archiving:
PQG_INTEGRATION_PLAN.md,ISAMPLES_MODEL_ACTION_PLAN.md- Planning docs (may be stale)examples/spatial/- Check if still relevant- Multiple
*_output.txtfiles
Most Active Files:
| File | Commits | Status |
|---|---|---|
pqg/sql_converter.py |
8 | ACTIVE - core converter |
pqg/pqg_singletable.py |
4 | ACTIVE - main implementation |
README.md |
4 | Docs |
pqg/typed_edges.py |
2 | ACTIVE - typed edge support |
pqg/schemas/*.py |
2 each | ACTIVE - schema validation |
Space Hogs (CRITICAL):
.git/- 17GB (likely large parquet commits in history).venv/- 690MB (normal for DuckDB/PyArrow)
Cleanup Opportunities:
# Check git history for large files
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sort -k3 -n -r | head -20
# Consider: git filter-repo to remove large parquet files from history
# Or: fresh clone without historyRoot cause investigation needed: Why is .git 17GB? Likely committed large parquet files that were later removed.
Most Active Files:
| File | Commits | Status |
|---|---|---|
isamples_export_client/pqg_converter.py |
4 | ACTIVE |
README.md |
2 | Docs |
Status: Clean, well-organized. No cleanup needed.
Most Active Files:
| File | Commits | Status |
|---|---|---|
src/docs/*.md |
1 each | Documentation updates |
Status: Foundational schema repo. Stable. No cleanup needed.
- pqg .git cleanup - 17GB is excessive. Investigate and consider
git filter-repoor fresh clone - Remove duplicate parquets -
assets/oc_isamples_pqg.parquetduplicated in website repo
- Add
.gitignorefor parquet - Reference R2 URLs instead of committing 691MB files - Archive planning docs - Move stale
*_PLAN.mdfiles toarchive/in each repo
- Clean root-level scratch files - Test scripts, output files in
/Users/raymondyee/C/src/iSamples/
# Safe mode (skips classification query)
~/.pyenv/versions/myenv/bin/python scripts/profile_queries.py --remote-only
# Full mode (WARNING: high memory/CPU)
~/.pyenv/versions/myenv/bin/python scripts/profile_queries.py --full
# Local only (if file downloaded)
curl -o /tmp/isamples_202512_wide.parquet https://pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/isamples_202512_wide.parquet
~/.pyenv/versions/myenv/bin/python scripts/profile_queries.py --local-only- R2 Credentials: Stored in 1Password, use
op run --env-file=...pattern - Gemini CLI:
/opt/homebrew/bin/gemini - Codex CLI:
/opt/homebrew/bin/codex exec "prompt" -o /tmp/output.txt
- Artifact storage: Upload optimized parquet files to R2? Or generate on-demand?
- Pre-compute strategy: Run classification once during ETL vs compute lazily?
- Location type: Should
location_typebe pre-computed (blue/purple/orange classification)?
- Read this SESSION_SUMMARY.md
- Review profiling results:
/tmp/query_profile_results.txt - Next action: Create
locations_summary.parquetgeneration script - Public URLs above are live and working
Last Updated: 2025-12-09 by Claude Code (Opus 4.5) Repository: isamplesorg.github.io (fork at rdhyee/isamplesorg.github.io) Focus: Query performance optimization, intermediary artifact design Next Action: Generate optimized parquet artifacts Session Status: IN PROGRESS