Skip to content

[Hackathon] feat(frontend): rich dataset file preview with type detection#5112

Open
kunwp1 wants to merge 4 commits into
apache:mainfrom
kunwp1:feat/dataset-file-rich-preview
Open

[Hackathon] feat(frontend): rich dataset file preview with type detection#5112
kunwp1 wants to merge 4 commits into
apache:mainfrom
kunwp1:feat/dataset-file-rich-preview

Conversation

@kunwp1
Copy link
Copy Markdown
Contributor

@kunwp1 kunwp1 commented May 16, 2026

Summary

The dataset file previewer (UserDatasetFileRendererComponent) previously identified files purely by extension and showed "Preview of the file type is currently not supported" for anything outside a small allow-list. This PR makes it identify-and-describe a much wider set of formats and surface rich per-format metadata.

What changed

  • Magic-byte detection: replaces extension-only guessing. Uses the file-type library (MIT) for ~100 common formats, plus hand-rolled signatures for Parquet (PAR1), Arrow (ARROW1), HDF5 (\x89HDF\r\n\x1a\n), NumPy .npy (\x93NUMPY), GGUF (GGUF), and Python pickle (\x80\x02..\x05). Extension-based refinement disambiguates ZIP containers (PyTorch .pt/.pth, Keras .keras, NumPy .npz) and gzipped R .rds. Text sniffing adds FASTA, FASTQ, VCF on top of the existing JSON / CSV / Markdown heuristics.

  • Lightweight header parsing for ML formats:

    • NumPy .npy → dtype, shape, byte-order, Fortran/C order
    • Safetensors → tensor count, total parameters, dtype breakdown, largest tensor, __metadata__
    • GGUF → version, tensor count, metadata KV count
  • Rich metadata per type displayed as a metadata strip above the preview:

    • CSV / XLSX: inferred column types (integer / double / boolean / date / string) and null counts shown directly under each column header in the data table; row & column counts; sheet count for XLSX
    • JSON: top-level type, item/key count, max nesting depth, per-key types
    • PDF: version, page count, /Info dictionary (Title, Author, Creator, Producer), encryption flag — rendered in <iframe>
    • Images: dimensions, aspect ratio (async via <img>.onload)
    • Video / audio: duration + resolution (async via loadedmetadata)
    • FASTA: total bases, GC content (skipped for proteins), min/max/avg sequence length
    • VCF: sample count parsed from #CHROM header, distinct chromosomes
    • Single-cell / R: AnnData (.h5ad), Seurat (.h5seurat, .rds), Loom — identification + "how to load" hint
  • Memory-safe rendering: text/CSV/JSON parsing is bounded at 10 MB (getPreviewSlice) to avoid browser OOM on large files. A warning banner appears when truncation occurs; truncation-affected stats (sequenceCountIsExact, variantCountIsExact) flip accordingly. turnOffAllDisplay now clears textContent / tableContent / currentFile so switching files reclaims memory. Per-MIME size cap raised to 1 GB from the prior 1–50 MB.

  • Async safety: ChangeDetectorRef injected and markForCheck() called from media loadedmetadata / <img>.onload callbacks, preserving the existing default change-detection strategy while supporting an eventual OnPush migration.

Screen.Recording.2026-05-16.at.12.38.35.PM.mov

Files changed

  • frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/user-dataset-file-renderer/user-dataset-file-renderer.component.ts — detection logic, parsers, render dispatch, metadata getter
  • …/user-dataset-file-renderer.component.html — metadata strip, PDF iframe, truncation banner, column-type tags on table headers
  • …/user-dataset-file-renderer.component.scss — metadata pill / column tag styles
  • …/user-dataset-file-renderer.component.spec.ts — 28 new tests (30 total)
  • frontend/package.json, frontend/yarn.lockfile-type@22.0.1 (MIT)

Test plan

  • yarn ng test --include="**/user-dataset-file-renderer.component.spec.ts" --watch=false30 / 30 passing (existing 2 retained, 28 new covering magic-byte detection, extension refinement, NumPy/Safetensors/GGUF header parsing, and column type inference)
  • Frontend visual review: open various file types in the dataset previewer and verify the metadata strip + column type tags render
  • Before/after screenshots / GIFs (not included in this draft; per AGENTS.md these should be added before merge)

Notes for reviewers

  • This is exploratory hackathon work; a tracking issue should be filed before merge per AGENTS.md.
  • The 1 GB preview limit still triggers a full file download from the dataset service. A follow-up could add HTTP Range request support so identify-only formats (Parquet, HDF5, pickle, model containers) fetch only the first 64 KB.
  • HDF5 sub-types (.h5ad / .h5seurat / .loom) are distinguished by extension because they share identical magic bytes; deep parsing would need an HDF5 reader (e.g. h5wasm) which is intentionally not included.

🤖 Generated with Claude Code

Replace extension-based file type guessing in the dataset previewer
with magic-byte detection (file-type library + manual signatures for
Parquet, Arrow, HDF5, NumPy .npy, GGUF, Python pickle), then extract
rich per-format metadata (CSV/XLSX column types and null counts,
JSON schema, PDF /Info, NumPy shape/dtype/byte-order, Safetensors
tensor breakdown and __metadata__, GGUF version, FASTA GC content
and sequence stats, VCF samples and chromosomes). PDF, AnnData,
Seurat, Loom, ML model containers, and bioinformatics text formats
now render meaningfully instead of "preview not supported."

Memory-safe rendering for large files: text/CSV/JSON content is
sliced to the first 10 MB before parsing to avoid browser OOM, with
a warning banner when truncation occurs; cached content is cleared
on file switch. Preview size cap raised to 1 GB.
@github-actions github-actions Bot added feature dependencies Pull requests that update a dependency file frontend Changes related to the frontend GUI labels May 16, 2026
@kunwp1 kunwp1 changed the title feat(frontend): rich dataset file preview with type detection [Hackathon] feat(frontend): rich dataset file preview with type detection May 16, 2026
kunwp1 added 3 commits May 16, 2026 12:12
Above 50 MB, skip the full-blob download from the dataset service and
show only the extension-based type identification + a "how to load"
hint. The dominant source of preview lag was the network download, not
the parsing — for a 500 MB Parquet file we used to fetch 500 MB just to
read its first 4 magic bytes.

Also drop the redundant "Size" pill from the metadata strip; size is
already visible in the dataset file listing and in the truncation
banner context.
Creates a new empty workflow and navigates to the editor when the user
clicks the button on a previewed file. The file path is copied to the
clipboard and a notification suggests which scan operator to drag in
(CSV → "CSV File Scan", etc.).

Empty-workflow + clipboard handoff is used instead of pre-populating
the operator JSON because hand-constructed OperatorPredicates skip the
operator-metadata schema validation, leading to workflows the editor
can't load. The same UX outcome with far higher reliability.
The dataset file renderer now passes addOp + fileName as query params
when navigating to the editor. The workspace component reads them
after the workflow finishes loading and adds the operator via
WorkflowUtilService.getNewOperatorPredicate() (so it goes through
schema validation), then strips the query params so a refresh doesn't
re-add. The file path is set via setOperatorProperty rather than
mutating the readonly operatorProperties dict directly.

Replaces the prior clipboard-handoff fallback. Unmapped file types
still get an empty workflow with no auto-add.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file feature frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant