[Hackathon] feat(frontend): rich dataset file preview with type detection by kunwp1 · Pull Request #5112 · apache/texera

kunwp1 · 2026-05-16T19:06:54Z

Summary

The dataset file previewer (UserDatasetFileRendererComponent) previously identified files purely by extension and showed "Preview of the file type is currently not supported" for anything outside a small allow-list. This PR makes it identify-and-describe a much wider set of formats and surface rich per-format metadata.

What changed

Magic-byte detection: replaces extension-only guessing. Uses the file-type library (MIT) for ~100 common formats, plus hand-rolled signatures for Parquet (PAR1), Arrow (ARROW1), HDF5 (\x89HDF\r\n\x1a\n), NumPy .npy (\x93NUMPY), GGUF (GGUF), and Python pickle (\x80\x02..\x05). Extension-based refinement disambiguates ZIP containers (PyTorch .pt/.pth, Keras .keras, NumPy .npz) and gzipped R .rds. Text sniffing adds FASTA, FASTQ, VCF on top of the existing JSON / CSV / Markdown heuristics.
Lightweight header parsing for ML formats:
- NumPy .npy → dtype, shape, byte-order, Fortran/C order
- Safetensors → tensor count, total parameters, dtype breakdown, largest tensor, __metadata__
- GGUF → version, tensor count, metadata KV count
Rich metadata per type displayed as a metadata strip above the preview:
- CSV / XLSX: inferred column types (integer / double / boolean / date / string) and null counts shown directly under each column header in the data table; row & column counts; sheet count for XLSX
- JSON: top-level type, item/key count, max nesting depth, per-key types
- PDF: version, page count, /Info dictionary (Title, Author, Creator, Producer), encryption flag — rendered in <iframe>
- Images: dimensions, aspect ratio (async via <img>.onload)
- Video / audio: duration + resolution (async via loadedmetadata)
- FASTA: total bases, GC content (skipped for proteins), min/max/avg sequence length
- VCF: sample count parsed from #CHROM header, distinct chromosomes
- Single-cell / R: AnnData (.h5ad), Seurat (.h5seurat, .rds), Loom — identification + "how to load" hint
Memory-safe rendering: text/CSV/JSON parsing is bounded at 10 MB (getPreviewSlice) to avoid browser OOM on large files. A warning banner appears when truncation occurs; truncation-affected stats (sequenceCountIsExact, variantCountIsExact) flip accordingly. turnOffAllDisplay now clears textContent / tableContent / currentFile so switching files reclaims memory. Per-MIME size cap raised to 1 GB from the prior 1–50 MB.
Async safety: ChangeDetectorRef injected and markForCheck() called from media loadedmetadata / <img>.onload callbacks, preserving the existing default change-detection strategy while supporting an eventual OnPush migration.

Screen.Recording.2026-05-16.at.12.38.35.PM.mov

Files changed

frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/user-dataset-file-renderer/user-dataset-file-renderer.component.ts — detection logic, parsers, render dispatch, metadata getter
…/user-dataset-file-renderer.component.html — metadata strip, PDF iframe, truncation banner, column-type tags on table headers
…/user-dataset-file-renderer.component.scss — metadata pill / column tag styles
…/user-dataset-file-renderer.component.spec.ts — 28 new tests (30 total)
frontend/package.json, frontend/yarn.lock — file-type@22.0.1 (MIT)

Test plan

yarn ng test --include="**/user-dataset-file-renderer.component.spec.ts" --watch=false — 30 / 30 passing (existing 2 retained, 28 new covering magic-byte detection, extension refinement, NumPy/Safetensors/GGUF header parsing, and column type inference)
Frontend visual review: open various file types in the dataset previewer and verify the metadata strip + column type tags render
Before/after screenshots / GIFs (not included in this draft; per AGENTS.md these should be added before merge)

Notes for reviewers

This is exploratory hackathon work; a tracking issue should be filed before merge per AGENTS.md.
The 1 GB preview limit still triggers a full file download from the dataset service. A follow-up could add HTTP Range request support so identify-only formats (Parquet, HDF5, pickle, model containers) fetch only the first 64 KB.
HDF5 sub-types (.h5ad / .h5seurat / .loom) are distinguished by extension because they share identical magic bytes; deep parsing would need an HDF5 reader (e.g. h5wasm) which is intentionally not included.

🤖 Generated with Claude Code

Replace extension-based file type guessing in the dataset previewer with magic-byte detection (file-type library + manual signatures for Parquet, Arrow, HDF5, NumPy .npy, GGUF, Python pickle), then extract rich per-format metadata (CSV/XLSX column types and null counts, JSON schema, PDF /Info, NumPy shape/dtype/byte-order, Safetensors tensor breakdown and __metadata__, GGUF version, FASTA GC content and sequence stats, VCF samples and chromosomes). PDF, AnnData, Seurat, Loom, ML model containers, and bioinformatics text formats now render meaningfully instead of "preview not supported." Memory-safe rendering for large files: text/CSV/JSON content is sliced to the first 10 MB before parsing to avoid browser OOM, with a warning banner when truncation occurs; cached content is cleared on file switch. Preview size cap raised to 1 GB.

Above 50 MB, skip the full-blob download from the dataset service and show only the extension-based type identification + a "how to load" hint. The dominant source of preview lag was the network download, not the parsing — for a 500 MB Parquet file we used to fetch 500 MB just to read its first 4 magic bytes. Also drop the redundant "Size" pill from the metadata strip; size is already visible in the dataset file listing and in the truncation banner context.

Creates a new empty workflow and navigates to the editor when the user clicks the button on a previewed file. The file path is copied to the clipboard and a notification suggests which scan operator to drag in (CSV → "CSV File Scan", etc.). Empty-workflow + clipboard handoff is used instead of pre-populating the operator JSON because hand-constructed OperatorPredicates skip the operator-metadata schema validation, leading to workflows the editor can't load. The same UX outcome with far higher reliability.

The dataset file renderer now passes addOp + fileName as query params when navigating to the editor. The workspace component reads them after the workflow finishes loading and adds the operator via WorkflowUtilService.getNewOperatorPredicate() (so it goes through schema validation), then strips the query params so a refresh doesn't re-add. The file path is set via setOperatorProperty rather than mutating the readonly operatorProperties dict directly. Replaces the prior clipboard-handoff fallback. Unmapped file types still get an empty workflow with no auto-add.

github-actions Bot added feature dependencies Pull requests that update a dependency file frontend Changes related to the frontend GUI labels May 16, 2026

github-actions Bot assigned kunwp1 May 16, 2026

kunwp1 changed the title ~~feat(frontend): rich dataset file preview with type detection~~ [Hackathon] feat(frontend): rich dataset file preview with type detection May 16, 2026

kunwp1 added 3 commits May 16, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon] feat(frontend): rich dataset file preview with type detection#5112

[Hackathon] feat(frontend): rich dataset file preview with type detection#5112
kunwp1 wants to merge 4 commits into
apache:mainfrom
kunwp1:feat/dataset-file-rich-preview

kunwp1 commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kunwp1 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Files changed

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kunwp1 commented May 16, 2026 •

edited

Loading