[Hackathon] feat(frontend): rich dataset file preview with type detection#5112
Open
kunwp1 wants to merge 4 commits into
Open
[Hackathon] feat(frontend): rich dataset file preview with type detection#5112kunwp1 wants to merge 4 commits into
kunwp1 wants to merge 4 commits into
Conversation
Replace extension-based file type guessing in the dataset previewer with magic-byte detection (file-type library + manual signatures for Parquet, Arrow, HDF5, NumPy .npy, GGUF, Python pickle), then extract rich per-format metadata (CSV/XLSX column types and null counts, JSON schema, PDF /Info, NumPy shape/dtype/byte-order, Safetensors tensor breakdown and __metadata__, GGUF version, FASTA GC content and sequence stats, VCF samples and chromosomes). PDF, AnnData, Seurat, Loom, ML model containers, and bioinformatics text formats now render meaningfully instead of "preview not supported." Memory-safe rendering for large files: text/CSV/JSON content is sliced to the first 10 MB before parsing to avoid browser OOM, with a warning banner when truncation occurs; cached content is cleared on file switch. Preview size cap raised to 1 GB.
Above 50 MB, skip the full-blob download from the dataset service and show only the extension-based type identification + a "how to load" hint. The dominant source of preview lag was the network download, not the parsing — for a 500 MB Parquet file we used to fetch 500 MB just to read its first 4 magic bytes. Also drop the redundant "Size" pill from the metadata strip; size is already visible in the dataset file listing and in the truncation banner context.
Creates a new empty workflow and navigates to the editor when the user clicks the button on a previewed file. The file path is copied to the clipboard and a notification suggests which scan operator to drag in (CSV → "CSV File Scan", etc.). Empty-workflow + clipboard handoff is used instead of pre-populating the operator JSON because hand-constructed OperatorPredicates skip the operator-metadata schema validation, leading to workflows the editor can't load. The same UX outcome with far higher reliability.
The dataset file renderer now passes addOp + fileName as query params when navigating to the editor. The workspace component reads them after the workflow finishes loading and adds the operator via WorkflowUtilService.getNewOperatorPredicate() (so it goes through schema validation), then strips the query params so a refresh doesn't re-add. The file path is set via setOperatorProperty rather than mutating the readonly operatorProperties dict directly. Replaces the prior clipboard-handoff fallback. Unmapped file types still get an empty workflow with no auto-add.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The dataset file previewer (
UserDatasetFileRendererComponent) previously identified files purely by extension and showed "Preview of the file type is currently not supported" for anything outside a small allow-list. This PR makes it identify-and-describe a much wider set of formats and surface rich per-format metadata.What changed
Magic-byte detection: replaces extension-only guessing. Uses the
file-typelibrary (MIT) for ~100 common formats, plus hand-rolled signatures for Parquet (PAR1), Arrow (ARROW1), HDF5 (\x89HDF\r\n\x1a\n), NumPy.npy(\x93NUMPY), GGUF (GGUF), and Python pickle (\x80\x02..\x05). Extension-based refinement disambiguates ZIP containers (PyTorch.pt/.pth, Keras.keras, NumPy.npz) and gzipped R.rds. Text sniffing adds FASTA, FASTQ, VCF on top of the existing JSON / CSV / Markdown heuristics.Lightweight header parsing for ML formats:
.npy→ dtype, shape, byte-order, Fortran/C order__metadata__Rich metadata per type displayed as a metadata strip above the preview:
integer/double/boolean/date/string) and null counts shown directly under each column header in the data table; row & column counts; sheet count for XLSX/Infodictionary (Title, Author, Creator, Producer), encryption flag — rendered in<iframe><img>.onload)loadedmetadata)#CHROMheader, distinct chromosomes.h5ad), Seurat (.h5seurat,.rds), Loom — identification + "how to load" hintMemory-safe rendering: text/CSV/JSON parsing is bounded at 10 MB (
getPreviewSlice) to avoid browser OOM on large files. A warning banner appears when truncation occurs; truncation-affected stats (sequenceCountIsExact,variantCountIsExact) flip accordingly.turnOffAllDisplaynow clearstextContent/tableContent/currentFileso switching files reclaims memory. Per-MIME size cap raised to 1 GB from the prior 1–50 MB.Async safety:
ChangeDetectorRefinjected andmarkForCheck()called from medialoadedmetadata/<img>.onloadcallbacks, preserving the existing default change-detection strategy while supporting an eventual OnPush migration.Screen.Recording.2026-05-16.at.12.38.35.PM.mov
Files changed
frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/user-dataset-file-renderer/user-dataset-file-renderer.component.ts— detection logic, parsers, render dispatch, metadata getter…/user-dataset-file-renderer.component.html— metadata strip, PDF iframe, truncation banner, column-type tags on table headers…/user-dataset-file-renderer.component.scss— metadata pill / column tag styles…/user-dataset-file-renderer.component.spec.ts— 28 new tests (30 total)frontend/package.json,frontend/yarn.lock—file-type@22.0.1(MIT)Test plan
yarn ng test --include="**/user-dataset-file-renderer.component.spec.ts" --watch=false— 30 / 30 passing (existing 2 retained, 28 new covering magic-byte detection, extension refinement, NumPy/Safetensors/GGUF header parsing, and column type inference)Notes for reviewers
.h5ad/.h5seurat/.loom) are distinguished by extension because they share identical magic bytes; deep parsing would need an HDF5 reader (e.g. h5wasm) which is intentionally not included.🤖 Generated with Claude Code