Skip to content

[TRACKER] Billion-Scale Data Generation #2026

@jinsolp

Description

@jinsolp

Goal is to create synthetic datasets that mimic the distribution of real world datasets at scale.
Main Related Issue: #1699

Current Status

  • We were able to approximate 960M x 1024 Falcon dataset
Image

TODOs

Evaluating how to decide parameters for scale up

  • Investigate good PCA heuristics
  • Investigate how sample size and number of clusters need to scale for approximating large dataset

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions