Goal is to create synthetic datasets that mimic the distribution of real world datasets at scale. Main Related Issue: https://github.com/rapidsai/cuvs/issues/1699 ### Current Status - We were able to approximate 960M x 1024 Falcon dataset <img width="655" height="347" alt="Image" src="https://github.com/user-attachments/assets/6271b3e9-0b21-4dd7-a6b6-b3448fba45f9" /> ### TODOs Evaluating how to decide parameters for scale up - [ ] Investigate good PCA heuristics - [ ] Investigate how sample size and number of clusters need to scale for approximating large dataset
Goal is to create synthetic datasets that mimic the distribution of real world datasets at scale.
Main Related Issue: #1699
Current Status
TODOs
Evaluating how to decide parameters for scale up