Skip to content

Create zip files for discover pipeline#80

Merged
ivyleavedtoadflax merged 2 commits into
mainfrom
feature/data-prep-pipeline
Nov 11, 2025
Merged

Create zip files for discover pipeline#80
ivyleavedtoadflax merged 2 commits into
mainfrom
feature/data-prep-pipeline

Conversation

@ivyleavedtoadflax
Copy link
Copy Markdown
Contributor

  • feat: add data-prep pipeline for zipping validation datasets
  • refactor: use zip -j flag and cleanup intermediate files

Add DVC pipeline to prepare validation datasets for distribution:
- zip_discover_qa: packages discover QA validation dataset (6,229 lines)
- zip_discover_embedding: packages discover embedding validation dataset (6,222 lines)

Both stages copy from data/discover/validation to data/zips and create compressed archives.
- Replace -r with -j flag to exclude directory paths from zip archives
- Add cleanup step to remove intermediate .jsonl files after zipping
- Prevents read-only permission issues on subsequent runs

Zip files now contain only the filename without data/zips/ path prefix.
@ivyleavedtoadflax ivyleavedtoadflax merged commit ef4d10a into main Nov 11, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant