diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..acb3bf2 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,264 @@ +# GSMA Dataset Creation - Agent Context + +## Project Overview + +GSMA synthetic dataset creation pipeline for telecom documentation. Scrapes documents, processes them into chunks, generates synthetic Q&A pairs, validates quality, and publishes training datasets. + +## Architecture + +- **Language**: Python 3.12+ with strict type hints +- **CLI**: typer with DVC pipeline integration +- **Testing**: pytest with TDD methodology +- **Processing**: chonkie (chunking), FAISS (similarity), SetFit (filtering) + +## Main Pipelines + +### PRD Pipeline (`pipelines/prd/dvc.yaml`) **[CONSOLIDATED]** +Unified end-to-end pipeline for technical specifications (32 DOCX documents). Consolidates the former chunker, questions, similarity, filters, and validation pipelines into a single pipeline. + +**15 Stages**: +1. **process_documents**: DOCX → Markdown (data/raw → data/processed) +2. **create_late_chunks** (5×): Late chunking at 500/1000/2000/3000/4000 tokens +3. **generate_questions** (5×): Synthetic Q&A with Cerebras GPT-OSS-120B (5/10/20/30/40 questions per chunk size) +4. **data_combiner**: Merge all chunks + questions with working group classification +5. **similarity_hasher**: Add SHA-256 content hashes +6. **similarity_ranker**: FAISS IVFFlat top-K (k=20, threshold=0.3) +7. **overlap_detector**: Character offset-based text overlaps (min 50 chars) +8. **explode_questions**: Question-centric format (min-similarity: 0.35, max: 0.95) +9. **apply_question_filter**: External reference classifier +10. **apply_chunk_filter**: Procedures classifier + keyword exclusion (`prd@gsma.com`) +11. **filter_questions_by_chunk_quality**: Combined quality filtering (min prob: 0.5) +12. **validate_requests**: LLM validation with Qwen 235B via Cerebras (50 concurrent, 50k limit) +13. **create_validation_dataset**: Dual format (embedding + QA, max 3 positives/negatives) +14. **upload_embedding_dataset**: → mantisnlp/gsma_prd_synthetic_embedding +15. **upload_qa_dataset**: → mantisnlp/gsma_prd_synthetic_qa + +**Data Paths**: `data/prd/` (chunks, questions, parquets, validation) +**Metrics Paths**: `metrics/prd/` (all stage metrics) + +**Usage**: `dvc repro pipelines/prd/dvc.yaml` or `cd pipelines/prd && dvc repro` + +**Key Features**: +- Variables: `data_prefix: data/prd`, `metrics_prefix: metrics/prd` +- Cerebras provider for question generation and validation +- Keyword filter for GSMA template boilerplate +- Min-similarity-score: 0.35 (validation pipeline setting) + +### Discover Pipeline (`pipelines/discover/dvc.yaml`) +End-to-end pipeline for reports/whitepapers (304 PDF/DOCX documents): +1. **Scrape**: Playwright automation (PRD + Discover pages, bypasses Cloudflare) +2. **Deduplicate**: Hash-based deduplication across sources +3. **Process**: PDF/DOCX → Markdown (PyMuPDF for PDFs) +4. **Chunk → Validation**: Same as PRD pipeline +5. **Datasets**: Dual format → HuggingFace Hub + +**Usage**: `dvc repro discover` or `cd pipelines/discover && dvc repro` + +**Outputs**: +- `mantisnlp/gsma_discover_synthetic_embedding`: Contrastive learning format +- `mantisnlp/gsma_discover_synthetic_qa`: RAG/QA format + +### Annotation Pipeline (`pipelines/annotation/dvc.yaml`) +Human validation workflow with subgroup-based tasks (TSG, FASG, NG, RCS, eSim): +- Downloads datasets from HuggingFace Hub +- Adds working group/subgroup classifications +- Creates Argilla workspaces with domain-specific samples +- Credentials: username=subgroup, password=subgroup-gsma + +**Usage**: `dvc repro annotation:add_subgroups` (frozen upload stages) + +**Argilla CLI Commands** (`gsma argilla [command]`): +- `upload`: Upload validation dataset to Argilla +- `delete`: Delete dataset from Argilla +- `download`: Download annotated dataset from Argilla +- `upload-by-subgroup`: Upload filtered dataset for subgroup annotation +- `delete-workspace`: Delete workspace and associated user +- `add-users`: Create multiple users with random secure passwords (e.g., 10 users for a workshop) +- `add-user`: Create single user with multi-workspace support +- `add-to-workspace`: Add existing user to multiple workspaces +- `list-users`: List all users in a workspace +- `list-workspaces`: List all available workspaces +- `list-datasets`: List all datasets in a workspace +- `track-progress`: Monitor annotation progress (optimized for instant response) +- `delete-user`: Remove user from Argilla + +**User Management Examples**: +```bash +# Create 10 users with random secure passwords +gsma argilla add-users -w TSG --count 10 --output-csv users.csv + +# Create user in multiple workspaces +gsma argilla add-user -u alice -p secret123 -w TSG -w FASG -w NG + +# Track annotation progress +gsma argilla track-progress -w TSG +``` + +## Key Components + +### Chunking (`gsma_dataset_creation/chunker.py`) +- Late chunking with embedded context preservation +- Configurable chunk sizes (500-4000 tokens) +- Embedding model: sentence-transformers/all-MiniLM-L6-v2 +- CLI: `gsma chunk --chunker late --chunker-config '{...}'` + +### Question Generation (`gsma_dataset_creation/qa_generator.py`) +- Synthetic Q&A using OpenRouter API (Cerebras provider) +- Concurrent processing with rate limiting +- Resumable (auto-resumes from last processed chunk) +- CLI: `gsma questions generate-from-chunks --num-questions 5` + +### Similarity Analysis (`gsma_dataset_creation/similarity/`) +- Data combination with working group classification +- SHA-256 content hashing for deduplication +- FAISS IVFFlat similarity ranking (top-K) +- Character offset-based overlap detection +- CLI: `gsma similarity combine`, `hash`, `rank`, `detect-overlaps` + +### Quality Filtering (`gsma_dataset_creation/filters_cli.py`) +- **Chunk filter**: Procedures classifier (filters legal/procedural content) +- **Question filter**: External reference classifier (filters unavailable content) +- **Keyword filter**: Exclude matches (e.g., "prd@gsma.com" boilerplate) +- Pre-trained SetFit models in `models/filters/` +- CLI: `gsma filters apply-chunk-filter`, `apply-question-filter` + +### Validation (`gsma_dataset_creation/validation_cli.py`) +- Individual evaluation (no batching as of PR #71) +- SQLite checkpointing for resumability +- Concurrent processing with asyncio +- Error categorization (rate limits, server errors, timeouts) +- CLI: `gsma validation explode-questions`, `validate-requests` + +### Dataset Creation (`gsma_dataset_creation/datasets_cli.py`) +- **Embedding format**: Contrastive learning (Question, Positive_Chunks, Negative_Chunks, Answer, Metadata) +- **QA format**: RAG training (Question, Content, Answer, Metadata) +- Quality-based filtering (max positives/negatives) +- Comprehensive histogram metrics +- CLI: `gsma datasets create-from-validation` + +## Dataset Formats + +### Embedding Format +```json +{ + "Question": "What is 5G network slicing?", + "Positive_Chunks": ["5G network slicing allows..."], + "Negative_Chunks": ["eSIM technology...", "WiFi 6..."], + "Answer": "5G network slicing allows...", + "Metadata": { + "source_document": "TS.23 v7.0.md", + "working_group": "TSG", + "chunk_id": "TS.23 v7.0.md_500_15" + } +} +``` + +### QA Format +```json +{ + "Question": "What is 5G network slicing?", + "Content": ["5G network slicing allows..."], + "Answer": "5G network slicing allows...", + "Metadata": { + "source_document": "TS.23 v7.0.md", + "working_group": "TSG", + "chunk_id": "TS.23 v7.0.md_500_15" + } +} +``` + +## Environment Variables + +- `OPENROUTER_API_KEY`: Cerebras/OpenRouter API access (question generation, validation) +- `ARGILLA_API_URL`: Argilla service URL (https://mantisnlp-annotate.hf.space) +- `ARGILLA_API_KEY`: Argilla API authentication +- `HUGGINGFACE_TOKEN`: HuggingFace Hub operations (dataset upload/download) + +## Critical Rules + +**NEVER use `git push --no-verify` or `git commit --no-verify`** +- All pre-commit hooks must pass +- Fix issues rather than bypassing checks + +**NEVER use `git add .` - always add files individually** +- Use `git add specific_file.py` +- Review each file before staging + +**ALWAYS update AGENTS.md with significant changes** +- Update after each PR merge +- Preserve existing entries unless obsolete +- This file is the project's living memory + +## Deprecated Pipelines + +The following individual pipelines have been **consolidated into `pipelines/prd/dvc.yaml`** and deleted: +- ~~`pipelines/chunker/`~~ → Stages 1-2 in PRD pipeline +- ~~`pipelines/questions/`~~ → Stage 3 in PRD pipeline +- ~~`pipelines/similarity/`~~ → Stages 4-7 in PRD pipeline +- ~~`pipelines/filters/`~~ → Stages 9-11 in PRD pipeline +- ~~`pipelines/validation/`~~ → Stages 8, 12-15 in PRD pipeline + +**Remaining pipelines**: +- `pipelines/prd/` - Consolidated PRD pipeline (primary) +- `pipelines/discover/` - Separate discover document pipeline +- `pipelines/annotation/` - Human annotation workflow +- `pipelines/datasets/` - Legacy question-based dataset creation + +## Recent Changes + +**Pipeline Consolidation** (Oct 2025): Consolidated PRD pipeline +- Created unified `pipelines/prd/dvc.yaml` with 15 stages +- Consolidated chunker, questions, similarity, filters, validation pipelines +- Data migrated to `data/prd/`, metrics to `metrics/prd/` +- Added keyword exclusion filter (`--exclude-matches "prd@gsma.com"`) +- Added Cerebras provider to validation stage +- Used `dvc commit --force` to register existing outputs (no re-execution of expensive stages) +- Deleted deprecated individual pipeline directories + +**PR #75** (Oct 2025): Prepare discover pipeline for validation +- Added working group classification to discover data combiner +- Fixed variable shadowing bug in explode-questions (question_id collisions) +- Added keyword exclusion filter (default: "prd@gsma.com" boilerplate) +- Increased min-similarity-score from 0.35 to 0.45 +- Added error type breakdown to validation metrics (rate limits, server errors, etc.) +- Removed arbitrary character limit on reasoning field (was 800 chars) +- Configured dual-format dataset creation (embedding + QA) +- Separate HuggingFace uploads for each format + +**PR #73** (Oct 2025): Complete discover question generation pipeline +- Full end-to-end discover pipeline (23 stages) +- PDF processing support via PyMuPDF +- Metrics standardization with subdirectory structure + +**PR #71** (Oct 2025): Refactor validation to individual evaluation +- Removed batching (was 3-10 candidates per API call) +- Individual chunk evaluation against questions +- Simplified validation logic and checkpoint management + +**PR #70** (Oct 2025): Discover pipeline expansion +- Added scraping, deduplication, processing, chunking, questions, similarity, filtering, validation, dataset creation +- Configurable metrics output path for chunk command +- PDF and DOCX support in document processor + +**Earlier Features**: +- Annotation pipeline with subgroup-based Argilla tasks +- Document scraping with Playwright (Cloudflare bypass) +- Batched validation with SQLite checkpointing (deprecated in #71) +- Quality filtering with SetFit classifiers +- Working group/subgroup classification +- Dual dataset formats (embedding + QA) + +## Performance Notes + +**PRD Pipeline** (~1000 documents): 9-17 hours total +- Document processing: ~5 min +- Chunking: ~10 min +- Question generation: 2-4 hours (Cerebras speed-dependent) +- Similarity: ~30 min +- Filtering: ~15 min +- Validation: 6-12 hours (concurrency-dependent) +- Dataset creation: ~5 min + +**Resumability**: All long-running stages support resumability via checkpoints or skip-existing logic. + +**Concurrent Processing**: Validation supports up to 50 concurrent requests (asyncio). diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index b69e4ce..0000000 --- a/CLAUDE.md +++ /dev/null @@ -1,783 +0,0 @@ -# GSMA Dataset Creation - Claude Code Context - -## Project Overview - -GSMA dataset creation pipeline for extracting, processing, and chunking telecom documentation into training datasets. - -## Current Architecture - -- **Language**: Python 3.12+ with type hints -- **CLI Framework**: typer for command-line interface -- **Testing**: pytest with TDD methodology -- **Data Processing**: chonkie for text chunking -- **Pipeline**: DVC for reproducible data processing - -## Existing Components - -### Chunking (`gsma_dataset_creation/chunker.py`) -- `chunk_document()`: Process single document into 300-token chunks -- `chunk_documents()`: Batch process directory of documents -- Uses chonkie TokenChunker with 30-token overlap -- Outputs JSON format with metadata - -### CLI (`gsma_dataset_creation/cli.py`) -- `chunk` command: Create chunks from processed documents -- Integration with DVC pipeline stages - -## New Feature: Q/A Generation - -### Overview -Generate question-answer pairs from document chunks for embedding training using OpenRouter API. - -### Technical Approach -- **LLM Integration**: OpenRouter API with OpenAI client library -- **Default Model**: openai/gpt-5-mini (configured in `model_defaults.py`) -- **Rate Limiting**: Built-in client limits + --limit parameter -- **Input**: Chunked documents (JSON format) -- **Output**: Q/A pairs (JSON format) - -### Key Components (In Development) -- `qa_generator.py`: Core Q/A generation logic -- CLI extension: `generate-qa` command -- OpenRouter integration with multiple model support - -### API Design -```python -def generate_qa_for_chunk( - chunk_content: str, - chunk_metadata: dict, - num_questions: int = 1 -) -> QAGenerationResult - -def generate_qa_for_directory( - input_dir: Path, - output_dir: Path, - num_questions: int = 1, - limit: Optional[int] = None -) -> QABatchResult -``` - -### CLI Usage -```bash -uv run gsma generate-qa data/chunked data/qa --limit 20 --model openai/gpt-5-mini -``` - -## Development Workflow - -1. **TDD Approach**: Write tests first, then implementation -2. **Test Order**: Contract → Integration → Unit -3. **Git**: Atomic commits with descriptive messages -4. **DVC**: Pipeline integration for reproducible processing - -## Critical Rules - -**NEVER use `git push --no-verify` or `git commit --no-verify`** -- All pre-commit hooks must pass before any push -- All tests must pass before any push -- Code quality and safety checks are non-negotiable -- If hooks fail, fix the issues rather than bypassing them - -**NEVER use `git add .` - always add files individually** -- Use `git add specific_file.py` instead of `git add .` -- Be intentional and explicit about which files are staged -- Review each file before adding it to ensure it should be committed -- This prevents accidentally committing unintended changes or files - -**ALWAYS update CLAUDE.md with each significant change or PR** -- Update the "Recent Changes" section to reflect new features, fixes, or improvements -- Update configuration details if new environment variables, DVC stages, or dependencies are added -- Keep the documentation current so future Claude instances understand the project state -- This file serves as the project's living memory and context -- **Preserve existing entries unless no longer relevant** - only remove work that has been superseded or deprecated -- This ensures all meaningful completed work remains visible across branches and merges - -## New Feature: HuggingFace Dataset Creation with Working Group Classification - -### Overview -Creates training datasets from hard negatives with automatic working group classification for domain-specific analysis. - -### Technical Approach -- **Working Group Mapping**: JSON mapping of document filenames to working groups (TSG, FASG, eSim, etc.) -- **Dual Output Formats**: Both embedding training and simple Q&A dataset formats -- **Unified Processing**: Combines all chunk sizes (300-3000 tokens) into single datasets -- **Simple Classification**: Direct dictionary lookup with graceful handling of unclassified documents - -### Key Components -- `hf_dataset_creator.py`: Core dataset creation logic with working group classification -- CLI commands: `create-datasets`, `create-embedding-dataset`, `create-qa-dataset` -- DVC integration: Unified dataset creation pipeline stages - -### Dataset Formats - -#### Embedding Training Format -```json -{ - "Question": "What is 5G network slicing?", - "Positive_Chunks": ["5G network slicing allows operators to create multiple virtual networks..."], - "Negative_Chunks": ["eSIM technology enables programmable SIM cards...", "WiFi 6 provides faster wireless connectivity..."], - "Answer": "5G network slicing allows operators to create multiple virtual networks on a shared physical infrastructure.", - "Source_Document": "TS.23 v7.0.md", - "Working_Group": "TSG", - "Similarity_Scores": [0.4863, 0.3275], - "Chunk_ID": "TS.23 v7.0.md_15", - "Chunk_Position": 15, - "Question_Type": "analytical", - "Generation_Timestamp": "2025-09-24T18:08:43.528098Z" -} -``` - -#### Simple Q&A Format -```json -{ - "Question": "What is 5G network slicing?", - "Answer": "5G network slicing allows operators to create multiple virtual networks on a shared physical infrastructure.", - "Chunk": "5G network slicing allows operators to create multiple virtual networks...", - "Source_Document": "TS.23 v7.0.md", - "Working_Group": "TSG", - "Chunk_ID": "TS.23 v7.0.md_15", - "Question_Type": "analytical" -} -``` - -### CLI Usage -```bash -# Create both dataset formats -uv run gsma create-datasets \ - --input-dirs data/hard_negatives_late_300 data/hard_negatives_late_500 data/hard_negatives_late_1000 data/hard_negatives_late_2000 data/hard_negatives_late_3000 \ - --working-groups-mapping data/working_groups_mapping.json \ - --embedding-output data/embedding_dataset_all_chunks.json \ - --qa-output data/qa_dataset_all_chunks.json - -# Create embedding training dataset only -uv run gsma create-embedding-dataset \ - --input-dirs data/hard_negatives_late_300 data/hard_negatives_late_500 \ - --working-groups-mapping data/working_groups_mapping.json \ - --output data/embedding_dataset_all_chunks.json - -# Create Q&A dataset only -uv run gsma create-qa-dataset \ - --input-dirs data/hard_negatives_late_300 data/hard_negatives_late_500 \ - --working-groups-mapping data/working_groups_mapping.json \ - --output data/qa_dataset_all_chunks.json -``` - -### DVC Pipeline Integration -```yaml -# Run unified dataset creation -dvc repro create_unified_datasets - -# Run individual formats -dvc repro create_embedding_dataset_only -dvc repro create_qa_dataset_only -``` - -## PRD Pipeline (Unified End-to-End Pipeline) - -### Overview -The PRD pipeline (`pipelines/prd/dvc.yaml`) is a unified end-to-end pipeline that combines all stages from raw document processing through to HuggingFace dataset upload. It consolidates the functionality of the individual pipelines (chunker, questions, similarity, validation) into a single cohesive workflow. - -### Pipeline Architecture - -**Total Stages**: 21 stages (17 logical stages, with 2 foreach loops creating 5 iterations each) - -#### Stage 1: Document Processing (1 stage) -- `process_documents`: Convert raw DOCX files to processed Markdown -- Input: `data/raw` -- Output: `data/processed` - -#### Stage 2: Chunking (5 stages via foreach) -- `create_late_chunks`: Create late chunking with embedded context preservation -- Chunk sizes: 500, 1000, 2000, 3000, 4000 tokens -- Embedding model: sentence-transformers/all-MiniLM-L6-v2 -- Input: `data/processed` -- Outputs: `data/chunked_late_{500,1000,2000,3000,4000}` - -#### Stage 3: Question Generation (5 stages via foreach) -- `generate_questions`: Generate synthetic questions using Cerebras GPT-OSS-120B -- Questions per chunk: 5, 10, 20, 30, 40 (scaled with chunk size) -- Provider: Cerebras (high-speed inference) -- Concurrency: 20 parallel requests -- Input: `data/chunked_late_*` -- Outputs: `data/questions_gpt-oss-120b_late_{500,1000,2000,3000,4000}` - -#### Stage 4: Similarity Analysis (4 stages) -- `data_combiner`: Merge all chunks + questions with working group classification -- `similarity_hasher`: Add SHA-256 content hashes for deduplication -- `similarity_ranker`: Compute FAISS IVFFlat top-K (k=20) similarity with threshold=0.3 -- `overlap_detector`: Detect character offset-based text overlaps (min 50 chars) -- Output: `data/enriched_chunks.parquet` - -#### Stage 5: Quality Filtering (5 stages) -- `apply_chunk_filter`: Apply procedures classifier (filters legal/procedural content) -- `filter_chunks`: Remove low-quality chunks (probability threshold ≥ 0.5) -- `explode_questions`: Transform to question-centric format with similarity range [0.35, 0.95] -- `apply_question_filter`: Apply external reference classifier (filters questions about unavailable content) -- `filter_questions_by_chunk_quality`: Combined filtering (min probability 0.5 for both questions and chunks) -- Filter models: Pre-trained SetFit models in `models/filters/` -- Output: `data/validation/questions_filtered.parquet` - -#### Stage 6: Validation (2 stages) -- `batch_candidates`: Group candidates into batches (3 per batch, randomized order) -- `validate_batched_requests`: LLM validation using Qwen 235B - - Concurrent processing: 50 parallel requests - - Limit: 50,000 questions - - SQLite checkpointing for resumability -- Output: `data/validation/validation_results.parquet` - -#### Stage 7: Dataset Creation & Upload (3 stages) -- `create_validation_dataset`: Transform validation results to HuggingFace dataset formats (both embedding and QA) - - **Embedding Format**: Contrastive learning with positive/negative chunks (max 3 positives, 3 negatives per question) - - **QA Format**: RAG/QA training with single source chunk per question - - Combined metrics JSON with both embedding and QA statistics - - Comprehensive histogram metrics (chunk counts, quality scores, similarity scores) - - Outputs: HuggingFace datasets (Arrow format) + JSONL files (without Metadata field) -- `upload_embedding_dataset`: Push embedding dataset to HuggingFace Hub (mantisnlp/gsma_prd_synthetic_embedding) -- `upload_qa_dataset`: Push QA dataset to HuggingFace Hub (mantisnlp/gsma_prd_synthetic_qa) -- Outputs: - - `data/validation/validation_dataset_embedding` - - `data/validation/validation_dataset_embedding.jsonl` - - `data/validation/validation_dataset_qa` - - `data/validation/validation_dataset_qa.jsonl` - -### Usage - -```bash -# Run complete pipeline from project root -dvc repro prd - -# Run from pipeline directory -cd pipelines/prd -dvc repro - -# Run specific stage -dvc repro prd:validate_batched_requests - -# View pipeline DAG (run from pipelines/prd/) -cd pipelines/prd && dvc dag -``` - -### Dependencies - -**Required Data Files**: -- `data/raw`: Raw DOCX documents -- `data/working_groups_mapping.json`: Working group classifications -- `models/filters/question-filter-run-5000-2025-10-08_22-47-46/model`: External reference filter -- `models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model`: Procedures filter - -**Environment Variables**: -- `OPENROUTER_API_KEY`: For Cerebras API access (question generation) -- `HUGGINGFACE_TOKEN`: For dataset upload (optional, if using authenticated uploads) - -### Metrics - -All stages generate metrics in the `metrics/` directory: -- Document processing: `metrics/document_processing_metrics.json` -- Chunking: `metrics/chunk_metrics_chunked_late_{size}.json` -- Question generation: `metrics/generate_questions_gpt-oss-120b_late_{size}.json` -- Similarity: `metrics/data_combiner.json`, `metrics/similarity_hasher.json`, `metrics/similarity_calculator.json`, `metrics/overlap_detector.json` -- Filtering: `metrics/filters/*.json` -- Validation: `metrics/validation_results.json` -- Dataset creation: `metrics/dataset_creation_from_validation.json` - -### Important Notes - -**Pipeline Conflicts**: The PRD pipeline consolidates outputs from the individual pipelines (chunker, questions, similarity, validation). Due to DVC's constraint that outputs cannot be tracked in multiple pipeline files: -- Use the PRD pipeline for complete end-to-end runs -- Individual pipelines remain available for selective execution or debugging -- Cannot run PRD pipeline and individual pipelines simultaneously -- If you need to run individual pipelines, temporarily rename or move `pipelines/prd/dvc.yaml` - -**Resumability**: All long-running stages support resumability: -- Question generation: Automatically resumes from last processed chunk -- Validation: SQLite checkpointing in `.dvc/.tmp/validation_checkpoints` - -**Performance**: Estimated runtime for complete pipeline on ~1000 documents: -- Document processing: ~5 minutes -- Chunking: ~10 minutes (all sizes) -- Question generation: ~2-4 hours (depends on Cerebras availability) -- Similarity analysis: ~30 minutes -- Quality filtering: ~15 minutes -- Validation: ~6-12 hours (depends on concurrency and question count) -- Dataset creation: ~5 minutes -- Total: ~9-17 hours - -## Document Scraping Pipeline (Discover Pipeline) - -### Overview -Automated scraping of GSMA documents using Playwright browser automation to bypass Cloudflare protection. - -**Two data sources:** -1. **PRD Page**: Permanent Reference Documents (technical specifications, TS documents) -2. **Discover Page**: Reports, whitepapers, factsheets (Algolia-powered search with 300+ results) - -### Architecture -- **Module**: `gsma_dataset_creation/scraper/` - - `models.py`: Data models (ScraperMetrics, DocumentDownload) - - `browser.py`: Shared browser setup with anti-detection settings - - `prd_scraper.py`: PRD page scraper (default URL in function) - - `discover_scraper.py`: Discover page scraper (default URL in function) -- **Browser Automation**: Playwright with anti-detection settings (non-headless required for Cloudflare) -- **Pagination**: Automatic navigation through paginated results (up to 50 pages) -- **Resumability**: Skips already-downloaded files by default -- **Metrics**: Structured JSON output with download stats, timing, errors - -### Usage - -**CLI Commands:** -```bash -# Scrape PRD documents (uses default URL) -gsma scrape prd --output-dir data/prd_documents --metrics-output metrics/scrape_prd.json - -# Scrape discover page (uses default URL with technical topic filters) -gsma scrape discover --output-dir data/discover_documents --metrics-output metrics/scrape_discover.json - -# Override with custom discover URL -gsma scrape discover --url "https://www.gsma.com/discover/?custom_filters" --output-dir data/custom -``` - -**DVC Pipeline:** -```bash -cd pipelines/discover -dvc repro # Run both stages -dvc repro scrape_prd # Run PRD scraper only -dvc repro scrape_discover # Run discover scraper only -dvc metrics show # View download metrics -``` - -### Configuration - -**Default URLs** (overridable via `--url` CLI option): -- **PRD**: `https://www.gsma.com/get-involved/working-groups/permanent-reference-documents/` -- **Discover**: `https://www.gsma.com/discover/` (filtered to: Documents & reports, eSIM, Identity, IoT, Networks, Security, Spectrum) - -**Browser Settings:** -- Headless: false (required to bypass Cloudflare) -- User Agent: Chrome 120 on Windows 10 -- Viewport: 1920x1080 -- Anti-detection: `--disable-blink-features=AutomationControlled` - -**Rate Limiting:** -- 1-2 second delays between requests (configurable via `--delay`) -- Respectful crawling to avoid server overload - -### Outputs - -**Downloaded Files:** -- `data/prd_documents/`: ~32 technical specifications (PDF, DOCX, XLSX) -- `data/discover_documents/`: ~304 reports, whitepapers, factsheets (PDF primarily) -- Total: ~660MB - -**Metrics** (`metrics/scrape_*.json`): -```json -{ - "source_url": "https://...", - "pages_visited": 26, - "total_links_found": 304, - "files_downloaded": 303, - "files_skipped": 1, - "files_failed": 0, - "total_size_mb": 658.4, - "duration_seconds": 3245.6, - "errors": [] -} -``` - -### DVC Pipeline Structure - -**Pipeline**: `pipelines/discover/dvc.yaml` - -**Stages:** -1. `scrape_prd`: Download PRD technical specifications -2. `scrape_discover`: Download discover page documents - -**Dependencies:** -- Scraper module code changes trigger re-runs -- Manual re-runs: `dvc repro -f scrape_prd` (force re-scrape) - -**Caching Strategy:** -- **Documents** (`data/*/`): Not cached (large external files, `cache: false`) -- **Metrics** (`metrics/*.json`): Cached (small, important for tracking, `cache: true`) - -### Notes - -- Browser runs in **visible mode** (not headless) to bypass Cloudflare detection -- Scraper automatically waits for Cloudflare challenges to resolve (5-10 seconds) -- Already-downloaded files are skipped (resumable) -- Failed downloads are logged but don't stop the scraper -- Pagination navigates through all available pages (discover page has 26 pages @ 12 results/page) - -## Recent Changes -- **Chunk CLI Metrics Parameter** (Feature #012.1 / PR #70): Added configurable metrics output path to chunk command - - Added `--metrics-output` parameter to `gsma chunk` CLI command (optional, with backward compatibility) - - Updated `chunk_documents()` function in chunker.py to accept `metrics_output` parameter - - Implemented conditional logic: uses provided path if given, otherwise defaults to `./metrics/chunk_metrics_{output_dir_name}.json` - - Creates parent directories automatically when custom path is provided - - Maintains full backward compatibility - existing code without the parameter works unchanged - - **Purpose**: Enables DVC pipelines using `wdir: ../..` pattern to control metrics file locations - - **Usage**: `uv run gsma chunk input output --chunker late --chunker-config '{}' --metrics-output metrics/discover/chunk_metrics.json` - - **DVC Integration**: All discover pipeline chunking stages now use `--metrics-output ${metrics_prefix}/chunk_metrics_chunked_${item.name}.json` - - Tests verified: All 39 CLI and processor tests pass, including CLI chunker contract tests -- **Discover Pipeline Expansion** (Feature #012 / PR #70): Complete end-to-end processing pipeline for discover documents - - Added 23 new stages to `pipelines/discover/dvc.yaml` mirroring PRD pipeline architecture - - **Complete Pipeline Flow**: Scraped discover documents → Deduplicated → Processed (DOCX/PDF to Markdown) → Chunked → Questions → Similarity → Filtered → Validated → HuggingFace Dataset → Uploaded - - **Stage Breakdown**: - 1. Scraping & Deduplication (4 stages): `scrape_prd`, `scrape_discover`, `dedup_prd`, `dedup_discover` - 2. Document Processing (1 stage): `process_discover_documents` - Convert PDFs/DOCX to Markdown with PDF support via PyMuPDF - 3. Chunking (5 stages via foreach): `create_discover_late_chunks` at 500/1000/2000/3000/4000 tokens using late chunking - 4. Question Generation (5 stages via foreach): `generate_discover_questions` with 5/10/20/30/40 questions per chunk using Cerebras GPT-OSS-120B - 5. Similarity Analysis (4 stages): `discover_data_combiner`, `discover_similarity_hasher`, `discover_similarity_ranker`, `discover_overlap_detector` - 6. Quality Filtering (3 stages): `discover_explode_questions`, `discover_apply_question_filter`, `discover_apply_chunk_filter`, `discover_filter_questions_by_chunk_quality` - 7. Validation (2 stages): `discover_batch_candidates`, `discover_validate_batched_requests` using Qwen 235B - 8. Dataset Creation & Upload (2 stages): `discover_create_validation_dataset`, `discover_upload_hf_dataset` to mantisnlp/gsma_discover_synthetic - - **PDF Processing Support**: Added PyMuPDF dependency and PDF conversion capability to handle 368 PDF files in discover pipeline - - Created `convert_pdf_to_markdown()` function in converter.py using PyMuPDF - - Added unified `convert_document_to_markdown()` router function - - Updated processor.py to handle both .docx and .pdf files (filters to ['.docx', '.doc', '.pdf']) - - Updated CLI help text to reflect PDF support - - **Metrics Standardization**: All discover pipeline metrics use `metrics/discover/` subdirectory structure for consistency with validation pipeline - - **Usage**: Run `dvc repro discover` or `cd pipelines/discover && dvc repro` for complete end-to-end execution on discover documents - - **HuggingFace Output**: Dataset uploaded to mantisnlp/gsma_discover_synthetic (separate from PRD dataset) - - **Key Features**: Same architecture as PRD pipeline but processes discover-scraped documents (reports, whitepapers, factsheets) instead of PRD technical specifications -- **Document Scraping Pipeline** (Feature #011): Automated GSMA document collection using Playwright - - Created `gsma_dataset_creation/scraper/` module with browser automation infrastructure - - **Two scrapers**: PRD page (32 technical specifications) and Discover page (304 reports/whitepapers) - - **Key features**: Cloudflare bypass (visible browser mode), automatic pagination, resumable downloads, structured metrics - - **CLI commands**: `gsma scrape prd` and `gsma scrape discover` with configurable options - - **DVC pipeline**: `pipelines/discover/dvc.yaml` with 2 stages (scrape_prd, scrape_discover) - - **Architecture**: Shared browser setup (browser.py), data models (models.py), separate scrapers for each source - - **Metrics**: JSON output tracking pages visited, files downloaded/skipped/failed, total size, duration, errors - - **Output**: ~336 files (~660MB total) in data/prd_documents/ and data/discover_documents/ - - **Default URLs**: PRD page and discover page with technical topic filters (eSIM, Identity, IoT, Networks, Security, Spectrum) - - **Caching strategy**: Documents not cached (large external files), metrics cached for tracking - -## Annotation Pipeline (Human Validation) - -### Overview -The Annotation pipeline (`pipelines/annotation/dvc.yaml`) enables human validation of synthetic Q&A pairs through Argilla. It adds subgroup classification to datasets and creates workspace-specific annotation tasks for domain experts. - -### Pipeline Architecture - -**Total Stages**: 6 stages (1 data preparation + 5 frozen upload stages) - -#### Stage 1: Add Subgroups -- `add_subgroups`: Downloads dataset from HuggingFace Hub and adds subgroup classifications -- Input: `mantisnlp/gsma_prd_synthetic` (HuggingFace Hub), `data/working_groups_mapping.json` -- Output: `data/gsma_prd_synthetic_with_subgroups` -- **Key Features**: - - Normalizes document names (removes extensions) for flexible matching across .md, .docx, .pdf formats - - Adds "subgroup" field to dataset Metadata - - Handles unclassified documents gracefully - -#### Stage 2: Upload by Subgroup (5 frozen stages) -Five frozen stages for uploading subgroup-specific annotation tasks: -- `upload_tsg_annotation`: Technical Specification Group (100 samples) -- `upload_fasg_annotation`: Fraud and Security Assurance Group (100 samples) -- `upload_ng_annotation`: Network Group (200 samples - larger due to corpus size) -- `upload_rcs_annotation`: Rich Communication Services (100 samples) -- `upload_esim_annotation`: eSIM specifications (100 samples) - -**Each upload stage**: -1. Filters dataset by subgroup -2. Randomly samples specified number of Q&A pairs -3. Creates workspace named after subgroup (if doesn't exist) -4. Creates annotator user credentials (username=subgroup, password=subgroup-gsma) -5. Automatically adds mattupson and louis as additional annotators -6. Creates one annotation record per Question-Answer pair (deduplicated by question_id) -7. Uploads to Argilla with proper metadata - -### Key Components - -#### Subgroup Classification (`gsma_dataset_creation/subgroup_adder.py`) -- `normalize_document_name()`: Remove file extensions for flexible matching -- `build_document_to_subgroup_mapping()`: Reverse mapping from documents to subgroups -- `add_subgroup_column()`: Add subgroup to Metadata for each dataset record -- `add_subgroup_to_dataset()`: Main entry point, supports Dataset or DatasetDict -- `get_subgroup_statistics()`: Get distribution of subgroups in dataset - -#### Argilla Upload (`gsma_dataset_creation/validation/argilla_subgroup_uploader.py`) -- `create_workspace_and_user()`: Automatic workspace/user provisioning -- `explode_chunks_and_answers()`: Transform Q&A dataset to annotation records (one per Question-Answer pair, deduplicated by question_id) -- `upload_subgroup_dataset_to_argilla()`: Main upload orchestrator - -#### CLI Commands (`gsma_dataset_creation/argilla_cli.py`) -- `upload`: Upload validation dataset to Argilla -- `delete`: Delete dataset from Argilla -- `download`: Download annotated dataset from Argilla -- `upload-by-subgroup`: Upload filtered dataset for subgroup annotation -- `delete-workspace`: Delete workspace and associated user -- `add-users`: Create multiple users with random secure passwords -- `add-user`: Create single user with multi-workspace support -- `add-to-workspace`: Add existing user to multiple workspaces -- `list-users`: List all users in a workspace -- `list-workspaces`: List all available workspaces -- `list-datasets`: List all datasets in a workspace -- `track-progress`: Monitor annotation progress -- `delete-user`: Remove user from Argilla - -### CLI Usage - -```bash -# Add subgroups to dataset -uv run gsma add-subgroup-to-dataset \ - --dataset-repo mantisnlp/gsma_prd_synthetic \ - --working-groups data/working_groups_mapping.json \ - --output data/gsma_prd_synthetic_with_subgroups - -# Upload subgroup for annotation -uv run gsma argilla upload-by-subgroup \ - --dataset-path data/gsma_prd_synthetic_with_subgroups \ - --subgroup TSG \ - --sample-size 100 \ - --dataset-name-prefix gsma_annotation_tsg - -# Delete workspace and user -uv run gsma argilla delete-workspace TSG --force - -# Create multiple users with random secure passwords -uv run gsma argilla add-users \ - --workspace tsg-wg \ - --count 10 \ - --output-csv data/tsg_users.csv - -# Create single user in multiple workspaces -uv run gsma argilla add-user \ - --username alice \ - --password secret123 \ - --workspace TSG \ - --workspace FASG \ - --workspace NG -``` - -### DVC Pipeline Usage - -```bash -# Run from project root -dvc repro annotation:add_subgroups - -# Unfreeze and run upload stage -dvc unfreeze annotation:upload_tsg_annotation -dvc repro annotation:upload_tsg_annotation - -# View pipeline -cd pipelines/annotation && dvc dag -``` - -### Annotation Workflow - -1. **Preparation**: Run `add_subgroups` stage to prepare dataset with classifications -2. **Upload**: Unfreeze and run desired `upload_*_annotation` stages -3. **Access Argilla**: Annotators login to https://mantisnlp-annotate.hf.space - - Credentials: username=subgroup (lowercase), password=subgroup-gsma - - Example: TSG workspace → username=tsg, password=tsg-gsma -4. **Annotate**: Evaluate each Question-Answer pair quality (Good/Partially Good/Bad) -5. **Download**: Retrieve annotated results using `gsma argilla download` - -### User Management - -#### Creating Multiple Users for a Workspace - -The `add-users` command creates multiple annotator accounts with predictable naming for easy distribution: - -```bash -uv run gsma argilla add-users \ - --workspace tsg-wg \ - --count 10 \ - --output-csv data/tsg_users.csv -``` - -**Features**: -- **Predictable naming**: Users follow pattern `{workspace}-user-{number}` (e.g., `tsg-wg-user-1`, `tsg-wg-user-2`) -- **Secure passwords**: Cryptographically random 8-character alphanumeric passwords (generated via `secrets` module) -- **CSV export**: Credentials saved to CSV file for easy distribution to annotators -- **Skip existing**: Automatically skips users that already exist -- **Workspace validation**: Requires workspace to exist before creating users -- **No auto-add**: Does NOT automatically add mattupson or louis (only creates specified users) - -**CSV Output Format**: -```csv -username,password -tsg-wg-user-1,aB3xK9mQ -tsg-wg-user-2,pL7wE2nR -``` - -**Use Cases**: -- Create annotation accounts for external domain experts -- Set up user accounts for workshops or training sessions -- Provision users for specific annotation campaigns - -**Note**: Workspace must exist before running this command. Create workspace first using `upload-by-subgroup` or manually in Argilla. - -### Annotation Guidelines - -Annotators evaluate Question-Answer pairs on: -- **Accuracy**: Is the answer factually correct based on the source document? -- **Completeness**: Does the answer fully address all aspects of the question? -- **Relevance**: Does the answer stay focused on the question asked? -- **Clarity**: Is the answer clear and understandable? - -### Dependencies - -**Environment Variables**: -- `ARGILLA_API_URL`: Argilla service URL (e.g., https://mantisnlp-annotate.hf.space) -- `ARGILLA_API_KEY`: API key for authentication -- `HUGGINGFACE_TOKEN`: For downloading datasets from Hub (optional if public) - -**Required Files**: -- `data/working_groups_mapping.json`: Document-to-subgroup mapping - -### Testing - -**Comprehensive test coverage** (519 tests): -- `tests/test_subgroup_adder.py`: 243 tests covering normalization, mapping, column addition, statistics -- `tests/validation/test_argilla_subgroup_uploader.py`: 276 tests covering workspace/user creation, explosion, upload logic, edge cases - -## Recent Changes -- **Annotation Pipeline** (Feature #011): Human validation workflow with subgroup-based annotation tasks - - Created `pipelines/annotation/dvc.yaml` with 6 stages (1 data prep + 5 frozen upload stages for subgroups: TSG, FASG, NG, RCS, eSim) - - **Subgroup Classification**: Added `gsma_dataset_creation/subgroup_adder.py` module - - Normalizes document names (removes extensions) for flexible matching across .md, .docx, .pdf formats - - Adds "subgroup" field to dataset Metadata via `add_subgroup_to_dataset()` - - CLI command: `add-subgroup-to-dataset` downloads from HuggingFace Hub and adds classifications - - **Argilla Integration**: Added `gsma_dataset_creation/validation/argilla_subgroup_uploader.py` - - `upload-by-subgroup` CLI command: Filters dataset by subgroup, samples Q&A pairs, uploads to Argilla - - Automatic workspace/user provisioning per subgroup (username=subgroup, password=subgroup-gsma) - - Automatically adds mattupson and louis as additional annotators to all workspaces - - Creates one annotation record per Question-Answer pair (deduplicated by question_id) - - `delete-workspace` CLI command: Clean up workspace and associated user - - **Annotation Workflow**: Domain experts evaluate Question-Answer pairs on accuracy, completeness, relevance, clarity - - **Comprehensive testing**: 519 tests total - - `tests/test_subgroup_adder.py`: 243 tests for normalization, mapping, column addition, statistics - - `tests/validation/test_argilla_subgroup_uploader.py`: 276 tests for workspace creation, explosion, upload logic, edge cases - - **DVC Pipeline**: Run `dvc repro annotation:add_subgroups`, then unfreeze/run individual `upload_*_annotation` stages as needed - - **Environment Variables**: Requires `ARGILLA_API_URL` and `ARGILLA_API_KEY` - - **Documentation**: Comprehensive README at `pipelines/annotation/README.md` with workflow, customization, troubleshooting - -## Recent Changes -- **Annotation Pipeline** (Feature #011): Human validation workflow with subgroup-based annotation tasks -- **Document Scraping Pipeline** (Feature #011): Automated GSMA document collection using Playwright -- **PRD Pipeline** (Feature #010): Unified end-to-end pipeline combining all stages - - Created `pipelines/prd/dvc.yaml` with 21 stages (17 logical stages with foreach loops) - - **Complete Pipeline Flow**: Raw documents → Processed → Chunked → Questions → Similarity → Filtered → Validated → HuggingFace Datasets (Embedding + QA) → Uploaded - - **Stage Breakdown**: - 1. Document Processing (1 stage): Convert raw DOCX to processed Markdown - 2. Chunking (5 stages via foreach): Create chunks at 500/1000/2000/3000/4000 tokens using late chunking - 3. Question Generation (5 stages via foreach): Generate 5/10/20/30/40 questions per chunk using Cerebras GPT-OSS-120B - 4. Similarity Analysis (4 stages): Combine data → Hash → Rank (FAISS IVFFlat, k=20, threshold=0.3) → Detect overlaps - 5. Quality Filtering (5 stages): Apply chunk filter (procedures) → Filter chunks (p≥0.5) → Explode questions → Apply question filter (external refs) → Combined filtering - 6. Validation (2 stages): Batch candidates (3 per batch, randomized) → Validate with LLM (Qwen 235B, 50 concurrent, 50k limit) - 7. Dataset Creation & Upload (3 stages): Create both embedding and QA HuggingFace datasets (max 3 positives, 3 negatives) → Upload embedding dataset to mantisnlp/gsma_prd_synthetic_embedding → Upload QA dataset to mantisnlp/gsma_prd_synthetic_qa - - **Usage**: Run `dvc repro prd` or `cd pipelines/prd && dvc repro` for complete end-to-end execution - - **Dependencies**: Pre-trained filter models in `models/filters/`, working groups mapping in `data/working_groups_mapping.json` - - **Metrics**: Comprehensive metrics collected at each stage in `metrics/` directory (combined metrics for both dataset formats) - - **Important**: PRD pipeline consolidates outputs from individual pipelines (chunker, questions, similarity, validation). Use PRD pipeline for complete runs; individual pipelines remain for selective execution but cannot run simultaneously with PRD due to shared outputs. -- **Filters Pipeline** (Feature #009): Quality classification for dataset filtering - - Added `gsma filters` CLI with 5 commands: classify-external-references, classify-procedures, classify-procedures-simple, upload-to-argilla, download-from-argilla - - **External Reference Classifier**: Detects questions referring to unavailable external content (documents, tables, figures, excerpts) using GPT-5-mini + SetFit distillation - - **Working Group Procedures Classifier**: Filters non-technical content (legal text, procedures, frontmatter) using GPT-5-mini + SetFit distillation - - **Argilla Integration**: Upload classifications for manual annotation/validation, download annotated results - - Dependencies: argilla, instructor, sieves, setfit, outlines (filters extra group in pyproject.toml) - - Models distilled to SetFit (sentence-transformers/all-MiniLM-L6-v2 base) and saved to models/filters/ - - Uses shared data: data/validation/questions_with_candidates.parquet, data/enriched_chunks.parquet (no duplication) - - DVC pipeline: pipelines/filters/dvc.yaml with 5 stages (2 frozen for manual Argilla operations) -- **Validation Dataset Creator** (Feature #008): Transform validation results into dual HuggingFace training dataset formats - - Added `gsma datasets create-from-validation` command with support for both embedding and QA formats - - **Dual Dataset Formats**: - - **Embedding Format**: Contrastive learning with Question, Positive_Chunks (list), Negative_Chunks (list), Answer, Metadata - - **QA Format**: RAG/QA training with Question, Content (single-item list with source chunk), Answer, Metadata - - Merges multi-batch validation results by question_id - - Always includes original positive_chunk_id in Positive Chunks (never re-validated, already known correct) - - **Quality-based filtering**: `--max-positives` and `--max-negatives` parameters limit chunks per question (embedding format only), selecting top quality candidates - - **Combined Metrics**: Single JSON file with nested `embedding_metrics` and `qa_metrics` fields for both formats - - **Comprehensive histogram metrics**: Tracks distributions of positive/negative chunk counts, quality scores, and similarity scores for dataset analysis (embedding format) - - Handles edge cases: deduplication across batches, missing chunk content, inconsistent LLM validation results (is_answerable=True with quality=0) - - **Single-pass streaming**: Optimized to calculate metrics and generate dataset in one streaming pass (no separate metrics calculation pass) - - Memory-efficient processing: loads full validation results but streams question processing, writes JSONL during streaming pass - - **Output Formats** (for each dataset type): - - HuggingFace dataset directory (Arrow format with Metadata field, loadable with `load_from_disk(...)`) - - Optional JSONL output (same format without Metadata field for compatibility) - - JSONL written during streaming pass for memory efficiency (no additional memory overhead) - - **CLI Usage**: - ```bash - # Create both formats - uv run gsma datasets create-from-validation \ - --input data/validation/validation_results.parquet \ - --enriched-chunks data/enriched_chunks.parquet \ - --embedding-output data/validation/validation_dataset_embedding \ - --embedding-jsonl-output data/validation/validation_dataset_embedding.jsonl \ - --qa-output data/validation/validation_dataset_qa \ - --qa-jsonl-output data/validation/validation_dataset_qa.jsonl \ - --max-positives 3 --max-negatives 3 \ - --metrics-output metrics/dataset_creation_from_validation.json - - # Backward compatible (creates embedding format only) - uv run gsma datasets create-from-validation \ - --input data/validation/validation_results.parquet \ - --enriched-chunks data/enriched_chunks.parquet \ - --output data/validation/validation_dataset \ - --jsonl-output data/validation/validation_dataset.jsonl - ``` - - **Metrics output**: - - Combined structure: `embedding_metrics` and `qa_metrics` in single JSON - - Embedding metrics: Summary counts, chunk averages, comprehensive histograms (chunk counts, quality scores, similarity scores) - - QA metrics: Total questions, questions skipped (missing content) - - Provides insights into dataset quality, LLM confidence, and semantic relevance - - **DVC Pipeline Integration**: - - Stage: `create_validation_dataset` after `validate_batched_requests` - - Creates both embedding and QA formats with default limits (3 positives, 3 negatives) - - Two separate upload stages: `upload_embedding_dataset` and `upload_qa_dataset` - - HuggingFace repos: `mantisnlp/gsma_prd_synthetic_embedding` and `mantisnlp/gsma_prd_synthetic_qa` - - Full test coverage: 29 tests total (20 embedding tests + 9 QA tests) covering transformation logic, multi-batch merging, filtering, histogram generation, JSONL export, both format structures, combined metrics, edge cases, and end-to-end integration -- **Batched Validation Pipeline** (Feature #007): Implemented multi-candidate validation with 4-5× cost reduction - - Added `gsma validation` CLI commands: `explode-questions`, `batch-candidates`, `validate-batched-requests` - - SQLite checkpointing for resumable processing (auto-resumes after crashes, detects input changes via mtime) - - Randomized candidate order within batches to avoid position bias - - Independent chunk evaluation (not comparative) - each candidate evaluated against question independently - - Batches up to 10 candidates per API call (vs 1 candidate per call previously) - - Smart checkpoint management: resumes from checkpoint if input unchanged, reinitializes if input newer or --force used - - Output only written when 100% complete (all requests completed or permanently failed after 3 retries) - - **Performance Optimizations**: - - Content deduplication: Store only IDs in validation_requests_batched.parquet (~1.2GB → ~50MB), lookup content on-demand from enriched_chunks.parquet - - Concurrent processing: Implemented asyncio.gather() for true concurrent API calls (20x speedup: ~1 req/s → ~20 req/s with --max-concurrent 20) - - Database write serialization: Added asyncio.Lock to prevent SQLite "database is locked" errors during concurrent writes - - Removed redundant status/attempts columns from batched parquet (checkpoint manages these) - - Fixed --force flag: Properly clears old checkpoint data with DELETE before loading new data - - Export with --limit: Support partial exports when limited batch completes (not just 100% completion) - - Metrics output: Added --metrics-output for DVC metrics tracking (success rate, completion stats) -- **Chunk ID Collision Fix** (Issue #53): Fixed chunk_id collision across multiple chunk sizes in data_combiner - - Added `_extract_chunk_size_from_path()` helper to extract chunk size from directory names - - Modified `load_qa_data()` and `load_chunker_data()` to append chunk_size to chunk_id - - Chunk ID format changed from `{document}_{position}` to `{document}_{chunk_size}_{position}` - - Example: `IG.05 v1.0.md_0` → `IG.05 v1.0.md_500_0` (500-token chunks) - - Fallback uses MD5 hash of directory path for uniqueness when chunk_size not extractable - - Fixes Q&A merging issue where 75 Q&As from different chunk sizes merged into single chunk -- **HuggingFace Dataset Creation**: Implemented unified dataset creation with working group classification from hard negatives -- **Working Group Classification**: Added automatic classification of documents to working groups (TSG, FASG, eSim, etc.) via simple dictionary lookup -- **Dual Dataset Formats**: Created both embedding training format (Question/Positive/Negative chunks) and simple Q&A format -- **Comprehensive Testing**: Added 50+ tests covering data models, classification logic, file operations, and integration scenarios -- **DVC Integration**: Added pipeline stages for unified dataset creation across all chunk sizes -- **CLI Commands**: Added `create-datasets`, `create-embedding-dataset`, and `create-qa-dataset` commands with full validation -- **Multiple Chunk Size Testing**: Implemented DVC foreach for parallel testing of 500/1000/2000 token chunk sizes -- **Token-based Filtering**: Added `--filter-min-tokens` parameter to discard short chunks with comprehensive metrics -- **Document Filtering**: Added preamble removal functionality to filter GSMA copyright/TOC content -- **Metrics Collection**: Automatic saving of chunking and filtering statistics to `./metrics` directory -- Added comprehensive chunking functionality with 124 passing tests -- Created SDD documentation for Q/A generation feature -- Designed OpenRouter-compatible API integration - -## Configuration - -**Environment Variables**: -- `OPENROUTER_API_KEY`: For OpenRouter API access (question generation) -- `ARGILLA_API_URL`: Argilla service URL for human annotation (e.g., https://mantisnlp-annotate.hf.space) -- `ARGILLA_API_KEY`: Argilla API key for authentication -- `HUGGINGFACE_TOKEN`: For HuggingFace Hub operations (dataset download/upload) - -**DVC Pipelines**: -- PRD pipeline: `pipelines/prd/dvc.yaml` (20 stages, end-to-end) -- Annotation pipeline: `pipelines/annotation/dvc.yaml` (6 stages, human validation) -- Filters pipeline: `pipelines/filters/dvc.yaml` (5 stages, quality classification) -- Individual pipelines: chunker, questions, similarity, validation (selective execution) - -**Logging**: Structured logging with configurable levels (DEBUG, INFO, WARNING, ERROR) - -**Metrics**: Automatic collection in `./metrics` directory for all pipeline operations diff --git a/README.md b/README.md index b0f62c7..bc13f38 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,49 @@ Pipelines for creating QA triplets from GSMA data for the [Open Telecom LLMs pro ## Overview +This repository contains end-to-end data processing pipelines that transform GSMA technical specifications and reports into high-quality synthetic question-answer datasets for training telecom-focused language models. The pipeline processes hundreds of documents through stages including document conversion, semantic chunking, synthetic Q&A generation using large language models, similarity analysis, quality filtering, and LLM-based validation. The resulting datasets are published to HuggingFace Hub in both contrastive learning format (for embedding models) and Q&A format (for retrieval-augmented generation). The repository includes three main pipelines: PRD (technical specifications), Discover (reports and whitepapers), and Annotation (human validation workflows with domain expert workspaces). + image +## Notes + +This projects was developed fairly rapidly, and some design decisions we made at the outset we now consider to be sub-optimal. Owing to time constraints, and an unwillingness to recreate time consuming and expensive steps (such as question creation and validation), there is some technical debt that would ideally be resolved in a longer project. Overall, in retrospect, DVC was not a good fit for the this workflow as there are very time consuing, and expensive processes which are not likely to be reproduced. Iterative (`dvc` developers) now have a tool called [datachain](https://datachain.ai/) which might have been a better fit, and was designed to resolve many of the shortcomings we experienced with `dvc` which is otherwise well suited for creating reproducible AI (machine learning) pipelines. + +### Only re-run the whole pipeline if a new dataset is required + +Running the complete `dvc` pipelines will recreate the questions data and then the validation, both of which are tasks which take a considerable amount of time, and owing to the non-deterministic nature of LLMs, the questions created will be different from the data delivered to the Open Telco project so far. For this reason I (@ivyleavedtoadflax) recommened not attempting to re-run the whole pipeline, unless it is to completely recreate the datasets that were produced for the project, which is probably not desireable. + +In addition, the existing PRD pipeline will register as being in need of reproduction if you were to run `dvc status` or `dvc repro --dry`. This is because we made changes to the pipeline components as we went, but did not recreate the PRD pipeline from the start as this would have invalidated results that had already been created and annotated. + +### Task Order + +Some tasks (such as assigning a working group) would be better completed as part of the pipeline and incorporated at an early stage. We did not do this as it would have required recreation of the questions data, as we did not receive the working group mapping until after this had been produced. + +### Data Format + +Initially we worked with a simple `json` file format with one document per chunk. Later this became a bottleneck, and we switched to parquet files. To avoid having to re-run the question creation task, we did not implement parquet at the beginning of the pipleine, so you will see the initial stages of the pipeline using `json` files, and the later stages using `parquet`. Ideally we would have used `parquet` throughout. + +### Binary classifiers + +The `filter` stages of the workflows were implemented with binary classifiers distilled from larger models. We used the framework [sieves](https://sieves.ai/) to do this. Sieves is moving rapidly, and so as to not add instability to the dependencies of this project, we did not attempt to implement the distillation process in this repository. Instead we used a sieves script to train the models externally in their own virtual environment, and simply loaded the models in the pipeline for inference. + +An example script (using the latest version of sieves) is included in the `examples/` folder. + +### Limitations of OpenRouter + +In the validation stage we need to send several hundred thousands of requests to `qwen` for validation, which is slow and expensive. I recommend fixing the provider to `Cerebras` which is optimised for high throughput inference at reasonable costs. Running with 50 concurrent tasks seems to work without issue. Running with 100 generated many `429` errors. Somewhere in between might be optimal. + +### Managing the validation step + +The validation step requires several hundred thousand API calls. In order to manage this process effectively, we implemented an SQLite job tracker that tracks the success of the API calls in an ephemeral database stored in `.dvc/.tmp/validation_checkpoints/requests.db`. This ensures that we can track succesful and failed tasks for repetition, something which was not achievable with `dvc` alone. + +If you wish to run this step with `dvc` the approach is to: + +1. Delete the database prior to running the `validate_requests` stage for the first time. Setting the `--force` parameter on `uv run gsma validation validate-requests` will have the same effect. +1. Set a request limit (`50,000` is reasonable) to reduce memory overhead +1. Run the `validate_requests` stage multiple times until no further request in the queue remain. If you run the stage with `dvc` it will show the stage as completed after the first run, since it has no knowledge of the checkpoints database, so you will need to run `dvc repro -sf pipelines/prd/validate_requests` to force re-running that stage. Passing the `-i` parameter will make it interactive and allow you to confirm the run prior to execution. +1. You can monitor progress of the job by running `uv run scripts/check_validation_progress.py` + ## Setup 1. Install dependencies: @@ -18,6 +59,8 @@ uv sync # Set up your Mantis AWS key export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret + +# This may fail for some artefacts. See note below for DVC limitations. ``` 3. Pull latest data: @@ -29,105 +72,239 @@ dvc pull ### Running the Complete Pipeline -Process all data through the full pipeline (deduplication → conversion → chunking): + ```bash dvc repro ``` -### Manual Commands +## Data Structure + +``` +data/ +├── raw/ # Original source documents (DOCX, PDF) +├── prd/ # PRD pipeline outputs +│ ├── processed/ # Markdown files from DOCX conversion +│ ├── chunks_*/ # Chunked data at different token sizes (500, 1000, 2000, 3000, 4000) +│ ├── questions_*/ # Generated Q&A pairs per chunk size +│ ├── combined/ # Merged chunks + questions with working group classification +│ ├── similarity/ # Similarity analysis results (hashes, rankings, overlaps) +│ ├── exploded/ # Question-centric format with positive/negative chunks +│ ├── filtered/ # Quality-filtered questions and chunks +│ └── validation/ # LLM validation results and final datasets +├── discover/ # Discover pipeline outputs (similar structure to PRD) +└── gsma_prd_synthetic_with_subgroups/ # Annotated dataset with subgroup classifications +``` + +## Pipeline Stages -#### Deduplication -Remove duplicate files based on MD5 hash comparison: -```bash -# Dry run (preview only) -uv run gsma deduplicate data/raw +### PRD Pipeline (`pipelines/prd/dvc.yaml`) -# Actually remove duplicates -uv run gsma deduplicate data/raw --execute +End-to-end pipeline for technical specifications: -# Custom file pattern -uv run gsma deduplicate data/raw --pattern "**/*.docx" --execute -``` +1. **process_documents**: + - Converts DOCX → Markdown + - Removes GSMA template boilerplate + - Input: `data/raw/` → Output: `data/prd/processed/` + +2. **create_late_chunks** (5 stages): + - Creates late chunks at 500/1000/2000/3000/4000 tokens + - Uses sentence-transformers/all-MiniLM-L6-v2 embeddings + - Output: `data/prd/chunks_{size}/` + +3. **generate_questions** (5 stages): + - Generates 5/10/20/30/40 synthetic Q&A pairs per chunk size + - Uses Cerebras GPT-OSS-120B via OpenRouter + - Output: `data/prd/questions_{size}/` + +4. **data_combiner**: + - Merges all chunks + questions with working group classification + - Output: `data/prd/combined/` + +5. **similarity_hasher**: + - Adds SHA-256 content hashes for deduplication + - Output: `data/prd/similarity/hashed/` + +6. **similarity_ranker**: + - FAISS IVFFlat similarity search (top-K=20, threshold=0.3) + - Output: `data/prd/similarity/ranked/` + +7. **overlap_detector**: + - Character offset-based text overlap detection (min 50 chars) + - Output: `data/prd/similarity/overlaps/` + +8. **explode_questions**: + - Question-centric format (min-similarity: 0.35, max: 0.95) + - Output: `data/prd/exploded/` + +9. **apply_question_filter**: + - External reference classifier (filters unavailable content) + - Output: `data/prd/filtered/questions/` + +10. **apply_chunk_filter**: + - Procedures classifier + keyword exclusion + - Filters: legal/procedural content, "prd@gsma.com" boilerplate + - Output: `data/prd/filtered/chunks/` + +11. **filter_questions_by_chunk_quality**: + - Combined quality filtering (min probability: 0.5) + - Output: `data/prd/filtered/combined/` + +12. **validate_requests**: + - LLM validation with Qwen 235B via Cerebras + - 50 concurrent requests, 50k question limit + - Output: `data/prd/validation/validated/` + +13. **create_validation_dataset**: + - Dual format: embedding (contrastive) + QA (RAG) + - Max 3 positives/negatives per question + - Output: `data/prd/validation/datasets/` + +14. **upload_embedding_dataset**: + - Uploads to HuggingFace: `mantisnlp/gsma_prd_synthetic_embedding` -#### Document Conversion -Convert DOCX files to Markdown: +15. **upload_qa_dataset**: + - Uploads to HuggingFace: `mantisnlp/gsma_prd_synthetic_qa` + +### Discover Pipeline (`pipelines/discover/dvc.yaml`) + +Similar structure for reports/whitepapers (304 PDF/DOCX documents): +- Includes web scraping with Playwright +- PDF processing via PyMuPDF +- Same chunking → validation → dataset creation workflow +- Outputs: `mantisnlp/gsma_discover_synthetic_embedding` and `mantisnlp/gsma_discover_synthetic_qa` + +### Annotation Pipeline (`pipelines/annotation/dvc.yaml`) + +Human validation workflow with subgroup-based tasks: +1. **add_subgroups**: Adds working group/subgroup classifications to datasets +2. **upload_*_annotation**: Creates Argilla workspaces for domain experts (TSG, FASG, NG, RCS, eSIM) + +### Running Tests ```bash -uv run gsma process data/deduplicated data/processed +uv run pytest tests/ -v +``` -# With debug logging -uv run gsma process data/deduplicated data/processed --log-level DEBUG +### Install in Development Mode +```bash +uv sync ``` -#### Document Chunking -Create 300-token chunks from Markdown files: +## CLI Commands + +The `gsma` CLI provides comprehensive tools for document processing, question generation, validation, filtering, and annotation management. + +### Document Processing ```bash -uv run gsma chunk data/processed data/chunked +# Convert DOCX to Markdown +uv run gsma process -# With custom options -uv run gsma chunk data/processed data/chunked --log-level DEBUG --pattern "*.md" --limit 10 +# Remove duplicate files +uv run gsma deduplicate [--execute] + +# Create chunks from Markdown files +uv run gsma chunk [--chunker late] [--chunk-size 500] ``` -## Data Structure +### Question Generation +```bash +# Generate synthetic Q&A pairs from chunks +uv run gsma questions generate-from-chunks \ + --num-questions 5 \ + --model cerebras/llama3.1-70b -``` -data/ -├── raw/ # Original DOCX files (DVC tracked) -├── deduplicated/ # Raw data with duplicates removed (pipeline stage 1) -├── processed/ # Markdown files (pipeline stage 2) -└── chunked/ # 300-token JSON chunks (pipeline stage 3) +# Combine questions with chunks +uv run gsma questions combine-questions ``` -## Pipeline Stages +### Similarity Analysis +```bash +# Combine data with working group classification +uv run gsma similarity combine -1. **process_documents**: - - Processes `data/raw` → `data/processed` - - Removes duplicate files using MD5 hash comparison - - Converts DOCX files to Markdown format - - Creates flattened directory structure with duplicate name handling +# Add SHA-256 content hashes +uv run gsma similarity hash -2. **create_chunks**: - - Processes `data/processed` → `data/chunked` - - Creates 300-token chunks from Markdown files using chonkie TokenChunker - - Generates JSON files with chunk metadata - - File naming: `document.md` → `document_chunks.json` +# FAISS similarity ranking +uv run gsma similarity rank --top-k 20 -### Running Tests -```bash -uv run pytest tests/ -v +# Detect text overlaps +uv run gsma similarity detect-overlaps ``` -### Type Checking +### Quality Filtering ```bash -uv run mypy gsma_dataset_creation/ +# Apply chunk quality filter (procedures classifier) +uv run gsma filters apply-chunk-filter + +# Apply question filter (external reference classifier) +uv run gsma filters apply-question-filter + +# Filter questions by chunk quality +uv run gsma filters filter-questions-by-chunk-quality ``` -### Install in Development Mode +### Validation ```bash -uv pip install -e . +# Explode questions to question-centric format +uv run gsma validation explode-questions + +# Validate Q&A pairs with LLM +uv run gsma validation validate-requests \ + --model cerebras/qwen-2.5-235b \ + --max-concurrent 50 ``` -## CLI Commands +### Dataset Creation +```bash +# Create datasets from validation results +uv run gsma datasets create-from-validation -### Available Commands -- `uv run gsma process` - Convert DOCX to Markdown with deduplication -- `uv run gsma chunk` - Create 300-token chunks from Markdown files -- `uv run gsma deduplicate` - Remove duplicate files (standalone mode) +# Upload to HuggingFace Hub +uv run gsma datasets upload +``` -### Help +### Argilla Annotation Management ```bash -uv run gsma --help -uv run gsma process --help -uv run gsma chunk --help -uv run gsma deduplicate --help +# Upload dataset for annotation +uv run gsma argilla upload --dataset-path -w + +# Upload by subgroup +uv run gsma argilla upload-by-subgroup \ + --dataset-path \ + --subgroup TSG \ + --sample-size 1000 + +# User management +uv run gsma argilla add-users -w TSG --count 10 --output-csv users.csv +uv run gsma argilla add-user -u alice -p secret123 -w TSG -w FASG +uv run gsma argilla list-users -w TSG +uv run gsma argilla list-workspaces +uv run gsma argilla list-datasets -w TSG + +# Track annotation progress +uv run gsma argilla track-progress -w TSG + +# Download annotated results +uv run gsma argilla download --output-path -w + +# Cleanup +uv run gsma argilla delete-user -u username +uv run gsma argilla delete-workspace TSG ``` -### Chunk Command Options +### Subgroup Classification ```bash -# Basic usage -uv run gsma chunk INPUT_DIR OUTPUT_DIR +# Add subgroup classifications to dataset +uv run gsma add-subgroup-to-dataset \ + --dataset-repo mantisnlp/gsma_prd_synthetic \ + --working-groups data/working_groups_mapping.json \ + --output data/gsma_prd_synthetic_with_subgroups +``` -# Available options: ---log-level LEVEL # Set logging level (DEBUG, INFO, WARNING, ERROR) ---pattern PATTERN # File pattern for input files (default: **/*.md) ---limit N # Limit processing to first N files (for testing) +### Help +```bash +# Get help for any command +uv run gsma --help +uv run gsma --help +uv run gsma --help ``` diff --git a/data/working_groups_mapping.json.dvc b/data/working_groups_mapping.json.dvc index be87e00..8a5b4d9 100644 --- a/data/working_groups_mapping.json.dvc +++ b/data/working_groups_mapping.json.dvc @@ -1,5 +1,5 @@ outs: -- md5: 2adbc799e706c013db55d1fb3338878c - size: 10629 +- md5: b70592bdeac5c03634a60d09d0a8fbc7 + size: 9901 hash: md5 path: working_groups_mapping.json diff --git a/examples/sieves_distillation_example.py b/examples/sieves_distillation_example.py new file mode 100644 index 0000000..d738700 --- /dev/null +++ b/examples/sieves_distillation_example.py @@ -0,0 +1,266 @@ +#!/usr/bin/env python3 + +"""Classify working group procedure chunks and distill into a small model. + +This script classifies document chunks as TRUE if they contain: +- Working group procedures (chair election, removal, etc.) +- Legal details (liability caps, etc.) +- Frontmatter content (index, TOC, etc.) + +Then distills the results into a smaller specialized model. +""" + +import dataclasses +import json +import os + +import instructor +import openai +import pandas as pd +from dotenv import load_dotenv + +from sieves import Doc, Model, Pipeline, tasks +from sieves.engines.types import GenerationSettings +from sieves.tasks.predictive.classification.core import FewshotExampleSingleLabel +from sieves.tasks.postprocessing.distillation import Distillation, DistillationFramework + +# Load environment variables +load_dotenv() + +#class Chunk(pydantic.BaseModel, frozen=True): +# """Classification result for working group procedure content.""" +# +# is_legal_content: bool +# is_GSMA_working_group_procedure: bool +# confidence: Literal["high", "medium", "low"] +# reasoning: str + +if __name__ == '__main__': + print("\n" + "="*60) + print("GSMA Working Group Procedures Classification") + print("="*60 + "\n") + + # Load data + print("📂 Loading data...") + INPUT_FILE = "~/Documents/mantis/GSMA/data/enriched_chunks.parquet" + df = pd.read_parquet(INPUT_FILE) + print(f" ✓ Loaded {len(df):,} total chunks from {INPUT_FILE}") + + sample_df = df.sample(n=2000) # Classification for review + chunks = sample_df["chunk_content"].tolist() + truncated_chunks = [chunk[0:200] for chunk in chunks] + docs = [Doc(text=chunk_text) for chunk_text in truncated_chunks] + print(f" ✓ Sampled {len(docs)} chunks (truncated to 200 chars)\n") + + # Setup OpenAI client with Instructor (async) + print("🔌 Setting up OpenAI client...") + openai_client = openai.AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"]) + client = instructor.from_openai(openai_client) + MODEL = "gpt-5-mini" # Fast and cheap for classification + + # Create Instructor model for sieves + from sieves.engines.instructor_ import Model + model = Model(name=MODEL, client=client) + print(f" ✓ Model created: {MODEL}\n") + + # Binary classification: technical content vs other + label_descriptions = { + "technical content": "Technical specifications, API documentation, protocols, standards, implementation details, test procedures, system architecture, technical requirements", + "other": "Legal text, working group procedures, document frontmatter, administrative content, table of contents, legal obligations" + } + + + # Few-shot examples for binary classification + fewshot_examples = [ + # Technical content examples + FewshotExampleSingleLabel( + text="The API endpoint supports HTTP GET requests with the following parameters: id (required), format (optional). Response format is JSON.", + reasoning="Describes technical API specifications and implementation details", + label="technical content", + confidence=1.0 + ), + FewshotExampleSingleLabel( + text="The protocol defines three message types: INIT, ACK, and DATA. Each message must include a 16-bit checksum in the header.", + reasoning="Describes technical protocol specifications and message structure", + label="technical content", + confidence=1.0 + ), + FewshotExampleSingleLabel( + text="Test Procedure B09: Verify that the device sends an acknowledgment within 500ms of receiving the request packet.", + reasoning="Describes technical test procedure and timing requirements", + label="technical content", + confidence=1.0 + ), + # Other content examples + FewshotExampleSingleLabel( + text="The Signatory shall not be liable for any indirect, consequential, or punitive damages arising from this agreement. Maximum liability is capped at $100,000.", + reasoning="Contains legal liability limitations and obligations", + label="other", + confidence=1.0 + ), + FewshotExampleSingleLabel( + text="The Chair shall be elected by a simple majority vote of the Working Group members present at the Annual General Meeting.", + reasoning="Describes working group governance procedure for chair election", + label="other", + confidence=1.0 + ), + FewshotExampleSingleLabel( + text="Table of Contents\n1. Introduction\n2. Scope\n3. Definitions\n4. Technical Requirements", + reasoning="Document frontmatter containing table of contents", + label="other", + confidence=1.0 + ), + ] + + # Create classification pipeline + print("⚙️ Creating classification pipeline...") + print(" Labels:") + for label, desc in label_descriptions.items(): + print(f" • {label}: {desc}") + print(f" Few-shot examples: {len(fewshot_examples)}") + print() + + # Create binary classification task + classifier = tasks.Classification( + task_id="procedures_classifier", + labels=["technical content", "other"], + model=model, + generation_settings=GenerationSettings( + batch_size=50, # Larger batches for speed + strict_mode=True, # Required for distillation - raises errors on failures + inference_kwargs={ + "max_tokens": None, # Disable default max_tokens for GPT-5 + "max_completion_tokens": 1000 # GPT-5 uses this - increased for longer outputs + }, + ), + label_descriptions=label_descriptions, + fewshot_examples=fewshot_examples, + multi_label=False, # Binary single-label classification + ) + + print(" ✓ Classification task created\n") + + # Create distillation task + print("🔬 Setting up distillation...") + OUTPUT_PATH = "working-group-procedures-model-2000" + distiller = Distillation( + target_task_id="procedures_classifier", + base_model_id="sentence-transformers/all-MiniLM-L6-v2", # Smaller model (80MB vs 420MB) + framework=DistillationFramework.setfit, + output_path=OUTPUT_PATH, + train_frac=0.9, + val_frac=0.1, + train_kwargs={ + "num_epochs": 1, + "batch_size": 32, # Reduced for memory safety + "num_iterations": 10, # Reduce iterations for speed + }, + ) + print(f" ✓ Distillation configured (framework: SetFit, output: {OUTPUT_PATH})\n") + + # Create pipeline with both tasks + pipe = classifier + distiller + + # Run classification + distillation + print("🔍 Running classification and distillation...") + print(f" Processing {len(docs)} documents...\n") + + docs = list(pipe(docs)) + + # Extract and save results + print("\n💾 Saving results...") + out = [] + for doc in docs: + classification_result = doc.results.get("procedures_classifier") + + # Extract label from result (can be string or [label, confidence] list) + if isinstance(classification_result, str): + predicted_label = classification_result + elif isinstance(classification_result, list) and len(classification_result) > 0: + predicted_label = classification_result[0] + else: + predicted_label = None + + out.append({ + "text": doc.text, + "predicted_label": predicted_label, + }) + + with open("out.jsonl", "w") as f: + for record in out: + f.write(json.dumps(record) + "\n") + print(f" ✓ Wrote {len(out)} records to out.jsonl") + + # Save classifier config and docs with results + print("💾 Saving classifier config and docs...") + config = classifier.serialize() + config.dump("classifier_config.yml") + + with open("working_group_procedures.jsonl", "w") as f: + for doc in docs: + doc_dict = dataclasses.asdict(doc) + # Remove images field if present (can't serialize PIL images) + doc_dict.pop("images", None) + f.write(json.dumps(doc_dict) + "\n") + print(f" ✓ Saved config and {len(docs)} docs with results") + + # Evaluate distilled model on validation set + print("\n📊 Evaluating distilled model on validation set...") + from setfit import SetFitModel + distilled_model = SetFitModel.from_pretrained(OUTPUT_PATH) + + # Get validation samples (last 10% of the data) + val_size = int(len(truncated_chunks) * 0.1) + val_chunks = truncated_chunks[-val_size:] + val_labels = [out[i]["predicted_label"] for i in range(len(out) - val_size, len(out))] + + # Run predictions on validation set + val_predictions = distilled_model.predict(val_chunks) + + # Get label mapping from model (matches the order in labels list) + if hasattr(distilled_model.model_head, 'id2label'): + id2label = distilled_model.model_head.id2label + else: + # Default mapping based on label order in Classification task + id2label = {0: "technical content", 1: "other"} + + val_pred_labels = [id2label[int(p)] for p in val_predictions] + + # Calculate metrics + correct = sum(1 for pred, true in zip(val_pred_labels, val_labels) if pred == true) + accuracy = correct / len(val_labels) if val_labels else 0 + + print(f" Validation size: {len(val_labels)}") + print(f" Accuracy: {accuracy:.2%} ({correct}/{len(val_labels)})") + + # Per-label metrics + from collections import defaultdict + label_stats = defaultdict(lambda: {"correct": 0, "total": 0}) + for pred, true in zip(val_pred_labels, val_labels): + label_stats[true]["total"] += 1 + if pred == true: + label_stats[true]["correct"] += 1 + + print(f"\n Per-label accuracy:") + for label in ["technical content", "other"]: + if label_stats[label]["total"] > 0: + acc = label_stats[label]["correct"] / label_stats[label]["total"] + print(f" {label:20s}: {acc:.2%} ({label_stats[label]['correct']}/{label_stats[label]['total']})") + + # Print summary statistics + print("\n" + "="*60) + print("Summary") + print("="*60) + label_counts = {"technical content": 0, "other": 0} + for record in out: + label = record.get("predicted_label") + if label in label_counts: + label_counts[label] += 1 + + print("\nLabel distribution:") + for label, count in label_counts.items(): + percentage = (count / len(out) * 100) if out else 0 + print(f" {label:30s}: {count:3d} ({percentage:5.1f}%)") + + print(f"\n✅ Distilled model saved to: {OUTPUT_PATH}/") + print("="*60) diff --git a/memory/constitution.md b/memory/constitution.md deleted file mode 100644 index a4670ff..0000000 --- a/memory/constitution.md +++ /dev/null @@ -1,50 +0,0 @@ -# [PROJECT_NAME] Constitution - - -## Core Principles - -### [PRINCIPLE_1_NAME] - -[PRINCIPLE_1_DESCRIPTION] - - -### [PRINCIPLE_2_NAME] - -[PRINCIPLE_2_DESCRIPTION] - - -### [PRINCIPLE_3_NAME] - -[PRINCIPLE_3_DESCRIPTION] - - -### [PRINCIPLE_4_NAME] - -[PRINCIPLE_4_DESCRIPTION] - - -### [PRINCIPLE_5_NAME] - -[PRINCIPLE_5_DESCRIPTION] - - -## [SECTION_2_NAME] - - -[SECTION_2_CONTENT] - - -## [SECTION_3_NAME] - - -[SECTION_3_CONTENT] - - -## Governance - - -[GOVERNANCE_RULES] - - -**Version**: [CONSTITUTION_VERSION] | **Ratified**: [RATIFICATION_DATE] | **Last Amended**: [LAST_AMENDED_DATE] - diff --git a/memory/constitution_update_checklist.md b/memory/constitution_update_checklist.md deleted file mode 100644 index adcf844..0000000 --- a/memory/constitution_update_checklist.md +++ /dev/null @@ -1,85 +0,0 @@ -# Constitution Update Checklist - -When amending the constitution (`/memory/constitution.md`), ensure all dependent documents are updated to maintain consistency. - -## Templates to Update - -### When adding/modifying ANY article: -- [ ] `/templates/plan-template.md` - Update Constitution Check section -- [ ] `/templates/spec-template.md` - Update if requirements/scope affected -- [ ] `/templates/tasks-template.md` - Update if new task types needed -- [ ] `/.claude/commands/plan.md` - Update if planning process changes -- [ ] `/.claude/commands/tasks.md` - Update if task generation affected -- [ ] `/CLAUDE.md` - Update runtime development guidelines - -### Article-specific updates: - -#### Article I (Library-First): -- [ ] Ensure templates emphasize library creation -- [ ] Update CLI command examples -- [ ] Add llms.txt documentation requirements - -#### Article II (CLI Interface): -- [ ] Update CLI flag requirements in templates -- [ ] Add text I/O protocol reminders - -#### Article III (Test-First): -- [ ] Update test order in all templates -- [ ] Emphasize TDD requirements -- [ ] Add test approval gates - -#### Article IV (Integration Testing): -- [ ] List integration test triggers -- [ ] Update test type priorities -- [ ] Add real dependency requirements - -#### Article V (Observability): -- [ ] Add logging requirements to templates -- [ ] Include multi-tier log streaming -- [ ] Update performance monitoring sections - -#### Article VI (Versioning): -- [ ] Add version increment reminders -- [ ] Include breaking change procedures -- [ ] Update migration requirements - -#### Article VII (Simplicity): -- [ ] Update project count limits -- [ ] Add pattern prohibition examples -- [ ] Include YAGNI reminders - -## Validation Steps - -1. **Before committing constitution changes:** - - [ ] All templates reference new requirements - - [ ] Examples updated to match new rules - - [ ] No contradictions between documents - -2. **After updating templates:** - - [ ] Run through a sample implementation plan - - [ ] Verify all constitution requirements addressed - - [ ] Check that templates are self-contained (readable without constitution) - -3. **Version tracking:** - - [ ] Update constitution version number - - [ ] Note version in template footers - - [ ] Add amendment to constitution history - -## Common Misses - -Watch for these often-forgotten updates: -- Command documentation (`/commands/*.md`) -- Checklist items in templates -- Example code/commands -- Domain-specific variations (web vs mobile vs CLI) -- Cross-references between documents - -## Template Sync Status - -Last sync check: 2025-07-16 -- Constitution version: 2.1.1 -- Templates aligned: ❌ (missing versioning, observability details) - ---- - -*This checklist ensures the constitution's principles are consistently applied across all project documentation.* diff --git a/pipelines/chunker/dvc.lock b/pipelines/chunker/dvc.lock deleted file mode 100644 index 88a941b..0000000 --- a/pipelines/chunker/dvc.lock +++ /dev/null @@ -1,177 +0,0 @@ -schema: '2.0' -stages: - process_documents: - cmd: uv run gsma process data/raw data/processed --log-level INFO - deps: - - path: data/raw - hash: md5 - md5: 8ec3c1c0ae7420e286a24114ee21d188.dir - size: 262572056 - nfiles: 270 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 98f0613663b314661b12ab8dc17ecb00 - size: 51648 - - path: gsma_dataset_creation/converter.py - hash: md5 - md5: 969c5d7815534d4001d2f381410c4c60 - size: 1487 - - path: gsma_dataset_creation/processor.py - hash: md5 - md5: dfd772bd77258ed343d6292c98510977 - size: 7738 - outs: - - path: data/processed - hash: md5 - md5: 72c80941549afe1a3d8792d26ed6fe1e.dir - size: 16688411 - nfiles: 248 - - path: metrics/document_processing_metrics.json - hash: md5 - md5: 0d6ddfe77d33458a0e9bfbd2690e637e - size: 424 - create_late_chunks@0: - cmd: "uv run gsma chunk data/processed data/chunked_late_500 --chunker late --chunker-config\ - \ '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\", \"chunk_size\"\ - : 500, \"min_characters_per_chunk\": 24}' --filter-min-tokens 0 --limit-docs\ - \ 1000 --log-level INFO" - deps: - - path: data/processed - hash: md5 - md5: 72c80941549afe1a3d8792d26ed6fe1e.dir - size: 16688411 - nfiles: 248 - - path: gsma_dataset_creation/chunker.py - hash: md5 - md5: 0c10ba0b0885709ae7bff0ead3570316 - size: 17857 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 8cf0155fb46d8d64779a3d5102876312 - size: 52349 - outs: - - path: data/chunked_late_500 - hash: md5 - md5: cdf4ce62598c452b97deaff3762174bd.dir - size: 112905314 - nfiles: 248 - - path: metrics/chunk_metrics_chunked_late_500.json - hash: md5 - md5: 8b1f09589021cf6885959d230268c7f3 - size: 768 - create_late_chunks@1: - cmd: "uv run gsma chunk data/processed data/chunked_late_1000 --chunker late --chunker-config\ - \ '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\", \"chunk_size\"\ - : 1000, \"min_characters_per_chunk\": 24}' --filter-min-tokens 0 --limit-docs\ - \ 1000 --log-level INFO" - deps: - - path: data/processed - hash: md5 - md5: 72c80941549afe1a3d8792d26ed6fe1e.dir - size: 16688411 - nfiles: 248 - - path: gsma_dataset_creation/chunker.py - hash: md5 - md5: 0c10ba0b0885709ae7bff0ead3570316 - size: 17857 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 8cf0155fb46d8d64779a3d5102876312 - size: 52349 - outs: - - path: data/chunked_late_1000 - hash: md5 - md5: 7cf69d9840ef5521665ef13dc0753c5f.dir - size: 64371559 - nfiles: 248 - - path: metrics/chunk_metrics_chunked_late_1000.json - hash: md5 - md5: 14915e93d2df0ace61543a93faffd0d9 - size: 770 - create_late_chunks@2: - cmd: "uv run gsma chunk data/processed data/chunked_late_2000 --chunker late --chunker-config\ - \ '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\", \"chunk_size\"\ - : 2000, \"min_characters_per_chunk\": 24}' --filter-min-tokens 0 --limit-docs\ - \ 1000 --log-level INFO" - deps: - - path: data/processed - hash: md5 - md5: 72c80941549afe1a3d8792d26ed6fe1e.dir - size: 16688411 - nfiles: 248 - - path: gsma_dataset_creation/chunker.py - hash: md5 - md5: 0c10ba0b0885709ae7bff0ead3570316 - size: 17857 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 8cf0155fb46d8d64779a3d5102876312 - size: 52349 - outs: - - path: data/chunked_late_2000 - hash: md5 - md5: ad55506022dcd9f89bf999e65e841e2e.dir - size: 41293426 - nfiles: 248 - - path: metrics/chunk_metrics_chunked_late_2000.json - hash: md5 - md5: 1366ce3d1773deaed3a5e33709fec16e - size: 757 - create_late_chunks@3: - cmd: "uv run gsma chunk data/processed data/chunked_late_3000 --chunker late --chunker-config\ - \ '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\", \"chunk_size\"\ - : 3000, \"min_characters_per_chunk\": 24}' --filter-min-tokens 0 --limit-docs\ - \ 1000 --log-level INFO" - deps: - - path: data/processed - hash: md5 - md5: 72c80941549afe1a3d8792d26ed6fe1e.dir - size: 16688411 - nfiles: 248 - - path: gsma_dataset_creation/chunker.py - hash: md5 - md5: 0c10ba0b0885709ae7bff0ead3570316 - size: 17857 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 8cf0155fb46d8d64779a3d5102876312 - size: 52349 - outs: - - path: data/chunked_late_3000 - hash: md5 - md5: ab6fc68856dfe51186a2cc9bfd61d29b.dir - size: 33715035 - nfiles: 248 - - path: metrics/chunk_metrics_chunked_late_3000.json - hash: md5 - md5: 9c88aba4104403423bb90bdda02952d9 - size: 767 - create_late_chunks@4: - cmd: "uv run gsma chunk data/processed data/chunked_late_4000 --chunker late --chunker-config\ - \ '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\", \"chunk_size\"\ - : 4000, \"min_characters_per_chunk\": 24}' --filter-min-tokens 0 --limit-docs\ - \ 1000 --log-level INFO" - deps: - - path: data/processed - hash: md5 - md5: 72c80941549afe1a3d8792d26ed6fe1e.dir - size: 16688411 - nfiles: 248 - - path: gsma_dataset_creation/chunker.py - hash: md5 - md5: 0c10ba0b0885709ae7bff0ead3570316 - size: 17857 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 8cf0155fb46d8d64779a3d5102876312 - size: 52349 - outs: - - path: data/chunked_late_4000 - hash: md5 - md5: 6c696d0ef870d7ba2aec72155bbfed66.dir - size: 29820279 - nfiles: 248 - - path: metrics/chunk_metrics_chunked_late_4000.json - hash: md5 - md5: 1e20cebad49f1722b72bff1226f4020b - size: 769 diff --git a/pipelines/chunker/dvc.yaml b/pipelines/chunker/dvc.yaml deleted file mode 100644 index ce455bf..0000000 --- a/pipelines/chunker/dvc.yaml +++ /dev/null @@ -1,67 +0,0 @@ -stages: - process_documents: - wdir: ../.. - cmd: uv run gsma process data/raw data/processed --log-level INFO - deps: - - data/raw - - gsma_dataset_creation/converter.py - - gsma_dataset_creation/processor.py - - gsma_dataset_creation/cli.py - outs: - - data/processed - metrics: - - metrics/document_processing_metrics.json - desc: "Convert raw DOCX files to Markdown with streaming deduplication and flattened structure" - frozen: true - - create_late_chunks: - foreach: - - size: 500 - name: late_500 - type: late - embedding_model: "sentence-transformers/all-MiniLM-L6-v2" - min_chars: 24 - min_tokens: 0 - - size: 1000 - name: late_1000 - type: late - embedding_model: "sentence-transformers/all-MiniLM-L6-v2" - min_chars: 24 - min_tokens: 0 - - size: 2000 - name: late_2000 - type: late - embedding_model: "sentence-transformers/all-MiniLM-L6-v2" - min_chars: 24 - min_tokens: 0 - - size: 3000 - name: late_3000 - type: late - embedding_model: "sentence-transformers/all-MiniLM-L6-v2" - min_chars: 24 - min_tokens: 0 - - size: 4000 - name: late_4000 - type: late - embedding_model: "sentence-transformers/all-MiniLM-L6-v2" - min_chars: 24 - min_tokens: 0 - do: - wdir: ../.. - cmd: >- - uv run gsma chunk data/processed data/chunked_${item.name} - --chunker ${item.type} - --chunker-config '{"embedding_model": "${item.embedding_model}", "chunk_size": ${item.size}, "min_characters_per_chunk": ${item.min_chars}}' - --filter-min-tokens ${item.min_tokens} - --limit-docs 1000 - --log-level INFO - deps: - - data/processed - - gsma_dataset_creation/chunker.py - - gsma_dataset_creation/cli.py - outs: - - data/chunked_${item.name} - metrics: - - metrics/chunk_metrics_chunked_${item.name}.json - desc: "Create late chunks from processed Markdown files with embedded context preservation" - frozen: true diff --git a/pipelines/datasets/dvc.lock b/pipelines/datasets/dvc.lock deleted file mode 100644 index a5f9b10..0000000 --- a/pipelines/datasets/dvc.lock +++ /dev/null @@ -1,122 +0,0 @@ -schema: '2.0' -stages: - create_datasets_from_questions_include_unclassified: - cmd: uv run gsma create-datasets-from-questions -i data/questions_gpt-oss-120b_late_500 - -i data/questions_gpt-oss-120b_late_1000 -i data/questions_gpt-oss-120b_late_2000 - -i data/questions_gpt-oss-120b_late_3000 -i data/questions_gpt-oss-120b_late_4000 - -w data/working_groups_mapping.json -e data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - -q data/qa_dataset_from_questions_all_chunks_no_unclassified.json --log-level - INFO --exclude-unclassified - deps: - - path: data/questions_gpt-oss-120b_late_1000 - hash: md5 - md5: e5aa8dfa448ab8afb2a546e9e0ed0ef2.dir - size: 46638867 - nfiles: 4001 - - path: data/questions_gpt-oss-120b_late_2000 - hash: md5 - md5: 589b801c932365ffcf109218f59d3bec.dir - size: 38981806 - nfiles: 1830 - - path: data/questions_gpt-oss-120b_late_3000 - hash: md5 - md5: 7fca8c90a24f32b89d4526eae1a99f7d.dir - size: 32217471 - nfiles: 1066 - - path: data/questions_gpt-oss-120b_late_4000 - hash: md5 - md5: c1578e42cf46e4eeefcefd25d0de812c.dir - size: 25730277 - nfiles: 658 - - path: data/questions_gpt-oss-120b_late_500 - hash: md5 - md5: 73a0a7facc7360db9b11a57e542c5fca.dir - size: 52133974 - nfiles: 8230 - - path: data/working_groups_mapping.json - hash: md5 - md5: b70592bdeac5c03634a60d09d0a8fbc7 - size: 9901 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: ab0eefec416b225e71137f63fd3dcead - size: 72979 - - path: gsma_dataset_creation/hf_dataset_creator.py - hash: md5 - md5: d265b35cddf227be90d7e49e00ca1d70 - size: 36320 - outs: - - path: data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - hash: md5 - md5: 4480366fe17815d50de752f9741200be - size: 871405063 - - path: data/qa_dataset_from_questions_all_chunks_no_unclassified.json - hash: md5 - md5: 6b2285362280896cfe01e80c2b7d8179 - size: 854945423 - create_hf_datasets_from_questions: - cmd: uv run gsma create-hf-datasets --embedding-json data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - --qa-json data/qa_dataset_from_questions_all_chunks_no_unclassified.json --output-dir - data/hf_datasets --dataset-name gsma_telecom_qa_questions --log-level INFO - deps: - - path: data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - hash: md5 - md5: 4480366fe17815d50de752f9741200be - size: 871405063 - - path: data/qa_dataset_from_questions_all_chunks_no_unclassified.json - hash: md5 - md5: 6b2285362280896cfe01e80c2b7d8179 - size: 854945423 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: ab0eefec416b225e71137f63fd3dcead - size: 72979 - - path: gsma_dataset_creation/hf_dataset_creator.py - hash: md5 - md5: d265b35cddf227be90d7e49e00ca1d70 - size: 36320 - outs: - - path: data/hf_datasets/gsma_telecom_qa_questions_embedding - hash: md5 - md5: ff43761c7a80f843eca832b0343f29c5.dir - size: 828682250 - nfiles: 4 - - path: data/hf_datasets/gsma_telecom_qa_questions_qa - hash: md5 - md5: cfb698fe3ac1a75947c13ba1be92fd9e.dir - size: 823548401 - nfiles: 4 - upload_hf_qa_dataset_to_hub: - cmd: uv run gsma upload-hf-dataset --dataset-path data/hf_datasets/gsma_telecom_qa_questions_qa - --repo-name "mantisnlp/telecom-questions" --log-level INFO - deps: - - path: data/hf_datasets/gsma_telecom_qa_questions_qa - hash: md5 - md5: cfb698fe3ac1a75947c13ba1be92fd9e.dir - size: 823548401 - nfiles: 4 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: ab0eefec416b225e71137f63fd3dcead - size: 72979 - - path: gsma_dataset_creation/hf_dataset_creator.py - hash: md5 - md5: d265b35cddf227be90d7e49e00ca1d70 - size: 36320 - upload_hf_embedding_dataset_to_hub: - cmd: uv run gsma upload-hf-dataset --dataset-path data/hf_datasets/gsma_telecom_qa_questions_embedding - --repo-name "mantisnlp/telecom-embedding" --log-level INFO - deps: - - path: data/hf_datasets/gsma_telecom_qa_questions_embedding - hash: md5 - md5: ff43761c7a80f843eca832b0343f29c5.dir - size: 828682250 - nfiles: 4 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: ab0eefec416b225e71137f63fd3dcead - size: 72979 - - path: gsma_dataset_creation/hf_dataset_creator.py - hash: md5 - md5: d265b35cddf227be90d7e49e00ca1d70 - size: 36320 diff --git a/pipelines/datasets/dvc.yaml b/pipelines/datasets/dvc.yaml deleted file mode 100644 index 6892b3b..0000000 --- a/pipelines/datasets/dvc.yaml +++ /dev/null @@ -1,77 +0,0 @@ -stages: - # Dataset creation from question generation results - create_datasets_from_questions_include_unclassified: - wdir: ../.. - cmd: >- - uv run gsma create-datasets-from-questions - -i data/questions_gpt-oss-120b_late_500 - -i data/questions_gpt-oss-120b_late_1000 - -i data/questions_gpt-oss-120b_late_2000 - -i data/questions_gpt-oss-120b_late_3000 - -i data/questions_gpt-oss-120b_late_4000 - -w data/working_groups_mapping.json - -e data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - -q data/qa_dataset_from_questions_all_chunks_no_unclassified.json - --log-level INFO - --exclude-unclassified - deps: - - data/questions_gpt-oss-120b_late_500 - - data/questions_gpt-oss-120b_late_1000 - - data/questions_gpt-oss-120b_late_2000 - - data/questions_gpt-oss-120b_late_3000 - - data/questions_gpt-oss-120b_late_4000 - - data/working_groups_mapping.json - - gsma_dataset_creation/hf_dataset_creator.py - - gsma_dataset_creation/cli.py - outs: - - data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - - data/qa_dataset_from_questions_all_chunks_no_unclassified.json - desc: "Create unified HuggingFace datasets from all question generation results with working group classification (including unclassified)" - frozen: true - - create_hf_datasets_from_questions: - wdir: ../.. - cmd: >- - uv run gsma create-hf-datasets - --embedding-json data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - --qa-json data/qa_dataset_from_questions_all_chunks_no_unclassified.json - --output-dir data/hf_datasets - --dataset-name gsma_telecom_qa_questions - --log-level INFO - deps: - - data/embedding_dataset_from_questions_all_chunks_no_unclassified.json - - data/qa_dataset_from_questions_all_chunks_no_unclassified.json - - gsma_dataset_creation/hf_dataset_creator.py - - gsma_dataset_creation/cli.py - outs: - - data/hf_datasets/gsma_telecom_qa_questions_embedding - - data/hf_datasets/gsma_telecom_qa_questions_qa - desc: "Convert question-based JSON datasets to separate HuggingFace Dataset formats" - - upload_hf_embedding_dataset_to_hub: - wdir: ../.. - cmd: >- - uv run gsma upload-hf-dataset - --dataset-path data/hf_datasets/gsma_telecom_qa_questions_embedding - --repo-name "mantisnlp/telecom-embedding" - --log-level INFO - deps: - - data/hf_datasets/gsma_telecom_qa_questions_embedding - - gsma_dataset_creation/hf_dataset_creator.py - - gsma_dataset_creation/cli.py - desc: "Upload embedding HuggingFace dataset to the Hub for public access" - frozen: true - - upload_hf_qa_dataset_to_hub: - wdir: ../.. - cmd: >- - uv run gsma upload-hf-dataset - --dataset-path data/hf_datasets/gsma_telecom_qa_questions_qa - --repo-name "mantisnlp/telecom-questions" - --log-level INFO - deps: - - data/hf_datasets/gsma_telecom_qa_questions_qa - - gsma_dataset_creation/hf_dataset_creator.py - - gsma_dataset_creation/cli.py - desc: "Upload Q&A HuggingFace dataset to the Hub for public access" - frozen: true diff --git a/pipelines/discover/dvc.lock b/pipelines/discover/dvc.lock index b4ee838..4d81b2a 100644 --- a/pipelines/discover/dvc.lock +++ b/pipelines/discover/dvc.lock @@ -90,8 +90,8 @@ stages: nfiles: 32 - path: gsma_dataset_creation/deduplicator.py hash: md5 - md5: f15def5709f3700c91776f3668f84a4b - size: 9252 + md5: 06230ba08c01bb7d94ef38015eee163a + size: 10498 - path: gsma_dataset_creation/deduplicator_cli.py hash: md5 md5: 11b2ab3f0e5e23a1e92d2bb0570f147b @@ -118,8 +118,8 @@ stages: nfiles: 395 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/converter.py hash: md5 md5: 480a5175b1837b844ec9b0fa0984b742 @@ -160,8 +160,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: e6b40090ef9e0769028d46ab4914a927 - size: 73984 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/chunked_late_500 hash: md5 @@ -190,8 +190,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: e6b40090ef9e0769028d46ab4914a927 - size: 73984 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/chunked_late_1000 hash: md5 @@ -220,8 +220,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: e6b40090ef9e0769028d46ab4914a927 - size: 73984 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/chunked_late_2000 hash: md5 @@ -250,8 +250,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: e6b40090ef9e0769028d46ab4914a927 - size: 73984 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/chunked_late_3000 hash: md5 @@ -280,8 +280,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: e6b40090ef9e0769028d46ab4914a927 - size: 73984 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/chunked_late_4000 hash: md5 @@ -299,13 +299,13 @@ stages: deps: - path: data/discover/chunked_late_500 hash: md5 - md5: 175353b73ae76f22e5bd4fdae316d3dd.dir - size: 135191636 - nfiles: 324 + md5: b22c25feaf52276350da8e06e16e6020.dir + size: 156678601 + nfiles: 387 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -327,13 +327,13 @@ stages: deps: - path: data/discover/chunked_late_1000 hash: md5 - md5: fccf22169412f8b10991655c31a62544.dir - size: 78883980 - nfiles: 324 + md5: 0c2b91c08676872ec94e3cb5037b8468.dir + size: 90956959 + nfiles: 387 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -355,13 +355,13 @@ stages: deps: - path: data/discover/chunked_late_2000 hash: md5 - md5: 46044cbe9cfca877f817a33c91ee75bf.dir - size: 51339755 - nfiles: 324 + md5: 5a421656b098b38936f19ab6a4c8579c.dir + size: 58875482 + nfiles: 387 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -388,8 +388,8 @@ stages: nfiles: 31 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/converter.py hash: md5 md5: 480a5175b1837b844ec9b0fa0984b742 @@ -430,8 +430,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/prd/chunked_late_500 hash: md5 @@ -460,8 +460,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/prd/chunked_late_1000 hash: md5 @@ -490,8 +490,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/prd/chunked_late_2000 hash: md5 @@ -520,8 +520,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/prd/chunked_late_3000 hash: md5 @@ -550,8 +550,8 @@ stages: size: 18366 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 outs: - path: data/discover/prd/chunked_late_4000 hash: md5 @@ -574,8 +574,8 @@ stages: nfiles: 30 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -602,8 +602,8 @@ stages: nfiles: 30 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -630,8 +630,8 @@ stages: nfiles: 30 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -658,8 +658,8 @@ stages: nfiles: 30 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -686,8 +686,8 @@ stages: nfiles: 30 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -939,8 +939,8 @@ stages: size: 179197851 - path: gsma_dataset_creation/filters_cli.py hash: md5 - md5: 064c49522df7a0fd6f5171860cdd3fcc - size: 28927 + md5: 8c2156d8461b8d364c72f13a12a83177 + size: 28949 - path: models/filters/question-filter-run-5000-2025-10-08_22-47-46/model hash: md5 md5: 381770c24cd08b42666d94a8cb5d7aee.dir @@ -967,8 +967,8 @@ stages: size: 136719041 - path: gsma_dataset_creation/filters_cli.py hash: md5 - md5: 064c49522df7a0fd6f5171860cdd3fcc - size: 28927 + md5: 8c2156d8461b8d364c72f13a12a83177 + size: 28949 - path: models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model hash: md5 md5: 9fc986e758f8b52124ef2b85da3f6341.dir @@ -999,8 +999,8 @@ stages: size: 180405441 - path: gsma_dataset_creation/filters_cli.py hash: md5 - md5: 064c49522df7a0fd6f5171860cdd3fcc - size: 28927 + md5: 8c2156d8461b8d364c72f13a12a83177 + size: 28949 outs: - path: data/discover/validation/questions_filtered.parquet hash: md5 @@ -1059,8 +1059,8 @@ stages: nfiles: 387 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f @@ -1087,8 +1087,8 @@ stages: nfiles: 387 - path: gsma_dataset_creation/cli.py hash: md5 - md5: 1379e093e7227d8a2efed762e3088089 - size: 74660 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 - path: gsma_dataset_creation/question_generator.py hash: md5 md5: 8156ae8a7fb54e0762363d17fba92a7f diff --git a/pipelines/discover/dvc.yaml b/pipelines/discover/dvc.yaml index 428f5e5..b38bf8a 100644 --- a/pipelines/discover/dvc.yaml +++ b/pipelines/discover/dvc.yaml @@ -19,7 +19,7 @@ stages: metrics: - metrics/scrape_prd.json: cache: true # Cache metrics for tracking - frozen: true + frozen: false scrape_discover: wdir: ../.. @@ -37,7 +37,7 @@ stages: metrics: - ${metrics_prefix}/scrape_discover.json: cache: true # Cache metrics for tracking - frozen: true + frozen: false dedup_prd: wdir: ../.. @@ -57,7 +57,7 @@ stages: metrics: - metrics/dedup_prd.json: cache: true - frozen: true + frozen: false dedup_discover: wdir: ../.. @@ -76,7 +76,7 @@ stages: metrics: - ${metrics_prefix}/dedup_discover.json: cache: true - frozen: true + frozen: false process_prd_documents: wdir: ../.. @@ -92,7 +92,7 @@ stages: metrics: - ${metrics_prefix}/prd_document_processing_metrics.json desc: "Convert deduplicated PRD DOCX to Markdown (English only)" - frozen: true + frozen: false process_discover_documents: wdir: ../.. @@ -108,7 +108,7 @@ stages: metrics: - ${metrics_prefix}/document_processing_metrics.json desc: "Convert deduplicated discover PDFs/DOCX to Markdown (English only)" - frozen: true + frozen: false create_prd_late_chunks: foreach: [500, 1000, 2000, 3000, 4000] @@ -130,7 +130,7 @@ stages: metrics: - ${metrics_prefix}/prd/chunk_metrics_chunked_late_${item}.json desc: "Create PRD late chunks from processed Markdown files (${item} tokens)" - frozen: true + frozen: false create_discover_late_chunks: foreach: [500, 1000, 2000, 3000, 4000] @@ -152,7 +152,7 @@ stages: metrics: - ${metrics_prefix}/chunk_metrics_chunked_late_${item}.json desc: "Create discover late chunks from processed Markdown files (${item} tokens)" - frozen: true + frozen: false generate_prd_questions: foreach: @@ -188,7 +188,7 @@ stages: metrics: - ${metrics_prefix}/prd/generate_questions_gpt-oss-120b_late_${key}.json desc: "Generate questions from PRD late_${key} chunks using gpt-oss-120b" - frozen: true + frozen: false generate_discover_questions: foreach: @@ -224,7 +224,7 @@ stages: metrics: - ${metrics_prefix}/generate_questions_gpt-oss-120b_late_${key}.json desc: "Generate questions from discover late_${key} chunks using gpt-oss-120b" - frozen: true + frozen: false discover_data_combiner: wdir: ../.. @@ -282,7 +282,7 @@ stages: metrics: - ${metrics_prefix}/data_combiner.json desc: "Combine PRD and discover chunker and QA data across chunk sizes" - frozen: true + frozen: false discover_similarity_hasher: wdir: ../.. @@ -300,7 +300,7 @@ stages: metrics: - ${metrics_prefix}/similarity_hasher.json desc: "Add SHA-256 hashes to discover combined chunk data" - frozen: true + frozen: false discover_similarity_ranker: wdir: ../.. @@ -323,7 +323,7 @@ stages: metrics: - ${metrics_prefix}/similarity_calculator.json desc: "Compute FAISS-based top-K similarity for discover chunks" - frozen: true + frozen: false discover_overlap_detector: wdir: ../.. @@ -342,7 +342,7 @@ stages: metrics: - ${metrics_prefix}/overlap_detector.json desc: "Detect character offset-based overlap for discover chunks" - frozen: true + frozen: false discover_explode_questions: wdir: ../.. @@ -362,7 +362,7 @@ stages: metrics: - ${metrics_prefix}/explode_questions_metrics.json desc: "Extract discover questions into question-centric format" - frozen: true + frozen: false discover_apply_question_filter: wdir: ../.. @@ -381,7 +381,7 @@ stages: metrics: - ${metrics_prefix}/question_filter_metrics.json desc: "Apply external reference filter to discover questions" - frozen: true + frozen: false discover_apply_chunk_filter: wdir: ../.. @@ -402,7 +402,7 @@ stages: metrics: - ${metrics_prefix}/chunk_filter_metrics.json desc: "Apply procedures filter and keyword exclusion to discover chunks" - frozen: true + frozen: false discover_filter_questions_by_chunk_quality: wdir: ../.. @@ -423,7 +423,7 @@ stages: metrics: - ${metrics_prefix}/question_chunk_filtering_metrics.json desc: "Filter discover questions by combined quality" - frozen: true + frozen: false discover_validate_requests: wdir: ../.. @@ -450,7 +450,7 @@ stages: metrics: - ${metrics_prefix}/validation_results.json desc: "Validate discover requests using LLM (individual validation)" - frozen: true + frozen: false discover_create_validation_dataset: wdir: ../.. @@ -479,7 +479,7 @@ stages: metrics: - ${metrics_prefix}/dataset_creation_from_validation.json desc: "Create HuggingFace dataset from discover validation results" - frozen: true + frozen: false upload_embedding_dataset: wdir: ../.. @@ -490,7 +490,7 @@ stages: deps: - ${data_prefix}/validation/validation_dataset_embedding desc: "Upload discover HuggingFace dataset to Hub" - frozen: true + frozen: false upload_qa_dataset: wdir: ../.. @@ -501,4 +501,4 @@ stages: deps: - ${data_prefix}/validation/validation_dataset_qa desc: "Upload discover HuggingFace dataset to Hub" - frozen: true + frozen: false diff --git a/pipelines/filters/dvc.lock b/pipelines/filters/dvc.lock deleted file mode 100644 index 74b706f..0000000 --- a/pipelines/filters/dvc.lock +++ /dev/null @@ -1,105 +0,0 @@ -schema: '2.0' -stages: - apply_question_filter: - cmd: uv run gsma filters apply-question-filter --input data/validation/questions_with_candidates.parquet - --output data/filters/questions_with_filter.parquet --model-path models/filters/question-filter-run-5000-2025-10-08_22-47-46/model - --metrics-output metrics/filters/question_filter_metrics.json - deps: - - path: data/validation/questions_with_candidates.parquet - hash: md5 - md5: fcd85bda6273b751d78d9792c253e202 - size: 216927340 - - path: gsma_dataset_creation/filters_cli.py - hash: md5 - md5: ff23de61429c81cb3d03a7a733e9e130 - size: 21368 - - path: models/filters/question-filter-run-5000-2025-10-08_22-47-46/model - hash: md5 - md5: 381770c24cd08b42666d94a8cb5d7aee.dir - size: 364872364 - nfiles: 37 - outs: - - path: data/filters/questions_with_filter.parquet - hash: md5 - md5: 558d7f6f9f21cbcc326d332203a1f4e4 - size: 218651749 - - path: metrics/filters/question_filter_metrics.json - hash: md5 - md5: 387f0c9acc6758dde1a873d1c249c430 - size: 363 - apply_chunk_filter: - cmd: uv run gsma filters apply-chunk-filter --input data/enriched_chunks.parquet - --output data/filters/enriched_chunks_with_filter.parquet --model-path models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model - --truncate-length 200 --metrics-output metrics/filters/chunk_filter_metrics.json - deps: - - path: data/enriched_chunks.parquet - hash: md5 - md5: f38b00fb8d08e7ae72d7343af230b0eb - size: 112412055 - - path: gsma_dataset_creation/filters_cli.py - hash: md5 - md5: ff23de61429c81cb3d03a7a733e9e130 - size: 21368 - - path: models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model - hash: md5 - md5: 9fc986e758f8b52124ef2b85da3f6341.dir - size: 365271032 - nfiles: 37 - outs: - - path: data/filters/enriched_chunks_with_filter.parquet - hash: md5 - md5: c4128900cf3c342fc8c3c35547e541d6 - size: 112561994 - - path: metrics/filters/chunk_filter_metrics.json - hash: md5 - md5: 90c155bcc2af3d01815955980d9211f9 - size: 356 - filter_chunks: - cmd: uv run gsma filters filter-chunks --input data/filters/enriched_chunks_with_filter.parquet - --output data/enriched_chunks_filtered.parquet --min-probability 0.5 --metrics-output - metrics/filters/chunk_filtering_metrics.json - deps: - - path: data/filters/enriched_chunks_with_filter.parquet - hash: md5 - md5: c4128900cf3c342fc8c3c35547e541d6 - size: 112561994 - - path: gsma_dataset_creation/filters_cli.py - hash: md5 - md5: c441c87c4804ee2d0eaace60f38b204c - size: 21321 - outs: - - path: data/enriched_chunks_filtered.parquet - hash: md5 - md5: bab3694f3ac230e65299f49bbd61c1ae - size: 102016121 - - path: metrics/filters/chunk_filtering_metrics.json - hash: md5 - md5: 90d16fa60e910d2d79816782ad826b0c - size: 175 - filter_questions_by_chunk_quality: - cmd: uv run gsma filters filter-questions-by-chunk-quality --input-questions data/filters/questions_with_filter.parquet - --input-chunks data/filters/enriched_chunks_with_filter.parquet --output data/validation/questions_filtered.parquet - --min-question-probability 0.5 --min-chunk-probability 0.5 --metrics-output - metrics/filters/question_chunk_filtering_metrics.json - deps: - - path: data/filters/enriched_chunks_with_filter.parquet - hash: md5 - md5: c4128900cf3c342fc8c3c35547e541d6 - size: 112561994 - - path: data/filters/questions_with_filter.parquet - hash: md5 - md5: 558d7f6f9f21cbcc326d332203a1f4e4 - size: 218651749 - - path: gsma_dataset_creation/filters_cli.py - hash: md5 - md5: c441c87c4804ee2d0eaace60f38b204c - size: 21321 - outs: - - path: data/validation/questions_filtered.parquet - hash: md5 - md5: 820df07535a94ed6a2efdb3c6e46c249 - size: 150739737 - - path: metrics/filters/question_chunk_filtering_metrics.json - hash: md5 - md5: de250d9b5cad236674b90d40db6f7366 - size: 382 diff --git a/pipelines/filters/dvc.yaml b/pipelines/filters/dvc.yaml deleted file mode 100644 index 9f89a28..0000000 --- a/pipelines/filters/dvc.yaml +++ /dev/null @@ -1,75 +0,0 @@ -#stages: -# apply_question_filter: -# wdir: ../.. -# cmd: >- -# uv run gsma filters apply-question-filter -# --input data/validation/questions_with_candidates.parquet -# --output data/filters/questions_with_filter.parquet -# --model-path models/filters/question-filter-run-5000-2025-10-08_22-47-46/model -# --metrics-output metrics/filters/question_filter_metrics.json -# deps: -# - data/validation/questions_with_candidates.parquet -# - models/filters/question-filter-run-5000-2025-10-08_22-47-46/model -# - gsma_dataset_creation/filters_cli.py -# outs: -# - data/filters/questions_with_filter.parquet -# metrics: -# - metrics/filters/question_filter_metrics.json -# desc: "Apply external reference filter to all questions and add low_quality_probability column" -# frozen: true -# -# apply_chunk_filter: -# wdir: ../.. -# cmd: >- -# uv run gsma filters apply-chunk-filter -# --input data/enriched_chunks.parquet -# --output data/filters/enriched_chunks_with_filter.parquet -# --model-path models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model -# --truncate-length 200 -# --metrics-output metrics/filters/chunk_filter_metrics.json -# deps: -# - data/enriched_chunks.parquet -# - models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model -# - gsma_dataset_creation/filters_cli.py -# outs: -# - data/filters/enriched_chunks_with_filter.parquet -# metrics: -# - metrics/filters/chunk_filter_metrics.json -# desc: "Apply procedures filter to all chunks and add low_quality_probability column" -# -# filter_chunks: -# wdir: ../.. -# cmd: >- -# uv run gsma filters filter-chunks -# --input data/filters/enriched_chunks_with_filter.parquet -# --output data/enriched_chunks_filtered.parquet -# --min-probability 0.5 -# --metrics-output metrics/filters/chunk_filtering_metrics.json -# deps: -# - data/filters/enriched_chunks_with_filter.parquet -# - gsma_dataset_creation/filters_cli.py -# outs: -# - data/enriched_chunks_filtered.parquet -# metrics: -# - metrics/filters/chunk_filtering_metrics.json -# desc: "Filter chunks to remove low-quality content based on probability threshold" -# -# filter_questions_by_chunk_quality: -# wdir: ../.. -# cmd: >- -# uv run gsma filters filter-questions-by-chunk-quality -# --input-questions data/filters/questions_with_filter.parquet -# --input-chunks data/filters/enriched_chunks_with_filter.parquet -# --output data/validation/questions_filtered.parquet -# --min-question-probability 0.5 -# --min-chunk-probability 0.5 -# --metrics-output metrics/filters/question_chunk_filtering_metrics.json -# deps: -# - data/filters/questions_with_filter.parquet -# - data/filters/enriched_chunks_with_filter.parquet -# - gsma_dataset_creation/filters_cli.py -# outs: -# - data/validation/questions_filtered.parquet -# metrics: -# - metrics/filters/question_chunk_filtering_metrics.json -# desc: "Filter questions by removing low-quality content based on both question and chunk quality" diff --git a/pipelines/prd/dvc.lock b/pipelines/prd/dvc.lock new file mode 100644 index 0000000..ed3dfdf --- /dev/null +++ b/pipelines/prd/dvc.lock @@ -0,0 +1,498 @@ +schema: '2.0' +stages: + process_documents: + cmd: uv run gsma process data/raw data/processed --metrics-output metrics/prd/document_processing_metrics.json + --allowed-languages en --log-level INFO + deps: + - path: data/raw + hash: md5 + md5: 8ec3c1c0ae7420e286a24114ee21d188.dir + size: 262572056 + nfiles: 270 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + - path: gsma_dataset_creation/converter.py + hash: md5 + md5: 480a5175b1837b844ec9b0fa0984b742 + size: 3577 + - path: gsma_dataset_creation/language_utils.py + hash: md5 + md5: ca1400c6987f72c9a6b456a26de22265 + size: 3187 + - path: gsma_dataset_creation/processor.py + hash: md5 + md5: 2ac2a27d46453a310cf51b1d275bb5f8 + size: 10666 + outs: + - path: data/processed + hash: md5 + md5: 72c80941549afe1a3d8792d26ed6fe1e.dir + size: 16688411 + nfiles: 248 + - path: metrics/prd/document_processing_metrics.json + hash: md5 + md5: 0d6ddfe77d33458a0e9bfbd2690e637e + size: 424 + create_late_chunks@0: + cmd: "uv run gsma chunk data/processed data/prd/chunked_late_500 --chunker late\ + \ --chunker-config '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\"\ + , \"chunk_size\": 500, \"min_characters_per_chunk\": 24}' --filter-min-tokens\ + \ 0 --metrics-output metrics/prd/chunk_metrics_chunked_late_500.json --log-level\ + \ INFO" + deps: + - path: data/processed + hash: md5 + md5: 72c80941549afe1a3d8792d26ed6fe1e.dir + size: 16688411 + nfiles: 248 + - path: gsma_dataset_creation/chunker.py + hash: md5 + md5: 291e7ce2d3e55de6f6a67ee2699e9136 + size: 18366 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + outs: + - path: data/prd/chunked_late_500 + hash: md5 + md5: cdf4ce62598c452b97deaff3762174bd.dir + size: 112905314 + nfiles: 248 + - path: metrics/prd/chunk_metrics_chunked_late_500.json + hash: md5 + md5: 8b1f09589021cf6885959d230268c7f3 + size: 768 + create_late_chunks@1: + cmd: "uv run gsma chunk data/processed data/prd/chunked_late_1000 --chunker late\ + \ --chunker-config '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\"\ + , \"chunk_size\": 1000, \"min_characters_per_chunk\": 24}' --filter-min-tokens\ + \ 0 --metrics-output metrics/prd/chunk_metrics_chunked_late_1000.json --log-level\ + \ INFO" + deps: + - path: data/processed + hash: md5 + md5: 72c80941549afe1a3d8792d26ed6fe1e.dir + size: 16688411 + nfiles: 248 + - path: gsma_dataset_creation/chunker.py + hash: md5 + md5: 291e7ce2d3e55de6f6a67ee2699e9136 + size: 18366 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + outs: + - path: data/prd/chunked_late_1000 + hash: md5 + md5: 7cf69d9840ef5521665ef13dc0753c5f.dir + size: 64371559 + nfiles: 248 + - path: metrics/prd/chunk_metrics_chunked_late_1000.json + hash: md5 + md5: 14915e93d2df0ace61543a93faffd0d9 + size: 770 + create_late_chunks@2: + cmd: "uv run gsma chunk data/processed data/prd/chunked_late_2000 --chunker late\ + \ --chunker-config '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\"\ + , \"chunk_size\": 2000, \"min_characters_per_chunk\": 24}' --filter-min-tokens\ + \ 0 --metrics-output metrics/prd/chunk_metrics_chunked_late_2000.json --log-level\ + \ INFO" + deps: + - path: data/processed + hash: md5 + md5: 72c80941549afe1a3d8792d26ed6fe1e.dir + size: 16688411 + nfiles: 248 + - path: gsma_dataset_creation/chunker.py + hash: md5 + md5: 291e7ce2d3e55de6f6a67ee2699e9136 + size: 18366 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + outs: + - path: data/prd/chunked_late_2000 + hash: md5 + md5: ad55506022dcd9f89bf999e65e841e2e.dir + size: 41293426 + nfiles: 248 + - path: metrics/prd/chunk_metrics_chunked_late_2000.json + hash: md5 + md5: 1366ce3d1773deaed3a5e33709fec16e + size: 757 + create_late_chunks@3: + cmd: "uv run gsma chunk data/processed data/prd/chunked_late_3000 --chunker late\ + \ --chunker-config '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\"\ + , \"chunk_size\": 3000, \"min_characters_per_chunk\": 24}' --filter-min-tokens\ + \ 0 --metrics-output metrics/prd/chunk_metrics_chunked_late_3000.json --log-level\ + \ INFO" + deps: + - path: data/processed + hash: md5 + md5: 72c80941549afe1a3d8792d26ed6fe1e.dir + size: 16688411 + nfiles: 248 + - path: gsma_dataset_creation/chunker.py + hash: md5 + md5: 291e7ce2d3e55de6f6a67ee2699e9136 + size: 18366 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + outs: + - path: data/prd/chunked_late_3000 + hash: md5 + md5: ab6fc68856dfe51186a2cc9bfd61d29b.dir + size: 33715035 + nfiles: 248 + - path: metrics/prd/chunk_metrics_chunked_late_3000.json + hash: md5 + md5: 9c88aba4104403423bb90bdda02952d9 + size: 767 + create_late_chunks@4: + cmd: "uv run gsma chunk data/processed data/prd/chunked_late_4000 --chunker late\ + \ --chunker-config '{\"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\"\ + , \"chunk_size\": 4000, \"min_characters_per_chunk\": 24}' --filter-min-tokens\ + \ 0 --metrics-output metrics/prd/chunk_metrics_chunked_late_4000.json --log-level\ + \ INFO" + deps: + - path: data/processed + hash: md5 + md5: 72c80941549afe1a3d8792d26ed6fe1e.dir + size: 16688411 + nfiles: 248 + - path: gsma_dataset_creation/chunker.py + hash: md5 + md5: 291e7ce2d3e55de6f6a67ee2699e9136 + size: 18366 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + outs: + - path: data/prd/chunked_late_4000 + hash: md5 + md5: 6c696d0ef870d7ba2aec72155bbfed66.dir + size: 29820279 + nfiles: 248 + - path: metrics/prd/chunk_metrics_chunked_late_4000.json + hash: md5 + md5: 1e20cebad49f1722b72bff1226f4020b + size: 769 + generate_questions@500: + cmd: uv run gsma generate-questions data/prd/chunked_late_500 data/prd/questions_gpt-oss-120b_late_500 + --num-questions 5 --model "openai/gpt-oss-120b" --max-concurrent 20 --credit-check-interval + 1000 --log-level INFO --provider Cerebras --metrics-file metrics/prd/generate_questions_gpt-oss-120b_late_500.json + deps: + - path: data/prd/chunked_late_500 + hash: md5 + md5: cdf4ce62598c452b97deaff3762174bd.dir + size: 112905314 + nfiles: 248 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + - path: gsma_dataset_creation/question_generator.py + hash: md5 + md5: 8156ae8a7fb54e0762363d17fba92a7f + size: 37313 + outs: + - path: data/prd/questions_gpt-oss-120b_late_500 + hash: md5 + md5: 73a0a7facc7360db9b11a57e542c5fca.dir + size: 52133974 + nfiles: 8230 + - path: metrics/prd/generate_questions_gpt-oss-120b_late_500.json + hash: md5 + md5: 8ff4996eb2b587fd87b0b465ec619281 + size: 1514 + generate_questions@1000: + cmd: uv run gsma generate-questions data/prd/chunked_late_1000 data/prd/questions_gpt-oss-120b_late_1000 + --num-questions 10 --model "openai/gpt-oss-120b" --max-concurrent 20 --credit-check-interval + 1000 --log-level INFO --provider Cerebras --metrics-file metrics/prd/generate_questions_gpt-oss-120b_late_1000.json + deps: + - path: data/prd/chunked_late_1000 + hash: md5 + md5: 7cf69d9840ef5521665ef13dc0753c5f.dir + size: 64371559 + nfiles: 248 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + - path: gsma_dataset_creation/question_generator.py + hash: md5 + md5: 8156ae8a7fb54e0762363d17fba92a7f + size: 37313 + outs: + - path: data/prd/questions_gpt-oss-120b_late_1000 + hash: md5 + md5: e5aa8dfa448ab8afb2a546e9e0ed0ef2.dir + size: 46638867 + nfiles: 4001 + - path: metrics/prd/generate_questions_gpt-oss-120b_late_1000.json + hash: md5 + md5: 6b223524d14a0b7de6fb252a18640ee0 + size: 1499 + generate_questions@2000: + cmd: uv run gsma generate-questions data/prd/chunked_late_2000 data/prd/questions_gpt-oss-120b_late_2000 + --num-questions 20 --model "openai/gpt-oss-120b" --max-concurrent 20 --credit-check-interval + 1000 --log-level INFO --provider Cerebras --metrics-file metrics/prd/generate_questions_gpt-oss-120b_late_2000.json + deps: + - path: data/prd/chunked_late_2000 + hash: md5 + md5: ad55506022dcd9f89bf999e65e841e2e.dir + size: 41293426 + nfiles: 248 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + - path: gsma_dataset_creation/question_generator.py + hash: md5 + md5: 8156ae8a7fb54e0762363d17fba92a7f + size: 37313 + outs: + - path: data/prd/questions_gpt-oss-120b_late_2000 + hash: md5 + md5: 589b801c932365ffcf109218f59d3bec.dir + size: 38981806 + nfiles: 1830 + - path: metrics/prd/generate_questions_gpt-oss-120b_late_2000.json + hash: md5 + md5: 71cedda9ae476bc3d8456323083c6ffd + size: 1513 + generate_questions@3000: + cmd: uv run gsma generate-questions data/prd/chunked_late_3000 data/prd/questions_gpt-oss-120b_late_3000 + --num-questions 30 --model "openai/gpt-oss-120b" --max-concurrent 20 --credit-check-interval + 1000 --log-level INFO --provider Cerebras --metrics-file metrics/prd/generate_questions_gpt-oss-120b_late_3000.json + deps: + - path: data/prd/chunked_late_3000 + hash: md5 + md5: ab6fc68856dfe51186a2cc9bfd61d29b.dir + size: 33715035 + nfiles: 248 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + - path: gsma_dataset_creation/question_generator.py + hash: md5 + md5: 8156ae8a7fb54e0762363d17fba92a7f + size: 37313 + outs: + - path: data/prd/questions_gpt-oss-120b_late_3000 + hash: md5 + md5: 7fca8c90a24f32b89d4526eae1a99f7d.dir + size: 32217471 + nfiles: 1066 + - path: metrics/prd/generate_questions_gpt-oss-120b_late_3000.json + hash: md5 + md5: 07f417dabcf481c3cae998b571c68e5b + size: 1526 + generate_questions@4000: + cmd: uv run gsma generate-questions data/prd/chunked_late_4000 data/prd/questions_gpt-oss-120b_late_4000 + --num-questions 40 --model "openai/gpt-oss-120b" --max-concurrent 20 --credit-check-interval + 1000 --log-level INFO --provider Cerebras --metrics-file metrics/prd/generate_questions_gpt-oss-120b_late_4000.json + deps: + - path: data/prd/chunked_late_4000 + hash: md5 + md5: 6c696d0ef870d7ba2aec72155bbfed66.dir + size: 29820279 + nfiles: 248 + - path: gsma_dataset_creation/cli.py + hash: md5 + md5: 2751ea55263e999be3819eee9a8f325e + size: 82130 + - path: gsma_dataset_creation/question_generator.py + hash: md5 + md5: 8156ae8a7fb54e0762363d17fba92a7f + size: 37313 + outs: + - path: data/prd/questions_gpt-oss-120b_late_4000 + hash: md5 + md5: c1578e42cf46e4eeefcefd25d0de812c.dir + size: 25730277 + nfiles: 658 + - path: metrics/prd/generate_questions_gpt-oss-120b_late_4000.json + hash: md5 + md5: d76ede81c66d391e5521fd811598ebf1 + size: 1509 + data_combiner: + cmd: uv run gsma similarity data-combiner --chunker-dirs data/prd/chunked_late_500 + --chunker-dirs data/prd/chunked_late_1000 --chunker-dirs data/prd/chunked_late_2000 + --chunker-dirs data/prd/chunked_late_3000 --chunker-dirs data/prd/chunked_late_4000 + --qa-dirs data/prd/questions_gpt-oss-120b_late_500 --qa-dirs data/prd/questions_gpt-oss-120b_late_1000 + --qa-dirs data/prd/questions_gpt-oss-120b_late_2000 --qa-dirs data/prd/questions_gpt-oss-120b_late_3000 + --qa-dirs data/prd/questions_gpt-oss-120b_late_4000 --working-groups data/working_groups_mapping.json + --output data/prd/combined_chunks.parquet --metrics-output metrics/prd/data_combiner.json + --logger-level INFO + deps: + - path: data/prd/chunked_late_1000 + hash: md5 + md5: 7cf69d9840ef5521665ef13dc0753c5f.dir + size: 64371559 + nfiles: 248 + - path: data/prd/chunked_late_2000 + hash: md5 + md5: ad55506022dcd9f89bf999e65e841e2e.dir + size: 41293426 + nfiles: 248 + - path: data/prd/chunked_late_3000 + hash: md5 + md5: ab6fc68856dfe51186a2cc9bfd61d29b.dir + size: 33715035 + nfiles: 248 + - path: data/prd/chunked_late_4000 + hash: md5 + md5: 6c696d0ef870d7ba2aec72155bbfed66.dir + size: 29820279 + nfiles: 248 + - path: data/prd/chunked_late_500 + hash: md5 + md5: cdf4ce62598c452b97deaff3762174bd.dir + size: 112905314 + nfiles: 248 + - path: data/prd/questions_gpt-oss-120b_late_1000 + hash: md5 + md5: e5aa8dfa448ab8afb2a546e9e0ed0ef2.dir + size: 46638867 + nfiles: 4001 + - path: data/prd/questions_gpt-oss-120b_late_2000 + hash: md5 + md5: 589b801c932365ffcf109218f59d3bec.dir + size: 38981806 + nfiles: 1830 + - path: data/prd/questions_gpt-oss-120b_late_3000 + hash: md5 + md5: 7fca8c90a24f32b89d4526eae1a99f7d.dir + size: 32217471 + nfiles: 1066 + - path: data/prd/questions_gpt-oss-120b_late_4000 + hash: md5 + md5: c1578e42cf46e4eeefcefd25d0de812c.dir + size: 25730277 + nfiles: 658 + - path: data/prd/questions_gpt-oss-120b_late_500 + hash: md5 + md5: 73a0a7facc7360db9b11a57e542c5fca.dir + size: 52133974 + nfiles: 8230 + - path: data/working_groups_mapping.json + hash: md5 + md5: b70592bdeac5c03634a60d09d0a8fbc7 + size: 9901 + - path: gsma_dataset_creation/similarity/data_combiner.py + hash: md5 + md5: 72bb3424212238249cc6bea8e25b6843 + size: 19985 + outs: + - path: data/prd/combined_chunks.parquet + hash: md5 + md5: e7bb746dfd09176236fd5b32cb00fd8a + size: 106961632 + - path: metrics/prd/data_combiner.json + hash: md5 + md5: 64bc79c7c3bbdf350893313a05e32048 + size: 721 + similarity_hasher: + cmd: uv run gsma similarity hasher --input data/prd/combined_chunks.parquet --output + data/prd/hashed_chunks.parquet --metrics-output metrics/prd/similarity_hasher.json + --logger-level INFO + deps: + - path: data/prd/combined_chunks.parquet + hash: md5 + md5: e7bb746dfd09176236fd5b32cb00fd8a + size: 106961632 + - path: gsma_dataset_creation/similarity/hashing.py + hash: md5 + md5: fae2341f4110086046a2a3939187a3d8 + size: 12555 + outs: + - path: data/prd/hashed_chunks.parquet + hash: md5 + md5: 599c6322827da8ecbdf553bafa18b65f + size: 108402760 + - path: metrics/prd/similarity_hasher.json + hash: md5 + md5: b76d7919548bf1b1dde3d7712a4b7d17 + size: 371 + similarity_ranker: + cmd: uv run gsma similarity ranker --input data/prd/hashed_chunks.parquet --output + data/prd/similarity_chunks.parquet --similarity-matrix data/prd/similarity_matrix.npz + --k 20 --threshold 0.3 --faiss-index-type IVFFlat --metrics-output metrics/prd/similarity_calculator.json + --logger-level INFO + deps: + - path: data/prd/hashed_chunks.parquet + hash: md5 + md5: 599c6322827da8ecbdf553bafa18b65f + size: 108402760 + - path: gsma_dataset_creation/similarity/similarity_calculator.py + hash: md5 + md5: f62609fe22d08df9df63bb0716f44912 + size: 15006 + outs: + - path: data/prd/similarity_chunks.parquet + hash: md5 + md5: 5a591c2856e8c8fb78d6376108f9150f + size: 111960860 + - path: data/prd/similarity_matrix.npz + hash: md5 + md5: affd9c9497e8b2460eccfcb8af364169 + size: 577 + - path: metrics/prd/similarity_calculator.json + hash: md5 + md5: 0ced64da2e7e652d6a9e5eff40a8b745 + size: 540 + overlap_detector: + cmd: uv run gsma similarity overlap-detector --input data/prd/similarity_chunks.parquet + --output data/prd/enriched_chunks.parquet --min-overlap-chars 50 --metrics-output + metrics/prd/overlap_detector.json --logger-level INFO + deps: + - path: data/prd/similarity_chunks.parquet + hash: md5 + md5: 5a591c2856e8c8fb78d6376108f9150f + size: 111960860 + - path: gsma_dataset_creation/similarity/overlap_detector.py + hash: md5 + md5: 80b7b89ec4e09e2857fba06b5806ec7e + size: 8782 + outs: + - path: data/prd/enriched_chunks.parquet + hash: md5 + md5: f38b00fb8d08e7ae72d7343af230b0eb + size: 112412055 + - path: metrics/prd/overlap_detector.json + hash: md5 + md5: 588f2ab3713ae386e40c6e3c0190a70e + size: 506 + explode_questions: + cmd: uv run gsma validation explode-questions --input data/prd/enriched_chunks.parquet + --output data/prd/validation/questions_with_candidates.parquet --min-similarity-score + 0.35 --max-similarity-score 0.95 --metrics-output metrics/prd/explode_questions_metrics.json + --logger-level INFO + deps: + - path: data/prd/enriched_chunks.parquet + hash: md5 + md5: f38b00fb8d08e7ae72d7343af230b0eb + size: 112412055 + - path: gsma_dataset_creation/validation_cli.py + hash: md5 + md5: 073150691d5683a6ea30db54c1d323c7 + size: 34369 + outs: + - path: data/prd/validation/questions_with_candidates.parquet + hash: md5 + md5: 5c17bfdba81cc86d4289e8d8e33831c3 + size: 223536814 + - path: metrics/prd/explode_questions_metrics.json + hash: md5 + md5: 8a80554c91d9fca8acb82f023de02f11 + size: 3 diff --git a/pipelines/prd/dvc.yaml b/pipelines/prd/dvc.yaml new file mode 100644 index 0000000..0d4fc18 --- /dev/null +++ b/pipelines/prd/dvc.yaml @@ -0,0 +1,330 @@ +vars: + - data_prefix: data/prd + - metrics_prefix: metrics/prd + +stages: + process_documents: + wdir: ../.. + cmd: uv run gsma process data/raw data/processed --metrics-output ${metrics_prefix}/document_processing_metrics.json --allowed-languages en --log-level INFO + deps: + - data/raw + - gsma_dataset_creation/converter.py + - gsma_dataset_creation/processor.py + - gsma_dataset_creation/cli.py + - gsma_dataset_creation/language_utils.py + outs: + - data/processed + metrics: + - ${metrics_prefix}/document_processing_metrics.json + desc: "Convert raw DOCX files to Markdown (English only)" + + create_late_chunks: + foreach: + - size: 500 + name: late_500 + - size: 1000 + name: late_1000 + - size: 2000 + name: late_2000 + - size: 3000 + name: late_3000 + - size: 4000 + name: late_4000 + do: + wdir: ../.. + cmd: >- + uv run gsma chunk data/processed ${data_prefix}/chunked_${item.name} + --chunker late + --chunker-config '{"embedding_model": "sentence-transformers/all-MiniLM-L6-v2", "chunk_size": ${item.size}, "min_characters_per_chunk": 24}' + --filter-min-tokens 0 + --metrics-output ${metrics_prefix}/chunk_metrics_chunked_${item.name}.json + --log-level INFO + deps: + - data/processed + - gsma_dataset_creation/chunker.py + - gsma_dataset_creation/cli.py + outs: + - ${data_prefix}/chunked_${item.name} + metrics: + - ${metrics_prefix}/chunk_metrics_chunked_${item.name}.json + desc: "Create late chunks from processed Markdown files (${item.size} tokens)" + + generate_questions: + foreach: + 500: + questions_per_chunk: 5 + 1000: + questions_per_chunk: 10 + 2000: + questions_per_chunk: 20 + 3000: + questions_per_chunk: 30 + 4000: + questions_per_chunk: 40 + do: + wdir: ../.. + cmd: >- + uv run gsma generate-questions + ${data_prefix}/chunked_late_${key} + ${data_prefix}/questions_gpt-oss-120b_late_${key} + --num-questions ${item.questions_per_chunk} + --model "openai/gpt-oss-120b" + --max-concurrent 20 + --credit-check-interval 1000 + --log-level INFO + --provider Cerebras + --metrics-file ${metrics_prefix}/generate_questions_gpt-oss-120b_late_${key}.json + deps: + - ${data_prefix}/chunked_late_${key} + - gsma_dataset_creation/question_generator.py + - gsma_dataset_creation/cli.py + outs: + - ${data_prefix}/questions_gpt-oss-120b_late_${key} + metrics: + - ${metrics_prefix}/generate_questions_gpt-oss-120b_late_${key}.json + desc: "Generate questions from late_${key} chunks using gpt-oss-120b" + + data_combiner: + wdir: ../.. + cmd: >- + uv run gsma similarity data-combiner + --chunker-dirs ${data_prefix}/chunked_late_500 + --chunker-dirs ${data_prefix}/chunked_late_1000 + --chunker-dirs ${data_prefix}/chunked_late_2000 + --chunker-dirs ${data_prefix}/chunked_late_3000 + --chunker-dirs ${data_prefix}/chunked_late_4000 + --qa-dirs ${data_prefix}/questions_gpt-oss-120b_late_500 + --qa-dirs ${data_prefix}/questions_gpt-oss-120b_late_1000 + --qa-dirs ${data_prefix}/questions_gpt-oss-120b_late_2000 + --qa-dirs ${data_prefix}/questions_gpt-oss-120b_late_3000 + --qa-dirs ${data_prefix}/questions_gpt-oss-120b_late_4000 + --working-groups data/working_groups_mapping.json + --output ${data_prefix}/combined_chunks.parquet + --metrics-output ${metrics_prefix}/data_combiner.json + --logger-level INFO + deps: + - ${data_prefix}/chunked_late_500 + - ${data_prefix}/chunked_late_1000 + - ${data_prefix}/chunked_late_2000 + - ${data_prefix}/chunked_late_3000 + - ${data_prefix}/chunked_late_4000 + - ${data_prefix}/questions_gpt-oss-120b_late_500 + - ${data_prefix}/questions_gpt-oss-120b_late_1000 + - ${data_prefix}/questions_gpt-oss-120b_late_2000 + - ${data_prefix}/questions_gpt-oss-120b_late_3000 + - ${data_prefix}/questions_gpt-oss-120b_late_4000 + - data/working_groups_mapping.json + - gsma_dataset_creation/similarity/data_combiner.py + outs: + - ${data_prefix}/combined_chunks.parquet + metrics: + - ${metrics_prefix}/data_combiner.json + desc: "Combine chunker and QA data across chunk sizes with working group classification" + + similarity_hasher: + wdir: ../.. + cmd: >- + uv run gsma similarity hasher + --input ${data_prefix}/combined_chunks.parquet + --output ${data_prefix}/hashed_chunks.parquet + --metrics-output ${metrics_prefix}/similarity_hasher.json + --logger-level INFO + deps: + - ${data_prefix}/combined_chunks.parquet + - gsma_dataset_creation/similarity/hashing.py + outs: + - ${data_prefix}/hashed_chunks.parquet + metrics: + - ${metrics_prefix}/similarity_hasher.json + desc: "Add SHA-256 hashes to combined chunk data" + + similarity_ranker: + wdir: ../.. + cmd: >- + uv run gsma similarity ranker + --input ${data_prefix}/hashed_chunks.parquet + --output ${data_prefix}/similarity_chunks.parquet + --similarity-matrix ${data_prefix}/similarity_matrix.npz + --k 20 + --threshold 0.3 + --faiss-index-type IVFFlat + --metrics-output ${metrics_prefix}/similarity_calculator.json + --logger-level INFO + deps: + - ${data_prefix}/hashed_chunks.parquet + - gsma_dataset_creation/similarity/similarity_calculator.py + outs: + - ${data_prefix}/similarity_chunks.parquet + - ${data_prefix}/similarity_matrix.npz + metrics: + - ${metrics_prefix}/similarity_calculator.json + desc: "Compute FAISS-based top-K similarity relationships for chunks" + + overlap_detector: + wdir: ../.. + cmd: >- + uv run gsma similarity overlap-detector + --input ${data_prefix}/similarity_chunks.parquet + --output ${data_prefix}/enriched_chunks.parquet + --min-overlap-chars 50 + --metrics-output ${metrics_prefix}/overlap_detector.json + --logger-level INFO + deps: + - ${data_prefix}/similarity_chunks.parquet + - gsma_dataset_creation/similarity/overlap_detector.py + outs: + - ${data_prefix}/enriched_chunks.parquet + metrics: + - ${metrics_prefix}/overlap_detector.json + desc: "Detect character offset-based overlap relationships with simplified schema" + + explode_questions: + wdir: ../.. + cmd: >- + uv run gsma validation explode-questions + --input ${data_prefix}/enriched_chunks.parquet + --output ${data_prefix}/validation/questions_with_candidates.parquet + --min-similarity-score 0.35 + --max-similarity-score 0.95 + --metrics-output ${metrics_prefix}/explode_questions_metrics.json + --logger-level INFO + deps: + - ${data_prefix}/enriched_chunks.parquet + - gsma_dataset_creation/validation_cli.py + outs: + - ${data_prefix}/validation/questions_with_candidates.parquet + metrics: + - ${metrics_prefix}/explode_questions_metrics.json + desc: "Extract questions from enriched chunks into question-centric format" + + apply_question_filter: + wdir: ../.. + cmd: >- + uv run gsma filters apply-question-filter + --input ${data_prefix}/validation/questions_with_candidates.parquet + --output ${data_prefix}/filters/questions_with_filter.parquet + --model-path models/filters/question-filter-run-5000-2025-10-08_22-47-46/model + --metrics-output ${metrics_prefix}/question_filter_metrics.json + deps: + - ${data_prefix}/validation/questions_with_candidates.parquet + - models/filters/question-filter-run-5000-2025-10-08_22-47-46/model + - gsma_dataset_creation/filters_cli.py + outs: + - ${data_prefix}/filters/questions_with_filter.parquet + metrics: + - ${metrics_prefix}/question_filter_metrics.json + desc: "Apply external reference filter to all questions and add low_quality_probability column" + + apply_chunk_filter: + wdir: ../.. + cmd: >- + uv run gsma filters apply-chunk-filter + --input ${data_prefix}/enriched_chunks.parquet + --output ${data_prefix}/filters/enriched_chunks_with_filter.parquet + --model-path models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model + --truncate-length 200 + --exclude-matches "prd@gsma.com" + --metrics-output ${metrics_prefix}/chunk_filter_metrics.json + deps: + - ${data_prefix}/enriched_chunks.parquet + - models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model + - gsma_dataset_creation/filters_cli.py + outs: + - ${data_prefix}/filters/enriched_chunks_with_filter.parquet + metrics: + - ${metrics_prefix}/chunk_filter_metrics.json + desc: "Apply procedures filter and keyword exclusion to all chunks" + + filter_questions_by_chunk_quality: + wdir: ../.. + cmd: >- + uv run gsma filters filter-questions-by-chunk-quality + --input-questions ${data_prefix}/filters/questions_with_filter.parquet + --input-chunks ${data_prefix}/filters/enriched_chunks_with_filter.parquet + --output ${data_prefix}/validation/questions_filtered.parquet + --min-question-probability 0.5 + --min-chunk-probability 0.5 + --metrics-output ${metrics_prefix}/question_chunk_filtering_metrics.json + deps: + - ${data_prefix}/filters/questions_with_filter.parquet + - ${data_prefix}/filters/enriched_chunks_with_filter.parquet + - gsma_dataset_creation/filters_cli.py + outs: + - ${data_prefix}/validation/questions_filtered.parquet + metrics: + - ${metrics_prefix}/question_chunk_filtering_metrics.json + desc: "Filter questions by removing low-quality content based on both question and chunk quality" + + validate_requests: + wdir: ../.. + cmd: >- + uv run gsma validation validate-requests + --input ${data_prefix}/validation/questions_filtered.parquet + --enriched-chunks ${data_prefix}/filters/enriched_chunks_with_filter.parquet + --output ${data_prefix}/validation/validation_results.parquet + --checkpoint-dir .dvc/.tmp/validation_checkpoints + --max-concurrent 50 + --limit 50000 + --metrics-output ${metrics_prefix}/validation_results.json + --logger-level INFO + --model qwen/qwen3-235b-a22b-2507 + --provider Cerebras + deps: + - ${data_prefix}/validation/questions_filtered.parquet + - ${data_prefix}/filters/enriched_chunks_with_filter.parquet + - gsma_dataset_creation/validation_cli.py + - gsma_dataset_creation/validation/validator.py + - gsma_dataset_creation/validation/request_tracker.py + outs: + - ${data_prefix}/validation/validation_results.parquet + metrics: + - ${metrics_prefix}/validation_results.json + desc: "Validate requests using LLM with SQLite checkpointing (individual validation)" + + create_validation_dataset: + wdir: ../.. + cmd: >- + uv run gsma datasets create-from-validation + --input ${data_prefix}/validation/validation_results.parquet + --enriched-chunks ${data_prefix}/filters/enriched_chunks_with_filter.parquet + --embedding-output ${data_prefix}/validation/validation_dataset_embedding + --embedding-jsonl-output ${data_prefix}/validation/validation_dataset_embedding.jsonl + --qa-output ${data_prefix}/validation/validation_dataset_qa + --qa-jsonl-output ${data_prefix}/validation/validation_dataset_qa.jsonl + --max-positives 3 + --max-negatives 3 + --metrics-output ${metrics_prefix}/dataset_creation_from_validation.json + --logger-level INFO + deps: + - ${data_prefix}/validation/validation_results.parquet + - ${data_prefix}/filters/enriched_chunks_with_filter.parquet + - gsma_dataset_creation/datasets_cli.py + - gsma_dataset_creation/datasets/validation_dataset_creator.py + outs: + - ${data_prefix}/validation/validation_dataset_embedding + - ${data_prefix}/validation/validation_dataset_embedding.jsonl + - ${data_prefix}/validation/validation_dataset_qa + - ${data_prefix}/validation/validation_dataset_qa.jsonl + metrics: + - ${metrics_prefix}/dataset_creation_from_validation.json + desc: "Create HuggingFace embedding and QA datasets with JSONL from validation results" + + upload_embedding_dataset: + wdir: ../.. + cmd: >- + uv run gsma upload-hf-dataset + --dataset-path ${data_prefix}/validation/validation_dataset_embedding + --repo-name mantisnlp/gsma_prd_synthetic_embedding + deps: + - ${data_prefix}/validation/validation_dataset_embedding + desc: "Upload embedding training dataset to HuggingFace Hub" + + upload_qa_dataset: + wdir: ../.. + cmd: >- + uv run gsma upload-hf-dataset + --dataset-path ${data_prefix}/validation/validation_dataset_qa + --repo-name mantisnlp/gsma_prd_synthetic_qa + deps: + - ${data_prefix}/validation/validation_dataset_qa + desc: "Upload QA dataset to HuggingFace Hub" diff --git a/pipelines/qa_filter/dvc.yaml b/pipelines/qa_filter/dvc.yaml deleted file mode 100644 index 996ed1f..0000000 --- a/pipelines/qa_filter/dvc.yaml +++ /dev/null @@ -1,24 +0,0 @@ -#stages: -# train_filter: -# cmd: uv run gsma train-filter data/qa_combined.hf models/qa_filter.pt --log-level INFO -# deps: -# - data/qa_combined.hf -# - gsma_dataset_creation/qa_filter_trainer.py -# - gsma_dataset_creation/qa_filter_models.py -# - gsma_dataset_creation/cli.py -# outs: -# - models/qa_filter.pt -# desc: "Train consistency-based QA filter using E5 approach with early stopping" -# frozen: true -# -# filter_qa: -# cmd: uv run gsma filter-qa data/qa_combined.hf data/qa_filtered.hf models/qa_filter.pt --log-level INFO -# deps: -# - data/qa_combined.hf -# - models/qa_filter.pt -# - gsma_dataset_creation/qa_filter.py -# - gsma_dataset_creation/cli.py -# outs: -# - data/qa_filtered.hf -# desc: "Filter QA pairs using trained consistency model, keeping top-2 ranked pairs" -# frozen: true diff --git a/pipelines/questions/dvc.lock b/pipelines/questions/dvc.lock deleted file mode 100644 index 23fc418..0000000 --- a/pipelines/questions/dvc.lock +++ /dev/null @@ -1,251 +0,0 @@ -schema: '2.0' -stages: - generate_questions@0: - cmd: uv run gsma generate-questions data/chunked_late_500 data/questions_gpt-oss-120b_late_500 - --num-questions 5 --model "openai/gpt-oss-120b" --limit-docs 1000 --limit-chunks - 1000 --max-concurrent 20 --credit-check-interval 1000 --log-level INFO --provider - Cerebras --metrics-file metrics/generate_questions_gpt-oss-120b_late_500.json - deps: - - path: data/chunked_late_500 - hash: md5 - md5: cdf4ce62598c452b97deaff3762174bd.dir - size: 112905314 - nfiles: 248 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 98f0613663b314661b12ab8dc17ecb00 - size: 51648 - - path: gsma_dataset_creation/question_generator.py - hash: md5 - md5: 16718f9f50383900bd23e6d8394abcce - size: 34560 - outs: - - path: data/questions_gpt-oss-120b_late_500 - hash: md5 - md5: 73a0a7facc7360db9b11a57e542c5fca.dir - size: 52133974 - nfiles: 8230 - - path: metrics/generate_questions_gpt-oss-120b_late_500.json - hash: md5 - md5: 8ff4996eb2b587fd87b0b465ec619281 - size: 1514 - generate_hard_negatives@300: - cmd: uv run gsma hard-negatives data/questions_gpt-oss-120_late_300 --output-dir - data/hard_negatives_late_gpt-oss-120_300 --model "openai/gpt-oss-120b" --num-negatives - 3 --max-candidates 100 --similarity-threshold 0.3 --llm-validation --llm-max-concurrent - 2 --limit 5 --log-level DEBUG - deps: - - path: data/questions_gpt-oss-120_late_300 - hash: md5 - md5: 540a55b1d77587ff31578cc321a78b52.dir - size: 591190 - nfiles: 329 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: fe5036edbdf311b062aab6b3156a308b - size: 50756 - - path: gsma_dataset_creation/hard_negatives.py - hash: md5 - md5: 4bffddf3c0fc01e8802a31b94ee68131 - size: 30405 - outs: - - path: data/hard_negatives_late_gpt-oss-120_300 - hash: md5 - md5: 38f3b06c157be5aebdde8e36f22a8edf.dir - size: 16541 - nfiles: 4 - generate_questions_gpt-5-mini@0: - cmd: uv run gsma generate-questions data/chunked_late_300 data/questions_gpt-5-mini_late_300 - --num-questions 1 --model "openai/gpt-oss-120b" --limit-docs 5 --limit-chunks - 1000 --max-concurrent 20 --credit-check-interval 1000 --log-level DEBUG --metrics-file - metrics/generate_questions_gpt-5-mini_late_300.json - deps: - - path: data/chunked_late_300 - hash: md5 - md5: 369b24c0b9e6c95685d2f14ba3486b13.dir - size: 182168867 - nfiles: 248 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: fe5036edbdf311b062aab6b3156a308b - size: 50756 - - path: gsma_dataset_creation/question_generator.py - hash: md5 - md5: 6092b529da9a8c1ee72f2de61bd6eaaa - size: 34228 - outs: - - path: data/questions_gpt-5-mini_late_300 - hash: md5 - md5: d6714d774b2c5d8b6080dd04c6f458be.dir - size: 363082 - nfiles: 201 - - path: metrics/generate_questions_gpt-5-mini_late_300.json - hash: md5 - md5: ce08c59a5b3be2c762f5630fcd32396d - size: 1490 - generate_questions@1: - cmd: uv run gsma generate-questions data/chunked_late_1000 data/questions_gpt-oss-120b_late_1000 - --num-questions 10 --model "openai/gpt-oss-120b" --limit-docs 1000 --limit-chunks - 1000 --max-concurrent 20 --credit-check-interval 1000 --log-level INFO --provider - Cerebras --metrics-file metrics/generate_questions_gpt-oss-120b_late_1000.json - deps: - - path: data/chunked_late_1000 - hash: md5 - md5: 7cf69d9840ef5521665ef13dc0753c5f.dir - size: 64371559 - nfiles: 248 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 98f0613663b314661b12ab8dc17ecb00 - size: 51648 - - path: gsma_dataset_creation/question_generator.py - hash: md5 - md5: 16718f9f50383900bd23e6d8394abcce - size: 34560 - outs: - - path: data/questions_gpt-oss-120b_late_1000 - hash: md5 - md5: e5aa8dfa448ab8afb2a546e9e0ed0ef2.dir - size: 46638867 - nfiles: 4001 - - path: metrics/generate_questions_gpt-oss-120b_late_1000.json - hash: md5 - md5: 6b223524d14a0b7de6fb252a18640ee0 - size: 1499 - generate_hard_negatives@0: - cmd: uv run gsma hard-negatives data/questions_gpt-oss-120b_late_500 --output-dir - data/hard_negatives_gpt-oss-120b_late_500 --model "openai/gpt-oss-120b" --num-negatives - 3 --max-candidates 10 --similarity-threshold 0.5 --max-similarity-threshold - 0.95 --llm-validation --llm-max-concurrent 20 --limit 5 --log-level INFO - deps: - - path: data/questions_gpt-oss-120b_late_500 - hash: md5 - md5: 73a0a7facc7360db9b11a57e542c5fca.dir - size: 52133974 - nfiles: 8230 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 98f0613663b314661b12ab8dc17ecb00 - size: 51648 - - path: gsma_dataset_creation/hard_negatives.py - hash: md5 - md5: 07ac025daa1200c7ebbd6ac423a0e88a - size: 30973 - outs: - - path: data/hard_negatives_gpt-oss-120b_late_500 - hash: md5 - md5: 27bc8c0d5390e0cd0ceb47478f924a37.dir - size: 160278 - nfiles: 5 - generate_hard_negatives@1: - cmd: uv run gsma hard-negatives data/questions_gpt-5-mini_late_300 --output-dir - data/hard_negatives_gpt-5-mini_late_300 --model "openai/gpt-oss-120b" --num-negatives - 3 --max-candidates 100 --similarity-threshold 0.3 --llm-validation --llm-max-concurrent - 2 --limit 5 --log-level DEBUG - deps: - - path: data/questions_gpt-5-mini_late_300 - hash: md5 - md5: 7694e63bc420ac41d22ffab9891fa90a.dir - size: 628785 - nfiles: 329 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: fe5036edbdf311b062aab6b3156a308b - size: 50756 - - path: gsma_dataset_creation/hard_negatives.py - hash: md5 - md5: 4bffddf3c0fc01e8802a31b94ee68131 - size: 30405 - outs: - - path: data/hard_negatives_gpt-5-mini_late_300 - hash: md5 - md5: b976c7b0cf4b9786faca209bb660a5ac.dir - size: 29299 - nfiles: 5 - generate_questions@2: - cmd: uv run gsma generate-questions data/chunked_late_2000 data/questions_gpt-oss-120b_late_2000 - --num-questions 20 --model "openai/gpt-oss-120b" --limit-docs 1000 --limit-chunks - 1000 --max-concurrent 20 --credit-check-interval 1000 --log-level INFO --provider - Cerebras --metrics-file metrics/generate_questions_gpt-oss-120b_late_2000.json - deps: - - path: data/chunked_late_2000 - hash: md5 - md5: ad55506022dcd9f89bf999e65e841e2e.dir - size: 41293426 - nfiles: 248 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 98f0613663b314661b12ab8dc17ecb00 - size: 51648 - - path: gsma_dataset_creation/question_generator.py - hash: md5 - md5: 16718f9f50383900bd23e6d8394abcce - size: 34560 - outs: - - path: data/questions_gpt-oss-120b_late_2000 - hash: md5 - md5: 589b801c932365ffcf109218f59d3bec.dir - size: 38981806 - nfiles: 1830 - - path: metrics/generate_questions_gpt-oss-120b_late_2000.json - hash: md5 - md5: 71cedda9ae476bc3d8456323083c6ffd - size: 1513 - generate_questions@3: - cmd: uv run gsma generate-questions data/chunked_late_3000 data/questions_gpt-oss-120b_late_3000 - --num-questions 30 --model "openai/gpt-oss-120b" --limit-docs 1000 --limit-chunks - 1000 --max-concurrent 20 --credit-check-interval 1000 --log-level INFO --provider - Cerebras --metrics-file metrics/generate_questions_gpt-oss-120b_late_3000.json - deps: - - path: data/chunked_late_3000 - hash: md5 - md5: ab6fc68856dfe51186a2cc9bfd61d29b.dir - size: 33715035 - nfiles: 248 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 98f0613663b314661b12ab8dc17ecb00 - size: 51648 - - path: gsma_dataset_creation/question_generator.py - hash: md5 - md5: 16718f9f50383900bd23e6d8394abcce - size: 34560 - outs: - - path: data/questions_gpt-oss-120b_late_3000 - hash: md5 - md5: 7fca8c90a24f32b89d4526eae1a99f7d.dir - size: 32217471 - nfiles: 1066 - - path: metrics/generate_questions_gpt-oss-120b_late_3000.json - hash: md5 - md5: 07f417dabcf481c3cae998b571c68e5b - size: 1526 - generate_questions@4: - cmd: uv run gsma generate-questions data/chunked_late_4000 data/questions_gpt-oss-120b_late_4000 - --num-questions 40 --model "openai/gpt-oss-120b" --limit-docs 1000 --limit-chunks - 1000 --max-concurrent 20 --credit-check-interval 1000 --log-level INFO --provider - Cerebras --metrics-file metrics/generate_questions_gpt-oss-120b_late_4000.json - deps: - - path: data/chunked_late_4000 - hash: md5 - md5: 6c696d0ef870d7ba2aec72155bbfed66.dir - size: 29820279 - nfiles: 248 - - path: gsma_dataset_creation/cli.py - hash: md5 - md5: 98f0613663b314661b12ab8dc17ecb00 - size: 51648 - - path: gsma_dataset_creation/question_generator.py - hash: md5 - md5: 16718f9f50383900bd23e6d8394abcce - size: 34560 - outs: - - path: data/questions_gpt-oss-120b_late_4000 - hash: md5 - md5: c1578e42cf46e4eeefcefd25d0de812c.dir - size: 25730277 - nfiles: 658 - - path: metrics/generate_questions_gpt-oss-120b_late_4000.json - hash: md5 - md5: d76ede81c66d391e5521fd811598ebf1 - size: 1509 diff --git a/pipelines/questions/dvc.yaml b/pipelines/questions/dvc.yaml deleted file mode 100644 index c7f0e4d..0000000 --- a/pipelines/questions/dvc.yaml +++ /dev/null @@ -1,58 +0,0 @@ -stages: - generate_questions: - foreach: - - model: gpt-oss-120b - model_name: openai/gpt-oss-120b - provider: "--provider Cerebras" - size: 500 - name: late_500 - questions_per_chunk: 5 - - model: gpt-oss-120b - model_name: openai/gpt-oss-120b - provider: "--provider Cerebras" - size: 1000 - name: late_1000 - questions_per_chunk: 10 - - model: gpt-oss-120b - model_name: openai/gpt-oss-120b - provider: "--provider Cerebras" - size: 2000 - name: late_2000 - questions_per_chunk: 20 - - model: gpt-oss-120b - model_name: openai/gpt-oss-120b - provider: "--provider Cerebras" - size: 3000 - name: late_3000 - questions_per_chunk: 30 - - model: gpt-oss-120b - model_name: openai/gpt-oss-120b - provider: "--provider Cerebras" - size: 4000 - name: late_4000 - questions_per_chunk: 40 - do: - wdir: ../.. - cmd: >- - uv run gsma generate-questions - data/chunked_${item.name} - data/questions_${item.model}_${item.name} - --num-questions ${item.questions_per_chunk} - --model "${item.model_name}" - --limit-docs 1000 - --limit-chunks 1000 - --max-concurrent 20 - --credit-check-interval 1000 - --log-level INFO - ${item.provider} - --metrics-file metrics/generate_questions_${item.model}_${item.name}.json - deps: - - data/chunked_${item.name} - - gsma_dataset_creation/question_generator.py - - gsma_dataset_creation/cli.py - outs: - - data/questions_${item.model}_${item.name} - metrics: - - metrics/generate_questions_${item.model}_${item.name}.json - desc: "Generate questions from ${item.name} chunks using ${item.model}" - frozen: true diff --git a/pipelines/similarity/dvc.lock b/pipelines/similarity/dvc.lock deleted file mode 100644 index 74e8290..0000000 --- a/pipelines/similarity/dvc.lock +++ /dev/null @@ -1,181 +0,0 @@ -schema: '2.0' -stages: - overlap_detector: - cmd: uv run gsma similarity overlap-detector --input data/similarity_chunks.parquet - --output data/enriched_chunks.parquet --min-overlap-chars 50 --metrics-output - metrics/overlap_detector.json --logger-level INFO - deps: - - path: data/similarity_chunks.parquet - hash: md5 - md5: 5a591c2856e8c8fb78d6376108f9150f - size: 111960860 - - path: gsma_dataset_creation/similarity/overlap_detector.py - hash: md5 - md5: 80b7b89ec4e09e2857fba06b5806ec7e - size: 8782 - outs: - - path: data/enriched_chunks.parquet - hash: md5 - md5: f38b00fb8d08e7ae72d7343af230b0eb - size: 112412055 - - path: metrics/overlap_detector.json - hash: md5 - md5: 588f2ab3713ae386e40c6e3c0190a70e - size: 506 - data_combiner: - cmd: uv run gsma similarity data-combiner --chunker-dirs data/chunked_late_500 - --chunker-dirs data/chunked_late_1000 --chunker-dirs data/chunked_late_2000 - --chunker-dirs data/chunked_late_3000 --chunker-dirs data/chunked_late_4000 - --qa-dirs data/questions_gpt-oss-120b_late_500 --qa-dirs data/questions_gpt-oss-120b_late_1000 - --qa-dirs data/questions_gpt-oss-120b_late_2000 --qa-dirs data/questions_gpt-oss-120b_late_3000 - --qa-dirs data/questions_gpt-oss-120b_late_4000 --working-groups data/working_groups_mapping.json - --output data/combined_chunks.parquet --metrics-output metrics/data_combiner.json - --logger-level INFO - deps: - - path: data/chunked_late_1000 - hash: md5 - md5: 7cf69d9840ef5521665ef13dc0753c5f.dir - size: 64371559 - nfiles: 248 - - path: data/chunked_late_2000 - hash: md5 - md5: ad55506022dcd9f89bf999e65e841e2e.dir - size: 41293426 - nfiles: 248 - - path: data/chunked_late_3000 - hash: md5 - md5: ab6fc68856dfe51186a2cc9bfd61d29b.dir - size: 33715035 - nfiles: 248 - - path: data/chunked_late_4000 - hash: md5 - md5: 6c696d0ef870d7ba2aec72155bbfed66.dir - size: 29820279 - nfiles: 248 - - path: data/chunked_late_500 - hash: md5 - md5: cdf4ce62598c452b97deaff3762174bd.dir - size: 112905314 - nfiles: 248 - - path: data/questions_gpt-oss-120b_late_1000 - hash: md5 - md5: e5aa8dfa448ab8afb2a546e9e0ed0ef2.dir - size: 46638867 - nfiles: 4001 - - path: data/questions_gpt-oss-120b_late_2000 - hash: md5 - md5: 589b801c932365ffcf109218f59d3bec.dir - size: 38981806 - nfiles: 1830 - - path: data/questions_gpt-oss-120b_late_3000 - hash: md5 - md5: 7fca8c90a24f32b89d4526eae1a99f7d.dir - size: 32217471 - nfiles: 1066 - - path: data/questions_gpt-oss-120b_late_4000 - hash: md5 - md5: c1578e42cf46e4eeefcefd25d0de812c.dir - size: 25730277 - nfiles: 658 - - path: data/questions_gpt-oss-120b_late_500 - hash: md5 - md5: 73a0a7facc7360db9b11a57e542c5fca.dir - size: 52133974 - nfiles: 8230 - - path: data/working_groups_mapping.json - hash: md5 - md5: b70592bdeac5c03634a60d09d0a8fbc7 - size: 9901 - - path: gsma_dataset_creation/similarity/data_combiner.py - hash: md5 - md5: 72bb3424212238249cc6bea8e25b6843 - size: 19985 - outs: - - path: data/combined_chunks.parquet - hash: md5 - md5: e7bb746dfd09176236fd5b32cb00fd8a - size: 106961632 - - path: metrics/data_combiner.json - hash: md5 - md5: 64bc79c7c3bbdf350893313a05e32048 - size: 721 - similarity_hasher: - cmd: uv run gsma similarity hasher --input data/combined_chunks.parquet --output - data/hashed_chunks.parquet --metrics-output metrics/similarity_hasher.json --logger-level - INFO - deps: - - path: data/combined_chunks.parquet - hash: md5 - md5: e7bb746dfd09176236fd5b32cb00fd8a - size: 106961632 - - path: gsma_dataset_creation/similarity/hashing.py - hash: md5 - md5: fae2341f4110086046a2a3939187a3d8 - size: 12555 - outs: - - path: data/hashed_chunks.parquet - hash: md5 - md5: 599c6322827da8ecbdf553bafa18b65f - size: 108402760 - - path: metrics/similarity_hasher.json - hash: md5 - md5: b76d7919548bf1b1dde3d7712a4b7d17 - size: 371 - similarity_calculator: - cmd: uv run python gsma_dataset_creation/similarity_calculator.py --input data/hashed_chunks.parquet - --output data/similarity_chunks.parquet --similarity-matrix data/similarity_matrix.npz - --k 20 --threshold 0.3 --faiss-index-type IVFFlat --metrics-output metrics/similarity_calculator.json - --verbose - deps: - - path: data/hashed_chunks.parquet - hash: md5 - md5: 174cc7f50029c8f9cf591dd715506d9b - size: 172534307 - - path: gsma_dataset_creation/similarity/similarity_calculator.py - hash: md5 - md5: 095521e401e091581c26c8ff726f0821 - size: 14724 - - path: gsma_dataset_creation/similarity_calculator.py - hash: md5 - md5: dacc004e5a3ab3a4c09517162435ea2d - size: 10565 - outs: - - path: data/similarity_chunks.parquet - hash: md5 - md5: 551a7c5083e12631efd6c1c679f073af - size: 176052244 - - path: data/similarity_matrix.npz - hash: md5 - md5: 1ab4b1778ca5376d254079bdf834d651 - size: 578 - - path: metrics/similarity_calculator.json - hash: md5 - md5: 39dbf8d225f58bc91cd49f89d568d8be - size: 540 - similarity_ranker: - cmd: uv run gsma similarity ranker --input data/hashed_chunks.parquet --output - data/similarity_chunks.parquet --similarity-matrix data/similarity_matrix.npz - --k 20 --threshold 0.3 --faiss-index-type IVFFlat --metrics-output metrics/similarity_calculator.json - --logger-level INFO - deps: - - path: data/hashed_chunks.parquet - hash: md5 - md5: 599c6322827da8ecbdf553bafa18b65f - size: 108402760 - - path: gsma_dataset_creation/similarity/similarity_calculator.py - hash: md5 - md5: f62609fe22d08df9df63bb0716f44912 - size: 15006 - outs: - - path: data/similarity_chunks.parquet - hash: md5 - md5: 5a591c2856e8c8fb78d6376108f9150f - size: 111960860 - - path: data/similarity_matrix.npz - hash: md5 - md5: affd9c9497e8b2460eccfcb8af364169 - size: 577 - - path: metrics/similarity_calculator.json - hash: md5 - md5: 0ced64da2e7e652d6a9e5eff40a8b745 - size: 540 diff --git a/pipelines/similarity/dvc.yaml b/pipelines/similarity/dvc.yaml deleted file mode 100644 index cde2573..0000000 --- a/pipelines/similarity/dvc.yaml +++ /dev/null @@ -1,94 +0,0 @@ -stages: - data_combiner: - wdir: ../.. - cmd: >- - uv run gsma similarity data-combiner - --chunker-dirs data/chunked_late_500 - --chunker-dirs data/chunked_late_1000 - --chunker-dirs data/chunked_late_2000 - --chunker-dirs data/chunked_late_3000 - --chunker-dirs data/chunked_late_4000 - --qa-dirs data/questions_gpt-oss-120b_late_500 - --qa-dirs data/questions_gpt-oss-120b_late_1000 - --qa-dirs data/questions_gpt-oss-120b_late_2000 - --qa-dirs data/questions_gpt-oss-120b_late_3000 - --qa-dirs data/questions_gpt-oss-120b_late_4000 - --working-groups data/working_groups_mapping.json - --output data/combined_chunks.parquet - --metrics-output metrics/data_combiner.json - --logger-level INFO - deps: - - data/chunked_late_500 - - data/chunked_late_1000 - - data/chunked_late_2000 - - data/chunked_late_3000 - - data/chunked_late_4000 - - data/questions_gpt-oss-120b_late_500 - - data/questions_gpt-oss-120b_late_1000 - - data/questions_gpt-oss-120b_late_2000 - - data/questions_gpt-oss-120b_late_3000 - - data/questions_gpt-oss-120b_late_4000 - - data/working_groups_mapping.json - - gsma_dataset_creation/similarity/data_combiner.py - outs: - - data/combined_chunks.parquet - metrics: - - metrics/data_combiner.json - desc: "Combine chunker and QA data across chunk sizes with working group classification" - - similarity_hasher: - wdir: ../.. - cmd: >- - uv run gsma similarity hasher - --input data/combined_chunks.parquet - --output data/hashed_chunks.parquet - --metrics-output metrics/similarity_hasher.json - --logger-level INFO - deps: - - data/combined_chunks.parquet - - gsma_dataset_creation/similarity/hashing.py - outs: - - data/hashed_chunks.parquet - metrics: - - metrics/similarity_hasher.json - desc: "Add SHA-256 hashes to combined chunk data" - - similarity_ranker: - wdir: ../.. - cmd: >- - uv run gsma similarity ranker - --input data/hashed_chunks.parquet - --output data/similarity_chunks.parquet - --similarity-matrix data/similarity_matrix.npz - --k 20 - --threshold 0.3 - --faiss-index-type IVFFlat - --metrics-output metrics/similarity_calculator.json - --logger-level INFO - deps: - - data/hashed_chunks.parquet - - gsma_dataset_creation/similarity/similarity_calculator.py - outs: - - data/similarity_chunks.parquet - - data/similarity_matrix.npz - metrics: - - metrics/similarity_calculator.json - desc: "Compute FAISS-based top-K similarity relationships for chunks" - - overlap_detector: - wdir: ../.. - cmd: >- - uv run gsma similarity overlap-detector - --input data/similarity_chunks.parquet - --output data/enriched_chunks.parquet - --min-overlap-chars 50 - --metrics-output metrics/overlap_detector.json - --logger-level INFO - deps: - - data/similarity_chunks.parquet - - gsma_dataset_creation/similarity/overlap_detector.py - outs: - - data/enriched_chunks.parquet - metrics: - - metrics/overlap_detector.json - desc: "Detect character offset-based overlap relationships with simplified schema" diff --git a/pipelines/validation/README.md b/pipelines/validation/README.md deleted file mode 100644 index e93a40d..0000000 --- a/pipelines/validation/README.md +++ /dev/null @@ -1,99 +0,0 @@ -# Validation Pipeline - -DVC pipeline for batched validation of Q&A pairs against candidate chunks. - -## Overview - -This pipeline validates whether candidate chunks (from similarity matching) can actually answer the questions generated from positive chunks. Uses batched LLM evaluation with SQLite checkpointing for resumable processing. - -## Stages - -### 1. explode-questions -Transforms enriched chunks (chunk-centric) into questions with candidates (question-centric). - -**Input**: `data/enriched_chunks.parquet` -**Output**: `data/validation/questions_with_candidates.parquet` -**Features**: -- Extracts questions from nested structure -- Filters candidates by similarity score (≥ 0.3) -- Creates parallel arrays for candidate metadata - -### 2. batch-candidates -Groups candidates into batched validation requests (max 10 per batch). - -**Input**: `data/validation/questions_with_candidates.parquet` -**Output**: `data/validation/validation_requests_batched.parquet` -**Features**: -- Randomizes candidate order to avoid position bias -- Estimates tokens for cost tracking -- Tracks original candidate order for debugging - -### 3. validate-batched-requests -Validates candidates using LLM with checkpoint-based resumption. - -**Input**: `data/validation/validation_requests_batched.parquet` -**Output**: `data/validation/validation_results.parquet` -**Features**: -- SQLite checkpoint for crash recovery -- Smart mtime detection (auto-reinitialize if input changed) -- Async validation with concurrency control -- Retry logic (max 3 attempts per request) -- Output only written when 100% complete - -## Running the Pipeline - -```bash -# Run full validation pipeline -cd pipelines/validation -dvc repro - -# Run individual stages -dvc repro explode_questions -dvc repro batch_candidates -dvc repro validate_batched_requests - -# Force rerun (ignore cache) -dvc repro --force - -# Resume after interruption (uses checkpoint) -dvc repro validate_batched_requests -``` - -## Cost Optimization - -This pipeline achieves **4-5× cost reduction** vs. single-candidate validation: -- Batches up to 10 candidates per API call -- Shares question/answer context across candidates -- Randomizes order to avoid position bias - -## Checkpoint Management - -The `validate-batched-requests` stage uses SQLite checkpoints stored in `.dvc/.tmp/validation_checkpoints/`: -- Automatically resumes if interrupted (Ctrl+C) -- Detects input file changes via mtime -- Use `--force` flag to ignore checkpoint and restart - -## Configuration - -Edit `dvc.yaml` to adjust: -- `--max-candidates-per-batch`: Batch size (1-10) -- `--max-concurrent`: API concurrency (default: 20) -- `--model`: LLM model (default: openai/gpt-4o-mini) -- `--min-similarity-score`: Candidate filter threshold (default: 0.3) - -## Output Schema - -`validation_results.parquet` contains: -- Question metadata (question_id, question, answer, etc.) -- Candidate metadata (chunk_ids, similarity_scores, etc.) -- Validation results (is_answerable, reasoning, quality_scores) -- Model metadata (model_used, tokens_used, timestamp) - -See `ValidationRequestRecord` model in `gsma_dataset_creation/validation/models.py` for full schema. - -## Environment Requirements - -Set `OPENROUTER_API_KEY` environment variable: -```bash -export OPENROUTER_API_KEY="sk-or-v1-..." -``` diff --git a/pipelines/validation/dvc.lock b/pipelines/validation/dvc.lock deleted file mode 100644 index 16d22ca..0000000 --- a/pipelines/validation/dvc.lock +++ /dev/null @@ -1,161 +0,0 @@ -schema: '2.0' -stages: - explode_questions: - cmd: uv run gsma validation explode-questions --input data/enriched_chunks.parquet - --output data/validation/questions_with_candidates.parquet --min-similarity-score - 0.6 --max-similarity-score 0.9 --metrics-output metrics/explode_questions.json - --logger-level INFO - deps: - - path: data/enriched_chunks.parquet - hash: md5 - md5: f38b00fb8d08e7ae72d7343af230b0eb - size: 112412055 - - path: gsma_dataset_creation/validation_cli.py - hash: md5 - md5: cc0058ea4aef40b34b561fb556ab5d56 - size: 28440 - outs: - - path: data/validation/questions_with_candidates.parquet - hash: md5 - md5: fcd85bda6273b751d78d9792c253e202 - size: 216927340 - - path: metrics/explode_questions.json - hash: md5 - md5: 4dd160c58147e39d8bb67117c05a77b3 - size: 1072 - batch_candidates: - cmd: uv run gsma validation batch-candidates --input data/validation/questions_with_candidates.parquet - --enriched-chunks data/enriched_chunks.parquet --output data/validation/validation_requests_batched.parquet - --max-candidates-per-batch 1 --randomize --logger-level INFO - deps: - - path: data/enriched_chunks.parquet - hash: md5 - md5: f38b00fb8d08e7ae72d7343af230b0eb - size: 112412055 - - path: data/validation/questions_with_candidates.parquet - hash: md5 - md5: fcd85bda6273b751d78d9792c253e202 - size: 216927340 - - path: gsma_dataset_creation/validation/candidate_batcher.py - hash: md5 - md5: f7162e58dd24817972b36dc05c52d246 - size: 11269 - - path: gsma_dataset_creation/validation_cli.py - hash: md5 - md5: cc0058ea4aef40b34b561fb556ab5d56 - size: 28440 - outs: - - path: data/validation/validation_requests_batched.parquet - hash: md5 - md5: 07b663f4aa43e7452b5bbb09f9325e02 - size: 45152020 - validate_batched_requests: - cmd: uv run gsma validation validate-batched-requests --input data/validation/validation_requests_batched.parquet - --enriched-chunks data/enriched_chunks.parquet --output data/validation/validation_results.parquet - --checkpoint-dir .dvc/.tmp/validation_checkpoints --max-concurrent 50 --limit - 1000 --force --metrics-output metrics/validation_results.json --logger-level - INFO --model qwen/qwen3-235b-a22b-2507 - deps: - - path: data/enriched_chunks.parquet - hash: md5 - md5: f38b00fb8d08e7ae72d7343af230b0eb - size: 112412055 - - path: data/validation/validation_requests_batched.parquet - hash: md5 - md5: 07b663f4aa43e7452b5bbb09f9325e02 - size: 45152020 - - path: gsma_dataset_creation/validation/request_tracker.py - hash: md5 - md5: 9d26e1f5562f40ef2f51422a1314130c - size: 18390 - - path: gsma_dataset_creation/validation/validator.py - hash: md5 - md5: a479662940a5eacbdea559687be8e3d9 - size: 5516 - - path: gsma_dataset_creation/validation_cli.py - hash: md5 - md5: cc0058ea4aef40b34b561fb556ab5d56 - size: 28440 - outs: - - path: data/validation/validation_results.parquet - hash: md5 - md5: 4a60ad38214d159574f66feb080f836d - size: 132153 - - path: metrics/validation_results.json - hash: md5 - md5: 03af36f287138c5e4317f0bedf8b8f8d - size: 217 - create_validation_dataset: - cmd: uv run gsma datasets create-from-validation --input data/validation/validation_results.parquet - --enriched-chunks data/filters/enriched_chunks_with_filter.parquet --embedding-output - data/validation/validation_dataset_embedding --embedding-jsonl-output data/validation/validation_dataset_embedding.jsonl - --qa-output data/validation/validation_dataset_qa --qa-jsonl-output data/validation/validation_dataset_qa.jsonl - --max-positives 3 --max-negatives 3 --metrics-output metrics/dataset_creation_from_validation.json - --logger-level INFO - deps: - - path: data/enriched_chunks.parquet - hash: md5 - md5: f38b00fb8d08e7ae72d7343af230b0eb - size: 112412055 - - path: data/validation/validation_results.parquet - hash: md5 - md5: 5f98fa0f00256044120b97b13a12be96 - size: 92152160 - - path: gsma_dataset_creation/datasets/validation_dataset_creator.py - hash: md5 - md5: 124492c518e2698dac8163295d787911 - size: 42923 - - path: gsma_dataset_creation/datasets_cli.py - hash: md5 - md5: 6f5f3f0dd5c1ac288e7ff43650df50e4 - size: 10247 - outs: - - path: data/validation/validation_dataset_embedding - hash: md5 - md5: 88223ea01b585dfb6b05e803dd5fa910.dir - size: 2233894968 - nfiles: 7 - - path: data/validation/validation_dataset_embedding.jsonl - hash: md5 - md5: ddc52bd5150d20032b0a624b336eab89 - size: 2257818924 - - path: data/validation/validation_dataset_qa - hash: md5 - md5: 702c0453605c372e9d4f238b5c96055d.dir - size: 423295114 - nfiles: 3 - - path: data/validation/validation_dataset_qa.jsonl - hash: md5 - md5: 0d9524474890cb4a057c77e31088d307 - size: 421247669 - - path: metrics/dataset_creation_from_validation.json - hash: md5 - md5: 63fc361595a9fdf4da53a6e3bc4d5aeb - size: 2124 - upload_hf_dataset: - cmd: uv run gsma upload-hf-dataset --dataset-path data/validation/validation_dataset - --repo-name mantisnlp/gsma_prd_synthetic - deps: - - path: data/validation/validation_dataset - hash: md5 - md5: e3025e14ed722130645935d3fb206e3d.dir - size: 2233894968 - nfiles: 7 - upload_qa_dataset: - cmd: uv run gsma upload-hf-dataset --dataset-path data/validation/validation_dataset_qa - --repo-name mantisnlp/gsma_prd_synthetic_qa - deps: - - path: data/validation/validation_dataset_qa - hash: md5 - md5: 702c0453605c372e9d4f238b5c96055d.dir - size: 423295114 - nfiles: 3 - upload_embedding_dataset: - cmd: uv run gsma upload-hf-dataset --dataset-path data/validation/validation_dataset_embedding - --repo-name mantisnlp/gsma_prd_synthetic_embedding - deps: - - path: data/validation/validation_dataset_embedding - hash: md5 - md5: 88223ea01b585dfb6b05e803dd5fa910.dir - size: 2233894968 - nfiles: 7 diff --git a/pipelines/validation/dvc.yaml b/pipelines/validation/dvc.yaml deleted file mode 100644 index c3e8801..0000000 --- a/pipelines/validation/dvc.yaml +++ /dev/null @@ -1,156 +0,0 @@ -stages: - explode_questions: - wdir: ../.. - cmd: >- - uv run gsma validation explode-questions - --input data/enriched_chunks.parquet - --output data/validation/questions_with_candidates.parquet - --min-similarity-score 0.35 - --max-similarity-score 0.95 - --metrics-output metrics/validation/explode_questions_metrics.json - --logger-level INFO - deps: - - data/enriched_chunks.parquet - - gsma_dataset_creation/validation_cli.py - outs: - - data/validation/questions_with_candidates.parquet - metrics: - - metrics/validation/explode_questions_metrics.json: - cache: true - desc: "Extract questions from enriched chunks into question-centric format" - - apply_question_filter: - wdir: ../.. - cmd: >- - uv run gsma filters apply-question-filter - --input data/validation/questions_with_candidates.parquet - --output data/filters/questions_with_filter.parquet - --model-path models/filters/question-filter-run-5000-2025-10-08_22-47-46/model - --metrics-output metrics/filters/question_filter_metrics.json - deps: - - data/validation/questions_with_candidates.parquet - - models/filters/question-filter-run-5000-2025-10-08_22-47-46/model - - gsma_dataset_creation/filters_cli.py - outs: - - data/filters/questions_with_filter.parquet - metrics: - - metrics/filters/question_filter_metrics.json: - cache: true - desc: "Apply external reference filter to all questions and add low_quality_probability column" - - apply_chunk_filter: - wdir: ../.. - cmd: >- - uv run gsma filters apply-chunk-filter - --input data/enriched_chunks.parquet - --output data/filters/enriched_chunks_with_filter.parquet - --model-path models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model - --truncate-length 200 - --metrics-output metrics/filters/chunk_filter_metrics.json - deps: - - data/enriched_chunks.parquet - - models/filters/chunk-filter-run-5000-2025-10-08_19-03-29/model - - gsma_dataset_creation/filters_cli.py - outs: - - data/filters/enriched_chunks_with_filter.parquet - metrics: - - metrics/filters/chunk_filter_metrics.json: - cache: true - desc: "Apply procedures filter to all chunks and add low_quality_probability column" - - filter_questions_by_chunk_quality: - wdir: ../.. - cmd: >- - uv run gsma filters filter-questions-by-chunk-quality - --input-questions data/filters/questions_with_filter.parquet - --input-chunks data/filters/enriched_chunks_with_filter.parquet - --output data/validation/questions_filtered.parquet - --min-question-probability 0.5 - --min-chunk-probability 0.5 - --metrics-output metrics/filters/question_chunk_filtering_metrics.json - deps: - - data/filters/questions_with_filter.parquet - - data/filters/enriched_chunks_with_filter.parquet - - gsma_dataset_creation/filters_cli.py - outs: - - data/validation/questions_filtered.parquet - metrics: - - metrics/filters/question_chunk_filtering_metrics.json: - cache: true - desc: "Filter questions by removing low-quality content based on both question and chunk quality" - - validate_requests: - wdir: ../.. - cmd: >- - uv run gsma validation validate-requests - --input data/validation/questions_filtered.parquet - --enriched-chunks data/filters/enriched_chunks_with_filter.parquet - --output data/validation/validation_results.parquet - --checkpoint-dir .dvc/.tmp/validation_checkpoints - --max-concurrent 50 - --limit 50000 - --metrics-output metrics/validation_results.json - --logger-level INFO - --model qwen/qwen3-235b-a22b-2507 - deps: - - data/validation/questions_filtered.parquet - - data/filters/enriched_chunks_with_filter.parquet - - gsma_dataset_creation/validation_cli.py - - gsma_dataset_creation/validation/validator.py - - gsma_dataset_creation/validation/request_tracker.py - outs: - - data/validation/validation_results.parquet - metrics: - - metrics/validation_results.json: - cache: true - desc: "Validate requests using LLM with SQLite checkpointing (individual validation)" - frozen: false - - create_validation_dataset: - wdir: ../.. - cmd: >- - uv run gsma datasets create-from-validation - --input data/validation/validation_results.parquet - --enriched-chunks data/filters/enriched_chunks_with_filter.parquet - --embedding-output data/validation/validation_dataset_embedding - --embedding-jsonl-output data/validation/validation_dataset_embedding.jsonl - --qa-output data/validation/validation_dataset_qa - --qa-jsonl-output data/validation/validation_dataset_qa.jsonl - --max-positives 3 - --max-negatives 3 - --metrics-output metrics/dataset_creation_from_validation.json - --logger-level INFO - deps: - - data/validation/validation_results.parquet - - data/enriched_chunks.parquet - - gsma_dataset_creation/datasets_cli.py - - gsma_dataset_creation/datasets/validation_dataset_creator.py - outs: - - data/validation/validation_dataset_embedding - - data/validation/validation_dataset_embedding.jsonl - - data/validation/validation_dataset_qa - - data/validation/validation_dataset_qa.jsonl - metrics: - - metrics/dataset_creation_from_validation.json: - cache: true - desc: "Create HuggingFace embedding and QA datasets with JSONL from validation results" - - upload_embedding_dataset: - wdir: ../.. - cmd: >- - uv run gsma upload-hf-dataset - --dataset-path data/validation/validation_dataset_embedding - --repo-name mantisnlp/gsma_prd_synthetic_embedding - deps: - - data/validation/validation_dataset_embedding - desc: "Upload embedding training dataset to HuggingFace Hub" - - upload_qa_dataset: - wdir: ../.. - cmd: >- - uv run gsma upload-hf-dataset - --dataset-path data/validation/validation_dataset_qa - --repo-name mantisnlp/gsma_prd_synthetic_qa - deps: - - data/validation/validation_dataset_qa - desc: "Upload QA dataset to HuggingFace Hub" diff --git a/scripts/check-task-prerequisites.sh b/scripts/check-task-prerequisites.sh deleted file mode 100755 index 68d5770..0000000 --- a/scripts/check-task-prerequisites.sh +++ /dev/null @@ -1,62 +0,0 @@ -#!/usr/bin/env bash -# Check that implementation plan exists and find optional design documents -# Usage: ./check-task-prerequisites.sh [--json] - -set -e - -JSON_MODE=false -for arg in "$@"; do - case "$arg" in - --json) JSON_MODE=true ;; - --help|-h) echo "Usage: $0 [--json]"; exit 0 ;; - esac -done - -# Source common functions -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -source "$SCRIPT_DIR/common.sh" - -# Get all paths -eval $(get_feature_paths) - -# Check if on feature branch -check_feature_branch "$CURRENT_BRANCH" || exit 1 - -# Check if feature directory exists -if [[ ! -d "$FEATURE_DIR" ]]; then - echo "ERROR: Feature directory not found: $FEATURE_DIR" - echo "Run /specify first to create the feature structure." - exit 1 -fi - -# Check for implementation plan (required) -if [[ ! -f "$IMPL_PLAN" ]]; then - echo "ERROR: plan.md not found in $FEATURE_DIR" - echo "Run /plan first to create the plan." - exit 1 -fi - -if $JSON_MODE; then - # Build JSON array of available docs that actually exist - docs=() - [[ -f "$RESEARCH" ]] && docs+=("research.md") - [[ -f "$DATA_MODEL" ]] && docs+=("data-model.md") - ([[ -d "$CONTRACTS_DIR" ]] && [[ -n "$(ls -A "$CONTRACTS_DIR" 2>/dev/null)" ]]) && docs+=("contracts/") - [[ -f "$QUICKSTART" ]] && docs+=("quickstart.md") - # join array into JSON - json_docs=$(printf '"%s",' "${docs[@]}") - json_docs="[${json_docs%,}]" - printf '{"FEATURE_DIR":"%s","AVAILABLE_DOCS":%s}\n' "$FEATURE_DIR" "$json_docs" -else - # List available design documents (optional) - echo "FEATURE_DIR:$FEATURE_DIR" - echo "AVAILABLE_DOCS:" - - # Use common check functions - check_file "$RESEARCH" "research.md" - check_file "$DATA_MODEL" "data-model.md" - check_dir "$CONTRACTS_DIR" "contracts/" - check_file "$QUICKSTART" "quickstart.md" -fi - -# Always succeed - task generation should work with whatever docs are available diff --git a/scripts/create-new-feature.sh b/scripts/create-new-feature.sh deleted file mode 100755 index 65a4afd..0000000 --- a/scripts/create-new-feature.sh +++ /dev/null @@ -1,96 +0,0 @@ -#!/usr/bin/env bash -# Create a new feature with branch, directory structure, and template -# Usage: ./create-new-feature.sh "feature description" -# ./create-new-feature.sh --json "feature description" - -set -e - -JSON_MODE=false - -# Collect non-flag args -ARGS=() -for arg in "$@"; do - case "$arg" in - --json) - JSON_MODE=true - ;; - --help|-h) - echo "Usage: $0 [--json] "; exit 0 ;; - *) - ARGS+=("$arg") ;; - esac -done - -FEATURE_DESCRIPTION="${ARGS[*]}" -if [ -z "$FEATURE_DESCRIPTION" ]; then - echo "Usage: $0 [--json] " >&2 - exit 1 -fi - -# Get repository root -REPO_ROOT=$(git rev-parse --show-toplevel) -SPECS_DIR="$REPO_ROOT/specs" - -# Create specs directory if it doesn't exist -mkdir -p "$SPECS_DIR" - -# Find the highest numbered feature directory -HIGHEST=0 -if [ -d "$SPECS_DIR" ]; then - for dir in "$SPECS_DIR"/*; do - if [ -d "$dir" ]; then - dirname=$(basename "$dir") - number=$(echo "$dirname" | grep -o '^[0-9]\+' || echo "0") - number=$((10#$number)) - if [ "$number" -gt "$HIGHEST" ]; then - HIGHEST=$number - fi - fi - done -fi - -# Generate next feature number with zero padding -NEXT=$((HIGHEST + 1)) -FEATURE_NUM=$(printf "%03d" "$NEXT") - -# Create branch name from description -BRANCH_NAME=$(echo "$FEATURE_DESCRIPTION" | \ - tr '[:upper:]' '[:lower:]' | \ - sed 's/[^a-z0-9]/-/g' | \ - sed 's/-\+/-/g' | \ - sed 's/^-//' | \ - sed 's/-$//') - -# Extract 2-3 meaningful words -WORDS=$(echo "$BRANCH_NAME" | tr '-' '\n' | grep -v '^$' | head -3 | tr '\n' '-' | sed 's/-$//') - -# Final branch name -BRANCH_NAME="${FEATURE_NUM}-${WORDS}" - -# Create and switch to new branch -git checkout -b "$BRANCH_NAME" - -# Create feature directory -FEATURE_DIR="$SPECS_DIR/$BRANCH_NAME" -mkdir -p "$FEATURE_DIR" - -# Copy template if it exists -TEMPLATE="$REPO_ROOT/templates/spec-template.md" -SPEC_FILE="$FEATURE_DIR/spec.md" - -if [ -f "$TEMPLATE" ]; then - cp "$TEMPLATE" "$SPEC_FILE" -else - echo "Warning: Template not found at $TEMPLATE" >&2 - touch "$SPEC_FILE" -fi - -if $JSON_MODE; then - printf '{"BRANCH_NAME":"%s","SPEC_FILE":"%s","FEATURE_NUM":"%s"}\n' \ - "$BRANCH_NAME" "$SPEC_FILE" "$FEATURE_NUM" -else - # Output results for the LLM to use (legacy key: value format) - echo "BRANCH_NAME: $BRANCH_NAME" - echo "SPEC_FILE: $SPEC_FILE" - echo "FEATURE_NUM: $FEATURE_NUM" -fi diff --git a/scripts/get-feature-paths.sh b/scripts/get-feature-paths.sh deleted file mode 100644 index 030ecc3..0000000 --- a/scripts/get-feature-paths.sh +++ /dev/null @@ -1,23 +0,0 @@ -#!/usr/bin/env bash -# Get paths for current feature branch without creating anything -# Used by commands that need to find existing feature files - -set -e - -# Source common functions -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -source "$SCRIPT_DIR/common.sh" - -# Get all paths -eval $(get_feature_paths) - -# Check if on feature branch -check_feature_branch "$CURRENT_BRANCH" || exit 1 - -# Output paths (don't create anything) -echo "REPO_ROOT: $REPO_ROOT" -echo "BRANCH: $CURRENT_BRANCH" -echo "FEATURE_DIR: $FEATURE_DIR" -echo "FEATURE_SPEC: $FEATURE_SPEC" -echo "IMPL_PLAN: $IMPL_PLAN" -echo "TASKS: $TASKS" diff --git a/scripts/setup-plan.sh b/scripts/setup-plan.sh deleted file mode 100755 index 1ec77de..0000000 --- a/scripts/setup-plan.sh +++ /dev/null @@ -1,44 +0,0 @@ -#!/usr/bin/env bash -# Setup implementation plan structure for current branch -# Returns paths needed for implementation plan generation -# Usage: ./setup-plan.sh [--json] - -set -e - -JSON_MODE=false -for arg in "$@"; do - case "$arg" in - --json) JSON_MODE=true ;; - --help|-h) echo "Usage: $0 [--json]"; exit 0 ;; - esac -done - -# Source common functions -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -source "$SCRIPT_DIR/common.sh" - -# Get all paths -eval $(get_feature_paths) - -# Check if on feature branch -check_feature_branch "$CURRENT_BRANCH" || exit 1 - -# Create specs directory if it doesn't exist -mkdir -p "$FEATURE_DIR" - -# Copy plan template if it exists -TEMPLATE="$REPO_ROOT/templates/plan-template.md" -if [ -f "$TEMPLATE" ]; then - cp "$TEMPLATE" "$IMPL_PLAN" -fi - -if $JSON_MODE; then - printf '{"FEATURE_SPEC":"%s","IMPL_PLAN":"%s","SPECS_DIR":"%s","BRANCH":"%s"}\n' \ - "$FEATURE_SPEC" "$IMPL_PLAN" "$FEATURE_DIR" "$CURRENT_BRANCH" -else - # Output all paths for LLM use - echo "FEATURE_SPEC: $FEATURE_SPEC" - echo "IMPL_PLAN: $IMPL_PLAN" - echo "SPECS_DIR: $FEATURE_DIR" - echo "BRANCH: $CURRENT_BRANCH" -fi diff --git a/templates/agent-file-template.md b/templates/agent-file-template.md deleted file mode 100644 index f734997..0000000 --- a/templates/agent-file-template.md +++ /dev/null @@ -1,23 +0,0 @@ -# [PROJECT NAME] Development Guidelines - -Auto-generated from all feature plans. Last updated: [DATE] - -## Active Technologies -[EXTRACTED FROM ALL PLAN.MD FILES] - -## Project Structure -``` -[ACTUAL STRUCTURE FROM PLANS] -``` - -## Commands -[ONLY COMMANDS FOR ACTIVE TECHNOLOGIES] - -## Code Style -[LANGUAGE-SPECIFIC, ONLY FOR LANGUAGES IN USE] - -## Recent Changes -[LAST 3 FEATURES AND WHAT THEY ADDED] - - - diff --git a/templates/plan-template.md b/templates/plan-template.md deleted file mode 100644 index 3936299..0000000 --- a/templates/plan-template.md +++ /dev/null @@ -1,237 +0,0 @@ -# Implementation Plan: [FEATURE] - -**Branch**: `[###-feature-name]` | **Date**: [DATE] | **Spec**: [link] -**Input**: Feature specification from `/specs/[###-feature-name]/spec.md` - -## Execution Flow (/plan command scope) -``` -1. Load feature spec from Input path - → If not found: ERROR "No feature spec at {path}" -2. Fill Technical Context (scan for NEEDS CLARIFICATION) - → Detect Project Type from context (web=frontend+backend, mobile=app+api) - → Set Structure Decision based on project type -3. Evaluate Constitution Check section below - → If violations exist: Document in Complexity Tracking - → If no justification possible: ERROR "Simplify approach first" - → Update Progress Tracking: Initial Constitution Check -4. Execute Phase 0 → research.md - → If NEEDS CLARIFICATION remain: ERROR "Resolve unknowns" -5. Execute Phase 1 → contracts, data-model.md, quickstart.md, agent-specific template file (e.g., `CLAUDE.md` for Claude Code, `.github/copilot-instructions.md` for GitHub Copilot, or `GEMINI.md` for Gemini CLI). -6. Re-evaluate Constitution Check section - → If new violations: Refactor design, return to Phase 1 - → Update Progress Tracking: Post-Design Constitution Check -7. Plan Phase 2 → Describe task generation approach (DO NOT create tasks.md) -8. STOP - Ready for /tasks command -``` - -**IMPORTANT**: The /plan command STOPS at step 7. Phases 2-4 are executed by other commands: -- Phase 2: /tasks command creates tasks.md -- Phase 3-4: Implementation execution (manual or via tools) - -## Summary -[Extract from feature spec: primary requirement + technical approach from research] - -## Technical Context -**Language/Version**: [e.g., Python 3.11, Swift 5.9, Rust 1.75 or NEEDS CLARIFICATION] -**Primary Dependencies**: [e.g., FastAPI, UIKit, LLVM or NEEDS CLARIFICATION] -**Storage**: [if applicable, e.g., PostgreSQL, CoreData, files or N/A] -**Testing**: [e.g., pytest, XCTest, cargo test or NEEDS CLARIFICATION] -**Target Platform**: [e.g., Linux server, iOS 15+, WASM or NEEDS CLARIFICATION] -**Project Type**: [single/web/mobile - determines source structure] -**Performance Goals**: [domain-specific, e.g., 1000 req/s, 10k lines/sec, 60 fps or NEEDS CLARIFICATION] -**Constraints**: [domain-specific, e.g., <200ms p95, <100MB memory, offline-capable or NEEDS CLARIFICATION] -**Scale/Scope**: [domain-specific, e.g., 10k users, 1M LOC, 50 screens or NEEDS CLARIFICATION] - -## Constitution Check -*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.* - -**Simplicity**: -- Projects: [#] (max 3 - e.g., api, cli, tests) -- Using framework directly? (no wrapper classes) -- Single data model? (no DTOs unless serialization differs) -- Avoiding patterns? (no Repository/UoW without proven need) - -**Architecture**: -- EVERY feature as library? (no direct app code) -- Libraries listed: [name + purpose for each] -- CLI per library: [commands with --help/--version/--format] -- Library docs: llms.txt format planned? - -**Testing (NON-NEGOTIABLE)**: -- RED-GREEN-Refactor cycle enforced? (test MUST fail first) -- Git commits show tests before implementation? -- Order: Contract→Integration→E2E→Unit strictly followed? -- Real dependencies used? (actual DBs, not mocks) -- Integration tests for: new libraries, contract changes, shared schemas? -- FORBIDDEN: Implementation before test, skipping RED phase - -**Observability**: -- Structured logging included? -- Frontend logs → backend? (unified stream) -- Error context sufficient? - -**Versioning**: -- Version number assigned? (MAJOR.MINOR.BUILD) -- BUILD increments on every change? -- Breaking changes handled? (parallel tests, migration plan) - -## Project Structure - -### Documentation (this feature) -``` -specs/[###-feature]/ -├── plan.md # This file (/plan command output) -├── research.md # Phase 0 output (/plan command) -├── data-model.md # Phase 1 output (/plan command) -├── quickstart.md # Phase 1 output (/plan command) -├── contracts/ # Phase 1 output (/plan command) -└── tasks.md # Phase 2 output (/tasks command - NOT created by /plan) -``` - -### Source Code (repository root) -``` -# Option 1: Single project (DEFAULT) -src/ -├── models/ -├── services/ -├── cli/ -└── lib/ - -tests/ -├── contract/ -├── integration/ -└── unit/ - -# Option 2: Web application (when "frontend" + "backend" detected) -backend/ -├── src/ -│ ├── models/ -│ ├── services/ -│ └── api/ -└── tests/ - -frontend/ -├── src/ -│ ├── components/ -│ ├── pages/ -│ └── services/ -└── tests/ - -# Option 3: Mobile + API (when "iOS/Android" detected) -api/ -└── [same as backend above] - -ios/ or android/ -└── [platform-specific structure] -``` - -**Structure Decision**: [DEFAULT to Option 1 unless Technical Context indicates web/mobile app] - -## Phase 0: Outline & Research -1. **Extract unknowns from Technical Context** above: - - For each NEEDS CLARIFICATION → research task - - For each dependency → best practices task - - For each integration → patterns task - -2. **Generate and dispatch research agents**: - ``` - For each unknown in Technical Context: - Task: "Research {unknown} for {feature context}" - For each technology choice: - Task: "Find best practices for {tech} in {domain}" - ``` - -3. **Consolidate findings** in `research.md` using format: - - Decision: [what was chosen] - - Rationale: [why chosen] - - Alternatives considered: [what else evaluated] - -**Output**: research.md with all NEEDS CLARIFICATION resolved - -## Phase 1: Design & Contracts -*Prerequisites: research.md complete* - -1. **Extract entities from feature spec** → `data-model.md`: - - Entity name, fields, relationships - - Validation rules from requirements - - State transitions if applicable - -2. **Generate API contracts** from functional requirements: - - For each user action → endpoint - - Use standard REST/GraphQL patterns - - Output OpenAPI/GraphQL schema to `/contracts/` - -3. **Generate contract tests** from contracts: - - One test file per endpoint - - Assert request/response schemas - - Tests must fail (no implementation yet) - -4. **Extract test scenarios** from user stories: - - Each story → integration test scenario - - Quickstart test = story validation steps - -5. **Update agent file incrementally** (O(1) operation): - - Run `/scripts/update-agent-context.sh [claude|gemini|copilot]` for your AI assistant - - If exists: Add only NEW tech from current plan - - Preserve manual additions between markers - - Update recent changes (keep last 3) - - Keep under 150 lines for token efficiency - - Output to repository root - -**Output**: data-model.md, /contracts/*, failing tests, quickstart.md, agent-specific file - -## Phase 2: Task Planning Approach -*This section describes what the /tasks command will do - DO NOT execute during /plan* - -**Task Generation Strategy**: -- Load `/templates/tasks-template.md` as base -- Generate tasks from Phase 1 design docs (contracts, data model, quickstart) -- Each contract → contract test task [P] -- Each entity → model creation task [P] -- Each user story → integration test task -- Implementation tasks to make tests pass - -**Ordering Strategy**: -- TDD order: Tests before implementation -- Dependency order: Models before services before UI -- Mark [P] for parallel execution (independent files) - -**Estimated Output**: 25-30 numbered, ordered tasks in tasks.md - -**IMPORTANT**: This phase is executed by the /tasks command, NOT by /plan - -## Phase 3+: Future Implementation -*These phases are beyond the scope of the /plan command* - -**Phase 3**: Task execution (/tasks command creates tasks.md) -**Phase 4**: Implementation (execute tasks.md following constitutional principles) -**Phase 5**: Validation (run tests, execute quickstart.md, performance validation) - -## Complexity Tracking -*Fill ONLY if Constitution Check has violations that must be justified* - -| Violation | Why Needed | Simpler Alternative Rejected Because | -|-----------|------------|-------------------------------------| -| [e.g., 4th project] | [current need] | [why 3 projects insufficient] | -| [e.g., Repository pattern] | [specific problem] | [why direct DB access insufficient] | - - -## Progress Tracking -*This checklist is updated during execution flow* - -**Phase Status**: -- [ ] Phase 0: Research complete (/plan command) -- [ ] Phase 1: Design complete (/plan command) -- [ ] Phase 2: Task planning complete (/plan command - describe approach only) -- [ ] Phase 3: Tasks generated (/tasks command) -- [ ] Phase 4: Implementation complete -- [ ] Phase 5: Validation passed - -**Gate Status**: -- [ ] Initial Constitution Check: PASS -- [ ] Post-Design Constitution Check: PASS -- [ ] All NEEDS CLARIFICATION resolved -- [ ] Complexity deviations documented - ---- -*Based on Constitution v2.1.1 - See `/memory/constitution.md`* diff --git a/templates/spec-template.md b/templates/spec-template.md deleted file mode 100644 index 94c268c..0000000 --- a/templates/spec-template.md +++ /dev/null @@ -1,116 +0,0 @@ -# Feature Specification: [FEATURE NAME] - -**Feature Branch**: `[###-feature-name]` -**Created**: [DATE] -**Status**: Draft -**Input**: User description: "$ARGUMENTS" - -## Execution Flow (main) -``` -1. Parse user description from Input - → If empty: ERROR "No feature description provided" -2. Extract key concepts from description - → Identify: actors, actions, data, constraints -3. For each unclear aspect: - → Mark with [NEEDS CLARIFICATION: specific question] -4. Fill User Scenarios & Testing section - → If no clear user flow: ERROR "Cannot determine user scenarios" -5. Generate Functional Requirements - → Each requirement must be testable - → Mark ambiguous requirements -6. Identify Key Entities (if data involved) -7. Run Review Checklist - → If any [NEEDS CLARIFICATION]: WARN "Spec has uncertainties" - → If implementation details found: ERROR "Remove tech details" -8. Return: SUCCESS (spec ready for planning) -``` - ---- - -## ⚡ Quick Guidelines -- ✅ Focus on WHAT users need and WHY -- ❌ Avoid HOW to implement (no tech stack, APIs, code structure) -- 👥 Written for business stakeholders, not developers - -### Section Requirements -- **Mandatory sections**: Must be completed for every feature -- **Optional sections**: Include only when relevant to the feature -- When a section doesn't apply, remove it entirely (don't leave as "N/A") - -### For AI Generation -When creating this spec from a user prompt: -1. **Mark all ambiguities**: Use [NEEDS CLARIFICATION: specific question] for any assumption you'd need to make -2. **Don't guess**: If the prompt doesn't specify something (e.g., "login system" without auth method), mark it -3. **Think like a tester**: Every vague requirement should fail the "testable and unambiguous" checklist item -4. **Common underspecified areas**: - - User types and permissions - - Data retention/deletion policies - - Performance targets and scale - - Error handling behaviors - - Integration requirements - - Security/compliance needs - ---- - -## User Scenarios & Testing *(mandatory)* - -### Primary User Story -[Describe the main user journey in plain language] - -### Acceptance Scenarios -1. **Given** [initial state], **When** [action], **Then** [expected outcome] -2. **Given** [initial state], **When** [action], **Then** [expected outcome] - -### Edge Cases -- What happens when [boundary condition]? -- How does system handle [error scenario]? - -## Requirements *(mandatory)* - -### Functional Requirements -- **FR-001**: System MUST [specific capability, e.g., "allow users to create accounts"] -- **FR-002**: System MUST [specific capability, e.g., "validate email addresses"] -- **FR-003**: Users MUST be able to [key interaction, e.g., "reset their password"] -- **FR-004**: System MUST [data requirement, e.g., "persist user preferences"] -- **FR-005**: System MUST [behavior, e.g., "log all security events"] - -*Example of marking unclear requirements:* -- **FR-006**: System MUST authenticate users via [NEEDS CLARIFICATION: auth method not specified - email/password, SSO, OAuth?] -- **FR-007**: System MUST retain user data for [NEEDS CLARIFICATION: retention period not specified] - -### Key Entities *(include if feature involves data)* -- **[Entity 1]**: [What it represents, key attributes without implementation] -- **[Entity 2]**: [What it represents, relationships to other entities] - ---- - -## Review & Acceptance Checklist -*GATE: Automated checks run during main() execution* - -### Content Quality -- [ ] No implementation details (languages, frameworks, APIs) -- [ ] Focused on user value and business needs -- [ ] Written for non-technical stakeholders -- [ ] All mandatory sections completed - -### Requirement Completeness -- [ ] No [NEEDS CLARIFICATION] markers remain -- [ ] Requirements are testable and unambiguous -- [ ] Success criteria are measurable -- [ ] Scope is clearly bounded -- [ ] Dependencies and assumptions identified - ---- - -## Execution Status -*Updated by main() during processing* - -- [ ] User description parsed -- [ ] Key concepts extracted -- [ ] Ambiguities marked -- [ ] User scenarios defined -- [ ] Requirements generated -- [ ] Entities identified -- [ ] Review checklist passed - ---- diff --git a/templates/tasks-template.md b/templates/tasks-template.md deleted file mode 100644 index e10f9c6..0000000 --- a/templates/tasks-template.md +++ /dev/null @@ -1,127 +0,0 @@ -# Tasks: [FEATURE NAME] - -**Input**: Design documents from `/specs/[###-feature-name]/` -**Prerequisites**: plan.md (required), research.md, data-model.md, contracts/ - -## Execution Flow (main) -``` -1. Load plan.md from feature directory - → If not found: ERROR "No implementation plan found" - → Extract: tech stack, libraries, structure -2. Load optional design documents: - → data-model.md: Extract entities → model tasks - → contracts/: Each file → contract test task - → research.md: Extract decisions → setup tasks -3. Generate tasks by category: - → Setup: project init, dependencies, linting - → Tests: contract tests, integration tests - → Core: models, services, CLI commands - → Integration: DB, middleware, logging - → Polish: unit tests, performance, docs -4. Apply task rules: - → Different files = mark [P] for parallel - → Same file = sequential (no [P]) - → Tests before implementation (TDD) -5. Number tasks sequentially (T001, T002...) -6. Generate dependency graph -7. Create parallel execution examples -8. Validate task completeness: - → All contracts have tests? - → All entities have models? - → All endpoints implemented? -9. Return: SUCCESS (tasks ready for execution) -``` - -## Format: `[ID] [P?] Description` -- **[P]**: Can run in parallel (different files, no dependencies) -- Include exact file paths in descriptions - -## Path Conventions -- **Single project**: `src/`, `tests/` at repository root -- **Web app**: `backend/src/`, `frontend/src/` -- **Mobile**: `api/src/`, `ios/src/` or `android/src/` -- Paths shown below assume single project - adjust based on plan.md structure - -## Phase 3.1: Setup -- [ ] T001 Create project structure per implementation plan -- [ ] T002 Initialize [language] project with [framework] dependencies -- [ ] T003 [P] Configure linting and formatting tools - -## Phase 3.2: Tests First (TDD) ⚠️ MUST COMPLETE BEFORE 3.3 -**CRITICAL: These tests MUST be written and MUST FAIL before ANY implementation** -- [ ] T004 [P] Contract test POST /api/users in tests/contract/test_users_post.py -- [ ] T005 [P] Contract test GET /api/users/{id} in tests/contract/test_users_get.py -- [ ] T006 [P] Integration test user registration in tests/integration/test_registration.py -- [ ] T007 [P] Integration test auth flow in tests/integration/test_auth.py - -## Phase 3.3: Core Implementation (ONLY after tests are failing) -- [ ] T008 [P] User model in src/models/user.py -- [ ] T009 [P] UserService CRUD in src/services/user_service.py -- [ ] T010 [P] CLI --create-user in src/cli/user_commands.py -- [ ] T011 POST /api/users endpoint -- [ ] T012 GET /api/users/{id} endpoint -- [ ] T013 Input validation -- [ ] T014 Error handling and logging - -## Phase 3.4: Integration -- [ ] T015 Connect UserService to DB -- [ ] T016 Auth middleware -- [ ] T017 Request/response logging -- [ ] T018 CORS and security headers - -## Phase 3.5: Polish -- [ ] T019 [P] Unit tests for validation in tests/unit/test_validation.py -- [ ] T020 Performance tests (<200ms) -- [ ] T021 [P] Update docs/api.md -- [ ] T022 Remove duplication -- [ ] T023 Run manual-testing.md - -## Dependencies -- Tests (T004-T007) before implementation (T008-T014) -- T008 blocks T009, T015 -- T016 blocks T018 -- Implementation before polish (T019-T023) - -## Parallel Example -``` -# Launch T004-T007 together: -Task: "Contract test POST /api/users in tests/contract/test_users_post.py" -Task: "Contract test GET /api/users/{id} in tests/contract/test_users_get.py" -Task: "Integration test registration in tests/integration/test_registration.py" -Task: "Integration test auth in tests/integration/test_auth.py" -``` - -## Notes -- [P] tasks = different files, no dependencies -- Verify tests fail before implementing -- Commit after each task -- Avoid: vague tasks, same file conflicts - -## Task Generation Rules -*Applied during main() execution* - -1. **From Contracts**: - - Each contract file → contract test task [P] - - Each endpoint → implementation task - -2. **From Data Model**: - - Each entity → model creation task [P] - - Relationships → service layer tasks - -3. **From User Stories**: - - Each story → integration test [P] - - Quickstart scenarios → validation tasks - -4. **Ordering**: - - Setup → Tests → Models → Services → Endpoints → Polish - - Dependencies block parallel execution - -## Validation Checklist -*GATE: Checked by main() before returning* - -- [ ] All contracts have corresponding tests -- [ ] All entities have model tasks -- [ ] All tests come before implementation -- [ ] Parallel tasks truly independent -- [ ] Each task specifies exact file path -- [ ] No task modifies same file as another [P] task