Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 264 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
# GSMA Dataset Creation - Agent Context

## Project Overview

GSMA synthetic dataset creation pipeline for telecom documentation. Scrapes documents, processes them into chunks, generates synthetic Q&A pairs, validates quality, and publishes training datasets.

## Architecture

- **Language**: Python 3.12+ with strict type hints
- **CLI**: typer with DVC pipeline integration
- **Testing**: pytest with TDD methodology
- **Processing**: chonkie (chunking), FAISS (similarity), SetFit (filtering)

## Main Pipelines

### PRD Pipeline (`pipelines/prd/dvc.yaml`) **[CONSOLIDATED]**
Unified end-to-end pipeline for technical specifications (32 DOCX documents). Consolidates the former chunker, questions, similarity, filters, and validation pipelines into a single pipeline.

**15 Stages**:
1. **process_documents**: DOCX → Markdown (data/raw → data/processed)
2. **create_late_chunks** (5×): Late chunking at 500/1000/2000/3000/4000 tokens
3. **generate_questions** (5×): Synthetic Q&A with Cerebras GPT-OSS-120B (5/10/20/30/40 questions per chunk size)
4. **data_combiner**: Merge all chunks + questions with working group classification
5. **similarity_hasher**: Add SHA-256 content hashes
6. **similarity_ranker**: FAISS IVFFlat top-K (k=20, threshold=0.3)
7. **overlap_detector**: Character offset-based text overlaps (min 50 chars)
8. **explode_questions**: Question-centric format (min-similarity: 0.35, max: 0.95)
9. **apply_question_filter**: External reference classifier
10. **apply_chunk_filter**: Procedures classifier + keyword exclusion (`prd@gsma.com`)
11. **filter_questions_by_chunk_quality**: Combined quality filtering (min prob: 0.5)
12. **validate_requests**: LLM validation with Qwen 235B via Cerebras (50 concurrent, 50k limit)
13. **create_validation_dataset**: Dual format (embedding + QA, max 3 positives/negatives)
14. **upload_embedding_dataset**: → mantisnlp/gsma_prd_synthetic_embedding
15. **upload_qa_dataset**: → mantisnlp/gsma_prd_synthetic_qa

**Data Paths**: `data/prd/` (chunks, questions, parquets, validation)
**Metrics Paths**: `metrics/prd/` (all stage metrics)

**Usage**: `dvc repro pipelines/prd/dvc.yaml` or `cd pipelines/prd && dvc repro`

**Key Features**:
- Variables: `data_prefix: data/prd`, `metrics_prefix: metrics/prd`
- Cerebras provider for question generation and validation
- Keyword filter for GSMA template boilerplate
- Min-similarity-score: 0.35 (validation pipeline setting)

### Discover Pipeline (`pipelines/discover/dvc.yaml`)
End-to-end pipeline for reports/whitepapers (304 PDF/DOCX documents):
1. **Scrape**: Playwright automation (PRD + Discover pages, bypasses Cloudflare)
2. **Deduplicate**: Hash-based deduplication across sources
3. **Process**: PDF/DOCX → Markdown (PyMuPDF for PDFs)
4. **Chunk → Validation**: Same as PRD pipeline
5. **Datasets**: Dual format → HuggingFace Hub

**Usage**: `dvc repro discover` or `cd pipelines/discover && dvc repro`

**Outputs**:
- `mantisnlp/gsma_discover_synthetic_embedding`: Contrastive learning format
- `mantisnlp/gsma_discover_synthetic_qa`: RAG/QA format

### Annotation Pipeline (`pipelines/annotation/dvc.yaml`)
Human validation workflow with subgroup-based tasks (TSG, FASG, NG, RCS, eSim):
- Downloads datasets from HuggingFace Hub
- Adds working group/subgroup classifications
- Creates Argilla workspaces with domain-specific samples
- Credentials: username=subgroup, password=subgroup-gsma

**Usage**: `dvc repro annotation:add_subgroups` (frozen upload stages)

**Argilla CLI Commands** (`gsma argilla [command]`):
- `upload`: Upload validation dataset to Argilla
- `delete`: Delete dataset from Argilla
- `download`: Download annotated dataset from Argilla
- `upload-by-subgroup`: Upload filtered dataset for subgroup annotation
- `delete-workspace`: Delete workspace and associated user
- `add-users`: Create multiple users with random secure passwords (e.g., 10 users for a workshop)
- `add-user`: Create single user with multi-workspace support
- `add-to-workspace`: Add existing user to multiple workspaces
- `list-users`: List all users in a workspace
- `list-workspaces`: List all available workspaces
- `list-datasets`: List all datasets in a workspace
- `track-progress`: Monitor annotation progress (optimized for instant response)
- `delete-user`: Remove user from Argilla

**User Management Examples**:
```bash
# Create 10 users with random secure passwords
gsma argilla add-users -w TSG --count 10 --output-csv users.csv

# Create user in multiple workspaces
gsma argilla add-user -u alice -p secret123 -w TSG -w FASG -w NG

# Track annotation progress
gsma argilla track-progress -w TSG
```

## Key Components

### Chunking (`gsma_dataset_creation/chunker.py`)
- Late chunking with embedded context preservation
- Configurable chunk sizes (500-4000 tokens)
- Embedding model: sentence-transformers/all-MiniLM-L6-v2
- CLI: `gsma chunk <input> <output> --chunker late --chunker-config '{...}'`

### Question Generation (`gsma_dataset_creation/qa_generator.py`)
- Synthetic Q&A using OpenRouter API (Cerebras provider)
- Concurrent processing with rate limiting
- Resumable (auto-resumes from last processed chunk)
- CLI: `gsma questions generate-from-chunks <input> <output> --num-questions 5`

### Similarity Analysis (`gsma_dataset_creation/similarity/`)
- Data combination with working group classification
- SHA-256 content hashing for deduplication
- FAISS IVFFlat similarity ranking (top-K)
- Character offset-based overlap detection
- CLI: `gsma similarity combine`, `hash`, `rank`, `detect-overlaps`

### Quality Filtering (`gsma_dataset_creation/filters_cli.py`)
- **Chunk filter**: Procedures classifier (filters legal/procedural content)
- **Question filter**: External reference classifier (filters unavailable content)
- **Keyword filter**: Exclude matches (e.g., "prd@gsma.com" boilerplate)
- Pre-trained SetFit models in `models/filters/`
- CLI: `gsma filters apply-chunk-filter`, `apply-question-filter`

### Validation (`gsma_dataset_creation/validation_cli.py`)
- Individual evaluation (no batching as of PR #71)
- SQLite checkpointing for resumability
- Concurrent processing with asyncio
- Error categorization (rate limits, server errors, timeouts)
- CLI: `gsma validation explode-questions`, `validate-requests`

### Dataset Creation (`gsma_dataset_creation/datasets_cli.py`)
- **Embedding format**: Contrastive learning (Question, Positive_Chunks, Negative_Chunks, Answer, Metadata)
- **QA format**: RAG training (Question, Content, Answer, Metadata)
- Quality-based filtering (max positives/negatives)
- Comprehensive histogram metrics
- CLI: `gsma datasets create-from-validation`

## Dataset Formats

### Embedding Format
```json
{
"Question": "What is 5G network slicing?",
"Positive_Chunks": ["5G network slicing allows..."],
"Negative_Chunks": ["eSIM technology...", "WiFi 6..."],
"Answer": "5G network slicing allows...",
"Metadata": {
"source_document": "TS.23 v7.0.md",
"working_group": "TSG",
"chunk_id": "TS.23 v7.0.md_500_15"
}
}
```

### QA Format
```json
{
"Question": "What is 5G network slicing?",
"Content": ["5G network slicing allows..."],
"Answer": "5G network slicing allows...",
"Metadata": {
"source_document": "TS.23 v7.0.md",
"working_group": "TSG",
"chunk_id": "TS.23 v7.0.md_500_15"
}
}
```

## Environment Variables

- `OPENROUTER_API_KEY`: Cerebras/OpenRouter API access (question generation, validation)
- `ARGILLA_API_URL`: Argilla service URL (https://mantisnlp-annotate.hf.space)
- `ARGILLA_API_KEY`: Argilla API authentication
- `HUGGINGFACE_TOKEN`: HuggingFace Hub operations (dataset upload/download)

## Critical Rules

**NEVER use `git push --no-verify` or `git commit --no-verify`**
- All pre-commit hooks must pass
- Fix issues rather than bypassing checks

**NEVER use `git add .` - always add files individually**
- Use `git add specific_file.py`
- Review each file before staging

**ALWAYS update AGENTS.md with significant changes**
- Update after each PR merge
- Preserve existing entries unless obsolete
- This file is the project's living memory

## Deprecated Pipelines

The following individual pipelines have been **consolidated into `pipelines/prd/dvc.yaml`** and deleted:
- ~~`pipelines/chunker/`~~ → Stages 1-2 in PRD pipeline
- ~~`pipelines/questions/`~~ → Stage 3 in PRD pipeline
- ~~`pipelines/similarity/`~~ → Stages 4-7 in PRD pipeline
- ~~`pipelines/filters/`~~ → Stages 9-11 in PRD pipeline
- ~~`pipelines/validation/`~~ → Stages 8, 12-15 in PRD pipeline

**Remaining pipelines**:
- `pipelines/prd/` - Consolidated PRD pipeline (primary)
- `pipelines/discover/` - Separate discover document pipeline
- `pipelines/annotation/` - Human annotation workflow
- `pipelines/datasets/` - Legacy question-based dataset creation

## Recent Changes

**Pipeline Consolidation** (Oct 2025): Consolidated PRD pipeline
- Created unified `pipelines/prd/dvc.yaml` with 15 stages
- Consolidated chunker, questions, similarity, filters, validation pipelines
- Data migrated to `data/prd/`, metrics to `metrics/prd/`
- Added keyword exclusion filter (`--exclude-matches "prd@gsma.com"`)
- Added Cerebras provider to validation stage
- Used `dvc commit --force` to register existing outputs (no re-execution of expensive stages)
- Deleted deprecated individual pipeline directories

**PR #75** (Oct 2025): Prepare discover pipeline for validation
- Added working group classification to discover data combiner
- Fixed variable shadowing bug in explode-questions (question_id collisions)
- Added keyword exclusion filter (default: "prd@gsma.com" boilerplate)
- Increased min-similarity-score from 0.35 to 0.45
- Added error type breakdown to validation metrics (rate limits, server errors, etc.)
- Removed arbitrary character limit on reasoning field (was 800 chars)
- Configured dual-format dataset creation (embedding + QA)
- Separate HuggingFace uploads for each format

**PR #73** (Oct 2025): Complete discover question generation pipeline
- Full end-to-end discover pipeline (23 stages)
- PDF processing support via PyMuPDF
- Metrics standardization with subdirectory structure

**PR #71** (Oct 2025): Refactor validation to individual evaluation
- Removed batching (was 3-10 candidates per API call)
- Individual chunk evaluation against questions
- Simplified validation logic and checkpoint management

**PR #70** (Oct 2025): Discover pipeline expansion
- Added scraping, deduplication, processing, chunking, questions, similarity, filtering, validation, dataset creation
- Configurable metrics output path for chunk command
- PDF and DOCX support in document processor

**Earlier Features**:
- Annotation pipeline with subgroup-based Argilla tasks
- Document scraping with Playwright (Cloudflare bypass)
- Batched validation with SQLite checkpointing (deprecated in #71)
- Quality filtering with SetFit classifiers
- Working group/subgroup classification
- Dual dataset formats (embedding + QA)

## Performance Notes

**PRD Pipeline** (~1000 documents): 9-17 hours total
- Document processing: ~5 min
- Chunking: ~10 min
- Question generation: 2-4 hours (Cerebras speed-dependent)
- Similarity: ~30 min
- Filtering: ~15 min
- Validation: 6-12 hours (concurrency-dependent)
- Dataset creation: ~5 min

**Resumability**: All long-running stages support resumability via checkpoints or skip-existing logic.

**Concurrent Processing**: Validation supports up to 50 concurrent requests (asyncio).
Loading