diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000..acb3bf2
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,264 @@
+# GSMA Dataset Creation - Agent Context
+
+## Project Overview
+
+GSMA synthetic dataset creation pipeline for telecom documentation. Scrapes documents, processes them into chunks, generates synthetic Q&A pairs, validates quality, and publishes training datasets.
+
+## Architecture
+
+- **Language**: Python 3.12+ with strict type hints
+- **CLI**: typer with DVC pipeline integration
+- **Testing**: pytest with TDD methodology
+- **Processing**: chonkie (chunking), FAISS (similarity), SetFit (filtering)
+
+## Main Pipelines
+
+### PRD Pipeline (`pipelines/prd/dvc.yaml`) **[CONSOLIDATED]**
+Unified end-to-end pipeline for technical specifications (32 DOCX documents). Consolidates the former chunker, questions, similarity, filters, and validation pipelines into a single pipeline.
+
+**15 Stages**:
+1. **process_documents**: DOCX → Markdown (data/raw → data/processed)
+2. **create_late_chunks** (5×): Late chunking at 500/1000/2000/3000/4000 tokens
+3. **generate_questions** (5×): Synthetic Q&A with Cerebras GPT-OSS-120B (5/10/20/30/40 questions per chunk size)
+4. **data_combiner**: Merge all chunks + questions with working group classification
+5. **similarity_hasher**: Add SHA-256 content hashes
+6. **similarity_ranker**: FAISS IVFFlat top-K (k=20, threshold=0.3)
+7. **overlap_detector**: Character offset-based text overlaps (min 50 chars)
+8. **explode_questions**: Question-centric format (min-similarity: 0.35, max: 0.95)
+9. **apply_question_filter**: External reference classifier
+10. **apply_chunk_filter**: Procedures classifier + keyword exclusion (`prd@gsma.com`)
+11. **filter_questions_by_chunk_quality**: Combined quality filtering (min prob: 0.5)
+12. **validate_requests**: LLM validation with Qwen 235B via Cerebras (50 concurrent, 50k limit)
+13. **create_validation_dataset**: Dual format (embedding + QA, max 3 positives/negatives)
+14. **upload_embedding_dataset**: → mantisnlp/gsma_prd_synthetic_embedding
+15. **upload_qa_dataset**: → mantisnlp/gsma_prd_synthetic_qa
+
+**Data Paths**: `data/prd/` (chunks, questions, parquets, validation)
+**Metrics Paths**: `metrics/prd/` (all stage metrics)
+
+**Usage**: `dvc repro pipelines/prd/dvc.yaml` or `cd pipelines/prd && dvc repro`
+
+**Key Features**:
+- Variables: `data_prefix: data/prd`, `metrics_prefix: metrics/prd`
+- Cerebras provider for question generation and validation
+- Keyword filter for GSMA template boilerplate
+- Min-similarity-score: 0.35 (validation pipeline setting)
+
+### Discover Pipeline (`pipelines/discover/dvc.yaml`)
+End-to-end pipeline for reports/whitepapers (304 PDF/DOCX documents):
+1. **Scrape**: Playwright automation (PRD + Discover pages, bypasses Cloudflare)
+2. **Deduplicate**: Hash-based deduplication across sources
+3. **Process**: PDF/DOCX → Markdown (PyMuPDF for PDFs)
+4. **Chunk → Validation**: Same as PRD pipeline
+5. **Datasets**: Dual format → HuggingFace Hub
+
+**Usage**: `dvc repro discover` or `cd pipelines/discover && dvc repro`
+
+**Outputs**:
+- `mantisnlp/gsma_discover_synthetic_embedding`: Contrastive learning format
+- `mantisnlp/gsma_discover_synthetic_qa`: RAG/QA format
+
+### Annotation Pipeline (`pipelines/annotation/dvc.yaml`)
+Human validation workflow with subgroup-based tasks (TSG, FASG, NG, RCS, eSim):
+- Downloads datasets from HuggingFace Hub
+- Adds working group/subgroup classifications
+- Creates Argilla workspaces with domain-specific samples
+- Credentials: username=subgroup, password=subgroup-gsma
+
+**Usage**: `dvc repro annotation:add_subgroups` (frozen upload stages)
+
+**Argilla CLI Commands** (`gsma argilla [command]`):
+- `upload`: Upload validation dataset to Argilla
+- `delete`: Delete dataset from Argilla
+- `download`: Download annotated dataset from Argilla
+- `upload-by-subgroup`: Upload filtered dataset for subgroup annotation
+- `delete-workspace`: Delete workspace and associated user
+- `add-users`: Create multiple users with random secure passwords (e.g., 10 users for a workshop)
+- `add-user`: Create single user with multi-workspace support
+- `add-to-workspace`: Add existing user to multiple workspaces
+- `list-users`: List all users in a workspace
+- `list-workspaces`: List all available workspaces
+- `list-datasets`: List all datasets in a workspace
+- `track-progress`: Monitor annotation progress (optimized for instant response)
+- `delete-user`: Remove user from Argilla
+
+**User Management Examples**:
+```bash
+# Create 10 users with random secure passwords
+gsma argilla add-users -w TSG --count 10 --output-csv users.csv
+
+# Create user in multiple workspaces
+gsma argilla add-user -u alice -p secret123 -w TSG -w FASG -w NG
+
+# Track annotation progress
+gsma argilla track-progress -w TSG
+```
+
+## Key Components
+
+### Chunking (`gsma_dataset_creation/chunker.py`)
+- Late chunking with embedded context preservation
+- Configurable chunk sizes (500-4000 tokens)
+- Embedding model: sentence-transformers/all-MiniLM-L6-v2
+- CLI: `gsma chunk