A data de-identification and anonymization toolkit that combines multiple anonymization strategies with cryptographic techniques and machine learning-based entity recognition.
The repository includes a Dev Container configuration that sets up
the full development environment automatically: Python 3.12, uv, all dependencies (including
GPU group), pre-commit hooks.
Prerequisites:
- Docker
- VS Code with the Dev Containers extension
Steps:
- Clone the repository and open it in VS Code:
git clone https://github.com/susom/tide2-core.git cd tide2 - When VS Code detects
.devcontainer/devcontainer.json, click Reopen in Container (or run the command Dev Containers: Reopen in Container from the command palette). - The virtual environment at
/opt/tide2-core/.venvis activated by default in all terminals.
The Dev Container includes these VS Code extensions pre-installed: Python, Ruff, Jupyter, Docker, and TOML support.
If you prefer to develop outside the Dev Container:
# Install uv (https://docs.astral.sh/uv/getting-started/installation/)
# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/susom/tide2.git
cd tide2
uv python install 3.12.8
uv sync
# Activate the virtual environment before running any Python commands
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows (PowerShell)The tutorial notebook walks you through the de-identification pipeline step by step.
In the Dev Container (or local VS Code):
- Open
notebooks/tide2_pipeline.ipynb(the Jupyter extension is pre-installed in the Dev Container) - When prompted for a kernel, select the
.venv (Python 3.12)environment
Jupyter in the browser (local installation):
uv sync --group dev # Install Jupyter (dev dependency group)
source .venv/bin/activate
jupyter notebook notebooks/tide2_pipeline.ipynbView the notebook on GitHub: TIDE 2.0 Pipeline Tutorial
Troubleshooting:
- Run from the repo root — launch Jupyter from the
tide2/directory so that relative paths resolve correctly. - GCP credentials are not required — the notebook downloads the transformer model from HuggingFace Hub by default. Set
project_idandbucket_namein the Configuration cell only if you want to use GCS-hosted weights. - Kernel crashes — if the Jupyter kernel crashes repeatedly, restart Jupyter (
Ctrl+C, then re-launch) and run the cells from the top.
TIDE 2.0 includes a Streamlit visualizer for comparing original and de-identified text side by side:
Launch it with:
tide2-visualizerTo stop the visualizer, press Ctrl+C in the terminal (works on macOS, Linux, and Windows).
TIDE 2.0 is a Python package for anonymizing sensitive data in healthcare and research contexts. It identifies and anonymizes personally identifiable information (PII) while maintaining data utility for analysis and research.
- Transformer-based NER: HuggingFace transformer models with direct batch inference (bypasses HF pipeline), BIO token aggregation, and chunk-to-document reassembly
- Regex recognizers: Phone, URL/IP, Email, SSN, Address — replacements for Presidio defaults (10-100x faster)
- Healthcare-specific: MRN, Accession Number, HAR code recognizers
- Known values detection: Aho-Corasick based matching against patient databases
- Specialized: Base64 image detection, genetic sequence detection, LLM-based JSON recognizer
- Cached results: Pre-computed NER results from GPU batch processing via
CachedResultsTransformerRecognizer - Presidio Integration: Built on Microsoft's Presidio framework
- HIPS (Healthcare Identity Protection System): Cryptographic deterministic anonymization for names, locations, and alphanumeric identifiers
- Accession number hashing: SHA256-based, compatible with BigQuery UDF
- Faker Integration: Realistic fake data generation
- Date Jittering: Deterministic, privacy-preserving date shifts derived from patient keys
- Age Grouping: Age range categorization
- Format-Preserving Encryption (FPE): Maintains data format during encryption
- Key Management: Key generation, storage, and derivation utilities
- Deterministic date jitter: Batch-capable date shift derivation from cryptographic keys
- String Selection: HMAC-based cached string selection
- Runner module: Single-node job runner with local and VM modes via
tide2-runnerCLI - Ray actors:
RecognizerActor,AnonymizerActor,TransformerInferenceActor,BIOAggregationActor,ReassemblyActorforray.data.map_batches - Two-stage GPU/CPU pipeline: GPU inference returns raw BIO tokens; CPU actors aggregate them concurrently via Ray Data streaming
- Direct inference: Bypasses HuggingFace pipeline dispatch loop with batch tokenize → single GPU forward pass → offset-based extraction
- Adaptive GPU batching: Auto-computes batch size from model config and free GPU memory; adjusts based on text lengths with VRAM-aware budgets (override via
--short-seq-budget) - OOM recovery: Automatic batch splitting on CUDA out-of-memory errors
- Fault tolerance: Actor restarts, task retries, graceful shutdown
- YAML config: All CLI arguments can be specified in a YAML config file (
--config)
- Text processing: Text chunking, BIO aggregation, span reconstruction, deduplication
- String parsers: Name parsing/classification, address parsing, format detection
- Span metrics: Gold vs ML evaluation, O(n log n) conflict resolution
- GCS cache: Auto-download models from GCS to
~/.cache/tide2/ - Model compilation:
torch.compilewith mega-cache support for faster inference startup
tide2-runner: Ray-based single-node job runner with six job types:recognizer,anonymizer,transformer,reassembly,pipeline(full end-to-end), andllm-recognizer. Supports YAML config files (--config) and dry-run mode (--dry-run).tide2-visualizer: Streamlit app for side-by-side PHI comparison and entity editing.
- GCS: input/output I/O and model caching.
- BigQuery: input/output of notes and recognizer/anonymizer results (e.g. via
ARRAY_AGG-grouped chunk columns) for the runner and visualizer. - Automatic Caching: Download and cache models from GCS automatically (
$TIDE_CACHE_DIR).
# Run recognition locally
tide2-runner run recognizer -i ./data/input -o ./data/output
# Run with more resources (e.g. on a large VM), reading/writing from GCS
tide2-runner run recognizer -i gs://bucket/input -o gs://bucket/output \
--num-cpus 224 --num-actors 200
# Run transformer NER on GPU
tide2-runner run transformer -i ./data/input -o ./data/transformer_output \
--model StanfordAIMI/stanford-deidentifier-v2 --batch-size 2048
# Run transformer with YAML config
tide2-runner run transformer --config config.yaml
# Run the full pipeline (transformer -> recognizer -> anonymizer)
tide2-runner run pipeline -i ./data/input.parquet -o ./data/output \
--model StanfordAIMI/stanford-deidentifier-v2
# If you are running on Mac, you can use --object-store-gb option to set
tide2-runner run pipeline -i ./data/input.parquet -o ./data/output \
--model StanfordAIMI/stanford-deidentifier-v2 --object-store-gb 2
# Run anonymization
tide2-runner run anonymizer -i ./data/recognized -o ./data/anonymized \
--salt /path/to/salt.bin --key /path/to/key.bin# Launch the Streamlit PHI visualizer
tide2-visualizerSeveral targets are built from a single multi-stage Dockerfile:
production-cpu— slim CPU-only image (no CUDA, nogpudependency group). Used by recognizer, anonymizer, and BigQuery tasks.production-gpu— GPU image based onnvidia/cuda:13.0.2-cudnn-runtime-ubuntu24.04, includes thegpudependency group (torch,transformers,spacy). Used by transformer inference.development— Dev Container target withgit,gcloud, build tools, and the full dev environment.test— extendsdevelopmentand runs the test suite (used bymake test-docker).
Build and push the GPU image (requires DOCKER_REGISTRY and DOCKER_IMAGE_GPU in .env):
make docker # build + push the GPU image (alias for docker-gpu)
make docker-gpu # build + push the GPU image
make test-docker # build the test target and run the suite in Dockerdev: Development tools (pytest,pytest-cov,ty,ruff,pre-commit), Jupyter, and theevaluationlibraries (scikit-learn,scipy,tqdm,umap-learn)evaluation: Evaluation/analysis libraries (scikit-learn,scipy,tqdm,umap-learn)test: Minimal test dependencies (pytest,pytest-cov)gpu: ML inference stack (torch,transformers,spacy,presidio-analyzer[transformers])docs: API documentation generation (pdoc)
Install an optional group as an extra with uv sync --extra <name>, or all extras with uv sync --all-extras. (These same sets are also defined as [dependency-groups], usable with uv sync --group <name>.)
Note: GCP, CLI, and Presidio dependencies ship in the main package by default. The transformer/NER ML stack (torch, transformers, spacy) lives in the gpu extra — install it with uv sync --extra gpu before running transformer or pipeline jobs.
tide2/
├── recognizers/ # PII detection (Presidio EntityRecognizer subclasses)
├── anonymizers/ # PII replacement (Presidio Operator subclasses)
├── transformers/ # Core NER inference engine (TransformerCore)
│ ├── core.py # Model loading, direct inference, BIO aggregation
│ └── config.py # Model configuration management
├── actors/ # Ray actors for distributed batch processing
│ ├── transformer.py # GPU inference actor + CPU BIO aggregation actor
│ ├── recognizer.py # CPU recognizer actor
│ ├── anonymizer.py # CPU anonymizer actor
│ ├── reassembly.py # Chunk-to-document reassembly actor
│ └── llm_recognizer.py # LLM-based recognizer actor
├── cryptographic/ # FPE, key management, date jitter derivation
├── string_parsers/ # Name/address parsing, format detection
├── runner/ # Ray-based single-node job runner + CLI
│ ├── local_runner.py # LocalJobRunner: transformer/recognizer/anonymizer/reassembly/pipeline/llm
│ ├── cli.py # tide2-runner CLI with YAML config support
│ ├── transformer.py # Document chunking and reassembly logic
│ ├── fault_tolerance.py # Actor restarts, graceful shutdown
│ └── utils.py # Runner utilities
├── cli/ # Streamlit visualizer
├── utils/
│ ├── gcs_resource_manager.py # GCS auto-download and caching
│ ├── gcs_connector.py # GCS file I/O
│ ├── span_metrics.py # Evaluation metrics and conflict resolution
│ ├── text_processing.py # Chunking, BIO aggregation, span reconstruction
│ ├── serialization.py # RecognizerResult <-> dict conversions
│ ├── llm_model.py # LLM client utilities
│ ├── batch_columns.py # Batch column constants
│ ├── constants.py # Shared constants
│ └── resource_utils.py # Resource path helpers
└── resources/ # Config files (model configs, name lists, etc.)
# Run all unit tests (coverage report prints automatically)
uv run pytest
# Run without coverage (faster, useful when debugging)
uv run pytest --no-cov
# Run a specific test file
uv run pytest tests/test_masking_anonymizer.py
# Skip slow integration tests
uv run pytest -m "not integration"Coverage is configured in pyproject.toml and runs automatically with pytest. Three reports are generated on each run:
- Terminal: line-by-line missing coverage printed to stdout
- HTML: detailed report at
htmlcov/index.html - XML:
coverage.xml(Cobertura format)
API documentation is hosted via GitHub Pages: https://susom.github.io/tide2/
To build or preview docs locally (generated with pdoc):
# Install docs dependencies (pdoc); also needs the gpu extra so all modules import
uv sync --extra docs --extra gpu
# Live preview (opens a local server with hot reload)
make docs-serve
# Generate static HTML to docs/
make docsDeployment to GitHub Pages is automated: the .github/workflows/docs.yml
workflow runs make docs on every push to main and publishes the docs/ directory as a
Pages artifact.
- Examples: Check the
notebooks/directory for usage examples - Tests: Test suite in
tests/directory
- Dev Container: Recommended — provides the full environment with no manual setup (requires Docker and VS Code with the Dev Containers extension)
- Python: 3.12 (required,
>=3.12,<3.13) — constrained to 3.12 for compatibility with thespacy/thincC-extension stack and other pinned dependencies. - Package Manager: uv (not pip or poetry)
- Virtual Environment:
.venv/(activated automatically in the Dev Container; must be activated manually for local installs) - Core Dependencies: Presidio, Ray (
>=2.54), Cryptography, Faker, Google Cloud libraries;torch/transformers/spacyin the optionalgpuextra
- Cryptographic operations use standard libraries (cryptography, pyca/cryptography)
- Format-preserving encryption maintains data format during encryption
- Key management supports generation, storage, and rotation
- Anonymization strategies are designed to prevent re-identification
See the Dev Container setup above for the development environment. Guidelines:
- Code style and testing requirements are enforced by pre-commit hooks (installed automatically in the Dev Container)
- Run
pytestto verify changes before submitting pull requests
This project is licensed under the MIT License - see the LICENSE file for details.
If you use TIDE 2.0 in your research, please cite:
@software{tide2,
title={TIDE 2.0: Data De-identification and Anonymization Toolkit},
author={TIDE 2.0 Team},
year={2025},
url={https://github.com/susom/tide2}
}- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Development: See the Contributing section above
Synthetic Data Notice: All sample data included in this repository (under notebooks/sample_data/) is entirely synthetic and fabricated. No real patient data is included. See notebooks/sample_data/README.md for details.
Note: This toolkit is designed for research and development purposes. Please ensure compliance with relevant privacy laws and regulations (HIPAA, GDPR, etc.) when using in production environments.
