Skip to content

Commit 0191a61

Browse files
feat(pii): build & own combined PII (analyzer + anonymizer) image (#5176)
* feat(presidio): build & own combined analyzer+anonymizer image Replace the stock mcr.microsoft.com/presidio-* sidecar images with a single image we build and push to ECR/GHCR. A thin FastAPI service constructs one AnalyzerEngine + one AnonymizerEngine at startup and serves both on port 3000 (/health, /supportedentities, /analyze, /anonymize) so the app needs one PRESIDIO_URL. English only; pinned presidio 2.2.362 + en_core_web_lg 3.8.0. Bakes in the native check-digit VIN recognizer and registers 12 English recognizers Presidio ships but does not load by default (UK_NINO, AU_*, IN_*, SG_*), taking the supported English set from 19 to 32. * feat(presidio): add multi-language support (es/it/pl/fi) Configure a multi-language spaCy NLP engine (en/es/it/pl/fi lg models) and explicitly register the national-id recognizers Presidio ships but does not load by default: ES_NIF/NIE, IT_FISCAL_CODE/DRIVER_LICENSE/VAT_CODE/PASSPORT/ IDENTITY_CARD, PL_PESEL, FI_PERSONAL_IDENTITY_CODE. Verified the NLP-engine + explicit-registration path detects in-language (Finnish id, score 1.0). * improvement(presidio): address review feedback - Register VIN under all served languages, not just en (Bugbot: VIN missed for non-English language routing). - Bump HEALTHCHECK start-period to 180s — five lg models load at import (Bugbot). - Drop --no-cache-dir so the pip cache mount actually works (Greptile). - Pydantic request models for /analyze + /anonymize so missing 'text' returns 422 not 500; default operator 'type' to 'replace' instead of KeyError->500 (Greptile). * refactor(pii): rename presidio image artifacts to pii Rename the image/repo/secret/files from 'presidio' to 'pii' for clarity — the service does PII detection + anonymization (and backs the guardrails block's block/mask), not just redaction, and 'pii' matches existing pii-* naming. docker/presidio.Dockerfile -> docker/pii.Dockerfile docker/presidio/ -> docker/pii/ ghcr.io/simstudioai/presidio -> .../pii ECR_PRESIDIO secret -> ECR_PII (infra side already renamed) No behavior change — paths/identifiers only. * refactor(pii): move service to apps/pii, make image ECR-only - Move server.py + requirements.txt from docker/pii/ to apps/pii/ (source belongs under apps/, matching app/realtime; Dockerfile stays in docker/). Add a minimal @sim/pii package.json so the apps/* bun workspace glob accepts the Python service. - Repoint docker/pii.Dockerfile COPY paths to apps/pii/; rename the container user presidio -> pii. - Drop GHCR for pii — it's a private ECS sidecar pulled from ECR, never published. Removed it from the arm64/manifest (GHCR-only) jobs and guarded the build-amd64 tag step to skip GHCR when no ghcr_image is set.
1 parent ccc6954 commit 0191a61

7 files changed

Lines changed: 296 additions & 5 deletions

File tree

.github/workflows/ci.yml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,8 @@ jobs:
8888
ecr_repo_secret: ECR_MIGRATIONS
8989
- dockerfile: ./docker/realtime.Dockerfile
9090
ecr_repo_secret: ECR_REALTIME
91+
- dockerfile: ./docker/pii.Dockerfile
92+
ecr_repo_secret: ECR_PII
9193
steps:
9294
- name: Checkout code
9395
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
@@ -115,7 +117,7 @@ jobs:
115117
id: ecr-repo
116118
run: echo "name=$ECR_REPO" >> $GITHUB_OUTPUT
117119
env:
118-
ECR_REPO: ${{ matrix.ecr_repo_secret == 'ECR_APP' && secrets.ECR_APP || matrix.ecr_repo_secret == 'ECR_MIGRATIONS' && secrets.ECR_MIGRATIONS || matrix.ecr_repo_secret == 'ECR_REALTIME' && secrets.ECR_REALTIME || '' }}
120+
ECR_REPO: ${{ matrix.ecr_repo_secret == 'ECR_APP' && secrets.ECR_APP || matrix.ecr_repo_secret == 'ECR_MIGRATIONS' && secrets.ECR_MIGRATIONS || matrix.ecr_repo_secret == 'ECR_REALTIME' && secrets.ECR_REALTIME || matrix.ecr_repo_secret == 'ECR_PII' && secrets.ECR_PII || '' }}
119121

120122
- name: Build and push
121123
uses: useblacksmith/build-push-action@fb9e3e6a9299c78462bfadd0d93352c316adc9b8 # v2
@@ -153,6 +155,10 @@ jobs:
153155
- dockerfile: ./docker/realtime.Dockerfile
154156
ghcr_image: ghcr.io/simstudioai/realtime
155157
ecr_repo_secret: ECR_REALTIME
158+
# pii is ECR-only (private ECS sidecar) — no ghcr_image, so the tag
159+
# step below skips GHCR for it.
160+
- dockerfile: ./docker/pii.Dockerfile
161+
ecr_repo_secret: ECR_PII
156162
steps:
157163
- name: Checkout code
158164
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6
@@ -188,7 +194,7 @@ jobs:
188194
id: ecr-repo
189195
run: echo "name=$ECR_REPO" >> $GITHUB_OUTPUT
190196
env:
191-
ECR_REPO: ${{ matrix.ecr_repo_secret == 'ECR_APP' && secrets.ECR_APP || matrix.ecr_repo_secret == 'ECR_MIGRATIONS' && secrets.ECR_MIGRATIONS || matrix.ecr_repo_secret == 'ECR_REALTIME' && secrets.ECR_REALTIME || '' }}
197+
ECR_REPO: ${{ matrix.ecr_repo_secret == 'ECR_APP' && secrets.ECR_APP || matrix.ecr_repo_secret == 'ECR_MIGRATIONS' && secrets.ECR_MIGRATIONS || matrix.ecr_repo_secret == 'ECR_REALTIME' && secrets.ECR_REALTIME || matrix.ecr_repo_secret == 'ECR_PII' && secrets.ECR_PII || '' }}
192198

193199
- name: Generate tags
194200
id: meta
@@ -206,7 +212,7 @@ jobs:
206212
207213
TAGS="${ECR_IMAGE}"
208214
209-
if [ "${{ github.ref }}" = "refs/heads/main" ]; then
215+
if [ "${{ github.ref }}" = "refs/heads/main" ] && [ -n "$GHCR_IMAGE" ]; then
210216
GHCR_AMD64="${GHCR_IMAGE}:latest-amd64"
211217
GHCR_SHA="${GHCR_IMAGE}:${{ github.sha }}-amd64"
212218
TAGS="${TAGS},$GHCR_AMD64,$GHCR_SHA"

.github/workflows/images.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ jobs:
2626
- dockerfile: ./docker/realtime.Dockerfile
2727
ghcr_image: ghcr.io/simstudioai/realtime
2828
ecr_repo_secret: ECR_REALTIME
29+
# pii is ECR-only (private ECS sidecar) — no ghcr_image.
30+
- dockerfile: ./docker/pii.Dockerfile
31+
ecr_repo_secret: ECR_PII
2932
outputs:
3033
registry: ${{ steps.login-ecr.outputs.registry }}
3134

@@ -80,8 +83,8 @@ jobs:
8083
# Build tags list
8184
TAGS="${ECR_IMAGE}"
8285
83-
# Add GHCR tags only for main branch
84-
if [ "${{ github.ref }}" = "refs/heads/main" ]; then
86+
# Add GHCR tags only for main branch (and only for images with a GHCR target)
87+
if [ "${{ github.ref }}" = "refs/heads/main" ] && [ -n "$GHCR_IMAGE" ]; then
8588
GHCR_AMD64="${GHCR_IMAGE}:latest-amd64"
8689
GHCR_SHA="${GHCR_IMAGE}:${{ github.sha }}-amd64"
8790
TAGS="${TAGS},$GHCR_AMD64,$GHCR_SHA"

apps/pii/package.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"name": "@sim/pii",
3+
"version": "0.0.0",
4+
"private": true,
5+
"description": "PII detection + anonymization service (Microsoft Presidio, FastAPI). Python service built as a container image (docker/pii.Dockerfile); not part of the JS/turbo build."
6+
}

apps/pii/requirements.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Pinned for reproducible image builds. Bump deliberately.
2+
presidio-analyzer==2.2.362
3+
presidio-anonymizer==2.2.362
4+
spacy==3.8.14
5+
fastapi==0.138.0
6+
uvicorn[standard]==0.49.0
7+
8+
# The English spaCy model (en_core_web_lg, ~400MB) is fetched + pinned in the
9+
# Dockerfile via curl-with-retry rather than here — a direct pip wheel URL
10+
# truncates on flaky networks and fails wheel validation.

apps/pii/server.py

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
"""Combined Presidio REST service: analyzer + anonymizer on one port.
2+
3+
Constructs one warm AnalyzerEngine (multi-language NLP + a native check-digit
4+
VIN recognizer) and one AnonymizerEngine at startup, exposing stock-compatible
5+
endpoints so a single PRESIDIO_URL serves both.
6+
"""
7+
8+
from typing import Any
9+
10+
from fastapi import FastAPI
11+
from presidio_analyzer import AnalyzerEngine, Pattern, PatternRecognizer, RecognizerResult
12+
from presidio_analyzer.nlp_engine import NlpEngineProvider
13+
from presidio_analyzer.predefined_recognizers import (
14+
AuAbnRecognizer,
15+
AuAcnRecognizer,
16+
AuMedicareRecognizer,
17+
AuTfnRecognizer,
18+
EsNieRecognizer,
19+
EsNifRecognizer,
20+
FiPersonalIdentityCodeRecognizer,
21+
InAadhaarRecognizer,
22+
InPanRecognizer,
23+
InPassportRecognizer,
24+
InVehicleRegistrationRecognizer,
25+
InVoterRecognizer,
26+
ItDriverLicenseRecognizer,
27+
ItFiscalCodeRecognizer,
28+
ItIdentityCardRecognizer,
29+
ItPassportRecognizer,
30+
ItVatCodeRecognizer,
31+
PlPeselRecognizer,
32+
SgFinRecognizer,
33+
SgUenRecognizer,
34+
UkNinoRecognizer,
35+
)
36+
from presidio_anonymizer import AnonymizerEngine
37+
from presidio_anonymizer.entities import OperatorConfig
38+
from pydantic import BaseModel
39+
40+
# Languages served. Each needs its spaCy model installed in the image; the
41+
# es/it/pl/fi predefined recognizers (ES_NIF, IT_FISCAL_CODE, PL_PESEL, ...)
42+
# auto-load once their NLP engine is present.
43+
NLP_CONFIGURATION = {
44+
"nlp_engine_name": "spacy",
45+
"models": [
46+
{"lang_code": "en", "model_name": "en_core_web_lg"},
47+
{"lang_code": "es", "model_name": "es_core_news_lg"},
48+
{"lang_code": "it", "model_name": "it_core_news_lg"},
49+
{"lang_code": "pl", "model_name": "pl_core_news_lg"},
50+
{"lang_code": "fi", "model_name": "fi_core_news_lg"},
51+
],
52+
}
53+
SUPPORTED_LANGUAGES = [m["lang_code"] for m in NLP_CONFIGURATION["models"]]
54+
55+
# Predefined recognizers Presidio ships but does NOT load into the default
56+
# registry — they must be added explicitly. Each carries its own
57+
# supported_language, so it fires under that language once its NLP model is
58+
# loaded. en: UK/AU/IN/SG locale ids; es/it/pl/fi: national ids.
59+
EXTRA_RECOGNIZERS = [
60+
UkNinoRecognizer,
61+
AuAbnRecognizer,
62+
AuAcnRecognizer,
63+
AuTfnRecognizer,
64+
AuMedicareRecognizer,
65+
InPanRecognizer,
66+
InAadhaarRecognizer,
67+
InVehicleRegistrationRecognizer,
68+
InVoterRecognizer,
69+
InPassportRecognizer,
70+
SgFinRecognizer,
71+
SgUenRecognizer,
72+
EsNifRecognizer,
73+
EsNieRecognizer,
74+
ItFiscalCodeRecognizer,
75+
ItDriverLicenseRecognizer,
76+
ItVatCodeRecognizer,
77+
ItPassportRecognizer,
78+
ItIdentityCardRecognizer,
79+
PlPeselRecognizer,
80+
FiPersonalIdentityCodeRecognizer,
81+
]
82+
83+
84+
class VinRecognizer(PatternRecognizer):
85+
"""VIN (17 chars, A-Z/0-9 excluding I/O/Q) with ISO 3779 check-digit
86+
validation (position 9). Validation makes accidental matches on arbitrary
87+
17-char codes (request ids, SKUs, tokens) extremely unlikely. Some
88+
non-North-American VINs omit the check digit and are skipped — an
89+
intentional bias toward precision.
90+
"""
91+
92+
_TRANSLIT = {
93+
**{str(d): d for d in range(10)},
94+
"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "H": 8,
95+
"J": 1, "K": 2, "L": 3, "M": 4, "N": 5, "P": 7, "R": 9,
96+
"S": 2, "T": 3, "U": 4, "V": 5, "W": 6, "X": 7, "Y": 8, "Z": 9,
97+
}
98+
_WEIGHTS = [8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8, 7, 6, 5, 4, 3, 2]
99+
100+
def validate_result(self, pattern_text: str):
101+
vin = pattern_text.upper()
102+
if len(vin) != 17:
103+
return False
104+
try:
105+
total = sum(self._TRANSLIT[c] * w for c, w in zip(vin, self._WEIGHTS))
106+
except KeyError:
107+
return False
108+
check = total % 11
109+
expected = "X" if check == 10 else str(check)
110+
return vin[8] == expected
111+
112+
113+
def build_analyzer() -> AnalyzerEngine:
114+
nlp_engine = NlpEngineProvider(nlp_configuration=NLP_CONFIGURATION).create_engine()
115+
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=SUPPORTED_LANGUAGES)
116+
# VIN is language-agnostic, so register it under every served language —
117+
# a recognizer only fires for the language the caller routes to.
118+
vin_pattern = Pattern(name="vin", regex=r"\b[A-HJ-NPR-Z0-9]{17}\b", score=0.7)
119+
for language in SUPPORTED_LANGUAGES:
120+
analyzer.registry.add_recognizer(
121+
VinRecognizer(
122+
supported_entity="VIN",
123+
patterns=[vin_pattern],
124+
context=["vin", "vehicle", "chassis"],
125+
supported_language=language,
126+
)
127+
)
128+
for recognizer_cls in EXTRA_RECOGNIZERS:
129+
analyzer.registry.add_recognizer(recognizer_cls())
130+
return analyzer
131+
132+
133+
analyzer = build_analyzer()
134+
anonymizer = AnonymizerEngine()
135+
136+
app = FastAPI(title="Sim Presidio", docs_url=None, redoc_url=None)
137+
138+
139+
class AnalyzeRequest(BaseModel):
140+
text: str
141+
language: str = "en"
142+
entities: list[str] | None = None
143+
score_threshold: float | None = None
144+
return_decision_process: bool = False
145+
146+
147+
class AnonymizeRequest(BaseModel):
148+
text: str
149+
analyzer_results: list[dict[str, Any]] = []
150+
anonymizers: dict[str, dict[str, Any]] | None = None
151+
operators: dict[str, dict[str, Any]] | None = None
152+
153+
154+
@app.get("/health")
155+
def health() -> dict[str, str]:
156+
return {"status": "ok"}
157+
158+
159+
@app.get("/supportedentities")
160+
def supported_entities(language: str = "en") -> list[str]:
161+
return analyzer.get_supported_entities(language)
162+
163+
164+
@app.post("/analyze")
165+
def analyze(req: AnalyzeRequest) -> list[dict[str, Any]]:
166+
results = analyzer.analyze(
167+
text=req.text,
168+
language=req.language,
169+
entities=req.entities or None,
170+
score_threshold=req.score_threshold,
171+
return_decision_process=req.return_decision_process,
172+
)
173+
return [r.to_dict() for r in results]
174+
175+
176+
@app.post("/anonymize")
177+
def anonymize(req: AnonymizeRequest) -> dict[str, Any]:
178+
analyzer_results = [
179+
RecognizerResult(
180+
entity_type=r["entity_type"],
181+
start=r["start"],
182+
end=r["end"],
183+
score=r.get("score", 1.0),
184+
)
185+
for r in req.analyzer_results
186+
]
187+
raw_operators = req.anonymizers or req.operators
188+
operators = None
189+
if raw_operators:
190+
operators = {}
191+
for entity, raw_cfg in raw_operators.items():
192+
op_cfg = dict(raw_cfg)
193+
op_type = op_cfg.pop("type", "replace")
194+
operators[entity] = OperatorConfig(op_type, op_cfg)
195+
result = anonymizer.anonymize(
196+
text=req.text,
197+
analyzer_results=analyzer_results,
198+
operators=operators,
199+
)
200+
return {
201+
"text": result.text,
202+
"items": [
203+
{
204+
"operator": item.operator,
205+
"entity_type": item.entity_type,
206+
"start": item.start,
207+
"end": item.end,
208+
"text": item.text,
209+
}
210+
for item in result.items
211+
],
212+
}

bun.lock

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docker/pii.Dockerfile

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# ========================================
2+
# Combined Presidio service (analyzer + anonymizer) on a single port (3000)
3+
# ========================================
4+
FROM python:3.12-slim-bookworm AS base
5+
6+
WORKDIR /app
7+
8+
# build-essential for any sdist that compiles native deps (e.g. blis/thinc).
9+
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
10+
--mount=type=cache,target=/var/lib/apt,sharing=locked \
11+
apt-get update && apt-get install -y --no-install-recommends \
12+
build-essential curl ca-certificates \
13+
&& rm -rf /var/lib/apt/lists/*
14+
15+
# Pinned Python deps. Separate layer so source edits don't reinstall them.
16+
COPY apps/pii/requirements.txt ./requirements.txt
17+
RUN --mount=type=cache,target=/root/.cache/pip \
18+
pip install -r requirements.txt
19+
20+
# Pinned spaCy models (en + es/it/pl/fi, ~2.2GB total). Downloaded with
21+
# retries/resume — the large wheels truncate on flaky networks if pip fetches
22+
# the URLs directly.
23+
ARG SPACY_MODELS="en_core_web_lg-3.8.0 es_core_news_lg-3.8.0 it_core_news_lg-3.8.0 pl_core_news_lg-3.8.0 fi_core_news_lg-3.8.0"
24+
RUN --mount=type=cache,target=/root/.cache/pip \
25+
for model in ${SPACY_MODELS}; do \
26+
whl="${model}-py3-none-any.whl"; \
27+
curl -fL --retry 5 --retry-delay 5 --retry-all-errors -C - \
28+
-o "/tmp/${whl}" \
29+
"https://github.com/explosion/spacy-models/releases/download/${model}/${whl}" || exit 1; \
30+
done && \
31+
pip install /tmp/*.whl && \
32+
rm /tmp/*.whl
33+
34+
COPY apps/pii/server.py ./server.py
35+
36+
RUN groupadd -g 1001 pii && \
37+
useradd -u 1001 -g pii pii && \
38+
chown -R pii:pii /app
39+
USER pii
40+
41+
EXPOSE 3000
42+
43+
# start-period is generous: five large spaCy models load at import before
44+
# /health responds. Tune against measured cold-start once built.
45+
HEALTHCHECK --interval=30s --timeout=5s --start-period=180s --retries=3 \
46+
CMD curl -fsS http://localhost:3000/health || exit 1
47+
48+
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "3000"]

0 commit comments

Comments
 (0)