From 7b0db8b9cd94449e535a95a1919f8fba5c3e7c3e Mon Sep 17 00:00:00 2001 From: "David Muto (pseudomuto)" Date: Fri, 26 Jun 2026 12:23:07 -0400 Subject: [PATCH] [rfc]: End-to-end Encryption Add an RFC proposing how the Temporal Proxy will handle end-to-end encryption of workflow data (payloads, memo, and header fields). The proposal centralizes encryption at the proxy so operators manage and rotate keys in one place, transparently to workflow developers, rather than every team standing up its own codec server or building encryption into each Worker. --- rfc/encryption.md | 522 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 522 insertions(+) create mode 100644 rfc/encryption.md diff --git a/rfc/encryption.md b/rfc/encryption.md new file mode 100644 index 0000000..7d43882 --- /dev/null +++ b/rfc/encryption.md @@ -0,0 +1,522 @@ +# End to End Encryption of Workflow Data + +This RFC outlines a proposal for how the Temporal Proxy will handle encrypting customer data. This includes, at a +minimum, workflow payloads and any memo and header fields. + +Questions and feedback can be directed to your Temporal account rep, the proxy team, or the community Slack channel. + +- **Slack:** [#proxy](https://temporalio.slack.com/archives/C0BD5MZF6UD) +- **Email:** + +## Background + +To ensure that PII or other private information is never shared outside the Worker pool, all payload data must be +encrypted both in transit and at rest. For some customers this is mandatory, but it will be designed as an opt-in +feature of the proxy. + +Today, customers who need this must stand up and operate their own proxy or codec server (or build encryption into every +Worker), then coordinate key management and rotation across many teams. That is friction during onboarding and an +ongoing operational burden. By centralizing encryption at the proxy, operators manage and rotate keys in one place, +transparently to workflow developers. + +## Goals and non-goals + +There are a number of goals related to encryption, but most notably it must be: + +- **Secure:** modern, authenticated encryption with regular key rotation. +- **Transparent:** no special Worker or client configuration; nothing changes for workflow developers. +- **Customer-controlled:** customers own their keys; key material never leaves their cloud provider (or their own KMS service). +- **Reliable under rotation:** key rotation must be graceful, and the system must never encrypt something it cannot later decrypt. +- **A fail-closed design:** unencrypted data is never proxied or otherwise visible outside the cluster hosting the proxy (assuming encryption is enabled). + +Non-goals for the initial release: + +- Encrypting details that need to be plaintext (e.g. search attributes). +- DEK rotation based on number of uses + +## At a glance + +- **Scheme:** envelope encryption, where payloads are sealed with a short-lived Data Encryption Key (DEK), which is itself wrapped by a customer-managed Key Encryption Key (KEK). +- **Payload cipher:** AES-256-GCM with a random 96-bit nonce per message. +- **KEK sources:** AWS KMS, GCP KMS, Azure Key Vault, an external (often sidecar'ed) gRPC service, plus a `testing://` scheme for local development. +- **Granularity:** A default KEK (enforces fail-closed), plus optional namespace-level overrides as needed. +- **DEK rotation:** automatic, time-based, pre-rotated off the hot path. +- **Decryption:** always attempted on any encrypted payload that references a known KEK regardless of whether encryption is currently enabled. + +> [!NOTE] +> **Treat KEK identifiers as immutable.** Once you have encrypted a payload with a given identifier, that identifier +> must remain reachable forever, or the payload becomes indecipherable. See [Decommissioning a +> KEK](#decommissioning-a-kek). + +## How it works + +The process involves two distinct types of keys: **Key Encryption Keys (KEKs)** and **Data Encryption Keys (DEKs)**. See +[Envelope Encryption](https://en.wikipedia.org/wiki/Hybrid_cryptosystem#Envelope_encryption) for background info. + +For our purposes, KEKs are entirely customer-managed/owned, and DEKs are ephemeral, short-lived (configurable), and +managed by the proxy. A locally generated DEK does the actual (symmetric) encryption, while the customer-managed KEK +wraps (encrypts) the DEK. Both the encrypted DEK and the ciphertext travel together in the payload, so the proxy only +needs the KEK to reconstruct the DEK at decrypt time. The proxy never holds KEK material itself, rather it asks the KMS +(or your gRPC service) to wrap and unwrap the DEK on its behalf. + +```mermaid +flowchart LR + PT[plaintext] --> SEAL["DEK.Encrypt (AES-256-GCM, random nonce)"] + DEK[("random 256-bit DEK")] --> SEAL + SEAL --> CT["nonce | ciphertext+tag"] + + DEK --> WRAP["KEK.Encrypt (cloud KMS or gRPC)"] + KEK[("customer-managed KEK (AWS / GCP / Azure / your gRPC)")] --> WRAP + WRAP --> EDEK["KEK id + encrypted DEK"] + + CT --> P[[Payload]] + EDEK --> P +``` + +When using the proxy, all encryption/decryption is completely transparent and managed by the proxy without any special +Worker/client config. The proxy implements `WorkflowService` (and others as necessary) and, when appropriate, encrypts +[Payloads] before forwarding them upstream, decrypting them before returning results downstream. + +> [!NOTE] +> [Memo] and [Header] field values are already Payload objects and will therefore be encrypted in the same manner +> without any special attention. + +[Memo]: https://github.com/temporalio/api/blob/8e0453c3a17693d7abb78853c3ccccf0c632e782/temporal/api/common/v1/message.proto#L53 +[Header]: https://github.com/temporalio/api/blob/8e0453c3a17693d7abb78853c3ccccf0c632e782/temporal/api/common/v1/message.proto#L59 +[Payload]: https://github.com/temporalio/api/blob/8e0453c3a17693d7abb78853c3ccccf0c632e782/temporal/api/common/v1/message.proto#L32 +[Payloads]: https://github.com/temporalio/api/blob/8e0453c3a17693d7abb78853c3ccccf0c632e782/temporal/api/common/v1/message.proto#L32 + +### Why envelope encryption? + +In two words: **cost and latency.** Cloud KMS calls are slow and rate-limited. Encrypting every payload directly against +the KMS would burn quota and add a network hop per payload. The envelope means we only call the KMS once per DEK +rotation (per namespace), not once per payload. + +We also limit the **blast radius**. A DEK is scoped to one namespace and rotates on a short schedule. If a DEK ever +leaks, exposure is bounded to whatever was encrypted with that DEK during its lifetime. KEKs never leave their source +(KMS or your gRPC service) and therefore cannot be leaked by this system. + +### Cryptographic details + +| Property | Value | +| ---------- | ------------------------------------------------- | +| Cipher | AES-256-GCM | +| DEK size | 256 bits (32 bytes), `crypto/rand` | +| Nonce size | 96 bits (12 bytes), `crypto/rand`, per encryption | + +GCM is authenticated encryption (AEAD); the auth tag is part of the ciphertext and is verified transparently on open. A +tampered payload fails to decrypt rather than returning garbage. + +## Key mapping + +Proxy operators define KEKs along with associated DEK lifetime settings, and map them to namespaces. + +A **default** key is required whenever encryption is enabled. It guarantees fail-closed behaviour: any namespace without +its own `overrides` entry, including ones created after the proxy started, falls back to the default rather than +silently transmitting plaintext. There is no configuration in which encryption is on but a namespace slips through +unencrypted. Enabling encryption (`enabled: true`) without a `default` is a startup error. + +The default alone satisfies the fail-closed goal: point it at one KEK and every namespace is covered. Per-namespace +overrides exist for when a single shared key isn't what you want, and there are a few common reasons to reach for them: + +- **Blast radius.** A KEK scopes the damage of a mishap. With one shared key, an accidental deletion or a revoked + permission takes down encryption for every namespace at once; per-namespace keys contain that to a single tenant. +- **Rate limits and throughput.** Every DEK wrap/unwrap is a request against a specific KMS key, and cloud providers + enforce per-key (and per-account) quotas. Spreading busy namespaces across separate keys, or even separate providers, + keeps any one key off its ceiling. +- **Data residency and tenancy.** Namespaces often map to different customers, teams, or jurisdictions. Per-namespace + keys let you keep an EU tenant's key in an EU KMS, give a customer custody of their own key, or satisfy a contract + that requires physically separate key material. +- **Independent lifecycle.** Rotation cadence, decommissioning, and access can all be managed per key. You can rotate or + revoke one namespace's key, or migrate it to a different provider, without touching anyone else's data. + +The tradeoff is operational: more keys mean more to provision, rotate, and audit. A sensible default could be to start +with `default` only and add overrides where one of the above actually applies. + +When encryption is turned off (`enabled: false`), the configured keys and policies are still loaded, but only the +decrypt path uses them; nothing new is encrypted. Defining keys while leaving encryption disabled is a valid +"decrypt-only" posture (e.g. after disabling encryption but still needing to open older payloads), and the proxy logs a +warning at startup so the half-configured state is obvious. + +A simple `namespace → key` mapping would create a problem when _changing_ keys (not rotating key material, but moving to +an entirely new key, e.g. a different cloud provider). Given _ns1_ using _dek1_ (wrapped by _kek1_), updating _ns1_ to +use _kek2_ would make old messages indecipherable; a rollback creates the same issue for anything encrypted between +release and rollback. + +```mermaid +flowchart TD + OLD["existing payload
keyid = kek1
(DEK wrapped by kek1)"] --> DEC{"decrypt: is kek1
still configured?"} + SWAP["config changed:
ns1 → kek2 only
(kek1 dropped)"] -.->|"removes kek1"| DEC + DEC -->|"no (single-key swap)"| FAIL["cannot unwrap dek1
payload indecipherable"] + DEC -->|"yes (kept as decrypt-only)"| OK["unwrap dek1 → decrypt"] +``` + +For this reason we use a one-to-many `namespace → keys` mapping (same for the default), with one key designated +**active** (used for wrapping DEKs) and the others kept reachable for decryption only: + +```yaml +encryption: + ... + default: + # ACTIVE KEK (used for all new encryption) + uri: testing://-ynIaZzFbAjp9VPgu0Ohk9YeQSLS9ta0m9mtnOnGZqo= + # OLD ONES (decrypt only) + decryptOnlyURIs: + - testing://vRQL-YpK_U44lLNwgAsufmOI-2_lCbQV1LDSdFmcHZ8= + # How long (max) a DEK stays valid for encryption + duration: 1h + # How far ahead of expiry to pre-rotate the DEK + renewBefore: 15m +``` + +This bundle of settings (the active key, any decrypt-only keys, and the DEK timing) is what we call a **KeyPolicy**. The +`default` is a KeyPolicy, and so is each per-namespace entry under `overrides`; they share the same shape, differing only +in what they apply to. The complete set of fields is in the [Field reference](#field-reference). + +## Encryption + +[Payload] encryption occurs automatically based on the proxy configuration. It is implemented as an interceptor that +walks the request/response objects, similar to how namespace translation works. + +Payloads have their `Data` field set to the encrypted bytes of the original payload, and their `Metadata` set to include +the encrypted DEK and a reference to the KEK used to decrypt it (along with the nonce and an encoding marker). + +### Encryption process + +```mermaid +flowchart TD + PT["plaintext payload
(forwarding upstream)"] --> EN{"encryption
enabled?"} + EN -->|"no"| PASS["forward unencrypted"] + EN -->|"yes"| POL["resolve KeyPolicy
(namespace override,
else default)"] + POL --> DEK["ensure current DEK
(from cache; pre-rotate
if within renewBefore)"] + DEK --> SEAL["DEK.Encrypt
(AES-256-GCM, random nonce)"] + SEAL --> WRAP["attach metadata:
keyid + encrypted DEK + nonce"] + WRAP --> UP["forward upstream"] +``` + +> [!NOTE] +> Anytime we encrypt something we must be careful never to encrypt something we cannot decrypt (e.g. storing the +> wrong materials in the payload). Places to watch: concurrent calls to KMS (rotation may have happened upstream), DEK +> caching (prevents unnecessary network calls and rate limits), and KMS calls (delays, timeouts, rate limits, outages). + +## Decryption + +Similar to encryption, the proxy handles decryption transparently, requiring no Worker/client configuration. To support +both encrypted and plaintext payloads/data blobs, the proxy only attempts to decrypt payloads that reference a known +KEK. + +### Decryption process + +```mermaid +flowchart TD + IN["payload
(returning downstream)"] --> KNOWN{"references a known
KEK id?"} + KNOWN -->|"no"| PASS["pass through unchanged"] + KNOWN -->|"yes"| UNWRAP["KEK.Decrypt(encrypted DEK)
→ DEK (cached)"] + UNWRAP --> OPEN["DEK.Decrypt(nonce, ciphertext)"] + OPEN --> OUT["plaintext → return to caller"] +``` + +A decrypted-DEK LRU cache (`cacheSize`) avoids redundant KMS calls, important for busy namespaces where decrypting DEKs +on every message would otherwise hit provider rate limits. + +## Key rotation + +There are two independent layers of rotation. + +### DEK rotation (automatic) + +DEKs rotate on the schedule defined by `duration` and `renewBefore`. You don't need to do anything: a background loop +rotates any namespace whose DEK is inside the `renewBefore` window before expiry. The first encryption call after +rotation triggers a single, coalesced KMS call to wrap the new DEK. + +Old payloads remain decryptable: their encrypted DEK is still in the payload metadata, and the KEK that wrapped it is +still in the registry. Nothing about DEK rotation invalidates older data. + +### KEK rotation (operator-driven) + +All supported cloud providers support both manual and automatic (recommended) **key-material rotation**: the provider +generates new key material for future encrypt calls while retaining all old versions for decrypt. This is completely +transparent and requires nothing on our part; we recommend enabling automatic rotation where possible. + +To **change keys entirely** (e.g. switch providers) without losing access to existing data, promote a new key to active +and mark the old one decrypt-only in the same config update: + +```yaml +encryption: + default: + # The new active KEK - used for all new encryption. + uri: gcpkms://projects/my-project/locations/global/keyRings/codec/cryptoKeys/v2 + # The old KEK - kept reachable so existing payloads can still be opened. + decryptOnlyURIs: + - gcpkms://projects/my-project/locations/global/keyRings/codec/cryptoKeys/v1 +``` + +On reload the proxy dials the new KEK as active and the old KEK as decrypt-only. New DEKs are wrapped with the new KEK; +old payloads still carry the old `keyid`, and the decrypt path looks them up against the decrypt-only entry. There is no +limit on how many decrypt-only keys a policy can carry. + +> [!IMPORTANT] +> Once used, a KEK must remain available at least until after the Temporal Workflow retention period, so +> previously encrypted values stay available to the UI, Workers, etc. + +## Decommissioning a KEK + +> [!CAUTION] +> A KEK can be safely removed **only after every payload it ever wrapped is gone from every system the proxy +> might decrypt for**: Temporal history, visibility, archival storage, and any downstream copies. The proxy has no way +> to know this for you. + +Deleting a KEK from the cloud provider before the workflow retention period has lapsed **will result in data loss** and +is **not recoverable**. We cannot decrypt data without the original key material used to wrap the DEK, and neither can +anyone else, including the cloud provider. + +If a payload carries `encryption/keyid = X` and `X` is in neither the active nor the decrypt-only list, decryption fails with +`unknown key: X`. Re-adding the KEK to the configuration restores access; if the underlying KMS key itself has been +destroyed, the payload is unrecoverable. + +### Checklist before removing a KEK + +1. **Audit existing data.** Confirm no live workflow history, visibility record, or archive references the KEK identifier you intend to remove. KEK identifiers are visible in payload metadata at `encryption/keyid`: for cloud KMS this is the full URI; for gRPC endpoints it's the configured `id`. +2. **Keep the KEK as decrypt-only for a grace period.** Move it to `decryptOnlyURIs` / `decryptOnlyEndpoints` first and watch error rates. Only then remove the entry. +3. **Don't destroy the KMS key.** Removing the KEK from the proxy config doesn't destroy the underlying KMS key, which is your last line of defence. + +> [!NOTE] +> It is often safer to remove encrypt **IAM permissions** for a key than to delete it. This lets old messages be +> decrypted while guaranteeing the key is never used for new encryption, useful during a config rollout where old proxy +> versions error on encrypt until they're rolled. Always exercise extreme caution when deleting a KMS key. + +We strongly recommend provisioning keys with something like Terraform and setting `prevent_destroy = true` where +supported ([GCP +example](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/kms_crypto_key#example-usage---kms-crypto-key-basic)) +to guard against accidental deletes. + +## Configuration reference + +This section is the config-level detail: where KEKs come from and exactly how to declare them. It covers the external +gRPC server option, the supported cloud KMS providers, and the full annotated configuration schema. + +### External server {#external-server} + +The proxy can delegate KEK operations to an external gRPC service you run, instead of (or alongside) a cloud KMS. This +is the right path when you: + +- Can't give the proxy direct credentials to a cloud provider (air-gapped networks, on-prem KMS, etc.). +- Need routing logic the built-in cloud backends can't express (e.g. per-tenant accounts with separate credentials). +- Have your own PKI infrastructure you need to integrate with. + +The RPC service looks roughly like this: + +```protobuf +service KMSService { + rpc Encrypt(EncryptRequest) returns (EncryptResponse) {} + rpc Decrypt(DecryptRequest) returns (DecryptResponse) {} +} + +message EncryptRequest { + bytes plaintext = 1; + string namespace = 2; +} + +message EncryptResponse { + bytes ciphertext = 1; +} + +message DecryptRequest { + bytes ciphertext = 1; + string namespace = 2; +} + +message DecryptResponse { + bytes plaintext = 1; +} +``` + +The proxy learns about external servers through the top-level `externalServers` map. Encryption policies then reference +a server by name via the `endpoint.server` field. mTLS is supported via a `tls` block on each `externalServers` entry. + +```yaml +externalServers: + kmsserver: + address: 127.0.0.1:50051 + # tls: ... # optional; omit for plaintext + +encryption: + ... + default: + endpoint: + server: kmsserver # reference server by name + id: kms-default + duration: 30m + renewBefore: 5m +``` + +> [!TIP] +> **Run the external server as a sidecar.** Every encrypt/decrypt that isn't served from the DEK cache turns into a +> round trip to the KEK service, so the network distance to it sits on the hot path. Deploying the gRPC service as a +> sidecar in the same Pod (reachable over `127.0.0.1`, as above) keeps that hop on the loopback interface, avoiding +> cross-node or cross-zone latency and keeping KEK material off the network entirely. A shared, remotely-hosted KEK +> service works, but pays a real latency tax under cache-miss pressure (e.g. during DEK rotation or a cache-miss storm). + +### Supported KMS providers + +- [AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html) +- [GCP KMS](https://docs.cloud.google.com/kms/docs/key-management-service) +- [Azure Key Vault](https://learn.microsoft.com/en-us/azure/key-vault/general/overview) +- An [external gRPC service](#external-server) you run +- Static keys (`testing://`) for local development only + +Supported KMS URI schemes: + +- `awskms:///arn:aws:kms:REGION:ACCOUNT:key/KEY-ID?region=REGION` +- `gcpkms://projects/PROJECT/locations/LOCATION/keyRings/RING/cryptoKeys/KEY` +- `azurekeyvault://VAULT.vault.azure.net/keys/KEY-NAME/KEY-VERSION` +- `testing://...` (local development only; provides no real security) + +The proxy authenticates to each KMS through that provider's standard credential chain, so it uses whatever ambient +credentials the host environment supplies and never needs static keys in its config. On Kubernetes, for example, bind +the proxy's service account to a cloud identity (Workload Identity on Azure/GCP, IRSA or Pod Identity on AWS) so +permissions are granted by the provider. Off Kubernetes, the same SDKs resolve credentials from the host instead: VM or +instance metadata (EC2 instance profiles, GCE service accounts, Azure managed identity), environment variables, or a +mounted credentials file. Either way the proxy holds no KEK material; it only needs permission to call encrypt and +decrypt on the specific crypto keys. + +### Field reference + +Encryption is configured under the top-level `encryption` block, which references gRPC KEK services declared in the +top-level `externalServers` map (see [External server](#external-server) above). A fully annotated example: + +```yaml +# gRPC KEK services the proxy can dial, keyed by name. Optional; only needed +# when a KeyPolicy uses an `endpoint` instead of a cloud KMS `uri`. A policy +# references an entry here via `endpoint.server`, and the dialed connection is +# shared across every policy that names it. +externalServers: + kmsserver: + # Address of the gRPC service. Required. Often a localhost sidecar. + address: 127.0.0.1:50051 + # mTLS settings for the connection. Optional; omit for plaintext. + # tls: ... + +encryption: + # Whether to encrypt NEW payloads. Optional; defaults to false. + # Decryption is attempted on any encrypted payload regardless of this flag, + # so flipping it off does NOT strand data already on the wire. + enabled: true + + # Size of the decrypted-DEK LRU. Optional; defaults to 100. + # A value <= 0 disables the cache. + cacheSize: 100 + + # Fallback KeyPolicy for any namespace not listed in `overrides`. + # REQUIRED when `enabled: true` - this is what makes encryption fail-closed, + # so no namespace is ever transmitted in plaintext. Optional only when + # `enabled: false` (decrypt-only posture). + default: + # ---- KeyPolicy ---- + # `uri` and `endpoint` are mutually exclusive (XOR) - set exactly one. + + # Cloud KMS URI. Validated against the supported schemes: awskms, gcpkms, + # azurekeyvault, plus `testing` for local development. + uri: gcpkms://projects/my-project/locations/global/keyRings/codec/cryptoKeys/v2 + + # Pointer to a gRPC KEK service. Mutually exclusive with `uri`. + # endpoint: + # server: kmsserver # required: name of an entry in `externalServers`; + # # the dialed connection is shared across every + # # policy that references it. + # id: kms-default # required: stable identifier for the KEK, written + # # into every payload it encrypts. Must be globally + # # unique across all active and decrypt-only endpoints + # # and must not change across restarts. + + # Cloud KMS URIs no longer used for new encryption but still reachable + # to decrypt older payloads. Optional. + decryptOnlyURIs: + - gcpkms://projects/my-project/locations/global/keyRings/codec/cryptoKeys/v1 + + # gRPC endpoints with the same semantics as `decryptOnlyURIs`. Optional. + # decryptOnlyEndpoints: + # - server: kmsserver + # id: kms-old + + # How long a DEK stays valid before rotation. Optional. + duration: 1h + + # How far ahead of expiry to pre-rotate. Optional; must be < duration. + renewBefore: 15m + + # Per-namespace policies, keyed by namespace. Each value is a KeyPolicy + # (same shape as `default`) and takes precedence over `default`. Optional. + overrides: + my-namespace: + uri: awskms:///arn:aws:kms:us-east-1:123456789012:key/abcd-1234?region=us-east-1 + duration: 30m + renewBefore: 5m +``` + +> [!IMPORTANT] +> `KeyPolicy.uri` and `KeyPolicy.endpoint` are mutually exclusive. Setting both fails validation. Setting neither (with +> `decryptOnly*` empty) marks the policy as "not configured" and the proxy skips wiring it. This is allowed for an +> `overrides` entry (that namespace just falls back to `default`), but the `default` policy itself **must** have an +> active `uri` or `endpoint` whenever `enabled: true`; otherwise startup fails. + +> [!IMPORTANT] +> **gRPC `KeyEndpoint.id` values must be globally unique** across active and decrypt-only endpoints, and must remain +> stable across restarts. Renaming an id is equivalent to losing the key. Reusing an id across endpoints is rejected at +> startup. + +## Security commentary + +We believe we're in a good place with respect to best practices: regular rotation of key material, AES-256-GCM/AEAD, +short-lived DEKs, and plaintext keys never leaving the process. + +The one sharp edge is irreversible key deletion: deleting a KMS key destroys all data encrypted under it, and by design +Temporal cannot recover it. This is covered in detail under [Decommissioning a KEK](#decommissioning-a-kek); the proxy's +responsibility is to fail gracefully (no panics) when a key is missing, and to make the risk obvious through the docs and +guardrails (e.g. `prevent_destroy = true`). + +We'll also test and document the minimum permissions (per cloud provider) needed to provide encryption capabilities, and +include those in the setup docs. + +## Observability + +### Metrics + +| Metric | Type | Labels | +| -------------------------- | --------- | ----------------------------------------------- | +| `encryption_ops_total` | Counter | operation (encrypt/decrypt), namespace, result | +| `encryption_duration_secs` | Histogram | operation, namespace | +| `kms_calls_total` | Counter | provider (aws/azure/gcp/rpc), operation, result | +| `kms_call_duration_secs` | Histogram | provider, operation | +| `dek_cache_hits_total` | Counter | namespace | +| `dek_cache_misses_total` | Counter | namespace | +| `dek_cache_size` | Gauge | - | + +These metrics either lack a namespace label or combine it only with low-cardinality labels, keeping series counts +well-bounded regardless of deployment scale. + +### Examples of using these metrics + +#### DEK cache hit rate + +A drop here is the leading indicator of KMS rate-limit pressure; every cache miss is a KMS call. Split by namespace; if +a namespace's hit rate tanks, check its traffic spike and DEK TTL config. + +**PromQL:** `rate(dek_cache_hits_total[5m]) / (rate(dek_cache_hits_total[5m]) + rate(dek_cache_misses_total[5m]))`. + +#### KMS error rate + +KMS errors cascade into encryption failures and then request failures. Separating by provider distinguishes a provider +outage from a permissions/config problem; correlate with DEK cache hit rate, since a cache-miss storm during a KMS +outage is the worst case. + +**PromQL:** `rate(kms_calls_total{result="error"}[5m]) / rate(kms_calls_total[5m])` by provider. + +## Operational notes + +- **Configuration is validated at startup.** The proxy checks the entire encryption config before serving traffic and refuses to start on any problem it can detect up front: a missing `default` while encryption is enabled, both `uri` and `endpoint` set on a policy, a malformed or unsupported KMS URI, `renewBefore` not less than `duration`, duplicate gRPC `id` values, and similar. The goal is to surface foreseeable mistakes as a clear startup failure rather than as a runtime error mid-request. +- **A default key is mandatory when encryption is on.** `enabled: true` without `encryption.default` is a startup error. This is deliberate: with encryption enabled, there is no path by which a namespace transmits plaintext: it either matches an `overrides` entry or falls back to the default. +- **Disabling encryption does not lose data.** Setting `enabled: false` stops new payloads from being encrypted but doesn't stop decryption. As long as the KEKs are still configured, the proxy keeps opening older payloads. Keys defined while `enabled: false` are loaded for decryption only, and the proxy logs a warning at startup to flag that half-configured state. +- **Restart on config change.** The proxy reads encryption configuration at startup. Adding a KEK, or moving one to decrypt-only, requires a restart. +- **`testing://` is not for production.** It's exposed for local development only and provides no real security; the key material is in the URL. +- **Per-payload timeout.** Each encrypt/decrypt is bounded to one second; the total deadline scales linearly with the number of payloads in a batch.