A self-hosted mini Svix/Hookdeck: a reliability buffer that sits between webhook senders (Stripe, GitHub, …) and your application. It catches webhooks durably, verifies signatures, deduplicates retries, and fans them out to subscriber endpoints with exponential-backoff retries, dead-letter queues, and replay.
Built on AWS serverless (Lambda, API Gateway, DynamoDB, SNS→SQS) with Terraform.
Runs for ~$0/month, is account-agnostic (clone and apply into any AWS account), and is
designed to be terraform destroyed between sessions.
sender ──→ API GW ──→ Lambda (api) ──→ DynamoDB
(signed POST) verify HMAC, │
dedupe, store └─→ SNS topic ──┬─→ SQS deliveries ─→ Lambda ─→ subscriber URLs
(fanout) │ └ DLQ (POST per matching sub,
│ signed; demo receiver:
│ ?mode=ok|fail|flaky)
├─→ SQS audit ─→ Lambda ─→ S3 (raw JSON archive)
│ └ DLQ
└─→ SQS metrics ─→ Lambda ─→ DynamoDB counters
└ DLQ (atomic ADD)
One accepted event fans out to three independent consumers — each with its own
queue, DLQ, retry policy, and IAM scope, all stamped from the same
queue-consumer Terraform module. A delivery outage never blocks the audit
trail or the metrics. DLQ depth alarms notify an SNS alerts topic (email),
and parked messages are recoverable via the replay API.
make bootstrap # one-time: create the S3 remote-state bucket (survives destroys)
make init # connect the dev stack to the state bucket (name derived from your account)
make apply # deploy (~60s)
make smoke # 13-step end-to-end test
make destroy # tear it all down (state bucket survives for next session)To remove everything including the state bucket — e.g. walking away from the project —
use make destroy-all (dev stack first, then bucket).
Optional: put alert_email = "you@example.com" in terraform/envs/dev/terraform.tfvars
(gitignored) before applying to receive DLQ alarm emails — AWS sends a confirmation
link on first apply.
Requires: Terraform ≥ 1.10, AWS credentials, curl, openssl, python3.
| Route | Purpose |
|---|---|
POST /sources |
Register a sender. Returns id + signing secret (shown once). |
GET /sources |
List sources (without secrets). |
POST /in/{source_id} |
Ingest endpoint — paste into the sender's webhook config. |
GET /sources/{source_id}/events |
Recent events for a source, newest first. |
POST /sources/{source_id}/subscriptions |
Register a subscriber: {url, events} where events is a glob like payment.* (default *). |
GET /sources/{source_id}/subscriptions |
List subscriptions (without secrets). |
GET /sources/{source_id}/events/{event_id}/deliveries |
Per-subscription delivery status for an event. |
GET /sources/{source_id}/metrics |
Counts of events received, per day per type. |
POST /sources/{source_id}/events/{event_id}/replay |
Re-queue one event for delivery (already-delivered subscriptions are skipped). |
POST /replay/dlq |
Bulk recovery: redrive everything parked in the deliveries DLQ. |
Ingest requests must carry x-relay-signature: hex HMAC-SHA256 of the raw body using the
source secret. Optional x-idempotency-key header controls deduplication (defaults to a
hash of the body). Duplicates are acked with 200 {"duplicate": true} so sender retries
never see errors.
Simulate a sender:
API=$(make url)
SRC=$(curl -s -X POST "$API/sources" -d '{"name":"stripe"}')
./scripts/send-event.sh "$API" <src_id> <secret> '{"type":"payment.succeeded","amount":4200}'The deployed stack includes a demo receiver Lambda whose behavior switches on a query param, so every failure mode is reproducible:
API=$(make url)
RECEIVER=$(terraform -chdir=terraform/envs/dev output -raw receiver_url)
# Subscribe a *flaky* endpoint: fails the first delivery attempt, then recovers
curl -s -X POST "$API/sources/<src_id>/subscriptions" \
-d "{\"url\": \"${RECEIVER}?mode=flaky\"}"
# Send an event, then watch the delivery go failed → delivered
./scripts/send-event.sh "$API" <src_id> <secret>
curl -s "$API/sources/<src_id>/events/<event_id>/deliveries"
# mode=fail exhausts all 3 attempts (with exponential backoff: 20s, 40s)
# and parks the message in the DLQ:
aws sqs get-queue-attributes \
--queue-url "$(terraform -chdir=terraform/envs/dev output -raw deliveries_dlq_url)" \
--attribute-names ApproximateNumberOfMessages
# ...which fires the DLQ CloudWatch alarm (email arrives if alert_email is set).
# Recover after fixing the endpoint:
curl -s -X POST "$API/replay/dlq" # bulk redrive
curl -s -X POST "$API/sources/<src_id>/events/<event_id>/replay" # or one eventOutbound deliveries are signed too (x-relay-signature with the subscription's
secret) and carry x-relay-event-id / x-relay-attempt headers.
- Single-table DynamoDB —
SRC#<id>partition holds the source META, its events (EVT#<ts>#<id>, time-ordered),DEDUP#<key>idempotency markers (24h TTL),SUB#<id>subscriptions,DLV#<event_id>#<sub_id>delivery records, andMET#<day>#<type>counters. - Exactly-once ingest — a DynamoDB transaction writes the event and a conditional
dedup marker atomically; a sender retry cancels the transaction (checked against
CancellationReasons, not blindly) and returns the original event id. - At-least-once delivery, no cross-talk — a failed delivery makes the worker
raise, so SQS redrives the whole message with exponential backoff;
DLVrecords mark which subscriptions already succeeded, so retries skip them and one flaky endpoint never causes duplicate deliveries to healthy ones. After 3 attempts the message parks in the DLQ, the depth alarm fires, and the replay API recovers it. - Replay bypasses the topic — replays go straight onto the deliveries queue, not SNS, so re-delivering an event never re-archives it or double-counts metrics.
- Secrets — source/subscription signing secrets are returned only at creation and never listed. Stored in DynamoDB for HMAC use; production would envelope-encrypt them with KMS (skipped here: customer-managed keys cost $1/mo each).
- Cost discipline — everything is pay-per-request, Lambdas stay out of any VPC (no NAT Gateway), log retention is 7 days, state bucket is the only persistent resource.
make venv && make test— unit tests for the storage layer against an in-memory DynamoDB (moto); no AWS account needed, runs in ~1s.make smoke— 13-step end-to-end test against a deployed stack, covering signed ingest, dedupe, fanout to all three consumers, filtering, and replay.
Core: sources, signed ingest, idempotent event storeSubscriptions + SNS→SQS fanout (deliveries/audit/metrics) + delivery worker + DLQs + demo receiverExponential backoff, DLQ alarms → email, replay endpoints, CloudWatch dashboard, moto unit tests- GitHub Actions CI/CD via OIDC (plan on PR, apply on merge, smoke test post-deploy)