The LLM Gateway is a LiteLLM proxy deployed on ECS Fargate that sits between the Strands agents (on AgentCore) and Amazon Bedrock. It provides centralized model routing, automatic fallback, rate limiting, and spend tracking for all five agents (Supervisor, Chunking, Extraction, Mapping, Validation).
Agents call the gateway using the OpenAI-compatible API at https://llm-gateway.document-processing.internal/v1/chat/completions (via CloudFront VPC Origin). The gateway translates requests to Bedrock and tracks usage per agent via virtual API keys.
┌──────────────────────────────────────────┐
│ AgentCore Runtime (VPC) │
│ │
│ Supervisor Chunking Extraction(n) │
│ Mapping Validation │
└──────────────┬────────────────────────────┘
│ OpenAI-compatible API
▼
┌──────────────────────────────────────────┐
│ Internal ALB (port 80) │
└──────────────┬────────────────────────────┘
│
┌──────────────▼────────────────────────────┐
│ ECS Fargate Service │
│ ┌─────────────────────────────┐ │
│ │ LiteLLM Proxy (port 4000) │ │
│ │ Admin UI (/ui) │ │
│ │ Health App (port 8001) │ │
│ └─────────────────────────────┘ │
│ 1 vCPU / 2 GB / ARM64 (Graviton) │
│ Auto-scale: 1 → 4 tasks │
└──────┬───────────────────┬────────────────┘
│ │
┌────────────▼──────┐ ┌─────────▼──────────────┐
│ Aurora Serverless │ │ Amazon Bedrock │
│ v2 PostgreSQL │ │ - Claude Sonnet 4.5 │
│ (LiteLLM spend │ │ - Claude Haiku 4.5 │
│ tracking only) │ │ │
└───────────────────┘ └────────────────────────┘
| Alias | Bedrock Model | Use Case |
|---|---|---|
claude-primary |
us.anthropic.claude-sonnet-4-5-20251101-v1:0 |
Supervisor, Extraction, Mapping agents — complex reasoning |
claude-fallback |
us.anthropic.claude-haiku-4-5-20251001-v1:0 |
Automatic fallback on Sonnet throttling/errors |
claude-fast |
us.anthropic.claude-haiku-4-5-20251001-v1:0 |
Chunking, Validation agents — simpler tasks, lower cost |
The us. prefix enables Bedrock cross-region inference routing for better availability.
usage-based-routing — distributes load across model endpoints by tracking active request count. This is optimal for the parallel Extraction agent pattern where multiple agents send concurrent LLM calls simultaneously.
claude-primary → claude-fallback (on error/throttle)
claude-primary → claude-fallback (on context window exceeded)
| Model | TPM | RPM |
|---|---|---|
claude-primary |
80,000 | 50 |
claude-fallback |
80,000 | 100 |
claude-fast |
80,000 | 100 |
Set conservatively below Bedrock's default quota (~100k TPM/region). When hit, LiteLLM returns 429 and the router automatically falls back.
LiteLLM configuration — model definitions, routing, fallback, rate limits, and general settings. All secrets referenced via os.environ/VAR_NAME (LiteLLM's own env var syntax).
model_list:
# Primary model — Claude Sonnet (complex reasoning tasks)
- model_name: claude-primary
litellm_params:
model: bedrock/us.anthropic.claude-sonnet-4-5-20251101-v1:0
aws_region_name: us-east-1
tpm: 80000
rpm: 50
# Fallback model — Claude Haiku (automatic fallback)
- model_name: claude-fallback
litellm_params:
model: bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
aws_region_name: us-east-1
tpm: 80000
rpm: 100
# Fast model — Claude Haiku (chunking, validation)
- model_name: claude-fast
litellm_params:
model: bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
aws_region_name: us-east-1
tpm: 80000
rpm: 100
router_settings:
routing_strategy: usage-based-routing
num_retries: 3
retry_after: 5
timeout: 600
fallbacks:
- claude-primary:
- claude-fallback
context_window_fallbacks:
- claude-primary:
- claude-fallback
allowed_fails: 2
cooldown_time: 60
general_settings:
master_key: os.environ/LITELLM_ADMIN_KEY
database_url: os.environ/DATABASE_URL
store_model_in_db: false
max_parallel_requests: 50
request_timeout: 600
litellm_settings:
drop_params: true
set_verbose: false
json_logs: true
num_retries: 3
callbacks: callbacks.metrics_logger.MetricsLoggerKey settings explained:
drop_params: true— silently drops OpenAI-only params that Bedrock doesn't supportstore_model_in_db: false— models defined in YAML only, not the DBrequest_timeout: 600— 10-minute timeout for long extraction tasksjson_logs: true— structured logging for CloudWatchcallbacks— customMetricsLoggercallback that emits structured JSON for CloudWatch metric filters
Custom LiteLLM callback that emits a structured JSON line to stdout after every LLM completion. These lines are captured by CloudWatch Logs and parsed by metric filters.
Emitted JSON format:
{
"litellm_metric": true,
"model": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",
"provider_model": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",
"prompt_tokens": 10164,
"completion_tokens": 32,
"total_tokens": 10196,
"response_cost": 0.0340692,
"response_time_ms": 1493,
"status_code": 200,
"timestamp": "2026-04-26T23:59:41.929045+00:00"
}Key implementation details:
- Extends
litellm.integrations.custom_logger.CustomLogger - Implements both sync (
log_success_event) and async (async_log_success_event) methods — the proxy uses async - Configured in
litellm_config.yamlvialitellm_settings.callbacks - Failure events also emitted with
status_codeanderrorfields
FROM ghcr.io/berriai/litellm:main-stable
WORKDIR /app
COPY config/litellm_config.yaml /app/config.yaml
COPY callbacks/ /app/callbacks/
EXPOSE 4000
CMD ["--config", "/app/config.yaml", \
"--num_workers", "2", \
"--run_gunicorn"]Notes:
main-stabletag — avoids surprise breakage vsmain-latest- Config baked into image at
/app/config.yaml— required for Fargate (no volume mounts) callbacks/directory copied to/app/callbacks/for the custom metrics logger- 2 gunicorn workers — sufficient for 1 vCPU task
# LiteLLM Gateway — Environment Variables
# Copy to .env for local development. Never commit .env.
# Admin API key (must start with sk-)
LITELLM_ADMIN_KEY=sk-<GENERATE_A_RANDOM_KEY>
# Encryption key for stored credentials (set once, never change)
LITELLM_SALT_KEY=<GENERATE_A_32_CHAR_RANDOM_STRING>
# PostgreSQL connection string (LiteLLM internal DB)
DATABASE_URL=postgresql://litellm:<DB_PASSWORD>@localhost:5432/litellm
# Admin UI credentials (accessible at localhost:4000/ui during local development)
UI_USERNAME=admin
UI_PASSWORD=<CHOOSE_A_STRONG_PASSWORD>
# AWS region
AWS_REGION_NAME=us-east-1
# AWS credentials (local dev only — on ECS, use IAM task role)
# AWS_ACCESS_KEY_ID=
# AWS_SECRET_ACCESS_KEY=
# AWS_SESSION_TOKEN=version: "3.9"
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: litellm
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-changeme}
POSTGRES_DB: litellm
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U litellm"]
interval: 5s
timeout: 5s
retries: 5
litellm:
build:
context: .
dockerfile: Dockerfile
env_file: .env
ports:
- "4000:4000"
- "8001:8001"
depends_on:
postgres:
condition: service_healthy
environment:
DATABASE_URL: postgresql://litellm:${POSTGRES_PASSWORD:-changeme}@postgres:5432/litellm
SEPARATE_HEALTH_APP: "1"
volumes:
postgres_data:Note: The DATABASE_URL in the environment block overrides .env to use the Docker network hostname (postgres instead of localhost).
llm-gateway/infra/
├── bin/
│ └── app.ts # CDK App entry point
├── lib/
│ ├── constants.ts # Shared tags, region, app name
│ ├── database-stack.ts # Aurora Serverless v2, Secrets Manager
│ ├── gateway-stack.ts # ECS Fargate, ALB, IAM, auto-scaling
│ └── observability-stack.ts # CloudWatch metric filters + dashboard
├── cdk.json
├── package.json
└── tsconfig.json
VPC (from vpc/infra) → DatabaseStack → GatewayStack → ObservabilityStack
Exports: vpc, ecsSecurityGroup, albSecurityGroup, dbSecurityGroup
| Resource | Configuration |
|---|---|
| VPC | 2 AZs, 2 public subnets (ALB), 2 private subnets with NAT (ECS + Aurora) |
alb-sg |
Inbound: port 80 from VPC CIDR (internal ALB) |
ecs-sg |
Inbound: port 4000 + 8001 from alb-sg only |
db-sg |
Inbound: port 5432 from ecs-sg only |
Security group rules:
alb-sg ← VPC CIDR :80
ecs-sg ← alb-sg :4000, :8001
db-sg ← ecs-sg :5432
Exports: dbSecret, litellmAdminKeySecret, litellmSaltKeySecret, uiCredentialsSecret
| Resource | Configuration |
|---|---|
| Aurora Serverless v2 Cluster | PostgreSQL 16, 0.5–1 ACU (dev), private subnets, db-sg |
| DB credentials secret | Auto-generated by CDK (Credentials.fromGeneratedSecret) |
DATABASE_URL secret |
Assembled by CDK custom resource: postgresql://<user>:<password>@<host>:5432/litellm |
litellmAdminKeySecret |
Auto-generated 32-char key (prefixed sk-) |
litellmSaltKeySecret |
Auto-generated 32-char key |
uiCredentialsSecret |
Username admin, auto-generated password |
Aurora Serverless v2 settings:
- Min capacity: 0.5 ACU (~$0.06/hour when near-idle)
- Max capacity: 1 ACU for dev (scale up for prod)
- Auto-pause: Not available on v2 but near-zero scaling achieves similar cost savings
- Encryption at rest: enabled (AWS-managed key)
- Deletion protection: off for dev
- Database name:
litellm
Exports: serviceUrl (ALB DNS name)
- Container Insights enabled for CloudWatch metrics
| Setting | Value |
|---|---|
| CPU | 1024 (1 vCPU) |
| Memory | 2048 MB |
| Architecture | ARM64 (Graviton — cheaper) |
| Task Role | Bedrock InvokeModel + InvokeModelWithResponseStream |
bedrock:InvokeModel
bedrock:InvokeModelWithResponseStream
→ arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-*
→ arn:aws:bedrock:us-east-1:*:inference-profile/us.anthropic.claude-*
Plus secretsmanager:GetSecretValue for all four secrets (granted via CDK .grantRead()).
| Setting | Value |
|---|---|
| Image | Built from llm-gateway/Dockerfile, pushed to ECR by CDK |
| Port mappings | 4000 (proxy + UI), 8001 (health) |
| Environment | AWS_REGION_NAME=us-east-1, SEPARATE_HEALTH_APP=1 |
| Secrets (from Secrets Manager) | LITELLM_ADMIN_KEY, LITELLM_SALT_KEY, DATABASE_URL, UI_USERNAME, UI_PASSWORD |
| Health check | curl -f http://localhost:8001/health, interval 30s, start period 60s |
| Logging | CloudWatch log group /document-processing/llm-gateway, 2-week retention |
Start period of 60s is critical — LiteLLM runs Prisma DB migrations on first start.
| Setting | Value |
|---|---|
| Desired count | 1 (dev) |
| Subnets | Private with NAT egress |
| Circuit breaker | Enabled with rollback |
| ECS Exec | Enabled (for debugging via aws ecs execute-command) |
| Setting | Value |
|---|---|
| Min tasks | 1 |
| Max tasks | 4 |
| Scale-out trigger | CPU > 70%, cooldown 60s |
| Scale-in cooldown | 300s |
| Setting | Value |
|---|---|
| Scheme | Internal (not internet-facing) |
| Listener | Port 80 → target group port 4000 |
| Health check | Port 8001, path /health, healthy codes 200 |
| Deregistration delay | 30s |
Depends on: GatewayStack (receives logGroup, ecsClusterName, ecsServiceName, albFullName)
Creates 8 CloudWatch metric filters on the gateway log group and a comprehensive dashboard. See Section 8 for details.
Deployed independently — can be torn down and redeployed without affecting the running gateway.
const app = new cdk.App()
VPC + Security Groups (imported from vpc/infra exports)
│
DatabaseStack
└─ exports secrets
│
GatewayStack
└─ ECS service, ALB, IAM
└─ exports logGroup, clusterName, serviceName, albFullName
│
ObservabilityStack
└─ 8 metric filters, CloudWatch dashboard
Tags.of(app).add('env', 'dev')
Tags.of(app).add('application', 'document-processing')
Agents on AgentCore connect to the gateway via the internal ALB DNS name. Recommended: create a Route 53 private hosted zone record:
llm-gateway.document-processing.internal → ALB DNS
Agent configuration (in Strands agent setup):
OPENAI_BASE_URL=https://llm-gateway.document-processing.internal/v1Each agent is assigned a virtual API key via the LiteLLM Admin UI for per-agent spend tracking.
cd llm-gateway
cp .env.example .env
# Edit .env with AWS credentials for local Bedrock access
docker-compose up --buildVerify:
curl http://localhost:8001/health→ 200http://localhost:4000/ui→ Admin UI loads- Test chat completion:
curl -X POST http://localhost:4000/v1/chat/completions -H "Authorization: Bearer <YOUR_ADMIN_KEY>" -H "Content-Type: application/json" -d '{"model":"claude-fast","messages":[{"role":"user","content":"hello"}]}'
cd llm-gateway/infra
npm install
npx cdk bootstrap
npx cdk deploy --all --require-approval broadeningDeploy order (handled automatically by CDK dependency chain):
LlmGatewayNetworkingStack— VPC, subnets, security groupsLlmGatewayDatabaseStack— Aurora Serverless v2, secrets (~5 min)LlmGatewayStack— ECR image build/push, ECS service, ALB (~5 min)
# Check ECS service health
aws ecs describe-services --cluster llm-gateway --services llm-gateway-service --region us-east-1
# Get ALB DNS from stack output
aws cloudformation describe-stacks --stack-name LlmGatewayStack --query "Stacks[0].Outputs"
# Test via SSM port forward (if no VPN)
aws ssm start-session --target <ecs-task-id> --document-name AWS-StartPortForwardingSession --parameters portNumber=4000,localPortNumber=4000
# Health check
curl http://localhost:4000/health
# Bedrock connectivity test
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer <admin-key-from-secrets-manager>" \
-H "Content-Type: application/json" \
-d '{"model":"claude-fast","messages":[{"role":"user","content":"ping"}]}'| Signal | Source | Destination |
|---|---|---|
| LLM request metrics (tokens, cost, latency) | Custom callback → CloudWatch Logs | CloudWatch Metrics (DocumentProcessing/LlmGateway) |
| LLM spend per agent/model | LiteLLM SpendLogs (Aurora) | LiteLLM Admin UI (/ui) |
| Container CPU/Memory | ECS Container Insights | CloudWatch Dashboard |
| Request count, latency, 5xx | ALB metrics (AWS/ApplicationELB) |
CloudWatch Dashboard |
| Application logs (JSON) | LiteLLM stdout | CloudWatch Logs (/document-processing/llm-gateway) |
The MetricsLogger callback (callbacks/metrics_logger.py) emits a structured JSON line to stdout for every LLM completion. The litellm_metric: true field acts as a marker for metric filter patterns.
Configured in litellm_config.yaml:
litellm_settings:
callbacks: callbacks.metrics_logger.MetricsLoggerAll metrics are in the DocumentProcessing/LlmGateway namespace with a Model dimension. Filter patterns use $.litellm_metric IS TRUE as the base selector.
| Metric | Filter | Value | Unit |
|---|---|---|---|
RequestCount |
litellm_metric IS TRUE |
1 | Count |
InputTokens |
litellm_metric IS TRUE AND status_code = 200 |
$.prompt_tokens |
Count |
OutputTokens |
litellm_metric IS TRUE AND status_code = 200 |
$.completion_tokens |
Count |
TotalTokens |
litellm_metric IS TRUE AND status_code = 200 |
$.total_tokens |
Count |
ResponseCostUSD |
litellm_metric IS TRUE AND status_code = 200 |
$.response_cost |
None |
ResponseTimeMs |
litellm_metric IS TRUE |
$.response_time_ms |
Milliseconds |
RequestErrors |
litellm_metric IS TRUE AND status_code >= 400 |
1 | Count |
ApplicationErrors |
level = "ERROR" |
1 | Count |
Dashboard name: LlmGateway-document-processing
| Row | Left Widget | Right Widget |
|---|---|---|
| KPIs | Total Requests (math: sonnet + haiku) | Input Tokens, Output Tokens, Est. Cost |
| Traffic | Request Rate by Model | Error Rate |
| Latency | Avg Response Time by Model | Latency Percentiles p50/p95/p99 (Logs Insights) |
| Tokens | Avg Tokens per Request by Model | Total Tokens Over Time (stacked) |
| Cost | Cost per Request by Model | Cost Breakdown Table (Logs Insights) |
| ECS | CPU Utilization % | Memory Utilization % |
| ALB | Request Count & 5xx Errors | Target Response Time (avg + p99) |
Console URL: https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards/dashboard/LlmGateway-document-processing
| Concern | Mitigation |
|---|---|
| Bedrock credentials | IAM Task Role — no hardcoded keys, boto3 auto-discovers credentials |
| Secrets | Stored in Secrets Manager, injected as ECS secrets at task launch |
| Network access | Internal ALB — not internet-facing, only reachable within VPC |
| Admin UI access | Protected by UI_USERNAME / UI_PASSWORD; external access via SSM port forward only |
| DB access | Aurora in private subnets, db-sg allows only ECS task traffic on port 5432 |
| API authentication | All proxy requests require Authorization: Bearer <virtual-key> header |
| Resource | Estimated Monthly Cost |
|---|---|
| ECS Fargate (1 task, 1 vCPU / 2 GB, ARM64) | ~$30 |
| Aurora Serverless v2 (0.5 ACU avg) | ~$44 |
| ALB (internal, low traffic) | ~$18 |
| NAT Gateway (1 AZ) | ~$32 |
| CloudWatch Logs | ~$2 |
| Total | ~$126/month |
Bedrock model costs are usage-based and billed separately per token.