A Kubernetes-native, AWS-native platform-of-platforms. Each team's agent workloads are declared as a Tenant CR; the operator provisions the per-tenant IRSA, KMS grants, S3 prefixes, agentgateway routes, kagent runtime, KEDA scaling, budget kill-switch, and Argo-Workflows eval pipeline. Eight personas (sales-ops, support, finance, ops, founder, eng, marketing, legal) are first-class users with their own onboarding playbooks + agentctl scaffolding.
AI clients / agents start here: AGENTS.md. For the stack-wide view, see the Platform Reference.
Bedrock for model access, kagent for the agent runtime, agentgateway for the model/tool data plane, DRA for accelerator scheduling.
Sits on top of landing-zone (Terragrunt org/account/cluster scaffolding) and eks-gitops (general-purpose ArgoCD addons).
| Persona | Start here |
|---|---|
| You're an engineer onboarding a team | docs/onboarding/eng.md |
| You're a non-eng team lead | pick your role in docs/onboarding/ — playbooks for sales-ops / support / finance / ops / founder / marketing / legal |
| You're SRE on-call | docs/runbooks/ — alert-driven + scenario-driven |
| You want the architecture | docs/architecture/ — overview + flow diagrams + multi-cluster, plus docs/adr/ |
| You're picking apart the CRDs | browsable index at docs/crd-reference/ (regenerated from godoc on every make manifests) |
| You want to see the model in action | examples/blank-tenant/ — minimum-viable Platform CR set + smoke-test eval |
| Layer | What's in it |
|---|---|
terraform/ |
OpenTofu/Terragrunt components: bedrock (invocation logging + Guardrails), model-artifacts (S3 + KMS), agent-iam (operator IRSA + tenant role factory), agent-egress (PrivateLink + WAF), accelerator-pools (NVIDIA + Neuron), kill-switch (EventBridge + Step Functions), cost-pipeline (CUR + Athena + Glue Crawler + invocation-cost-publisher Lambda), eval-runtime (eval-runner IRSA + Workflow infra). |
operators/ |
Go (kubebuilder v4) — one binary, six reconcilers (tenant, platform, gateway, runtime, budget, eval), per-reconciler leader election. Owns per-tenant AWS state via in-cluster IRSA. Also ships agentctl CLI. |
charts/ |
Helm — operator (CRDs + Deployment + RBAC + cert-manager-issued webhook cert), bedrock-egress, tenant (opinionated Platform CR scaffold). |
gitops/ |
ArgoCD ApplicationSets layered on top of eks-gitops: agentgateway, kagent, KEDA, Argo Workflows + Rollouts, GPU operator, Neuron device plugin, DRA driver, eval-runtime kustomize, operator chart. Per-environment values (dev/staging/production). PrometheusRule + AlertmanagerConfig (operator-slo). Grafana dashboards. |
examples/ |
blank-tenant (smoke-test single-agent Platform), agent-fleet (KEDA + ToolServer snippet), bedrock-rag (RAG snippet). |
docs/ |
onboarding/ (per-persona playbooks), runbooks/ (alert + scenario playbooks), architecture/ (overview + flow diagrams + multi-cluster), adr/ (Architecture Decision Records), crd-reference/ (CRD index). |
All under agents.stxkxs.io/v1alpha1. Composed on top of kagent's Agent/ModelConfig/ToolServer.
| Kind | Scope | Owns |
|---|---|---|
Tenant |
Cluster | Aggregate budget + readiness + suspension across a tenant's Platforms |
Platform |
Namespaced | Tenant workload namespace, IRSA role, KMS grant, S3 bucket policy, ArgoCD AppProject |
ModelGateway |
Namespaced | agentgateway Route per ModelRoute (Bedrock backend + Guardrail attachment) |
AgentFleet |
Namespaced | kagent Agent + ModelConfig per agent, KEDA ScaledObject (SQS or CPU), NetworkPolicy |
BudgetPolicy |
Namespaced | Hourly Athena CUR aggregation + CloudWatch in-flight estimate; kill-switch event at 120% |
EvalSuite |
Namespaced | Argo Workflow/CronWorkflow against the fleet; status writeback by the runner template |
# Prereqs: tofu >=1.11, terragrunt, kubectl, helm, argocd CLI, pnpm >=11, go >=1.24
git clone git@github.com:nanohype/eks-agent-platform.git
cd eks-agent-platform
pnpm install
task --list
# Validate everything locally
task ci
# Substrate (per environment)
task tofu:apply ENVIRONMENT=dev COMPONENT=bedrock
task tofu:apply ENVIRONMENT=dev COMPONENT=agent-iam
task tofu:apply ENVIRONMENT=dev COMPONENT=model-artifacts
task tofu:apply ENVIRONMENT=dev COMPONENT=cost-pipeline
task tofu:apply ENVIRONMENT=dev COMPONENT=kill-switch
task tofu:apply ENVIRONMENT=dev COMPONENT=eval-runtime
# Cluster-side (point eks-gitops at gitops/applicationsets/)
task gitops:validate
# Onboard a tenant (persona-flexed scaffolding)
agentctl tenant init my-team --persona support --slack '#my-team' \
| kubectl apply -f -
agentctl tenant get my-teamThe operator chart is pulled from oci://ghcr.io/nanohype/eks-agent-platform/charts/operator. On a fresh fork the OCI registry is empty until you cut the first charts-v* release tag. Until then:
helm install operator ./charts/operator \
-n eks-agent-platform --create-namespace \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="$(aws ssm get-parameter --name /eks-agent-platform/dev/agent-iam/operator_role_arn --query Parameter.Value --output text)" \
--set config.environment=devOr cut a release: git tag charts-v0.1.0 && git push origin charts-v0.1.0 (triggers .github/workflows/release.yaml).
BudgetReconcilerticks hourly, queries the CUR Athena table + CloudWatch in-flight metric, computes percent-of-budget.- At ≥ 120% with
KillSwitchEnabled: true, the reconciler publishes aBudgetBreachevent to the kill-switch EventBridge bus. - The kill-switch Step Functions state machine detaches the baseline policy from the tenant IRSA role AND tags the role with
agents.stxkxs.io/suspended=true. - On its next reconcile (≤60s), the operator's
PlatformReconcilersees the suspension tag, setsPlatform.status.phase = Suspended, andAgentFleetReconcilertears down kagent Agents + KEDA ScaledObject so no pods can serve traffic. - Slack #incidents + PagerDuty fire (
PlatformSuspendedalert fromoperator-slo). - Recovery: ops removes the IAM tag; next reconcile sees the cleared tag, reattaches the baseline, fleet scales back up. No CR mutation required.
Full sequence + recovery in docs/runbooks/platform-suspended.md. Threat model: docs/adr/0003-threat-model.md.
This repo deliberately does not own:
- Org, account, network, EKS cluster, baseline IAM →
landing-zone - General-purpose cluster addons (cert-manager, cilium, kyverno, observability stack) →
eks-gitops - Cluster bootstrap (ArgoCD install, app-of-apps wiring) →
aws-eks(CDK)
This repo does own everything between "an empty EKS cluster with ArgoCD running" and "non-technical teams can onboard a tenant with agentctl tenant init".
Conventional commits enforced via commitlint. task ci runs the full lint + test matrix locally. See CONTRIBUTING.md.
Apache-2.0.