The reliability layer your AI agents are missing.
Your agent is processing 1,000 customer records. It reaches record 847 — and the process dies.
Without Aetheris: start over from record 1. Re-run 847 LLM calls. Pay twice. Pray nothing was written twice.
With Aetheris: restart. It resumes from record 847. Zero duplicates. Zero data loss.
Every production AI agent eventually hits the same three walls:
| Failure mode | What happens today |
|---|---|
| Process crash mid-task | Restart from the beginning; re-run all LLM calls |
| Retry after tool failure | Email sent twice, order created twice, payment charged twice |
| "Why did the AI do that?" | No visibility, no audit trail, no replay |
Aetheris is an open-source runtime that solves all three — without requiring you to rewrite your agent.
Requirements: Go 1.26.1+, Git
git clone https://github.com/Colin4k1024/Aetheris.git
cd Aetheris
make run-embedded # starts with embedded SQLite, no external servicescurl http://localhost:8080/api/health # {"status":"ok", ...}From Python (pip install aetheris):
from aetheris import AetherisClient
client = AetherisClient("http://localhost:8080")
job = client.run("my-agent", "Summarize the Q3 earnings report")
result = job.wait()
print(result.output)From any language — Aetheris exposes a REST API. Wrap your existing agent with two config lines:
# configs/api.embedded.yaml
agents:
agents:
my_python_agent:
type: "external_http"
external:
url: "http://localhost:9000/invoke"
timeout: "120s"Then submit a job:
curl -X POST http://localhost:8080/api/agents/my_python_agent/message \
-H "Idempotency-Key: task-001" \
-H "Content-Type: application/json" \
-d '{"message": "Process customer batch #42"}'Every job step is checkpointed. If the worker dies, the next worker picks up from the last checkpoint — not the beginning.
Job progress: ████████████████████░░░░░░░░░░ (step 16/25)
Worker crash! 💀
Restart: ████████████████████ (resumes at step 16)
External API calls (payments, emails, order creation) are wrapped in an invocation ledger. Even if a step is retried, each side effect runs exactly once.
# Without Aetheris: retry → email sent twice
# With Aetheris: retry → ledger returns cached result, email sent onceEvery LLM call, tool invocation, and checkpoint is appended to an immutable event log. You can replay any job from any point — without re-calling LLMs or external APIs.
aetheris trace <job-id> # view the full decision timeline
aetheris replay <job-id> # replay without side effectsAetheris works with any agent, in any language. You don't need to change your agent code.
For split API/Worker deployments, load the same external_http agent definition into both processes so the API can accept /api/agents/:id/message and the Worker can execute the job.
# Your existing LangChain agent — unchanged
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent
agent = create_react_agent(ChatOpenAI(), tools, prompt)
# Expose it as an HTTP endpoint (one function)
from aetheris.integrations.langchain import serve
serve(agent, port=9000) # Aetheris will call this endpoint durably→ Full LangChain integration guide
# Add to configs/api.embedded.yaml
agents:
agents:
my_agent:
type: "external_http"
external:
url: "http://your-agent:9000/invoke"Your agent receives a job envelope with message, job_id, and idempotency_key. It returns {"answer": "...", "final": true}.
// Built-in via AgentFactory — config-driven
// configs/agents.yaml
agents:
my_eino_agent:
type: "react"
llm: "default"
tools: ["web_search", "calculator"]Your Agent (Python/JS/Go/any)
│
▼
Aetheris API ──── idempotency key ──▶ Invocation Ledger
│ (at-most-once)
▼
Durable Worker ──── checkpoint ──────▶ Event Store
│ (crash recovery)
▼
Trace & Replay API ───────────────────────────────▶ Audit
The runtime is event-sourced: every state transition is an append-only event. This enables deterministic replay — the same job can be re-run at any time without re-calling LLMs or APIs.
| Aetheris | LangGraph Platform | Temporal | |
|---|---|---|---|
| Open source + self-hosted | ✅ | ❌ (cloud only) | ✅ |
| No infrastructure for local dev | ✅ (embedded SQLite) | ❌ | ❌ (requires server) |
| At-most-once tool execution | ✅ built-in | ||
| Works with any agent framework | ✅ | ❌ LangGraph only | ❌ requires SDK |
| LLM decision audit trail | ✅ | ✅ | ❌ |
| Deterministic replay | ✅ | ❌ | ❌ |
See the current black-box adapter boundary in 2 minutes:
cd examples/crash_recovery
pip install aetheris
python demo.py
# Starts a local external_http demo agent and submits one durable batch jobThe example shows durable submission and trace visibility around one external HTTP call. For true per-step checkpoint resume inside the work itself, use native Aetheris tools/workflows instead of a single external_http call.
| Path | Purpose |
|---|---|
| cmd/api | HTTP API service |
| cmd/worker | Background job worker |
| cmd/cli | CLI: aetheris trace/replay/jobs/chat |
| configs | Runtime configs (embedded, Docker, production) |
| examples | Working examples for each integration pattern |
| sdk/python | Python SDK (pip install aetheris) |
| docs | Guides, API reference, design notes |
| internal/agent | Core runtime engine |
| Goal | Link |
|---|---|
| Get started in 5 minutes | docs/guides/quickstart.md |
| Connect an existing HTTP agent | docs/adapters/external-http-agent.md |
| Connect a LangChain agent | docs/adapters/langchain.md |
| Understand crash recovery | docs/guides/runtime-guarantees.md |
| Deploy to production (Docker) | docs/guides/deployment.md |
| API reference | docs/reference/api.md |
Apache 2.0 — free to use, self-host, and modify.