initializ · initializ-mk · Jun 5, 2026 · Jun 5, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,40 @@
 
 ### Added
 
+- **Token usage and execution duration emission (issue #87, FWS-3).**
+  Every `llm_call` audit event now carries `input_tokens`,
+  `output_tokens`, `model`, `provider`, `duration_ms`, and `request_id`
+  captured directly from provider response metadata (Anthropic, OpenAI,
+  Ollama via the OpenAI-compatible path, OpenAI Responses). Field
+  naming aligns with OTel GenAI semantic conventions
+  (`gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens`) so audit
+  consumers can correlate to OTel traces without a translation table.
+  When a provider returns no usage (some self-hosted Ollama setups),
+  the event flags `tokens_unavailable: true` rather than silent zeros.
+  Each `tool_exec` event gains `duration_ms` plus structured arg-shape
+  metadata (`args_size`, `result_size`) — raw arg values are not
+  emitted (payload stripping is FWS-8's concern). A new
+  `invocation_complete` event closes every A2A invocation with total
+  wall-clock duration and aggregated `input_tokens_total` /
+  `output_tokens_total` / `llm_call_count`. A2A responses now carry
+  the same totals inline as `X-Forge-Tokens-In`, `X-Forge-Tokens-Out`,
+  `X-Forge-Duration-Ms`, `X-Forge-Model`, `X-Forge-Provider` headers
+  so orchestrators can enforce cost ceilings during parallel workflow
+  execution without subscribing to the audit stream. Headers populate
+  regardless of OTel-tracing state. Cost calculation is deliberately
+  not in Forge — Forge emits tokens, the platform applies price tables.
+  The new emitters route through `AuditLogger.EmitFromContext` so
+  workflow-correlation fields (FWS-2) auto-tag every `llm_call` /
+  `tool_exec` / `invocation_complete` event when the inbound request
+  carried orchestrator headers. Schema additivity: existing audit
+  consumers reading the pre-FWS-3 shape continue to work unchanged. See
+  `docs/security/audit-logging.md#token-usage-and-execution-duration`.
+
+  Internal API change as part of this work: `llm.UsageInfo` field
+  names were renamed `PromptTokens` → `InputTokens` and
+  `CompletionTokens` → `OutputTokens` (JSON tags too) to align with
+  OTel GenAI semconv. The type is internal to `forge-core/llm` and not
+  consumed outside that package, so no external callers are affected.
 - **Workflow correlation ID threading (issue #86, FWS-2).** Forge agents
   now extract orchestration headers — `X-Workflow-ID`,
   `X-Workflow-Stage-ID`, `X-Workflow-Step-ID`, `X-Invocation-Caller` —

diff --git a/docs/security/audit-logging.md b/docs/security/audit-logging.md
@@ -17,7 +17,9 @@ All runtime security events are emitted as structured NDJSON to stderr with corr
 | `tool_exec` | Tool execution start/end (with tool name) |
 | `egress_allowed` | Outbound request allowed (with domain, mode) |
 | `egress_blocked` | Outbound request blocked (with domain, mode) |
-| `llm_call` | LLM API call completed (with token count) |
+| `llm_call` | LLM API call completed (with `input_tokens`, `output_tokens`, `model`, `provider`, `duration_ms`, `request_id`). See [Token usage and duration](#token-usage-and-execution-duration). |
+| `llm_call_cancelled` | Streaming LLM call cancelled mid-flight; carries partial token counts captured up to cancellation. |
+| `invocation_complete` | A2A invocation finished (auth → dispatch → engine → response). Carries `duration_ms` (wall-clock) plus aggregated `input_tokens_total` / `output_tokens_total` / `llm_call_count` / `model` / `provider`. |
 | `guardrail_check` | Guardrail evaluation result |
 | `auth_verify` | Inbound request authenticated successfully (with `provider`, `user_id`, `org_id`, `token_kind`) |
 | `auth_fail` | Inbound request rejected (with `reason`, `token_kind`) |
@@ -39,6 +41,53 @@ The `source` field distinguishes in-process enforcer events from subprocess prox
 
 When the inbound A2A request carries the orchestrator's correlation headers (`X-Workflow-ID`, `X-Workflow-Stage-ID`, `X-Workflow-Step-ID`, `X-Invocation-Caller`), every audit event emitted during that invocation is tagged with the matching `workflow_id` / `stage_id` / `step_id` / `invocation_caller` fields. Header names are vendor-neutral so any A2A-compatible orchestrator can populate them. Direct A2A invocations (no orchestrator) omit the fields entirely — emitted JSON is byte-identical to the pre-correlation shape. See [Workflow correlation IDs](workflow-correlation.md) for the full reference, including outbound propagation for agent-to-agent flows.
 
+### Token usage and execution duration
+
+Every `llm_call` audit event carries the normalized token counts the provider returned in its response metadata, plus the wall-clock time spent in the provider call. Field naming aligns with [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (`gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens`) so audit consumers can correlate Forge audit events with OTel traces without a translation table.
+
+```json
+{
+  "ts": "2026-06-04T15:21:09Z",
+  "event": "llm_call",
+  "correlation_id": "9b3d…",
+  "task_id": "task-42",
+  "model": "claude-sonnet-4-6",
+  "provider": "anthropic",
+  "input_tokens": 1240,
+  "output_tokens": 387,
+  "duration_ms": 2150,
+  "request_id": "msg_01H8…"
+}
+```
+
+| Field | Source | Notes |
+|---|---|---|
+| `input_tokens` | Provider response usage | Maps to `gen_ai.usage.input_tokens` |
+| `output_tokens` | Provider response usage | Maps to `gen_ai.usage.output_tokens` |
+| `tokens_unavailable` | Audit emitter | `true` when both counts are zero — some self-hosted Ollama setups don't return usage; billing consumers must distinguish "not measured" from "zero tokens used" |
+| `model` | Runtime model config | The model identifier the executor was configured with |
+| `provider` | Runtime model config | One of `anthropic`, `openai`, `ollama`, `custom` |
+| `duration_ms` | Captured at call site | Wall-clock time spent in `client.Chat`, in milliseconds |
+| `request_id` | Provider response | Opaque provider call ID (Anthropic `id`, OpenAI `id`) — debug-correlation handle only, never used for billing |
+
+Each `tool_exec` event (phase=end) carries `duration_ms` for the tool execution plus structured arg-shape metadata (`args_size`, `result_size`) — raw arg values are deliberately not included (payload stripping is FWS-8's concern). One `invocation_complete` event closes each A2A invocation with the total wall-clock duration and aggregated token totals across all LLM calls in the invocation.
+
+Workflow correlation fields (`workflow_id` / `stage_id` / `step_id` / `invocation_caller` from FWS-2) also auto-tag every `llm_call` / `tool_exec` / `invocation_complete` event when the inbound request carried orchestrator headers — billing and audit consumers can attribute cost not just to a task but to a specific workflow run / stage / step.
+
+A2A response headers carry the same per-invocation totals inline so an orchestrator can ceiling-check cost during parallel workflow execution without subscribing to the audit stream:
+
+| Header | Value |
+|---|---|
+| `X-Forge-Tokens-In` | Sum of `input_tokens` across all LLM calls in the invocation |
+| `X-Forge-Tokens-Out` | Sum of `output_tokens` across all LLM calls in the invocation |
+| `X-Forge-Duration-Ms` | Wall-clock invocation duration (auth → dispatch → engine → response) |
+| `X-Forge-Model` | Most-recently-used model |
+| `X-Forge-Provider` | Most-recently-used provider |
+
+Headers populate regardless of whether OTel tracing is enabled — they're the orchestration channel, not the observability channel.
+
+**Cost calculation is deliberately not in Forge.** Forge emits token counts; the platform applies price tables to compute dollar amounts. Price tables change frequently and shouldn't require agent redeploys.
+
 ### Authentication events
 
 Every inbound request to `/tasks` emits exactly one of `auth_verify` or `auth_fail`.

diff --git a/forge-cli/runtime/forge_usage_headers.go b/forge-cli/runtime/forge_usage_headers.go
@@ -0,0 +1,44 @@
+package runtime
+
+import (
+	"net/http"
+	"strconv"
+
+	coreruntime "github.com/initializ/forge/forge-core/runtime"
+)
+
+// A2A response header names for per-invocation cost telemetry. These
+// are the inline channel for orchestrator real-time cost enforcement
+// during parallel workflow execution — the orchestrator can ceiling-check
+// against running totals before the next stage dispatches. They populate
+// regardless of whether OTel tracing is enabled. See issue #87 / FWS-3.
+const (
+	HeaderForgeTokensIn   = "X-Forge-Tokens-In"
+	HeaderForgeTokensOut  = "X-Forge-Tokens-Out"
+	HeaderForgeDurationMs = "X-Forge-Duration-Ms"
+	HeaderForgeModel      = "X-Forge-Model"
+	HeaderForgeProvider   = "X-Forge-Provider"
+)
+
+// applyForgeUsageHeaders stamps the X-Forge-* invocation-usage headers
+// onto the given http.Header from a usage snapshot. Headers are omitted
+// for snapshots with zero LLM calls (e.g. guardrail-failed invocations
+// that never reached the LLM) so the response shape mirrors what
+// actually happened.
+func applyForgeUsageHeaders(h http.Header, snap coreruntime.LLMUsageSnapshot) {
+	if snap.LLMCallCount == 0 {
+		// Still stamp duration so orchestrators always see a wall-clock
+		// figure even for short-circuited invocations.
+		h.Set(HeaderForgeDurationMs, strconv.FormatInt(snap.InvocationDuration.Milliseconds(), 10))
+		return
+	}
+	h.Set(HeaderForgeTokensIn, strconv.Itoa(snap.InputTokens))
+	h.Set(HeaderForgeTokensOut, strconv.Itoa(snap.OutputTokens))
+	h.Set(HeaderForgeDurationMs, strconv.FormatInt(snap.InvocationDuration.Milliseconds(), 10))
+	if snap.PrimaryModel != "" {
+		h.Set(HeaderForgeModel, snap.PrimaryModel)
+	}
+	if snap.PrimaryProvider != "" {
+		h.Set(HeaderForgeProvider, snap.PrimaryProvider)
+	}
+}
diff --git a/forge-cli/runtime/forge_usage_headers_test.go b/forge-cli/runtime/forge_usage_headers_test.go
@@ -0,0 +1,80 @@
+package runtime
+
+import (
+	"net/http"
+	"testing"
+	"time"
+
+	coreruntime "github.com/initializ/forge/forge-core/runtime"
+)
+
+// Regression tests for issue #87 / FWS-3 — X-Forge-* response header
+// emission. Headers are the inline channel for orchestrator real-time
+// cost enforcement; they populate regardless of whether OTel tracing
+// is enabled.
+
+func TestApplyForgeUsageHeaders_StampsAllFields(t *testing.T) {
+	h := http.Header{}
+	applyForgeUsageHeaders(h, coreruntime.LLMUsageSnapshot{
+		InputTokens:        450,
+		OutputTokens:       180,
+		InvocationDuration: 1234 * time.Millisecond,
+		PrimaryModel:       "claude-sonnet-4-6",
+		PrimaryProvider:    "anthropic",
+		LLMCallCount:       3,
+	})
+
+	if h.Get(HeaderForgeTokensIn) != "450" {
+		t.Errorf("X-Forge-Tokens-In = %q, want 450", h.Get(HeaderForgeTokensIn))
+	}
+	if h.Get(HeaderForgeTokensOut) != "180" {
+		t.Errorf("X-Forge-Tokens-Out = %q, want 180", h.Get(HeaderForgeTokensOut))
+	}
+	if h.Get(HeaderForgeDurationMs) != "1234" {
+		t.Errorf("X-Forge-Duration-Ms = %q, want 1234", h.Get(HeaderForgeDurationMs))
+	}
+	if h.Get(HeaderForgeModel) != "claude-sonnet-4-6" {
+		t.Errorf("X-Forge-Model = %q, want claude-sonnet-4-6", h.Get(HeaderForgeModel))
+	}
+	if h.Get(HeaderForgeProvider) != "anthropic" {
+		t.Errorf("X-Forge-Provider = %q, want anthropic", h.Get(HeaderForgeProvider))
+	}
+}
+
+func TestApplyForgeUsageHeaders_NoLLMCalls_StillStampsDuration(t *testing.T) {
+	// Short-circuited invocation (guardrail-failed before LLM dispatch):
+	// orchestrator still wants a wall-clock figure, but token fields
+	// would mislead — emit duration only.
+	h := http.Header{}
+	applyForgeUsageHeaders(h, coreruntime.LLMUsageSnapshot{
+		InvocationDuration: 5 * time.Millisecond,
+		LLMCallCount:       0,
+	})
+
+	if h.Get(HeaderForgeDurationMs) != "5" {
+		t.Errorf("X-Forge-Duration-Ms must still be stamped on short-circuited invocations, got %q", h.Get(HeaderForgeDurationMs))
+	}
+	if h.Get(HeaderForgeTokensIn) != "" || h.Get(HeaderForgeTokensOut) != "" {
+		t.Errorf("token headers must NOT be stamped when no LLM calls happened, got in=%q out=%q",
+			h.Get(HeaderForgeTokensIn), h.Get(HeaderForgeTokensOut))
+	}
+}
+
+func TestApplyForgeUsageHeaders_OmitsModelProviderWhenAbsent(t *testing.T) {
+	// Edge case: LLM call happened but provider/model were empty (no
+	// runtime attribution available). Stamp tokens + duration only —
+	// don't stamp empty model/provider values.
+	h := http.Header{}
+	applyForgeUsageHeaders(h, coreruntime.LLMUsageSnapshot{
+		InputTokens:        50,
+		OutputTokens:       25,
+		InvocationDuration: 100 * time.Millisecond,
+		LLMCallCount:       1,
+	})
+	if _, present := h[http.CanonicalHeaderKey(HeaderForgeModel)]; present {
+		t.Errorf("X-Forge-Model must be omitted when PrimaryModel is empty")
+	}
+	if _, present := h[http.CanonicalHeaderKey(HeaderForgeProvider)]; present {
+		t.Errorf("X-Forge-Provider must be omitted when PrimaryProvider is empty")
+	}
+}