Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,40 @@

### Added

- **Token usage and execution duration emission (issue #87, FWS-3).**
Every `llm_call` audit event now carries `input_tokens`,
`output_tokens`, `model`, `provider`, `duration_ms`, and `request_id`
captured directly from provider response metadata (Anthropic, OpenAI,
Ollama via the OpenAI-compatible path, OpenAI Responses). Field
naming aligns with OTel GenAI semantic conventions
(`gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens`) so audit
consumers can correlate to OTel traces without a translation table.
When a provider returns no usage (some self-hosted Ollama setups),
the event flags `tokens_unavailable: true` rather than silent zeros.
Each `tool_exec` event gains `duration_ms` plus structured arg-shape
metadata (`args_size`, `result_size`) — raw arg values are not
emitted (payload stripping is FWS-8's concern). A new
`invocation_complete` event closes every A2A invocation with total
wall-clock duration and aggregated `input_tokens_total` /
`output_tokens_total` / `llm_call_count`. A2A responses now carry
the same totals inline as `X-Forge-Tokens-In`, `X-Forge-Tokens-Out`,
`X-Forge-Duration-Ms`, `X-Forge-Model`, `X-Forge-Provider` headers
so orchestrators can enforce cost ceilings during parallel workflow
execution without subscribing to the audit stream. Headers populate
regardless of OTel-tracing state. Cost calculation is deliberately
not in Forge — Forge emits tokens, the platform applies price tables.
The new emitters route through `AuditLogger.EmitFromContext` so
workflow-correlation fields (FWS-2) auto-tag every `llm_call` /
`tool_exec` / `invocation_complete` event when the inbound request
carried orchestrator headers. Schema additivity: existing audit
consumers reading the pre-FWS-3 shape continue to work unchanged. See
`docs/security/audit-logging.md#token-usage-and-execution-duration`.

Internal API change as part of this work: `llm.UsageInfo` field
names were renamed `PromptTokens` → `InputTokens` and
`CompletionTokens` → `OutputTokens` (JSON tags too) to align with
OTel GenAI semconv. The type is internal to `forge-core/llm` and not
consumed outside that package, so no external callers are affected.
- **Workflow correlation ID threading (issue #86, FWS-2).** Forge agents
now extract orchestration headers — `X-Workflow-ID`,
`X-Workflow-Stage-ID`, `X-Workflow-Step-ID`, `X-Invocation-Caller` —
Expand Down
51 changes: 50 additions & 1 deletion docs/security/audit-logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@ All runtime security events are emitted as structured NDJSON to stderr with corr
| `tool_exec` | Tool execution start/end (with tool name) |
| `egress_allowed` | Outbound request allowed (with domain, mode) |
| `egress_blocked` | Outbound request blocked (with domain, mode) |
| `llm_call` | LLM API call completed (with token count) |
| `llm_call` | LLM API call completed (with `input_tokens`, `output_tokens`, `model`, `provider`, `duration_ms`, `request_id`). See [Token usage and duration](#token-usage-and-execution-duration). |
| `llm_call_cancelled` | Streaming LLM call cancelled mid-flight; carries partial token counts captured up to cancellation. |
| `invocation_complete` | A2A invocation finished (auth → dispatch → engine → response). Carries `duration_ms` (wall-clock) plus aggregated `input_tokens_total` / `output_tokens_total` / `llm_call_count` / `model` / `provider`. |
| `guardrail_check` | Guardrail evaluation result |
| `auth_verify` | Inbound request authenticated successfully (with `provider`, `user_id`, `org_id`, `token_kind`) |
| `auth_fail` | Inbound request rejected (with `reason`, `token_kind`) |
Expand All @@ -39,6 +41,53 @@ The `source` field distinguishes in-process enforcer events from subprocess prox

When the inbound A2A request carries the orchestrator's correlation headers (`X-Workflow-ID`, `X-Workflow-Stage-ID`, `X-Workflow-Step-ID`, `X-Invocation-Caller`), every audit event emitted during that invocation is tagged with the matching `workflow_id` / `stage_id` / `step_id` / `invocation_caller` fields. Header names are vendor-neutral so any A2A-compatible orchestrator can populate them. Direct A2A invocations (no orchestrator) omit the fields entirely — emitted JSON is byte-identical to the pre-correlation shape. See [Workflow correlation IDs](workflow-correlation.md) for the full reference, including outbound propagation for agent-to-agent flows.

### Token usage and execution duration

Every `llm_call` audit event carries the normalized token counts the provider returned in its response metadata, plus the wall-clock time spent in the provider call. Field naming aligns with [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (`gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens`) so audit consumers can correlate Forge audit events with OTel traces without a translation table.

```json
{
"ts": "2026-06-04T15:21:09Z",
"event": "llm_call",
"correlation_id": "9b3d…",
"task_id": "task-42",
"model": "claude-sonnet-4-6",
"provider": "anthropic",
"input_tokens": 1240,
"output_tokens": 387,
"duration_ms": 2150,
"request_id": "msg_01H8…"
}
```

| Field | Source | Notes |
|---|---|---|
| `input_tokens` | Provider response usage | Maps to `gen_ai.usage.input_tokens` |
| `output_tokens` | Provider response usage | Maps to `gen_ai.usage.output_tokens` |
| `tokens_unavailable` | Audit emitter | `true` when both counts are zero — some self-hosted Ollama setups don't return usage; billing consumers must distinguish "not measured" from "zero tokens used" |
| `model` | Runtime model config | The model identifier the executor was configured with |
| `provider` | Runtime model config | One of `anthropic`, `openai`, `ollama`, `custom` |
| `duration_ms` | Captured at call site | Wall-clock time spent in `client.Chat`, in milliseconds |
| `request_id` | Provider response | Opaque provider call ID (Anthropic `id`, OpenAI `id`) — debug-correlation handle only, never used for billing |

Each `tool_exec` event (phase=end) carries `duration_ms` for the tool execution plus structured arg-shape metadata (`args_size`, `result_size`) — raw arg values are deliberately not included (payload stripping is FWS-8's concern). One `invocation_complete` event closes each A2A invocation with the total wall-clock duration and aggregated token totals across all LLM calls in the invocation.

Workflow correlation fields (`workflow_id` / `stage_id` / `step_id` / `invocation_caller` from FWS-2) also auto-tag every `llm_call` / `tool_exec` / `invocation_complete` event when the inbound request carried orchestrator headers — billing and audit consumers can attribute cost not just to a task but to a specific workflow run / stage / step.

A2A response headers carry the same per-invocation totals inline so an orchestrator can ceiling-check cost during parallel workflow execution without subscribing to the audit stream:

| Header | Value |
|---|---|
| `X-Forge-Tokens-In` | Sum of `input_tokens` across all LLM calls in the invocation |
| `X-Forge-Tokens-Out` | Sum of `output_tokens` across all LLM calls in the invocation |
| `X-Forge-Duration-Ms` | Wall-clock invocation duration (auth → dispatch → engine → response) |
| `X-Forge-Model` | Most-recently-used model |
| `X-Forge-Provider` | Most-recently-used provider |

Headers populate regardless of whether OTel tracing is enabled — they're the orchestration channel, not the observability channel.

**Cost calculation is deliberately not in Forge.** Forge emits token counts; the platform applies price tables to compute dollar amounts. Price tables change frequently and shouldn't require agent redeploys.

### Authentication events

Every inbound request to `/tasks` emits exactly one of `auth_verify` or `auth_fail`.
Expand Down
44 changes: 44 additions & 0 deletions forge-cli/runtime/forge_usage_headers.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
package runtime

import (
"net/http"
"strconv"

coreruntime "github.com/initializ/forge/forge-core/runtime"
)

// A2A response header names for per-invocation cost telemetry. These
// are the inline channel for orchestrator real-time cost enforcement
// during parallel workflow execution — the orchestrator can ceiling-check
// against running totals before the next stage dispatches. They populate
// regardless of whether OTel tracing is enabled. See issue #87 / FWS-3.
const (
HeaderForgeTokensIn = "X-Forge-Tokens-In"
HeaderForgeTokensOut = "X-Forge-Tokens-Out"
HeaderForgeDurationMs = "X-Forge-Duration-Ms"
HeaderForgeModel = "X-Forge-Model"
HeaderForgeProvider = "X-Forge-Provider"
)

// applyForgeUsageHeaders stamps the X-Forge-* invocation-usage headers
// onto the given http.Header from a usage snapshot. Headers are omitted
// for snapshots with zero LLM calls (e.g. guardrail-failed invocations
// that never reached the LLM) so the response shape mirrors what
// actually happened.
func applyForgeUsageHeaders(h http.Header, snap coreruntime.LLMUsageSnapshot) {
if snap.LLMCallCount == 0 {
// Still stamp duration so orchestrators always see a wall-clock
// figure even for short-circuited invocations.
h.Set(HeaderForgeDurationMs, strconv.FormatInt(snap.InvocationDuration.Milliseconds(), 10))
return
}
h.Set(HeaderForgeTokensIn, strconv.Itoa(snap.InputTokens))
h.Set(HeaderForgeTokensOut, strconv.Itoa(snap.OutputTokens))
h.Set(HeaderForgeDurationMs, strconv.FormatInt(snap.InvocationDuration.Milliseconds(), 10))
if snap.PrimaryModel != "" {
h.Set(HeaderForgeModel, snap.PrimaryModel)
}
if snap.PrimaryProvider != "" {
h.Set(HeaderForgeProvider, snap.PrimaryProvider)
}
}
80 changes: 80 additions & 0 deletions forge-cli/runtime/forge_usage_headers_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
package runtime

import (
"net/http"
"testing"
"time"

coreruntime "github.com/initializ/forge/forge-core/runtime"
)

// Regression tests for issue #87 / FWS-3 — X-Forge-* response header
// emission. Headers are the inline channel for orchestrator real-time
// cost enforcement; they populate regardless of whether OTel tracing
// is enabled.

func TestApplyForgeUsageHeaders_StampsAllFields(t *testing.T) {
h := http.Header{}
applyForgeUsageHeaders(h, coreruntime.LLMUsageSnapshot{
InputTokens: 450,
OutputTokens: 180,
InvocationDuration: 1234 * time.Millisecond,
PrimaryModel: "claude-sonnet-4-6",
PrimaryProvider: "anthropic",
LLMCallCount: 3,
})

if h.Get(HeaderForgeTokensIn) != "450" {
t.Errorf("X-Forge-Tokens-In = %q, want 450", h.Get(HeaderForgeTokensIn))
}
if h.Get(HeaderForgeTokensOut) != "180" {
t.Errorf("X-Forge-Tokens-Out = %q, want 180", h.Get(HeaderForgeTokensOut))
}
if h.Get(HeaderForgeDurationMs) != "1234" {
t.Errorf("X-Forge-Duration-Ms = %q, want 1234", h.Get(HeaderForgeDurationMs))
}
if h.Get(HeaderForgeModel) != "claude-sonnet-4-6" {
t.Errorf("X-Forge-Model = %q, want claude-sonnet-4-6", h.Get(HeaderForgeModel))
}
if h.Get(HeaderForgeProvider) != "anthropic" {
t.Errorf("X-Forge-Provider = %q, want anthropic", h.Get(HeaderForgeProvider))
}
}

func TestApplyForgeUsageHeaders_NoLLMCalls_StillStampsDuration(t *testing.T) {
// Short-circuited invocation (guardrail-failed before LLM dispatch):
// orchestrator still wants a wall-clock figure, but token fields
// would mislead — emit duration only.
h := http.Header{}
applyForgeUsageHeaders(h, coreruntime.LLMUsageSnapshot{
InvocationDuration: 5 * time.Millisecond,
LLMCallCount: 0,
})

if h.Get(HeaderForgeDurationMs) != "5" {
t.Errorf("X-Forge-Duration-Ms must still be stamped on short-circuited invocations, got %q", h.Get(HeaderForgeDurationMs))
}
if h.Get(HeaderForgeTokensIn) != "" || h.Get(HeaderForgeTokensOut) != "" {
t.Errorf("token headers must NOT be stamped when no LLM calls happened, got in=%q out=%q",
h.Get(HeaderForgeTokensIn), h.Get(HeaderForgeTokensOut))
}
}

func TestApplyForgeUsageHeaders_OmitsModelProviderWhenAbsent(t *testing.T) {
// Edge case: LLM call happened but provider/model were empty (no
// runtime attribution available). Stamp tokens + duration only —
// don't stamp empty model/provider values.
h := http.Header{}
applyForgeUsageHeaders(h, coreruntime.LLMUsageSnapshot{
InputTokens: 50,
OutputTokens: 25,
InvocationDuration: 100 * time.Millisecond,
LLMCallCount: 1,
})
if _, present := h[http.CanonicalHeaderKey(HeaderForgeModel)]; present {
t.Errorf("X-Forge-Model must be omitted when PrimaryModel is empty")
}
if _, present := h[http.CanonicalHeaderKey(HeaderForgeProvider)]; present {
t.Errorf("X-Forge-Provider must be omitted when PrimaryProvider is empty")
}
}
Loading
Loading