feat(runtime): token usage + execution duration emission (closes #87, FWS-3) by initializ-mk · Pull Request #99 · initializ/forge

initializ-mk · 2026-06-05T02:07:03Z

Summary

Emits per-LLM-call token counts (input_tokens / output_tokens — OTel-aligned naming), model, provider, duration_ms, and request_id on every llm_call audit event. Captured directly from provider response metadata across all four providers (Anthropic, OpenAI, Ollama via OpenAI-compatible, OpenAI Responses).
Emits per-invocation totals as A2A response headers (X-Forge-Tokens-In, X-Forge-Tokens-Out, X-Forge-Duration-Ms, X-Forge-Model, X-Forge-Provider) so orchestrators can enforce cost ceilings inline during parallel workflow execution without subscribing to the audit stream.
Emits a new invocation_complete audit event with wall-clock duration + aggregated token totals at every A2A request boundary.
tokens_unavailable=true flag distinguishes "provider did not report usage" (some self-hosted Ollama setups) from "you used zero tokens" so downstream billing doesn't undercount.
Tool execution events gain duration_ms plus structured arg-shape metadata (args_size, result_size). Raw arg values are deliberately not emitted — that's FWS-8's payload-stripping concern.

Pre-work inventory (per issue body)

Confirmed Architecture A before coding:

Inventory check	Result
`Client` interface	✓ `forge-core/llm/client.go`
Normalized response type	✓ `ChatResponse` with `Usage UsageInfo` already present
Polymorphic runtime call site	✓ `forge-core/runtime/loop.go:245` calls `e.client.Chat` on `llm.Client`
Anthropic / OpenAI / OpenAI Responses populate Usage	✓
Ollama populates Usage	Wraps OpenAIClient → handled at audit-emit site via `tokens_unavailable`

Decision-tree Row 1 → original S (3–5 days) estimate held.

Architectural notes

Shared call-site instrumentation. AuditLogger.EmitLLMCall is the single capture point for token/duration/model/provider/request_id. The OTel tracing initiative (FORGE_OTEL_TRACING.md) can hook into the same point to populate gen_ai.usage.* span attributes without re-doing per-provider extraction. Same data, captured once, fanned out to multiple emission targets with independent failure domains.
Field-name alignment with OTel GenAI semconv. Audit emits input_tokens / output_tokens (matching gen_ai.usage.input_tokens / gen_ai.usage.output_tokens). Aligned once at FWS-3, then Forge's audit schema stays Forge-owned and shouldn't churn with upstream OTel renames — consumers correlate via the trace_id/span_id cross-link the OTel work adds later.
Schema additivity guarantee. All new fields are *int / *int64 + omitempty, so pre-FWS-3 audit consumers parsing session_start / session_end / etc. see byte-identical JSON shape.
No cost calculation in Forge. Forge emits token counts; the platform applies price tables. Price tables change frequently and shouldn't require agent redeploys.
A2A headers are the orchestration channel, not the observability channel. They populate regardless of OTel-tracing state.

Wiring

Layer	File
OTel-aligned `UsageInfo` field names	`forge-core/llm/types.go` + 4 providers
`AuditEvent` extension + `EmitLLMCall` / `EmitToolExec` / `EmitInvocationComplete`	`forge-core/runtime/audit.go`
LLM call timing + provider/model on `HookContext`	`forge-core/runtime/hooks.go` + `loop.go`
Tool-exec timing + arg-shape metadata	`loop.go` + audit hook in `runner.go`
Per-invocation `LLMUsageAccumulator` (thread-safe)	`forge-core/runtime/usage_accumulator.go` (new)
`invocation_complete` emission + `X-Forge-*` headers	`forge-cli/runtime/runner.go` + `forge_usage_headers.go` (new)
JSON-RPC `tasks/send` simplified to delegate to `executeTask`	`forge-cli/runtime/runner.go` (~120 lines deleted)

Tests

forge-core/runtime/audit_llm_test.go — 6 tests: full usage, tokens_unavailable Ollama path, cancelled → llm_call_cancelled, OTel naming check, backward-compat omission for non-LLM events, tool_exec + invocation_complete shape
forge-core/runtime/usage_accumulator_test.go — 8 tests including a 500-call concurrent-add race regression
forge-cli/runtime/forge_usage_headers_test.go — 3 tests: full stamping, short-circuited invocation, missing model/provider omission
forge-core/llm/providers/usage_extraction_test.go — Anthropic / OpenAI / Ollama-no-usage wire-shape tests

Docs

docs/security/audit-logging.md — new event-types rows (llm_call_cancelled, invocation_complete), expanded llm_call description, new "Token usage and execution duration" section with field table + header table + design notes
CHANGELOG.md — Unreleased entry above the FWS-1 entry, with the internal UsageInfo rename called out

Test plan

go test -race -count=1 ./forge-core/... ./forge-cli/runtime/... ./forge-cli/server/... — all 28 packages pass
golangci-lint run across forge-core/... + forge-cli/... — 0 issues
gofmt -l clean
CI green on push

Out of scope (deliberately)

True streaming llm_call_cancelled emission — the event constant and EmitLLMCall(args.Cancelled) path exist, but ExecuteStream currently wraps non-streaming Chat so the path doesn't fire today. Ready for whenever Forge adopts true client-side streaming.
Embedding call audit events — embedder.go already uses UsageInfo (now OTel-aligned); per-call audit emission for embeddings is a follow-up that mirrors the llm_call pattern.
Cost calculation. By design.

Closes #87.

… FWS-3) Every llm_call audit event now carries OTel-aligned token counts (input_tokens / output_tokens), model, provider, duration_ms, and a provider-specific request_id captured at the LLM call site for the four supported providers (Anthropic, OpenAI, Ollama via the OpenAI- compatible path, OpenAI Responses). When a provider returns no usage metadata (some self-hosted Ollama setups), the emitter flags tokens_unavailable=true rather than emit silent zeros — billing consumers can distinguish "not measured" from "zero tokens used." Each tool_exec event gains duration_ms plus structured arg-shape metadata (args_size, result_size). Raw arg values are not emitted — that's FWS-8's payload-stripping concern, not FWS-3's. A new invocation_complete audit event closes every A2A invocation with the wall-clock duration and aggregated input_tokens_total / output_tokens_total / llm_call_count. A2A REST responses carry the same per-invocation totals inline as X-Forge-Tokens-In / X-Forge-Tokens-Out / X-Forge-Duration-Ms / X-Forge-Model / X-Forge-Provider headers so an orchestrator can ceiling-check cost during parallel workflow execution without subscribing to the audit stream. Headers populate regardless of whether OTel tracing is enabled — they're the orchestration channel, not the observability channel. Cost calculation is deliberately not in Forge. Forge emits token counts; the platform applies price tables to compute dollar amounts. Price tables change frequently and shouldn't require agent redeploys. Schema additivity: all new fields use *int / *int64 pointers + the omitempty JSON tag, so pre-FWS-3 audit consumers parsing without these fields see byte-identical shape for session_start / session_end / etc. Internal API rename: llm.UsageInfo field names PromptTokens → InputTokens and CompletionTokens → OutputTokens (JSON tags too) align with the OTel GenAI semconv. The type is internal to forge-core/llm and not consumed outside that package. Bonus simplification: JSON-RPC tasks/send now delegates to executeTask (~120 lines of duplicated audit/guardrail logic removed), so both JSON-RPC and REST paths share the same usage-accumulator wiring. See docs/security/audit-logging.md#token-usage-and-execution-duration for the full event shape and header contract.

initializ-mk force-pushed the feat/issue-87-token-usage-duration branch from 6ef1701 to f23d770 Compare June 5, 2026 03:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime): token usage + execution duration emission (closes #87, FWS-3)#99

feat(runtime): token usage + execution duration emission (closes #87, FWS-3)#99
initializ-mk wants to merge 1 commit into
mainfrom
feat/issue-87-token-usage-duration

initializ-mk commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

initializ-mk commented Jun 5, 2026

Summary

Pre-work inventory (per issue body)

Architectural notes

Wiring

Tests

Docs

Test plan

Out of scope (deliberately)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant