Skip to content

feat(runtime): token usage + execution duration emission (closes #87, FWS-3)#99

Open
initializ-mk wants to merge 1 commit into
mainfrom
feat/issue-87-token-usage-duration
Open

feat(runtime): token usage + execution duration emission (closes #87, FWS-3)#99
initializ-mk wants to merge 1 commit into
mainfrom
feat/issue-87-token-usage-duration

Conversation

@initializ-mk
Copy link
Copy Markdown
Contributor

Summary

  • Emits per-LLM-call token counts (input_tokens / output_tokens — OTel-aligned naming), model, provider, duration_ms, and request_id on every llm_call audit event. Captured directly from provider response metadata across all four providers (Anthropic, OpenAI, Ollama via OpenAI-compatible, OpenAI Responses).
  • Emits per-invocation totals as A2A response headers (X-Forge-Tokens-In, X-Forge-Tokens-Out, X-Forge-Duration-Ms, X-Forge-Model, X-Forge-Provider) so orchestrators can enforce cost ceilings inline during parallel workflow execution without subscribing to the audit stream.
  • Emits a new invocation_complete audit event with wall-clock duration + aggregated token totals at every A2A request boundary.
  • tokens_unavailable=true flag distinguishes "provider did not report usage" (some self-hosted Ollama setups) from "you used zero tokens" so downstream billing doesn't undercount.
  • Tool execution events gain duration_ms plus structured arg-shape metadata (args_size, result_size). Raw arg values are deliberately not emitted — that's FWS-8's payload-stripping concern.

Pre-work inventory (per issue body)

Confirmed Architecture A before coding:

Inventory check Result
Client interface forge-core/llm/client.go
Normalized response type ChatResponse with Usage UsageInfo already present
Polymorphic runtime call site forge-core/runtime/loop.go:245 calls e.client.Chat on llm.Client
Anthropic / OpenAI / OpenAI Responses populate Usage
Ollama populates Usage Wraps OpenAIClient → handled at audit-emit site via tokens_unavailable

Decision-tree Row 1 → original S (3–5 days) estimate held.

Architectural notes

  • Shared call-site instrumentation. AuditLogger.EmitLLMCall is the single capture point for token/duration/model/provider/request_id. The OTel tracing initiative (FORGE_OTEL_TRACING.md) can hook into the same point to populate gen_ai.usage.* span attributes without re-doing per-provider extraction. Same data, captured once, fanned out to multiple emission targets with independent failure domains.
  • Field-name alignment with OTel GenAI semconv. Audit emits input_tokens / output_tokens (matching gen_ai.usage.input_tokens / gen_ai.usage.output_tokens). Aligned once at FWS-3, then Forge's audit schema stays Forge-owned and shouldn't churn with upstream OTel renames — consumers correlate via the trace_id/span_id cross-link the OTel work adds later.
  • Schema additivity guarantee. All new fields are *int / *int64 + omitempty, so pre-FWS-3 audit consumers parsing session_start / session_end / etc. see byte-identical JSON shape.
  • No cost calculation in Forge. Forge emits token counts; the platform applies price tables. Price tables change frequently and shouldn't require agent redeploys.
  • A2A headers are the orchestration channel, not the observability channel. They populate regardless of OTel-tracing state.

Wiring

Layer File
OTel-aligned UsageInfo field names forge-core/llm/types.go + 4 providers
AuditEvent extension + EmitLLMCall / EmitToolExec / EmitInvocationComplete forge-core/runtime/audit.go
LLM call timing + provider/model on HookContext forge-core/runtime/hooks.go + loop.go
Tool-exec timing + arg-shape metadata loop.go + audit hook in runner.go
Per-invocation LLMUsageAccumulator (thread-safe) forge-core/runtime/usage_accumulator.go (new)
invocation_complete emission + X-Forge-* headers forge-cli/runtime/runner.go + forge_usage_headers.go (new)
JSON-RPC tasks/send simplified to delegate to executeTask forge-cli/runtime/runner.go (~120 lines deleted)

Tests

  • forge-core/runtime/audit_llm_test.go — 6 tests: full usage, tokens_unavailable Ollama path, cancelled → llm_call_cancelled, OTel naming check, backward-compat omission for non-LLM events, tool_exec + invocation_complete shape
  • forge-core/runtime/usage_accumulator_test.go — 8 tests including a 500-call concurrent-add race regression
  • forge-cli/runtime/forge_usage_headers_test.go — 3 tests: full stamping, short-circuited invocation, missing model/provider omission
  • forge-core/llm/providers/usage_extraction_test.go — Anthropic / OpenAI / Ollama-no-usage wire-shape tests

Docs

  • docs/security/audit-logging.md — new event-types rows (llm_call_cancelled, invocation_complete), expanded llm_call description, new "Token usage and execution duration" section with field table + header table + design notes
  • CHANGELOG.md — Unreleased entry above the FWS-1 entry, with the internal UsageInfo rename called out

Test plan

  • go test -race -count=1 ./forge-core/... ./forge-cli/runtime/... ./forge-cli/server/... — all 28 packages pass
  • golangci-lint run across forge-core/... + forge-cli/... — 0 issues
  • gofmt -l clean
  • CI green on push

Out of scope (deliberately)

  • True streaming llm_call_cancelled emission — the event constant and EmitLLMCall(args.Cancelled) path exist, but ExecuteStream currently wraps non-streaming Chat so the path doesn't fire today. Ready for whenever Forge adopts true client-side streaming.
  • Embedding call audit events — embedder.go already uses UsageInfo (now OTel-aligned); per-call audit emission for embeddings is a follow-up that mirrors the llm_call pattern.
  • Cost calculation. By design.

Closes #87.

… FWS-3)

Every llm_call audit event now carries OTel-aligned token counts
(input_tokens / output_tokens), model, provider, duration_ms, and a
provider-specific request_id captured at the LLM call site for the
four supported providers (Anthropic, OpenAI, Ollama via the OpenAI-
compatible path, OpenAI Responses).

When a provider returns no usage metadata (some self-hosted Ollama
setups), the emitter flags tokens_unavailable=true rather than emit
silent zeros — billing consumers can distinguish "not measured" from
"zero tokens used."

Each tool_exec event gains duration_ms plus structured arg-shape
metadata (args_size, result_size). Raw arg values are not emitted —
that's FWS-8's payload-stripping concern, not FWS-3's.

A new invocation_complete audit event closes every A2A invocation
with the wall-clock duration and aggregated input_tokens_total /
output_tokens_total / llm_call_count.

A2A REST responses carry the same per-invocation totals inline as
X-Forge-Tokens-In / X-Forge-Tokens-Out / X-Forge-Duration-Ms /
X-Forge-Model / X-Forge-Provider headers so an orchestrator can
ceiling-check cost during parallel workflow execution without
subscribing to the audit stream. Headers populate regardless of
whether OTel tracing is enabled — they're the orchestration channel,
not the observability channel.

Cost calculation is deliberately not in Forge. Forge emits token
counts; the platform applies price tables to compute dollar amounts.
Price tables change frequently and shouldn't require agent redeploys.

Schema additivity: all new fields use *int / *int64 pointers + the
omitempty JSON tag, so pre-FWS-3 audit consumers parsing without these
fields see byte-identical shape for session_start / session_end / etc.

Internal API rename: llm.UsageInfo field names PromptTokens →
InputTokens and CompletionTokens → OutputTokens (JSON tags too) align
with the OTel GenAI semconv. The type is internal to forge-core/llm
and not consumed outside that package.

Bonus simplification: JSON-RPC tasks/send now delegates to executeTask
(~120 lines of duplicated audit/guardrail logic removed), so both
JSON-RPC and REST paths share the same usage-accumulator wiring.

See docs/security/audit-logging.md#token-usage-and-execution-duration
for the full event shape and header contract.
@initializ-mk initializ-mk force-pushed the feat/issue-87-token-usage-duration branch from 6ef1701 to f23d770 Compare June 5, 2026 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FWS-3 — Token usage and execution duration emission (per LLM call + per invocation)

1 participant