Skip to content

feat: vLLM (in-process) backend never populates mot.usage #696

@planetf1

Description

@planetf1

Summary

The in-process vLLM backend (mellea/backends/vllm.py) never sets mot.usage, so callers always receive None for token counts regardless of whether the generation succeeded.

Affected code

VLLMBackend.post_processing records tool calls, the generate log, and telemetry metadata, but contains no usage-population step.

The processing method accumulates only the decoded text from vllm.RequestOutput.outputs[0].text; the token ID arrays are discarded.

How other backends handle this

Every other backend that can compute token counts does so unconditionally in its post-processing step:

Backend Source of counts
HuggingFace GenerateDecoderOnlyOutput.sequences shape
OpenAI / LiteLLM usage field in API response
Ollama prompt_eval_count / eval_count in response
WatsonX usage field in API response

vllm.RequestOutput exposes both prompt_token_ids and outputs[0].token_ids, so counts can be derived without any extra API call.

Expected behaviour

mot.usage should be set to {"prompt_tokens": N, "completion_tokens": M, "total_tokens": N+M} after every successful vLLM generation, consistent with other backends.

Notes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions