Summary
The in-process vLLM backend (mellea/backends/vllm.py) never sets mot.usage, so callers always receive None for token counts regardless of whether the generation succeeded.
Affected code
VLLMBackend.post_processing records tool calls, the generate log, and telemetry metadata, but contains no usage-population step.
The processing method accumulates only the decoded text from vllm.RequestOutput.outputs[0].text; the token ID arrays are discarded.
How other backends handle this
Every other backend that can compute token counts does so unconditionally in its post-processing step:
| Backend |
Source of counts |
| HuggingFace |
GenerateDecoderOnlyOutput.sequences shape |
| OpenAI / LiteLLM |
usage field in API response |
| Ollama |
prompt_eval_count / eval_count in response |
| WatsonX |
usage field in API response |
vllm.RequestOutput exposes both prompt_token_ids and outputs[0].token_ids, so counts can be derived without any extra API call.
Expected behaviour
mot.usage should be set to {"prompt_tokens": N, "completion_tokens": M, "total_tokens": N+M} after every successful vLLM generation, consistent with other backends.
Notes
Summary
The in-process vLLM backend (
mellea/backends/vllm.py) never setsmot.usage, so callers always receiveNonefor token counts regardless of whether the generation succeeded.Affected code
VLLMBackend.post_processingrecords tool calls, the generate log, and telemetry metadata, but contains no usage-population step.The
processingmethod accumulates only the decoded text fromvllm.RequestOutput.outputs[0].text; the token ID arrays are discarded.How other backends handle this
Every other backend that can compute token counts does so unconditionally in its post-processing step:
GenerateDecoderOnlyOutput.sequencesshapeusagefield in API responseprompt_eval_count/eval_countin responseusagefield in API responsevllm.RequestOutputexposes bothprompt_token_idsandoutputs[0].token_ids, so counts can be derived without any extra API call.Expected behaviour
mot.usageshould be set to{"prompt_tokens": N, "completion_tokens": M, "total_tokens": N+M}after every successful vLLM generation, consistent with other backends.Notes
generate_from_raw(batch path, line ~462) also does not set usage — same fix needed there.