Skip to content

test: intermittent failures from vllm tests on LSF cluster #699

@planetf1

Description

@planetf1

I'm seeing intermittent failures from vllm tests on lsf cluster when run with

uv run --all-extras --all-groups pytest --isolate-heavy -v

For example:

==== 723 passed, 142 skipped, 2 xfailed, 90 warnings in 1572.83s (0:26:12) =====

when all worked well, and

FAILED test/backends/test_openai_vllm.py::test_instruct - openai.NotFoundErro...
FAILED test/backends/test_openai_vllm.py::test_multiturn - openai.NotFoundErr...
FAILED test/backends/test_openai_vllm.py::test_chat - openai.NotFoundError: E...
FAILED test/backends/test_openai_vllm.py::test_chat_stream - openai.NotFoundE...
FAILED test/backends/test_openai_vllm.py::test_format - openai.NotFoundError:...
FAILED test/backends/test_openai_vllm.py::test_generate_from_raw - openai.Not...
FAILED test/backends/test_openai_vllm.py::test_generate_from_raw_with_format
= 7 failed, 716 passed, 142 skipped, 2 xfailed, 90 warnings in 1409.38s (0:23:29) =

at other times.

Success seems about 50-75% failure from running multiple times

On further investigation the underlying error for all these cases is:

E               openai.NotFoundError: Error code: 404 - {'error': {'message': 'The model `ibm-granite/granite-4.0-micro` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}}

Question to persue -- How is the vllm server initialized when tests are run with uv on a GPU enabled cluster - clearly sometimes we get access to a vllm environment with the right model, othertimes we don't

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions