Skip to content

[BUG] Qwen VL models with text prompt longer than max_seq_len 4096 length error #401

@iguy0

Description

@iguy0

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Describe the bug

When prompting Qwen VL models with a long(>4096 max_seq_len ) enough prompt the call fails with the following error:

Dec 06 23:53:16 ailab llama-swap[3174452]: models-local/qwen3-vl-32b-instruct-exl3. Skipping inline model load.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.673 INFO:     Received chat completion request
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:    Traceback (most recent call last):
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 437,
Dec 06 23:53:16 ailab llama-swap[3174452]: in generate_chat_completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        generations = await
Dec 06 23:53:16 ailab llama-swap[3174452]: asyncio.gather(*gen_tasks)
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:
Dec 06 23:53:16 ailab llama-swap[3174452]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 692, in generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        async for generation in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.stream_generate(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 779, in
Dec 06 23:53:16 ailab llama-swap[3174452]: stream_generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        async for generation_chunk in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.generate_gen(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 968, in
Dec 06 23:53:16 ailab llama-swap[3174452]: generate_gen
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        raise ValueError(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:    ValueError: Prompt length 10083 is greater
Dec 06 23:53:16 ailab llama-swap[3174452]: than max_seq_len 4096
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.715 ERROR:    Sent to request: Chat completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e aborted. Maybe the model was unloaded? Please
Dec 06 23:53:16 ailab llama-swap[3174452]: check the server console.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.716 INFO:     192.168.10.45:0 - "POST /v1/chat/completions
Dec 06 23:53:16 ailab llama-swap[3174452]: HTTP/1.1" 503
Dec 06 23:53:16 ailab llama-swap[3174452]: [WARN] metrics skipped, HTTP status=503, path=/v1/chat/completions

I believe the model configuration may not be assigned correctly to max_seq_len and fails here:

if context_len > self.max_seq_len:

Please let me know if you need more information.

Reproduction steps

Download a version of turboderp/Qwen3-VL-32B-Instruct-exl3 and run a call to endpoint with an image and a text prompt with > 4096 max_seq_len

Expected behavior

The api call should respect the model configuration from config.json

Logs

No response

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions