Skip to content

[GSD-12905] ovms crash on arc iGPU when upgrading from 26.01 to 26.18 #937

@jpm-canonical

Description

@jpm-canonical

Pre-submission Checklist

  • I am using the latest GPU driver version (releases)
  • I have searched for similar issues and found none

GPU Hardware

Intel Corporation Lunar Lake [Intel Arc Graphics 130V / 140V] (rev 04)

DRI Devices Information

$ ls -la /dev/dri/by-path/
total 0
drwxr-xr-x 2 root root 100 Jun  2 14:11 .
drwxr-xr-x 3 root root 100 Jun  2 14:11 ..
lrwxrwxrwx+1 root root   8 Jun  2 14:11 pci-0000:00:02.0-card -> ../card1
lrwxrwxrwx 1 root root   8 Jun  2 14:11 pci-0000:00:02.0-platform-simple-framebuffer.0-card -> ../card0
lrwxrwxrwx+1 root root  13 Jun  2 14:11 pci-0000:00:02.0-render -> ../renderD128

GPU Detailed Information (lspci output)

$ sudo lspci -vvv -k -s 0000:00:02.0
00:02.0 VGA compatible controller: Intel Corporation Lunar Lake [Intel Arc Graphics 130V / 140V] (rev 04) (prog-if 00 [VGA controller])
	Subsystem: Dell Device 0cc9
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin ? routed to IRQ 212
	IOMMU group: 4
	Region 0: Memory at 3014000000 (64-bit, prefetchable) [size=16M]
	Region 2: Memory at 2000000000 (64-bit, prefetchable) [size=256M]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [40] Vendor Specific Information: Intel Capabilities v1
		CapA: Peg60Dis- Peg12Dis- Peg11Dis- Peg10Dis- PeLWUDis- DmiWidth=x4
		      EccDis- ForceEccEn- VTdDis- DmiG2Dis- PegG2Dis- DDRMaxSize=Unlimited
		      1NDis- CDDis- DDPCDis- X2APICEn- PDCDis- IGDis- CDID=0 CRID=0
		      DDROCCAP- OCEn- DDRWrtVrefEn+ DDR3LEn+
		CapB: ImguDis- OCbySSKUCap- OCbySSKUEn- SMTCap- CacheSzCap 0x0
		      SoftBinCap- DDR3MaxFreqWithRef100=Disabled PegG3Dis-
		      PkgTyp- AddGfxEn- AddGfxCap- PegX16Dis- DmiG3Dis- GmmDis-
		      DDR3MaxFreq=2932MHz LPDDR3En-
	Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, IntMsgNum 0
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag+ RBE+ FLReset+ TEE-IO-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
			 AtomicOpsCtl: ReqEn-
			 IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
	Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
		Address: 00000000fee00018  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [d0] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Null
	Capabilities: [110 v1] Process Address Space ID (PASID)
		PASIDCap: Exec- Priv-, Max PASID Width: 14
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [200 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable+, Smallest Translation Unit: 00
	Capabilities: [420 v1] Physical Resizable BAR
		BAR 2: current size: 256MB, supported: 256MB
	Capabilities: [400 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Kernel driver in use: xe
	Kernel modules: xe

Driver Version

26.18.38308.1

Installed GPU Driver Packages

$ sudo dpkg --list | grep -iE "igc|gmm|opencl|level-zero|fc|level_zero|ocloc|libze"
ii  bpfcc-tools                                0.35.0+ds-1ubuntu2                          all          tools for BPF Compiler Collection (BCC)
ii  clinfo                                     3.0.25.02.14-1build1                        amd64        Query OpenCL system information
ii  intel-opencl-icd                           26.05.37020.3-1                             amd64        Intel graphics compute runtime for OpenCL
ii  intel-opencl-icd-legacy                    24.35.30872.45-1                            amd64        Intel graphics compute runtime for OpenCL -- legacy platforms
ii  libbpfcc:amd64                             0.35.0+ds-1ubuntu2                          amd64        shared library for BPF Compiler Collection (BCC)
ii  libcbor0.10:amd64                          0.10.2-2ubuntu3                             amd64        library for parsing and generating CBOR (RFC 7049)
ii  libdebconfclient0:amd64                    0.280ubuntu1                                amd64        Debian Configuration Management System (C-implementation library)
ii  libfile-fcntllock-perl                     0.22-4ubuntu6                               amd64        Perl module for file locking with fcntl(2)
ii  libigc1:amd64                              1.0.17791.18+1-3                            amd64        Intel graphics compiler for OpenCL -- core libs
ii  libigc2:amd64                              2.28.4-4                                    amd64        Intel graphics compiler for OpenCL -- core libs
ii  libigdfcl1:amd64                           1.0.17791.18+1-3                            amd64        Intel graphics compiler for OpenCL -- OpenCL library
ii  libigdfcl2:amd64                           2.28.4-4                                    amd64        Intel graphics compiler for OpenCL -- OpenCL library
ii  libigdgmm12:amd64                          22.9.0+ds1-1                                amd64        Intel Graphics Memory Management Library -- shared library
ii  libref-array1t64:amd64                     0.6.2-3build1                               amd64        refcounted array for C
ii  libze1:amd64                               1.28.2-2                                    amd64        oneAPI Level Zero -- share libraries
ii  linux-firmware-qlogic                      20260319.git217ca6e4-0ubuntu1               all          Firmware for QLogic SCSI, FC, and IB host bus and Ethernet adapters
ii  ocl-icd-libopencl1:amd64                   2.3.4-1                                     amd64        Generic OpenCL ICD Loader
ii  python3-bpfcc                              0.35.0+ds-1ubuntu2                          all          Python 3 wrappers for BPF Compiler Collection (BCC)
ii  python3-idna                               3.11-1                                      all          Python IDNA2008 (RFC 5891) handling (Python 3)

Driver Installation Details

Host machine is Ubuntu 26.04.

Software and compute runtime are confined inside a snap using Ubuntu 24.04 as base.

PR for reference: canonical/gemma4-snap#48

Linux Distribution

Ubuntu 24.04 LTS

Other Linux Distribution

No response

Kernel Version & Boot Parameters

$ uname -r
7.0.0-22-generic
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-7.0.0-22-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
$ lsmod | grep -E 'i915|xe'
xe                   4415488  75
drm_gpusvm_helper      57344  1 xe
gpu_sched              69632  1 xe
drm_gpuvm              57344  1 xe
drm_buddy              28672  1 xe
drm_ttm_helper         20480  1 xe
ttm                   135168  2 drm_ttm_helper,xe
drm_exec               12288  2 drm_gpuvm,xe
drm_suballoc_helper    24576  1 xe
drm_display_helper    303104  1 xe
cec                   106496  2 drm_display_helper,xe
i2c_algo_bit           16384  1 xe
intel_vsec             24576  3 intel_pmc_ssram_telemetry,pmt_telemetry,xe
video                  77824  3 dell_wmi,dell_laptop,xe

Actual Behavior

When running OpenVINO Model Server, targeting the iGPU of my laptop, it crashes. This is after updating the compute runtime to 26.18.38308.1. Before I used 26.01.36711.4 on which it did not crash, and ran correctly.

Expected Behavior

Newer compute runtime should not crash.

Reproduction Rate

Always reproduces - 100%

Steps to Reproduce

  1. sudo snap install gemma4 --channel latest/edge
  2. gemma4 chat runs as expected, using the GPU
  3. sudo snap refresh gemma4 --channel latest/edge/pr-48
  4. gemma4 chat does not work
  5. sudo snap logs -f gemma4 will show openvino model server crashing and restarting

Is this a regression?

  • Yes, this is a regression - functionality that previously worked is now broken

Last Known Working Driver Version

26.01.36711.4

First Known Failing Driver Version

26.18.38308.1

API Call Logs

+ ovms --rest_port 8336 --rest_bind_address 127.0.0.1 --model_name gemma4-e4b-it-int4-ov --model_path /tmp/gemma4-e4b-it-int4-ov --pipeline_type VLM --task text_generation --target_device GPU
[2026-06-11 16:40:48.283][1702589][modelmanager][info][modelmanager.cpp:180] Available devices for Open VINO: CPU, GPU
[2026-06-11 16:40:48.283][1702589][serving][info][capimodule.cpp:40] C-APIModule starting
[2026-06-11 16:40:48.283][1702589][serving][info][capimodule.cpp:42] C-APIModule started
[2026-06-11 16:40:48.283][1702589][serving][info][grpcservermodule.cpp:110] GRPCServerModule starting
[2026-06-11 16:40:48.283][1702589][serving][info][grpcservermodule.cpp:114] GRPCServerModule started
[2026-06-11 16:40:48.283][1702589][serving][info][grpcservermodule.cpp:115] Port was not set. GRPC server will not be started.
[2026-06-11 16:40:48.283][1702589][serving][info][httpservermodule.cpp:35] HTTPServerModule starting
[2026-06-11 16:40:48.283][1702589][serving][info][httpservermodule.cpp:39] Will start 8 REST workers
[2026-06-11 16:40:48.284][1702638][serving][info][drogon_http_server.cpp:157] Binding REST server to address: 127.0.0.1:8336
[2026-06-11 16:40:48.334][1702589][serving][info][drogon_http_server.cpp:184] REST server listening on port 8336 with 8 unary threads and 8 streaming threads
[2026-06-11 16:40:48.334][1702589][serving][info][http_server.cpp:249] API key not provided via --api_key_file or API_KEY environment variable. Authentication will be disabled.
[2026-06-11 16:40:48.334][1702589][serving][info][httpservermodule.cpp:52] HTTPServerModule started
[2026-06-11 16:40:48.334][1702589][serving][info][httpservermodule.cpp:53] Started REST server at 127.0.0.1:8336
[2026-06-11 16:40:48.336][1702589][serving][info][server.cpp:461] Graph config created in memory from model_path: /tmp/gemma4-e4b-it-int4-ov
[2026-06-11 16:40:48.336][1702589][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2026-06-11 16:40:48.337][1702589][serving][info][mediapipegraphdefinition.cpp:101] Graph queue globally disabled via OVMS_GRAPH_QUEUE_OFF=1 for mediapipe: gemma4-e4b-it-int4-ov
[2026-06-11 16:40:48.337][1702589][serving][info][mediapipegraphdefinition.cpp:479] MediapipeGraphDefinition initializing graph nodes
[2026-06-11 16:40:48.337][1702589][modelmanager][info][servable_initializer.cpp:467] Initializing Visual Language Model Legacy servable
[2026-06-11 16:40:53.554][1702589][serving][error][servable_initializer.cpp:88] Error during llm node initialization for models_path: /tmp/gemma4-e4b-it-int4-ov exception: Exception from src/inference/src/cpp/core.cpp:117:
Exception from src/inference/src/dev/plugin.cpp:54:
Check 'false' failed at src/plugins/intel_gpu/src/plugin/program_builder.cpp:168:
[GPU] ProgramBuilder build failed!
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_memory.cpp:569:
[GPU] clWaitForEvents, error code: -58 CL_INVALID_EVENT
[2026-06-11 16:40:53.554][1702589][modelmanager][error][servable_initializer.cpp:471] Error during LLM node resources initialization: The LLM Node resource initialization failed
[2026-06-11 16:40:53.554][1702589][serving][error][llm_node_initializer.cpp:60] Failed to process LLM node graph gemma4-e4b-it-int4-ov
[2026-06-11 16:40:53.560][1702589][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: gemma4-e4b-it-int4-ov state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent:
[2026-06-11 16:40:53.560][1702589][modelmanager][error][modelmanager.cpp:208] Couldn't start model manager
[2026-06-11 16:40:53.560][1702589][serving][error][servablemanagermodule.cpp:58] ovms::ModelManager::Start() Error: The LLM Node resource initialization failed
[2026-06-11 16:40:53.560][1702589][serving][info][grpcservermodule.cpp:201] GRPCServerModule shutting down
[2026-06-11 16:40:53.560][1702589][serving][info][grpcservermodule.cpp:211] GRPCServerModule shutdown
[2026-06-11 16:40:53.560][1702589][serving][info][httpservermodule.cpp:59] HTTPServerModule shutting down
[2026-06-11 16:40:53.561][1702589][serving][info][httpservermodule.cpp:64] Shutdown HTTP server
[2026-06-11 16:40:53.561][1702589][serving][info][servablemanagermodule.cpp:65] ServableManagerModule shutting down
[2026-06-11 16:40:53.563][1702589][serving][info][servablemanagermodule.cpp:71] ServableManagerModule shutdown
[2026-06-11 16:40:53.563][1702589][serving][info][capimodule.cpp:50] C-APIModule shutting down
[2026-06-11 16:40:53.563][1702589][serving][info][capimodule.cpp:52] C-APIModule shutdown
Error: exit status 1

strace Logs

No response

System Logs / dmesg Output

No response

Backtrace (if crash or hang occurred)

No response

Source Code / Reproducer

No response

Command Line / Application Details

No response

oneAPI Version (if applicable)

No response

Screenshots / Video

No response

Additional Notes

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    OS: LinuxIssue specific to Linux distributions (Ubuntu, Fedora, RHEL, etc.)Type: BugGeneral bug report, unexpected behavior or crashType: RegressionPreviously working functionality is now broken

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions