[GSD-12919] Arc 140T (Arrow Lake-P / Xe2): intermittent CCS engine reset under sustained OpenCL MoE compute (UR_RESULT_ERROR_OUT_OF_RESOURCES)

### Description

On an Intel Arc 140T iGPU (Arrow Lake-P, Xe2-LPG), sustained OpenCL compute from llama.cpp (SYCL, via the oneAPI Unified Runtime) intermittently hangs the GPU compute engine. The kernel `xe` driver detects and resets it:

```
xe 0000:01:00.0: [drm] exec queue reset detected
xe 0000:01:00.0: [drm] GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=N
```

The `devcoredump` reset reason is `LR job cleanup, guc_id=N`. The application then sees `UR_RESULT_ERROR_OUT_OF_RESOURCES` (surfaced at a later `clFinish` / `stream->wait()`), and aborts.

### It is timing-sensitive (looks like a race)

Bare, it aborts within ~2 requests. The exact same workload under `SYCL_UR_TRACE=2` (which heavily slows and serializes the UR calls) survives indefinitely (18/18 requests, no reset). Anything that slows submission (tracing, debug logging, fewer concurrent kernels) avoids it, which points to an async-ordering / race condition rather than a deterministic resource limit.

### Workload

llama.cpp Mixture-of-Experts inference (e.g. gemma-4-26b-a4b) with the experts on the GPU: each layer dispatches many small per-expert matmul kernels back to back. Running the experts on the CPU (`-ot exps=CPU`) avoids it entirely, so it is specific to the high-rate small-kernel GPU compute pattern.

### Environment

- GPU: Intel Arc 140T (Arrow Lake-P, Xe2-LPG), Core Ultra 9 285H
- OS: Ubuntu 24.04, kernel 6.17.0-35-generic, `xe` driver
- GuC firmware: 70.53.0 (updated from upstream linux-firmware; the issue persists at this version)
- intel-compute-runtime (NEO): 24.39.31294
- Backend: OpenCL 3.0 NEO via oneAPI Unified Runtime

### Tried, did NOT help

- GuC firmware update 70.36.0 -> 70.53.0 (rebooted, confirmed loaded)
- `GGML_SYCL_DISABLE_OPT=1`

### Question

Is this a known CCS engine-reset issue on Arrow Lake / Xe2 under high-rate OpenCL compute, and is it addressed in a newer compute-runtime (we are on 24.39; latest is ~26.18) or a newer kernel? I am about to retest on a much newer stack (NEO 26.18, kernel 7.0) and can report back either way. Happy to provide the full `devcoredump`, `dmesg`, or a minimal repro on request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSD-12919] Arc 140T (Arrow Lake-P / Xe2): intermittent CCS engine reset under sustained OpenCL MoE compute (UR_RESULT_ERROR_OUT_OF_RESOURCES) #939

Description

It is timing-sensitive (looks like a race)

Workload

Environment

Tried, did NOT help

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[GSD-12919] Arc 140T (Arrow Lake-P / Xe2): intermittent CCS engine reset under sustained OpenCL MoE compute (UR_RESULT_ERROR_OUT_OF_RESOURCES) #939

Description

Description

It is timing-sensitive (looks like a race)

Workload

Environment

Tried, did NOT help

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions