Skip to content

[GSD-12919] Arc 140T (Arrow Lake-P / Xe2): intermittent CCS engine reset under sustained OpenCL MoE compute (UR_RESULT_ERROR_OUT_OF_RESOURCES) #939

@mayerwin

Description

@mayerwin

Description

On an Intel Arc 140T iGPU (Arrow Lake-P, Xe2-LPG), sustained OpenCL compute from llama.cpp (SYCL, via the oneAPI Unified Runtime) intermittently hangs the GPU compute engine. The kernel xe driver detects and resets it:

xe 0000:01:00.0: [drm] exec queue reset detected
xe 0000:01:00.0: [drm] GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=N

The devcoredump reset reason is LR job cleanup, guc_id=N. The application then sees UR_RESULT_ERROR_OUT_OF_RESOURCES (surfaced at a later clFinish / stream->wait()), and aborts.

It is timing-sensitive (looks like a race)

Bare, it aborts within ~2 requests. The exact same workload under SYCL_UR_TRACE=2 (which heavily slows and serializes the UR calls) survives indefinitely (18/18 requests, no reset). Anything that slows submission (tracing, debug logging, fewer concurrent kernels) avoids it, which points to an async-ordering / race condition rather than a deterministic resource limit.

Workload

llama.cpp Mixture-of-Experts inference (e.g. gemma-4-26b-a4b) with the experts on the GPU: each layer dispatches many small per-expert matmul kernels back to back. Running the experts on the CPU (-ot exps=CPU) avoids it entirely, so it is specific to the high-rate small-kernel GPU compute pattern.

Environment

  • GPU: Intel Arc 140T (Arrow Lake-P, Xe2-LPG), Core Ultra 9 285H
  • OS: Ubuntu 24.04, kernel 6.17.0-35-generic, xe driver
  • GuC firmware: 70.53.0 (updated from upstream linux-firmware; the issue persists at this version)
  • intel-compute-runtime (NEO): 24.39.31294
  • Backend: OpenCL 3.0 NEO via oneAPI Unified Runtime

Tried, did NOT help

  • GuC firmware update 70.36.0 -> 70.53.0 (rebooted, confirmed loaded)
  • GGML_SYCL_DISABLE_OPT=1

Question

Is this a known CCS engine-reset issue on Arrow Lake / Xe2 under high-rate OpenCL compute, and is it addressed in a newer compute-runtime (we are on 24.39; latest is ~26.18) or a newer kernel? I am about to retest on a much newer stack (NEO 26.18, kernel 7.0) and can report back either way. Happy to provide the full devcoredump, dmesg, or a minimal repro on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    OS: LinuxIssue specific to Linux distributions (Ubuntu, Fedora, RHEL, etc.)Type: BugGeneral bug report, unexpected behavior or crash

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions