github ggml-org/llama.cpp b9145

2 hours ago
Details

SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597)

  • SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

  • SYCL: address review feedback - remove try/catch, check device types, deduplicate
  • Remove try/catch from malloc/free/memcpy helpers, check backend and
    device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
  • Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
    and declare in common.hpp to eliminate code duplication
  • Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
  • Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
    host-staged path for iGPU-to-dGPU transfers
  • Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
    in CMakeLists.txt (co-authored with @arthw)
  • SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

  • Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
    Zero code is wrapped in #ifdef so the build works on systems without
    the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
    loader library and headers are checked before enabling.

  • Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
    whether Level Zero or SYCL memory APIs are used. Only one API style is
    used per session, no mixing. If Level Zero is enabled but the devices
    don't support the Level Zero backend, it auto-disables with a warning.

  • Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
    is not called anywhere in the backend) and used try/catch for flow control.

  • Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

  • SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

  • SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

  • The Level Zero backend check was iterating all SYCL devices
    including CPU. The OpenCL CPU device caused Level Zero to be
    disabled for the GPUs, defeating the fix on multi-GPU systems.
    Added is_gpu() filter so only GPU devices are checked.

  • sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
    were still calling sycl::malloc/sycl::free directly, bypassing the
    Level Zero path. Routed through ggml_sycl_malloc_device/free_device
    for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

  • SYCL: address arthw review feedback on Level Zero memory API structure
  • Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
    only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
  • Switch both helpers to use g_ggml_sycl_enable_level_zero global
    instead of per-call queue backend checks
  • Remove #ifdef wrapper from global definition; always declare at 0,
    add #else branch in init block so it stays 0 when L0 not compiled in
  • Update init loop comment to explain GPU-only device check
  • CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

  • SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 noreply@anthropic.com

  • Apply suggestions from code review

Co-authored-by: Neo Zhang zhang.jianyu@outlook.com

  • SYCL: preserve Level Zero allocation path during early malloc

  • ci: fix Level Zero package conflict in Intel Docker build

  • ci: find Level Zero loader in oneAPI package step

  • ci: allow Windows SYCL package without Level Zero DLL


Co-authored-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
Co-authored-by: Neo Zhang zhang.jianyu@outlook.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.