github ggml-org/llama.cpp b9254

2 hours ago
Details

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (#22522)

  • Adds initial PDL setup.

  • Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.

  • Further optimization pass of the first half of kernels

  • Optimized PDL barriers for the second batch of kernels

  • Further refinements after rebase.

  • Moves pdl logic to separate function, removes some whitespace

  • Strips post-hoc PDL logic

  • Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to
    overlap execution with previous kernels

  • Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL

  • Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL

  • Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx,
    to enable hip/musa compatibility

  • Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32

  • Enrolls flash_attn_combine_results

  • Fix: Drops needless and broken check of CUDA arch for PDL. PDL either
    works or is without effect.

  • Enrolls flash-attention kernels to pdl

  • Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for
    kernels args. This fixes PDL.

  • Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via
    and template alias and template expansion

  • Enrolls all remaining kernels for qwen3-coder-next into PDL

  • Remove all PDL LC calls to create a baseline

  • Added LC according to internal guidance and tested kernel performance.

  • Enrols missing qwen3-5 kernels passively into PDL.

  • Kernel optimizations (LC signals) for qwen3.5

  • Enrolls ssm-scan kernels into PDL

  • Adds GGML_CUDA_PDL command line option to toggle PDL.

  • Fix: Ada and lower compilation by guarding PDL calls correctly

  • Cleanup: Removes commented out GGML_CUDA_PDL_LC

  • Cleanup: Removes experimental comments

  • Adds 90-virtual to build script so that Hopper GPUs can leverage PDL.

  • Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL.

  • Fix: Correct PDL en/disablement based on device-side arch check. Host
    side check is UB. Required moving from macros to inlined functions

  • Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1

  • Enable PDL by default for Hopper+ devices

  • Enrolls softcap_f32 and two flash_attn kernels into PDL.

  • Improves flash attn PDL barrier placement

  • Fix: Perf regression on ada; excludes ada and below from PDL launches

  • Improves some sync barrier placements

  • Drops superfluous constructor

  • Adds #endif guard comments

  • Reverts experimental change to top-k-moe.cu, which moved expensive allocations
    in front of the PDL barrier. It did not have a meaningful impact.

  • Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0
    PDL is disabled

  • Revert "Drops superfluous constructor". Adds const to remaining
    arguments

This reverts commit 12b1d25.

  • Cleanup: Removes and fixes some comments and whitespace

  • Clarifies comment of sync-barrier position

  • Relocates and refactors PDL launch functions and accessories

  • Adds error checking to the regular kernel launch path

  • Drops "auto" in favor of "ggml_cuda_kernel_params"

  • Adds "const" to ggml_cuda_kernel_launch_params

  • [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.