github ggml-org/llama.cpp b8738

latest release: b8739
2 hours ago
Details

ggml: backend-agnostic tensor parallelism (experimental) (#19378)

  • ggml: backend-agnostic tensor parallelism

  • support for GPT-OSS, Qwen 3 MoE

  • partial Vulkan fix

  • add support for 4/8 GPUs

  • unconditional peer access

  • re-use buffers + ggml contexts

  • fix output pattern

  • NCCL support

  • GGML: HIP: add RCCL support

  • Remove shfl and AllReduce from backend interface

  • move allocation workaround out of ggml-alloc.c

  • 2d tensor set/get support

  • Fix the seg fault without NCCL

  • Apply suggestion from JohannesGaessler

  • support for tensor dims % n_devs != 0

  • fix view_offs scaling

  • arbitrary num. of GPUs/tensor split

  • fix compilation

  • better granularity estimate

  • Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

  • partial Qwen 3 Next support

  • Fix qwen3 30b (#8)

  • Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

  • Decide block size based on tensor quantization type

  • Fix crashes due to KV cache serialization (#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

  • metal : fix build (#7)

  • static memory allocations, fix usage count

  • fix tensor granularity

  • more even memory distribution

  • use BF16 for allreduce

  • rebase fixup

  • better error message for unsupported architectures

  • Fix device mismatch during scatter of allReduce. (#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

  • Enable the previous allreduce implementation. It is better in both perf and stability (#12)

  • delay AllReduce for Moe for less I/O

  • build : clean-up compile warnings

  • backend : move most of the meta backend API to ggml-backend-impl.h

  • cont : hide unused public API in the implementation

  • llama : use llama_device + remove ggml_backend_dev_is_meta()

  • ggml-backend : remove unused alloc include

  • minor : remove regex include

  • ggml : introduce ggml-ext.h for staging new APIs

  • rebase fixup

  • fix tests

  • llama : more robust logic for determining Meta devices (#16)

  • llama : more robust logic for determining Meta devices

  • cont : fix devs size check

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • cont : fix log type

Co-authored-by: Johannes Gäßler johannesg@5d6.de


Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • disable roundtrip for meta backend

  • fix arch selection

  • Qwen 3.5 support

  • fix Gemma 4 MoE

  • fix OpenVino, SYCL

  • fix test-llama-archs for CPU-only builds

  • Fix Qwen 3.5 MoE

  • disable meta backend tests for WebGPU

  • tests : filter CPU-based devices from the Meta backend tests (#17)

  • meta : formatting, naming, indentation (#18)

  • formatting : llama-model.cpp

  • formatting : ggml-ext.h

  • formatting : ggml-backend-meta.cpp

  • meta : add TODO

  • add documentation

  • better error messages

  • fix GPT-OSS


Co-authored-by: Carl Philipp Klemm carl@uvos.xyz
Co-authored-by: Gaurav Garg gaugarg@nvidia.com
Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.