Details
ggml: backend-agnostic tensor parallelism (experimental) (#19378)
-
ggml: backend-agnostic tensor parallelism
-
support for GPT-OSS, Qwen 3 MoE
-
partial Vulkan fix
-
add support for 4/8 GPUs
-
unconditional peer access
-
re-use buffers + ggml contexts
-
fix output pattern
-
NCCL support
-
GGML: HIP: add RCCL support
-
Remove shfl and AllReduce from backend interface
-
move allocation workaround out of ggml-alloc.c
-
2d tensor set/get support
-
Fix the seg fault without NCCL
-
Apply suggestion from JohannesGaessler
-
support for tensor dims % n_devs != 0
-
fix view_offs scaling
-
arbitrary num. of GPUs/tensor split
-
fix compilation
-
better granularity estimate
-
Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
-
partial Qwen 3 Next support
-
Fix qwen3 30b (#8)
-
Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
-
Decide block size based on tensor quantization type
-
Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
-
metal : fix build (#7)
-
static memory allocations, fix usage count
-
fix tensor granularity
-
more even memory distribution
-
use BF16 for allreduce
-
rebase fixup
-
better error message for unsupported architectures
-
Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
-
Enable the previous allreduce implementation. It is better in both perf and stability (#12)
-
delay AllReduce for Moe for less I/O
-
build : clean-up compile warnings
-
backend : move most of the meta backend API to ggml-backend-impl.h
-
cont : hide unused public API in the implementation
-
llama : use llama_device + remove ggml_backend_dev_is_meta()
-
ggml-backend : remove unused alloc include
-
minor : remove regex include
-
ggml : introduce ggml-ext.h for staging new APIs
-
rebase fixup
-
fix tests
-
llama : more robust logic for determining Meta devices (#16)
-
llama : more robust logic for determining Meta devices
-
cont : fix devs size check
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- cont : fix log type
Co-authored-by: Johannes Gäßler johannesg@5d6.de
Co-authored-by: Johannes Gäßler johannesg@5d6.de
-
disable roundtrip for meta backend
-
fix arch selection
-
Qwen 3.5 support
-
fix Gemma 4 MoE
-
fix OpenVino, SYCL
-
fix test-llama-archs for CPU-only builds
-
Fix Qwen 3.5 MoE
-
disable meta backend tests for WebGPU
-
tests : filter CPU-based devices from the Meta backend tests (#17)
-
meta : formatting, naming, indentation (#18)
-
formatting : llama-model.cpp
-
formatting : ggml-ext.h
-
formatting : ggml-backend-meta.cpp
-
meta : add TODO
-
add documentation
-
better error messages
-
fix GPT-OSS
Co-authored-by: Carl Philipp Klemm carl@uvos.xyz
Co-authored-by: Gaurav Garg gaugarg@nvidia.com
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled)
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: