github ggml-org/llama.cpp b8210

latest release: b8209
3 hours ago
Details

CUDA: Improve performance via less synchronizations between token (#17795)

  • Adds CPU-to-CUDA copy capability to
    ggml_backend_cuda_cpy_tensor_async()

  • Adds function to relax sync requirements between input copies on
    supported backends (CUDA for now)

  • Exchanges synchronous copy with async copy function.

  • Adds macro guards to allow compilation in non-CUDA builds

  • Reworked backend detection in ggml-backend.cpp to avoid linking
    conflicts

  • Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

  • Minor cleanup

  • Makes opt-in to relax use of explicit syncs more general. Backends like
    vulkan which require a synchronization between HtoD copies and graph
    execution could also adopt this change now.

  • Reintroduces stricter check for CPU->CUDA backend async copy via
    GGML_DEVICE_TYPE_CPU.

  • Corrects initialization of ggml_backend_sync_mode in
    ggml_backend_sched_split initialization

  • Simplifies synchronizations to adhere to saaasg pattern.

  • Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

  • Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.