github ggml-org/llama.cpp b7276

latest release: b7278
4 hours ago

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

Add support for CUMSUM and TRI for CUDA. (#17584)

  • Add support for CUMSUM and TRI for CUDA.

  • Minor optimizations.

  • Correct warp_prefix_inclusive_sum in float2 variant to return float2

  • Optimize TRI

  • Whitespace

  • Fix strides.

  • Implement double loop

  • Whitespace

  • Fix HIP compilation bugs

  • Optimizations + big case performance tests

  • Implement using CUB with fallback to custom kernel

  • Remove error message.

  • Fixes from code review

  • Comment out CPU-unsupported F16/BF16 cases to fix CI

  • Fine, you win :P

  • Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

  • Vary warp-size based on physical warp size

  • Add GGML_UNUSED_VARS in tri as well

  • Use constexpr and call prefix_inclusive with warp_size template param

  • Update ggml/src/ggml-cuda/cumsum.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Apply suggestions from code review

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Change to tid % warp_size

  • Fix strides; hardcode mask; add ggml_lane_mask_t

  • Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()

  • Too hasty...


Co-authored-by: Johannes Gäßler johannesg@5d6.de

macOS/iOS:

Linux:

Windows:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.