ggml-org/llama.cpp b7276 on GitHub

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

Add support for CUMSUM and TRI for CUDA. (#17584)

Add support for CUMSUM and TRI for CUDA.
Minor optimizations.
Correct warp_prefix_inclusive_sum in float2 variant to return float2
Optimize TRI
Whitespace
Fix strides.
Implement double loop
Whitespace
Fix HIP compilation bugs
Optimizations + big case performance tests
Implement using CUB with fallback to custom kernel
Remove error message.
Fixes from code review
Comment out CPU-unsupported F16/BF16 cases to fix CI
Fine, you win :P
Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS
Vary warp-size based on physical warp size
Add GGML_UNUSED_VARS in tri as well
Use constexpr and call prefix_inclusive with warp_size template param
Update ggml/src/ggml-cuda/cumsum.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Apply suggestions from code review

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Change to tid % warp_size
Fix strides; hardcode mask; add ggml_lane_mask_t
Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()
Too hasty...

Co-authored-by: Johannes Gäßler johannesg@5d6.de

macOS/iOS:

Linux:

Windows: