Warning
Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
Add support for CUMSUM and TRI for CUDA. (#17584)
-
Add support for CUMSUM and TRI for CUDA.
-
Minor optimizations.
-
Correct warp_prefix_inclusive_sum in float2 variant to return float2
-
Optimize TRI
-
Whitespace
-
Fix strides.
-
Implement double loop
-
Whitespace
-
Fix HIP compilation bugs
-
Optimizations + big case performance tests
-
Implement using CUB with fallback to custom kernel
-
Remove error message.
-
Fixes from code review
-
Comment out CPU-unsupported F16/BF16 cases to fix CI
-
Fine, you win :P
-
Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS
-
Vary warp-size based on physical warp size
-
Add GGML_UNUSED_VARS in tri as well
-
Use constexpr and call prefix_inclusive with warp_size template param
-
Update ggml/src/ggml-cuda/cumsum.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Apply suggestions from code review
Co-authored-by: Johannes Gäßler johannesg@5d6.de
-
Change to tid % warp_size
-
Fix strides; hardcode mask; add ggml_lane_mask_t
-
Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()
-
Too hasty...
Co-authored-by: Johannes Gäßler johannesg@5d6.de
macOS/iOS:
Linux:
Windows: