github ggml-org/llama.cpp b8680

latest release: b8681
2 hours ago
Details

[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)

  • Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

  • Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

  • Address review comments

  • Address review comments

  • Revert variable names to original

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.