github ggml-org/llama.cpp b7865

one hour ago
Details

Vulkan Flash Attention Coopmat1 Refactor (#19075)

  • vulkan: use coopmat for flash attention p*v matrix multiplication

  • fix P loading issue

  • fix barrier position

  • remove reduction that is no longer needed

  • move max thread reduction into loop

  • remove osh padding

  • add bounds checks and padding

  • remove unused code

  • fix shmem sizes, loop duration and accesses

  • don't overwrite Qf, add new shared psh buffer instead

  • add missing bounds checks

  • use subgroup reductions

  • optimize

  • move bounds check, reduce barriers

  • support other Bc values and other subgroup sizes

  • remove D_split

  • replace Of register array with shared memory Ofsh array

  • parallelize HSV across the rowgroups

  • go back to Of in registers, not shmem

  • vectorize sfsh

  • don't store entire K tile in shmem

  • fixes

  • load large k tiles to shmem on Nvidia

  • adapt shared memory host check function to shader changes

  • remove Bc 32 case

  • remove unused variable

  • fix missing mask reduction tmspsh barrier

  • fix mask bounds check

  • fix rowmax f16 under/overflow to inf

  • fix flash_attn_cm2 BLOCK_SIZE preprocessor directives

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.