github ggml-org/llama.cpp b7240

latest release: b7243
4 hours ago

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

macOS/iOS:

Linux:

Windows:

openEuler:

vulkan: Reduce temporary memory usage for TOP_K (#17623)

  • Compute row size for the temp buffer based on the output of the first pass.
  • Update shader addressing math to use the output row size
  • Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.