github ggml-org/llama.cpp b9460

latest release: b9464
2 hours ago
Details

llama: limit max outputs of llama_context (#23861)

  • llama: save more VRAM by reserving n_outputs == n_seqs when possible

  • add n_outputs_per_seq

  • move n_outputs_max to server-context

  • change ubatch to batch everywhere

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.