github ggml-org/llama.cpp b7436

latest releases: b7440, b7439, b7438...
9 hours ago

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

Details

server: fix crash when batch > ubatch with embeddings (#17912)

  • server: fix crash when batch > ubatch with embeddings (#12836)

Fixes #12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.

Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.

Solution:

  • Add parameter validation in main() after common_params_parse()
  • When embeddings enabled and n_batch > n_ubatch:
    • Log warnings explaining the issue
    • Automatically set n_batch = n_ubatch
    • Prevent server crash

This follows the approach suggested by @ggerganov in issue #12836.

Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).

Testing:

  • Build: Compiles successfully
  • Validation triggers: Warns when -b > -ub with --embedding
  • Auto-correction works: Adjusts n_batch = n_ubatch
  • No false positives: Valid params don't trigger warnings
  • Verified on macOS M3 Pro with embedding model
  • Update tools/server/server.cpp

Co-authored-by: ytian218 ytian218@bloomberg.net
Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.