github ollama/ollama v0.4.8-rc0
v0.5.0

latest releases: v0.5.4, v0.5.3, v0.5.3-rc0...
pre-release19 days ago

What's Changed

  • Fixed error importing model vocabulary files
  • Experimental: new flag to set KV cache quantization to 4-bit (q4_0), 8-bit (q8_0) or 16-bit (f16). This reduces VRAM requirements for longer context windows.
    • To enable for all models, use OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve
    • Note: in the future flash attention will be enabled by default where available, with kv cache quantization available on a per-model basis
    • Thank you @sammcj for the contribution in in #7926

New Contributors

Full Changelog: v0.4.7...v0.5.0-rc1

Don't miss a new ollama release

NewReleases is sending notifications on new releases.