ollama/ollama v0.4.8-rc0 on GitHub

What's Changed

Fixed error importing model vocabulary files
Experimental: new flag to set KV cache quantization to 4-bit (q4_0), 8-bit (q8_0) or 16-bit (f16). This reduces VRAM requirements for longer context windows.
- To enable for all models, use OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve
- Note: in the future flash attention will be enabled by default where available, with kv cache quantization available on a per-model basis
- Thank you @sammcj for the contribution in in #7926

Full Changelog: v0.4.7...v0.5.0-rc1