Overview
- Documentation improvements
- Better handling of CTRL-C in interactive mode
- Matmul via low-precision kernels to take advantage of faster cuBLAS GEMM kernels (thanks @lucasavila00)
- New loading API (thanks @Jeadie)
- Various small bug fixes
- Reduce dependancy complexity (thanks @LLukas22)
What's Changed
- bug fix: llama kv cache part by @keisuke-niimi-insightedge-jp in #300
- Refactor cache manager and kv cache by @EricLBuehler in #304
- Update the docs for ISQ and misc by @EricLBuehler in #310
- Make
pyo3
an optional dependency inmistralrs-core
by @LLukas22 in #303 - Update kv cache by @EricLBuehler in #312
- Print gguf metadata consistently by @EricLBuehler in #313
- Allow loading LoRA without activating adapters and fix bugs by @EricLBuehler in #306
- Remove spurious tokenizer warnings by @EricLBuehler in #314
- Better handling of ctrlc by @EricLBuehler in #315
- Add analysis bot by @EricLBuehler in #316
- Quantized: Use cublas for prompt by @lucasavila00 in #238
- Support loading model into pipeline from local filesystem by @Jeadie in #308
- Fix the ctrlc handler by @EricLBuehler in #318
- Don't force QLlama to have >2 input dims @Jeadie by @Jeadie in #320
- Matmul via f16 when possible by @EricLBuehler in #317
New Contributors
- @keisuke-niimi-insightedge-jp made their first contribution in #300
- @Jeadie made their first contribution in #308
Full Changelog: v0.1.7...v0.1.8