Features
- server: flash attention past key values optimization (contributed by @njhill)
- router: remove requests when client closes the connection (co-authored by @njhill)
- server: support quantization for flash models
- router: add info route
- server: optimize token decode
- server: support flash sharded santacoder
- security: image signing with cosign
- security: image analysis with trivy
- docker: improve image size
Fix
- server: check cuda capability before importing flash attention
- server: fix hf_transfer issue with private repositories
- router: add auth token for private tokenizers
Misc
- rust: update to 1.69