Release Notes
Enhancements
- Distributed Inference: Introduced support for distributed inference. See issues #117, #118.
- Legacy GPU Support: Added support for Nvidia GPUs with Compute Capability below 8.0. See issue #216.
- CPU Inference: Implemented support for CPU-based inference. See issue #141.
- Sharded Model Files: Added support for sharded model files. See issue #190.
- Enhanced Scheduling: Improved scheduling capabilities, including binpack and spread strategies, scheduling to specific GPUs, and worker label-based scheduling. See issue #99.
- Replica Scaling: Added functionality to scale model replicas from the model list page. See issue #210.
Bug Fixes
- Model Deployment Issue: Resolved an issue where model instances could not be deployed after updating worker names. See issue #191.
- GPU Detection: Fixed GPU detection problems when
nvidia-drm
module was not loaded. See issue #212. - Custom System Reserved Parameter: Addressed failures when using custom
--system-reserved
parameters. See issue #152. - Ollama Models Download: Fixed an issue that prevented non-library Ollama models from being downloaded correctly. See issue #230.
- GPU Index Assignment: Corrected the assignment of incorrect GPU indices to model instances. See issue #221.
- GLIBC Version Not Found Error: Fixed a bug where model deployment would fail due to missing GLIBC version errors. See issue #270.