Release Notes

Enhancements

Distributed Inference: Introduced support for distributed inference. See issues #117, #118.
Legacy GPU Support: Added support for Nvidia GPUs with Compute Capability below 8.0. See issue #216.
CPU Inference: Implemented support for CPU-based inference. See issue #141.
Sharded Model Files: Added support for sharded model files. See issue #190.
Enhanced Scheduling: Improved scheduling capabilities, including binpack and spread strategies, scheduling to specific GPUs, and worker label-based scheduling. See issue #99.
Replica Scaling: Added functionality to scale model replicas from the model list page. See issue #210.

Model Deployment Issue: Resolved an issue where model instances could not be deployed after updating worker names. See issue #191.
GPU Detection: Fixed GPU detection problems when nvidia-drm module was not loaded. See issue #212.
Custom System Reserved Parameter: Addressed failures when using custom --system-reserved parameters. See issue #152.
Ollama Models Download: Fixed an issue that prevented non-library Ollama models from being downloaded correctly. See issue #230.
GPU Index Assignment: Corrected the assignment of incorrect GPU indices to model instances. See issue #221.
GLIBC Version Not Found Error: Fixed a bug where model deployment would fail due to missing GLIBC version errors. See issue #270.