bigscience-workshop/petals v1.1.4 on GitHub

Highlights

🗝️ 8-bit servers support more GPUs. A bitsandbytes update brings 8-bit support to older generations of NVIDIA GPUs, as well as the GeForce 16 GPU series (e.g. 1660 Ti). Please try Petals 1.1.4 if you previously had errors like Your GPU does not support Int8 Matmul! and cublasLt ran into an error! on some GPUs. This version also loads weights in 8-bit by default when tensor parallelism is enabled.

⏱️ Servers start faster. Servers take ~2x less time to load block weights from the disk cache to the GPU memory. The next release will also reduce the time it takes to download the weights from the Internet, since they will be downloaded in 8-bit instead of 16-bit.

🧵 Multi-threaded clients work faster. Earlier, multi-threaded clients were actually performing only one network request at a time due to a bug in hivemind. This bug was recently fixed in hivemind. This significantly improves the speed of the chat.petals.ml app when multiple users chat concurrently.

⏱️ Clients start faster. Clients take ~10% less time to load the model, since they build a route through remote servers in parallel with loading the local part of the model (input/output embeddings).

🌳 Relaxed dependency requirements. We relaxed version requirements for transformers and other huggingface libraries, so you can update them independently of Petals. In particular, Petals works with PyTorch 2.0 and the latest transformers release. Also, we fixed a bug where the client loaded a model in float32 by default (instead of bfloat16/float16) in some transformers releases. Please try Petals 1.1.4 if you previously had out-of-memory errors when running the client.

What's Changed

Speed up loading blocks using init with meta weights by @mryab in #285
Add benchmarks to readme by @borzunov in #284
Fix invalid author email in setup.cfg by @borzunov in #287
Hotfix: Increase daemon_startup_timeout by @borzunov in #292
Update bitsandbytes, hivemind, transformers by @justheuristic in #290
Fix deps, enable 8-bit by default for TP by @borzunov in #298
Add Python 3.10 to CI by @borzunov in #299
Remove CustomLinear8bitLt by @borzunov in #297
Remove use_auto_relay=True in client by @borzunov in #300
Start SequenceManager's thread only after first .make_sequence() by @borzunov in #301
Require bitsandbytes == 0.38.0.post2, hivemind == 1.1.7 by @borzunov in #302
Suggest commands for Docker first by @borzunov in #304
Relax the rest of Hugging Face dependencies by @borzunov in #305
Force transformers to use config.torch_dtype by default by @borzunov in #307
Bump version to 1.1.4 by @borzunov in #306

Full Changelog: v1.1.3...v1.1.4

bigscience-workshop/petals v1.1.4 v1.1.4: Extended GPU support, faster startup, and more on GitHub

Highlights

What's Changed

bigscience-workshop/petals v1.1.4
v1.1.4: Extended GPU support, faster startup, and more

on GitHub