bigscience-workshop/petals v2.0.0.post1 on GitHub

We're excited to announce Petals 2.0.0 — the largest Petals release to date!

Highlights

🦙 Support for LLaMA and LLaMA 2. We've added support for inference and fine-tuning of any models based on 🤗 Transformers LlamaModel, including all variants of LLaMA and LLaMA 2 — one of the strongest open source models available today. The public swarm hosts the largest variants of these models, LLaMA-65B and LLaMA 2 (70B and 70B-Chat), providing inference at the speed of up to 5-6 tokens/sec.

You can try them in the 💬 chatbot web app or in 🚀 our Colab tutorial.

🗜️ 4-bit quantization. We've integrated efficient 4-bit (NF4) quantization from the recent "QLoRA: Efficient Finetuning of Quantized LLMs" paper. This allows to use ~40% less GPU memory (thus, ~40% less servers) to fit all model blocks and have ~2x speedup for token-by-token inference, compared to the 8-bit quantization we previously used, with relatively small quality loss.

🔌 Pre-loading LoRA adapters, such as Guanaco. We added an opportunity to pre-load LoRA adapters compatible with the 🤗 PEFT library, which may add extra functionality to the model you host. You can do this using the --adapters argument on the server (e.g., --adapters repo1/adapter1 repo2/adapter2). These adapters are activated at a client's request - specifically, the client may specify .from_pretrained(..., active_adapter="repo1/adapter1") when loading a distributed model. One example of this is Guanaco - an instruction-finetuned adapter for LLaMA that turns it into a helpful chatbot that carefully follows user's instructions. You can try using LLaMA with this adapter in our chatbot app.

➡️ Direct server-to-server communication. Previously, servers didn't send tensors to each other directly due to specifics of our fault-tolerant inference algorithm. This update changes that, which saves round-trip time between servers and a client and leads to substantial speedups for clients located far away from servers they're using.

🛣️ Shortest-path routing for inference. Previously, a client didn't properly choose geographically close and fast servers, so the client could choose a slow inference chain, especially if the swarm has many servers located for away from it. Now, the client builds a full graph of client-server and server-server latencies, as well as server inference speeds, to find the fastest chain of servers for inference among all possible ones. It also considers the amount of GPU memory left for attention caches, so that we don't choose a close server that doesn't actually have memory for our request.

🌎 Loading models directly from 🤗 Model Hub and Auto classes. Starting from Petals 2.0.0, models do not need to be converted to a special format to be hosted by Petals. Instead, both clients and servers can load models directly from 🤗 Model Hub, fetching only the shards they need to host their part of the model. Furthermore, you can write code supporting multiple architectures at once using Auto classes, such as AutoDistributedConfig.from_pretrained(...) and AutoDistributedModelForCausalLM.from_pretrained(...). The guide for adding new model architectures to Petals also became much simpler due to generalizing Petals code to multiple architectures and the absence of the model conversion step.

🏋️ Fine-tuning examples. We've switched most examples to LLaMA-65B and fixed previously reported bugs. In particular, the "Getting started" notebook now includes a simple example of deep prompt tuning on a dummy task, and the sequence classification notebook uses LLaMA-65B and improved hyperparameters for a stable training.

🖥️ Upgraded swarm monitor. The swarm monitor now contains much more info about the server, including pre-loaded LoRA adapters, detailed performance info, latencies to potential next servers, and so on. All these info is published to DHT, so you don't need to ping each server to fetch it. We've also added a "Contributor" column, so that contributors hosting 10+ blocks get a chance to publish their name, advertise their company or a social media account in exchange to hosting a server for Petals. A name (or a link) shown there may be specified using the server's --public_name argument.

What's Changed

Remove unused imports and attributes by @mryab in #324
Determine block dtype in a unified manner by @mryab in #325
Use number of tokens for attn_cache_size by @mryab in #286
Add LLaMA support by @borzunov in #323
Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} by @borzunov in #329
Fix llama's lm_head.weight.requires_grad by @borzunov in #330
Show license links when loading models by @borzunov in #332
Add benchmark scripts by @borzunov in #319
Fix warmup steps and minor issues in benchmarks by @borzunov in #334
Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) by @borzunov in #337
Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) by @borzunov in #333
Allow free_disk_space_for() remove arbitrary files from Petals cache by @borzunov in #339
Implement direct server-to-server communication by @borzunov in #331
Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 by @borzunov in #340
Delete deprecated petals.cli scripts by @borzunov in #336
Use bitsandbytes 0.40.0.post4 with bias hotfix by @borzunov in #342
Support peft LoRA adapters by @artek0chumak in #335
Fix convergence issues and switch to LLaMA in the SST-2 example by @mryab in #343
Mention LLaMA in readme by @borzunov in #344
Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes by @borzunov in #345
Fix Docker build by avoiding Python 3.11 by @borzunov in #348
Support LLaMA repos without "-hf" suffix by @borzunov in #349
Estimate adapter memory overhead in choose_num_blocks() by @justheuristic in #346
Spam less in server logs by @borzunov in #350
Remove unused import os by @justheuristic in #352
Test that bitsandbytes is not imported when it's not used by @borzunov in #351
Fix bugs in _choose_num_blocks() added in #346 by @borzunov in #354
Switch adapters slightly faster by @justheuristic in #353
Share more info about a server in DHT by @borzunov in #355
Make a server ping next servers by @borzunov in #356
Use bitsandbytes 0.40.1.post1 by @borzunov in #357
Update readme and "Getting started" link by @borzunov in #360
Report inference, forward, and network RPS separately by @borzunov in #358
Fix typo in generation_algorithms.py by @eltociear in #364
Implement shortest-path routing for inference by @borzunov in #362
Update readme to show new models by @borzunov in #365
Require transformers < 4.31.0 until we're compatible by @borzunov in #369
Fix AssertionError on rebalancing by @borzunov in #370
Update transformers to 4.31.0 and peft to 0.4.0 by @borzunov in #371
Fix readme code example, require Python < 3.11 until supported by @borzunov in #374
Fix handler memory leak, get rid of mp.Manager by @justheuristic in #373
Inherit bitsandbytes compute dtype correctly (override peft quirk) by @justheuristic in #377
Fix --token arg by @borzunov in #378
Support Llama 2 by @borzunov in #379
Require accelerate>=0.20.3 as transformers do by @borzunov in #383
Bump version to 2.0.0.post1 by @borzunov in #384

New Contributors

@eltociear made their first contribution in #364

Full Changelog: v1.1.5...v2.0.0.post1

bigscience-workshop/petals v2.0.0.post1 v2.0.0: LLaMA 1 and 2, Guanaco, 4-bit, shortest-path routing, direct server-to-server communication on GitHub

Highlights

What's Changed

New Contributors

bigscience-workshop/petals v2.0.0.post1
v2.0.0: LLaMA 1 and 2, Guanaco, 4-bit, shortest-path routing, direct server-to-server communication

on GitHub