We're excited to announce Petals 2.0.0 — the largest Petals release to date!
Highlights
🦙 Support for LLaMA and LLaMA 2. We've added support for inference and fine-tuning of any models based on 🤗 Transformers LlamaModel
, including all variants of LLaMA and LLaMA 2 — one of the strongest open source models available today. The public swarm hosts the largest variants of these models, LLaMA-65B and LLaMA 2 (70B and 70B-Chat), providing inference at the speed of up to 5-6 tokens/sec.
- You can try them in the 💬 chatbot web app or in 🚀 our Colab tutorial.
🗜️ 4-bit quantization. We've integrated efficient 4-bit (NF4) quantization from the recent "QLoRA: Efficient Finetuning of Quantized LLMs" paper. This allows to use ~40% less GPU memory (thus, ~40% less servers) to fit all model blocks and have ~2x speedup for token-by-token inference, compared to the 8-bit quantization we previously used, with relatively small quality loss.
🔌 Pre-loading LoRA adapters, such as Guanaco. We added an opportunity to pre-load LoRA adapters compatible with the 🤗 PEFT library, which may add extra functionality to the model you host. You can do this using the --adapters
argument on the server (e.g., --adapters repo1/adapter1 repo2/adapter2
). These adapters are activated at a client's request - specifically, the client may specify .from_pretrained(..., active_adapter="repo1/adapter1")
when loading a distributed model. One example of this is Guanaco - an instruction-finetuned adapter for LLaMA that turns it into a helpful chatbot that carefully follows user's instructions. You can try using LLaMA with this adapter in our chatbot app.
➡️ Direct server-to-server communication. Previously, servers didn't send tensors to each other directly due to specifics of our fault-tolerant inference algorithm. This update changes that, which saves round-trip time between servers and a client and leads to substantial speedups for clients located far away from servers they're using.
🛣️ Shortest-path routing for inference. Previously, a client didn't properly choose geographically close and fast servers, so the client could choose a slow inference chain, especially if the swarm has many servers located for away from it. Now, the client builds a full graph of client-server and server-server latencies, as well as server inference speeds, to find the fastest chain of servers for inference among all possible ones. It also considers the amount of GPU memory left for attention caches, so that we don't choose a close server that doesn't actually have memory for our request.
🌎 Loading models directly from 🤗 Model Hub and Auto
classes. Starting from Petals 2.0.0, models do not need to be converted to a special format to be hosted by Petals. Instead, both clients and servers can load models directly from 🤗 Model Hub, fetching only the shards they need to host their part of the model. Furthermore, you can write code supporting multiple architectures at once using Auto
classes, such as AutoDistributedConfig.from_pretrained(...)
and AutoDistributedModelForCausalLM.from_pretrained(...)
. The guide for adding new model architectures to Petals also became much simpler due to generalizing Petals code to multiple architectures and the absence of the model conversion step.
🏋️ Fine-tuning examples. We've switched most examples to LLaMA-65B and fixed previously reported bugs. In particular, the "Getting started" notebook now includes a simple example of deep prompt tuning on a dummy task, and the sequence classification notebook uses LLaMA-65B and improved hyperparameters for a stable training.
🖥️ Upgraded swarm monitor. The swarm monitor now contains much more info about the server, including pre-loaded LoRA adapters, detailed performance info, latencies to potential next servers, and so on. All these info is published to DHT, so you don't need to ping each server to fetch it. We've also added a "Contributor" column, so that contributors hosting 10+ blocks get a chance to publish their name, advertise their company or a social media account in exchange to hosting a server for Petals. A name (or a link) shown there may be specified using the server's --public_name
argument.
What's Changed
- Remove unused imports and attributes by @mryab in #324
- Determine block dtype in a unified manner by @mryab in #325
- Use number of tokens for attn_cache_size by @mryab in #286
- Add LLaMA support by @borzunov in #323
- Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} by @borzunov in #329
- Fix llama's lm_head.weight.requires_grad by @borzunov in #330
- Show license links when loading models by @borzunov in #332
- Add benchmark scripts by @borzunov in #319
- Fix warmup steps and minor issues in benchmarks by @borzunov in #334
- Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) by @borzunov in #337
- Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) by @borzunov in #333
- Allow free_disk_space_for() remove arbitrary files from Petals cache by @borzunov in #339
- Implement direct server-to-server communication by @borzunov in #331
- Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 by @borzunov in #340
- Delete deprecated petals.cli scripts by @borzunov in #336
- Use bitsandbytes 0.40.0.post4 with bias hotfix by @borzunov in #342
- Support peft LoRA adapters by @artek0chumak in #335
- Fix convergence issues and switch to LLaMA in the SST-2 example by @mryab in #343
- Mention LLaMA in readme by @borzunov in #344
- Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes by @borzunov in #345
- Fix Docker build by avoiding Python 3.11 by @borzunov in #348
- Support LLaMA repos without "-hf" suffix by @borzunov in #349
- Estimate adapter memory overhead in choose_num_blocks() by @justheuristic in #346
- Spam less in server logs by @borzunov in #350
- Remove unused import os by @justheuristic in #352
- Test that bitsandbytes is not imported when it's not used by @borzunov in #351
- Fix bugs in _choose_num_blocks() added in #346 by @borzunov in #354
- Switch adapters slightly faster by @justheuristic in #353
- Share more info about a server in DHT by @borzunov in #355
- Make a server ping next servers by @borzunov in #356
- Use bitsandbytes 0.40.1.post1 by @borzunov in #357
- Update readme and "Getting started" link by @borzunov in #360
- Report inference, forward, and network RPS separately by @borzunov in #358
- Fix typo in generation_algorithms.py by @eltociear in #364
- Implement shortest-path routing for inference by @borzunov in #362
- Update readme to show new models by @borzunov in #365
- Require transformers < 4.31.0 until we're compatible by @borzunov in #369
- Fix AssertionError on rebalancing by @borzunov in #370
- Update transformers to 4.31.0 and peft to 0.4.0 by @borzunov in #371
- Fix readme code example, require Python < 3.11 until supported by @borzunov in #374
- Fix handler memory leak, get rid of mp.Manager by @justheuristic in #373
- Inherit bitsandbytes compute dtype correctly (override peft quirk) by @justheuristic in #377
- Fix --token arg by @borzunov in #378
- Support Llama 2 by @borzunov in #379
- Require accelerate>=0.20.3 as transformers do by @borzunov in #383
- Bump version to 2.0.0.post1 by @borzunov in #384
New Contributors
- @eltociear made their first contribution in #364
Full Changelog: v1.1.5...v2.0.0.post1