Highlights
🦅 Falcon support. Petals now supports all models based on Falcon, including Falcon 180B released today. We improved the 🤗 Transformers FalconModel
implementation to be up to 40% faster on recent GPUs. Our chatbot app runs Falcon 180B-Chat at ~2 tokens/sec.
Falcon-40B is licensed under Apache 2.0, so you can load it by specifying tiiuae/falcon-40b
or tiiuae/falcon-40b-instruct
as the model name. Falcon-180B is licensed under a custom license, and it is not clear if we can provide a Python interface for inference and fine-tuning of this model. Right now, it is only available in the chatbot app, and we are waiting for further clarifications from TII on this issue.
🍏 Native macOS support. You can run Petals clients and servers on macOS natively - just install Homebrew and run these commands:
brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2
If your computer has Apple M1/M2 chip, the Petals server will use the integrated GPU automatically. We recommend to only host Llama-based models, since other supported architectures do not work efficiently on M1/M2 chips yet. We also recommend using Python 3.10+ on macOS (installed by Homebrew automatically).
🔌 Serving custom models. Custom models now automatically show up at https://health.petals.dev as "not officially supported" models. As a reminder, you are not limited to models available at https://health.petals.dev and can run a server hosting any model based on BLOOM, Llama, or Falcon architecture (given that it's allowed by the model license), or even add a support for a new architecture yourself. We also improved Petals compatibility with some popular Llama-based models (e.g., models from NousResearch) in this release.
🐞 Bug fixes. This release also fixes inference of prefix-tuned models, which was broken in Petals 2.1.0.
What's Changed
- Require transformers>=4.32.0 by @borzunov in #479
- Fix requiring transformers>=4.32.0 by @borzunov in #480
- Rewrite MemoryCache alloc_timeout logic by @justheuristic in #434
- Refactor readme by @borzunov in #482
- Support macOS natively by @borzunov in #477
- Remove no-op process in PrioritizedTaskPool by @borzunov in #484
- Fix
.generate(input_ids=...)
by @borzunov in #485 - Wait for DHT storing state OFFLINE on shutdown by @borzunov in #486
- Fix race condition in MemoryCache by @borzunov in #487
- Replace dots in repo names when building DHT prefixes by @borzunov in #489
- Create model index in DHT by @borzunov in #491
- Force use_cache=True by @borzunov in #496
- Force use_cache=True in config only by @borzunov in #497
- Add Falcon support by @borzunov in #499
- Fix prompt tuning after #464 by @borzunov in #501
- Optimize the Falcon block for inference by @mryab in #500
Full Changelog: v2.1.0...v2.2.0