av/harbor v0.1.4 on GitHub

AirLLM

Handle: airllm
URL: http://localhost:33981

Quickstart |
Configurations |
MacOS |
Example notebooks |
FAQ

AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.

Note that above is true, but don't expect a performant inference. AirLLM loads LLM layers into memory in small groups. The main benefit is that it allows a "transformers"-like workflow for models that are much much larger than your VRAM.

Note

AirLLM requires a GPU with CUDA by default, can't be run on CPU.

Starting

# [Optional] Pre-build the image
# Needs PyTorch and CUDA, so will be quite large
harbor build airllm

# Start the service
# Will download selected models if not present yet
harbor up airllm

For funsies, Harbor implements an OpenAI-compatible server for AirLLM, so that you can connect it to the Open WebUI and... wait... wait... wait... and then get an amazing response from a previously unreachable model.

Full Changelog: v0.1.3...v0.1.4