bentoml/OpenLLM v0.4.0 on GitHub

Release Highlights

OpenLLM 0.4.0 brings a few revamp feature

Unified API

0.4.0 brings a revamped API for OpenLLM. Users now can run LLM with two new API

await llm.generate_iterator(prompt, stop, **kwargs)
await llm.generate(prompt, stop, **kwargs

llm.generate is the one shot generation for any given prompt, whereas llm.generate_iterator is the streaming variant.

import openllm, asyncio

llm = openllm.LLM("HuggingFaceH4/zephyr-7b-beta")

async def infer(prompt,**kwargs): 
  return await llm.generate(prompt, **kwargs)

asyncio.run(infer("Time is a definition of"))

For using within a BentoML Service, one can do the following

import bentoml, openllm
import openllm

llm = openllm.LLM("HuggingFaceH4/zephyr-7b-beta")

svc = bentoml.Service(name='zephyr-instruct', runners=[llm.runner])

@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text(media_type='text/event-stream'))
async def prompt(input_text: str) -> str:
  async for generation in llm.generate_iterator(input_text):
    yield f"data: {generation.outputs[0].text}\n\n"

Mistral supports

Mistral is now supported with OpenLLM. Simply do openllm start mistral to start a mistral server

AWQ support

AWQ is not supported with both vLLM and PyTorch backend. Simply pass --quantize awq to use AWQ.

Important

For using AWQ it is crucial that the model weight is already quantized with AWQ. Please look for the model variant on HuggingFace hub for the AWQ version of the model you want to use

General bug fixes

Fixes a bug with regards to tag generation. Standalone Bento that use this new API should just work as expected if the model is already exists in the model store.

For consistency, make sure to run openllm prune -y --include-bentos

Installation

pip install openllm==0.4.0

To upgrade from a previous version, use the following command:

pip install --upgrade openllm==0.4.0

Usage

All available models: openllm models

To start a LLM: python -m openllm start opt

To run OpenLLM within a container environment (requires GPUs): docker run --gpus all -it -P ghcr.io/bentoml/openllm:0.4.0 start opt

To run OpenLLM Clojure UI (community-maintained): docker run -p 8420:80 ghcr.io/bentoml/openllm-ui-clojure:0.4.0

Find more information about this release in the CHANGELOG.md

What's Changed

ci: pre-commit autoupdate [pre-commit.ci] by @pre-commit-ci in #563
chore(deps): bump aquasecurity/trivy-action from 0.13.0 to 0.13.1 by @dependabot in #562
chore(deps): bump taiki-e/install-action from 2.21.3 to 2.21.7 by @dependabot in #561
chore(deps-dev): bump eslint from 8.47.0 to 8.53.0 by @dependabot in #558
chore(deps): bump @vercel/og from 0.5.18 to 0.5.20 by @dependabot in #556
chore(deps-dev): bump @types/react from 18.2.20 to 18.2.35 by @dependabot in #559
chore(deps-dev): bump @typescript-eslint/eslint-plugin from 6.9.0 to 6.10.0 by @dependabot in #564
fix : updated client to toggle tls verification by @ABHISHEK03312 in #532
perf: unify LLM interface by @aarnphm in #518
fix(stop): stop is not available in config by @aarnphm in #566
infra: update docs on serving fine-tuning layers by @aarnphm in #567
fix: update build dependencies and format chat prompt by @aarnphm in #569
chore(examples): update openai client by @aarnphm in #568
fix(client): one-shot generation construction by @aarnphm in #570
feat: Mistral support by @aarnphm in #571

New Contributors

@ABHISHEK03312 made their first contribution in #532

Full Changelog: v0.3.14...v0.4.0