bentoml/BentoML v1.0.17 on GitHub

🍱 We are excited to announce the release of BentoML v1.0.17, which includes support for 🤗 Hugging Face Transformers pre-trained instances. Prior to this release, only pipelines could be saved and loaded using the bentoml.transformers APIs. However, based on the community's demand to work with pre-trained models, tokenizers, preprocessors, etc., without pipelines, we have expanded our capabilities in bentoml.transformers APIs. With this release, all pre-trained instances can be saved and loaded into either built-in Transformers framework runners or custom runners. This update opens up new possibilities for users to work with pre-trained models, and we are thrilled to see what the community will create using this feature. To learn more, visit BentoML Transformers framework documentation.

Pre-trained models and instances, such as tokenizers, preprocessors, and feature extractors, can also be saved as standalone models using the bentoml.transformers.save_model API.

import bentoml
from transformers import AutoTokenizer

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

bentoml.transformers.save_model("speecht5_tts_processor", processor)
bentoml.transformers.save_model("speecht5_tts_model", model, signatures={"generate_speech": {"batchable": False}})
bentoml.transformers.save_model("speecht5_tts_vocoder", vocoder)

Pre-trained models and instances can be run either independently as Transformers framework runners or jointly in a custom runner. To use pre-trained models and instances as individual framework runners, simply get the models reference and convert them to runners using the to_runner method.

import bentoml
import torch

from bentoml.io import Text, NumpyNdarray
from datasets import load_dataset

proccessor_runner = bentoml.transformers.get("speecht5_tts_processor").to_runner()
model_runner = bentoml.transformers.get("speecht5_tts_model").to_runner()
vocoder_runner = bentoml.transformers.get("speecht5_tts_vocoder").to_runner()
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

svc = bentoml.Service("text2speech", runners=[proccessor_runner, model_runner, vocoder_runner])

@svc.api(input=Text(), output=NumpyNdarray())
def generate_speech(inp: str):
    inputs = proccessor_runner.run(text=inp, return_tensors="pt")
    speech = model_runner.generate_speech.run(input_ids=inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder_runner.run)
    return speech.numpy()

To use the pre-trained models and instances together in a custom runner, use the bentoml.transformers.get API to get the models references and load them in a custom runner. The pretrained instances can then be used for inference in the custom runner.

import bentoml
import torch

from datasets import load_dataset

processor_ref = bentoml.models.get("speecht5_tts_processor:latest")
model_ref = bentoml.models.get("speecht5_tts_model:latest")
vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest")

class SpeechT5Runnable(bentoml.Runnable):

    def __init__(self):
        self.processor = bentoml.transformers.load_model(processor_ref)
        self.model = bentoml.transformers.load_model(model_ref)
        self.vocoder = bentoml.transformers.load_model(vocoder_ref)
        self.embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
        self.speaker_embeddings = torch.tensor(self.embeddings_dataset[7306]["xvector"]).unsqueeze(0)

    @bentoml.Runnable.method(batchable=False)
    def generate_speech(self, inp: str):
        inputs = self.processor(text=inp, return_tensors="pt")
        speech = self.model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder)
        return speech.numpy()

text2speech_runner = bentoml.Runner(SpeechT5Runnable, name="speecht5_runner", models=[processor_ref, model_ref, vocoder_ref])
svc = bentoml.Service("talk_gpt", runners=[text2speech_runner])

@svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray())
async def generate_speech(inp: str):
    return await text2speech_runner.generate_speech.async_run(inp)

What's Changed

feat(containerize): caching pip/conda installation layers by @smidm in #3673
docs(batching): update docs to 503 by @sauyon in #3677
chore(deps): bump ruff from 0.0.255 to 0.0.256 by @dependabot in #3676
fix(type): annotate PdSeries with pandas-stubs by @aarnphm in #3466
chore(dispatcher): refactor out training code by @sauyon in #3663
fix: makes containerize for triton examples to all amd64 by @aarnphm in #3678
chore(deps): bump coverage[toml] from 7.2.1 to 7.2.2 by @dependabot in #3679
revert: "chore(dispatcher): refactor out training code (#3663)" by @sauyon in #3680
doc: add more links to Bentoml/examples by @larme in #3631
perf: serialization optimization by @larme in #3606
examples: Kubeflow by @ssheng in #3656
chore(deps): bump pytest-asyncio from 0.20.3 to 0.21.0 by @dependabot in #3688
chore(deps): bump ruff from 0.0.256 to 0.0.257 by @dependabot in #3689
chore(deps): bump imageio from 2.26.0 to 2.26.1 by @dependabot in #3690
chore(deps): bump yamllint from 1.29.0 to 1.30.0 by @dependabot in #3694
fix: remove duplicate dependabot check for pip by @aarnphm in #3691
chore(deps): bump ruff from 0.0.257 to 0.0.258 by @dependabot in #3699
docs: Update the Kubeflow example by @ssheng in #3703
chore(deps): bump ruff from 0.0.258 to 0.0.259 by @dependabot in #3709
docs: add link to pyfilesystem plugins by @sauyon in #3716
docs: Kubeflow integration documentation by @ssheng in #3704
docs: replace load_runner() to get().to_runner() by @KimSoungRyoul in #3715
chore(deps): bump imageio from 2.26.1 to 2.27.0 by @dependabot in #3720
fix(readme): format markdown table by @aarnphm in #3722
fix: copy files before running setup_script by @aarnphm in #3713
chore: remove experimental warning for bentoml.metrics by @aarnphm in #3725
ci: temporary disable coverage by @aarnphm in #3726
chore(deps): bump ruff from 0.0.259 to 0.0.260 by @dependabot in #3734
chore(deps): bump tritonclient[all] from 2.31.0 to 2.32.0 by @dependabot in #3730
fix(type): bentoml.container.build should accept multiple image_tag by @pmayd in #3719
chore(deps): bump bufbuild/buf-setup-action from 1.15.1 to 1.16.0 by @dependabot in #3738
feat: add query params to request context by @sauyon in #3717
chore(dispatcher): use attr class instead of a tuple by @sauyon in #3731
fix: Make it so the configured max_batch_size is respected when batching inference requests together by @RShang97 in #3741
feat(transformers): pretrained protocol support by @aarnphm in #3684
fix(tests): broken CI by @aarnphm in #3742
chore(deps): bump ruff from 0.0.260 to 0.0.261 by @dependabot in #3744
docs: Transformers documentation on pre-trained instances support by @ssheng in #3745

New Contributors

@smidm made their first contribution in #3673
@pmayd made their first contribution in #3719
@RShang97 made their first contribution in #3741

Full Changelog: v1.0.16...v1.0.17

bentoml/BentoML v1.0.17 BentoML - v1.0.17 on GitHub

What's Changed

New Contributors

bentoml/BentoML v1.0.17
BentoML - v1.0.17

on GitHub