🍱 We are excited to announce the release of BentoML v1.0.17, which includes support for 🤗 Hugging Face Transformers pre-trained instances. Prior to this release, only pipelines could be saved and loaded using the bentoml.transformers
APIs. However, based on the community's demand to work with pre-trained models, tokenizers, preprocessors, etc., without pipelines, we have expanded our capabilities in bentoml.transformers
APIs. With this release, all pre-trained instances can be saved and loaded into either built-in Transformers framework runners or custom runners. This update opens up new possibilities for users to work with pre-trained models, and we are thrilled to see what the community will create using this feature. To learn more, visit BentoML Transformers framework documentation.
-
Pre-trained models and instances, such as tokenizers, preprocessors, and feature extractors, can also be saved as standalone models using the
bentoml.transformers.save_model
API.import bentoml from transformers import AutoTokenizer processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") bentoml.transformers.save_model("speecht5_tts_processor", processor) bentoml.transformers.save_model("speecht5_tts_model", model, signatures={"generate_speech": {"batchable": False}}) bentoml.transformers.save_model("speecht5_tts_vocoder", vocoder)
-
Pre-trained models and instances can be run either independently as Transformers framework runners or jointly in a custom runner. To use pre-trained models and instances as individual framework runners, simply get the models reference and convert them to runners using the
to_runner
method.import bentoml import torch from bentoml.io import Text, NumpyNdarray from datasets import load_dataset proccessor_runner = bentoml.transformers.get("speecht5_tts_processor").to_runner() model_runner = bentoml.transformers.get("speecht5_tts_model").to_runner() vocoder_runner = bentoml.transformers.get("speecht5_tts_vocoder").to_runner() embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0) svc = bentoml.Service("text2speech", runners=[proccessor_runner, model_runner, vocoder_runner]) @svc.api(input=Text(), output=NumpyNdarray()) def generate_speech(inp: str): inputs = proccessor_runner.run(text=inp, return_tensors="pt") speech = model_runner.generate_speech.run(input_ids=inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder_runner.run) return speech.numpy()
-
To use the pre-trained models and instances together in a custom runner, use the
bentoml.transformers.get
API to get the models references and load them in a custom runner. The pretrained instances can then be used for inference in the custom runner.import bentoml import torch from datasets import load_dataset processor_ref = bentoml.models.get("speecht5_tts_processor:latest") model_ref = bentoml.models.get("speecht5_tts_model:latest") vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest") class SpeechT5Runnable(bentoml.Runnable): def __init__(self): self.processor = bentoml.transformers.load_model(processor_ref) self.model = bentoml.transformers.load_model(model_ref) self.vocoder = bentoml.transformers.load_model(vocoder_ref) self.embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") self.speaker_embeddings = torch.tensor(self.embeddings_dataset[7306]["xvector"]).unsqueeze(0) @bentoml.Runnable.method(batchable=False) def generate_speech(self, inp: str): inputs = self.processor(text=inp, return_tensors="pt") speech = self.model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder) return speech.numpy() text2speech_runner = bentoml.Runner(SpeechT5Runnable, name="speecht5_runner", models=[processor_ref, model_ref, vocoder_ref]) svc = bentoml.Service("talk_gpt", runners=[text2speech_runner]) @svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray()) async def generate_speech(inp: str): return await text2speech_runner.generate_speech.async_run(inp)
What's Changed
- feat(containerize): caching pip/conda installation layers by @smidm in #3673
- docs(batching): update docs to 503 by @sauyon in #3677
- chore(deps): bump ruff from 0.0.255 to 0.0.256 by @dependabot in #3676
- fix(type): annotate PdSeries with pandas-stubs by @aarnphm in #3466
- chore(dispatcher): refactor out training code by @sauyon in #3663
- fix: makes containerize for triton examples to all amd64 by @aarnphm in #3678
- chore(deps): bump coverage[toml] from 7.2.1 to 7.2.2 by @dependabot in #3679
- revert: "chore(dispatcher): refactor out training code (#3663)" by @sauyon in #3680
- doc: add more links to Bentoml/examples by @larme in #3631
- perf: serialization optimization by @larme in #3606
- examples: Kubeflow by @ssheng in #3656
- chore(deps): bump pytest-asyncio from 0.20.3 to 0.21.0 by @dependabot in #3688
- chore(deps): bump ruff from 0.0.256 to 0.0.257 by @dependabot in #3689
- chore(deps): bump imageio from 2.26.0 to 2.26.1 by @dependabot in #3690
- chore(deps): bump yamllint from 1.29.0 to 1.30.0 by @dependabot in #3694
- fix: remove duplicate dependabot check for pip by @aarnphm in #3691
- chore(deps): bump ruff from 0.0.257 to 0.0.258 by @dependabot in #3699
- docs: Update the Kubeflow example by @ssheng in #3703
- chore(deps): bump ruff from 0.0.258 to 0.0.259 by @dependabot in #3709
- docs: add link to pyfilesystem plugins by @sauyon in #3716
- docs: Kubeflow integration documentation by @ssheng in #3704
- docs: replace load_runner() to get().to_runner() by @KimSoungRyoul in #3715
- chore(deps): bump imageio from 2.26.1 to 2.27.0 by @dependabot in #3720
- fix(readme): format markdown table by @aarnphm in #3722
- fix: copy files before running
setup_script
by @aarnphm in #3713 - chore: remove experimental warning for
bentoml.metrics
by @aarnphm in #3725 - ci: temporary disable coverage by @aarnphm in #3726
- chore(deps): bump ruff from 0.0.259 to 0.0.260 by @dependabot in #3734
- chore(deps): bump tritonclient[all] from 2.31.0 to 2.32.0 by @dependabot in #3730
- fix(type):
bentoml.container.build
should accept multipleimage_tag
by @pmayd in #3719 - chore(deps): bump bufbuild/buf-setup-action from 1.15.1 to 1.16.0 by @dependabot in #3738
- feat: add query params to request context by @sauyon in #3717
- chore(dispatcher): use attr class instead of a tuple by @sauyon in #3731
- fix: Make it so the configured max_batch_size is respected when batching inference requests together by @RShang97 in #3741
- feat(transformers): pretrained protocol support by @aarnphm in #3684
- fix(tests): broken CI by @aarnphm in #3742
- chore(deps): bump ruff from 0.0.260 to 0.0.261 by @dependabot in #3744
- docs: Transformers documentation on pre-trained instances support by @ssheng in #3745
New Contributors
- @smidm made their first contribution in #3673
- @pmayd made their first contribution in #3719
- @RShang97 made their first contribution in #3741
Full Changelog: v1.0.16...v1.0.17