github bentoml/BentoML v1.0.17
BentoML - v1.0.17

latest releases: v1.3.12, v1.3.11, v1.3.10...
19 months ago

🍱 We are excited to announce the release of BentoML v1.0.17, which includes support for 🤗 Hugging Face Transformers pre-trained instances. Prior to this release, only pipelines could be saved and loaded using the bentoml.transformers APIs. However, based on the community's demand to work with pre-trained models, tokenizers, preprocessors, etc., without pipelines, we have expanded our capabilities in bentoml.transformers APIs. With this release, all pre-trained instances can be saved and loaded into either built-in Transformers framework runners or custom runners. This update opens up new possibilities for users to work with pre-trained models, and we are thrilled to see what the community will create using this feature. To learn more, visit BentoML Transformers framework documentation.

  • Pre-trained models and instances, such as tokenizers, preprocessors, and feature extractors, can also be saved as standalone models using the bentoml.transformers.save_model API.

    import bentoml
    from transformers import AutoTokenizer
    
    processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
    model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
    vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
    
    bentoml.transformers.save_model("speecht5_tts_processor", processor)
    bentoml.transformers.save_model("speecht5_tts_model", model, signatures={"generate_speech": {"batchable": False}})
    bentoml.transformers.save_model("speecht5_tts_vocoder", vocoder)
  • Pre-trained models and instances can be run either independently as Transformers framework runners or jointly in a custom runner. To use pre-trained models and instances as individual framework runners, simply get the models reference and convert them to runners using the to_runner method.

    import bentoml
    import torch
    
    from bentoml.io import Text, NumpyNdarray
    from datasets import load_dataset
    
    proccessor_runner = bentoml.transformers.get("speecht5_tts_processor").to_runner()
    model_runner = bentoml.transformers.get("speecht5_tts_model").to_runner()
    vocoder_runner = bentoml.transformers.get("speecht5_tts_vocoder").to_runner()
    embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
    speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
    
    svc = bentoml.Service("text2speech", runners=[proccessor_runner, model_runner, vocoder_runner])
    
    @svc.api(input=Text(), output=NumpyNdarray())
    def generate_speech(inp: str):
        inputs = proccessor_runner.run(text=inp, return_tensors="pt")
        speech = model_runner.generate_speech.run(input_ids=inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder_runner.run)
        return speech.numpy()
  • To use the pre-trained models and instances together in a custom runner, use the bentoml.transformers.get API to get the models references and load them in a custom runner. The pretrained instances can then be used for inference in the custom runner.

    import bentoml
    import torch
    
    from datasets import load_dataset
    
    processor_ref = bentoml.models.get("speecht5_tts_processor:latest")
    model_ref = bentoml.models.get("speecht5_tts_model:latest")
    vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest")
    
    class SpeechT5Runnable(bentoml.Runnable):
    
        def __init__(self):
            self.processor = bentoml.transformers.load_model(processor_ref)
            self.model = bentoml.transformers.load_model(model_ref)
            self.vocoder = bentoml.transformers.load_model(vocoder_ref)
            self.embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
            self.speaker_embeddings = torch.tensor(self.embeddings_dataset[7306]["xvector"]).unsqueeze(0)
    
        @bentoml.Runnable.method(batchable=False)
        def generate_speech(self, inp: str):
            inputs = self.processor(text=inp, return_tensors="pt")
            speech = self.model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder)
            return speech.numpy()
    
    text2speech_runner = bentoml.Runner(SpeechT5Runnable, name="speecht5_runner", models=[processor_ref, model_ref, vocoder_ref])
    svc = bentoml.Service("talk_gpt", runners=[text2speech_runner])
    
    @svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray())
    async def generate_speech(inp: str):
        return await text2speech_runner.generate_speech.async_run(inp)

What's Changed

  • feat(containerize): caching pip/conda installation layers by @smidm in #3673
  • docs(batching): update docs to 503 by @sauyon in #3677
  • chore(deps): bump ruff from 0.0.255 to 0.0.256 by @dependabot in #3676
  • fix(type): annotate PdSeries with pandas-stubs by @aarnphm in #3466
  • chore(dispatcher): refactor out training code by @sauyon in #3663
  • fix: makes containerize for triton examples to all amd64 by @aarnphm in #3678
  • chore(deps): bump coverage[toml] from 7.2.1 to 7.2.2 by @dependabot in #3679
  • revert: "chore(dispatcher): refactor out training code (#3663)" by @sauyon in #3680
  • doc: add more links to Bentoml/examples by @larme in #3631
  • perf: serialization optimization by @larme in #3606
  • examples: Kubeflow by @ssheng in #3656
  • chore(deps): bump pytest-asyncio from 0.20.3 to 0.21.0 by @dependabot in #3688
  • chore(deps): bump ruff from 0.0.256 to 0.0.257 by @dependabot in #3689
  • chore(deps): bump imageio from 2.26.0 to 2.26.1 by @dependabot in #3690
  • chore(deps): bump yamllint from 1.29.0 to 1.30.0 by @dependabot in #3694
  • fix: remove duplicate dependabot check for pip by @aarnphm in #3691
  • chore(deps): bump ruff from 0.0.257 to 0.0.258 by @dependabot in #3699
  • docs: Update the Kubeflow example by @ssheng in #3703
  • chore(deps): bump ruff from 0.0.258 to 0.0.259 by @dependabot in #3709
  • docs: add link to pyfilesystem plugins by @sauyon in #3716
  • docs: Kubeflow integration documentation by @ssheng in #3704
  • docs: replace load_runner() to get().to_runner() by @KimSoungRyoul in #3715
  • chore(deps): bump imageio from 2.26.1 to 2.27.0 by @dependabot in #3720
  • fix(readme): format markdown table by @aarnphm in #3722
  • fix: copy files before running setup_script by @aarnphm in #3713
  • chore: remove experimental warning for bentoml.metrics by @aarnphm in #3725
  • ci: temporary disable coverage by @aarnphm in #3726
  • chore(deps): bump ruff from 0.0.259 to 0.0.260 by @dependabot in #3734
  • chore(deps): bump tritonclient[all] from 2.31.0 to 2.32.0 by @dependabot in #3730
  • fix(type): bentoml.container.build should accept multiple image_tag by @pmayd in #3719
  • chore(deps): bump bufbuild/buf-setup-action from 1.15.1 to 1.16.0 by @dependabot in #3738
  • feat: add query params to request context by @sauyon in #3717
  • chore(dispatcher): use attr class instead of a tuple by @sauyon in #3731
  • fix: Make it so the configured max_batch_size is respected when batching inference requests together by @RShang97 in #3741
  • feat(transformers): pretrained protocol support by @aarnphm in #3684
  • fix(tests): broken CI by @aarnphm in #3742
  • chore(deps): bump ruff from 0.0.260 to 0.0.261 by @dependabot in #3744
  • docs: Transformers documentation on pre-trained instances support by @ssheng in #3745

New Contributors

Full Changelog: v1.0.16...v1.0.17

Don't miss a new BentoML release

NewReleases is sending notifications on new releases.