🍱 To better support LLM serving through response streaming, we are proud to introduce an experimental support of server-sent events (SSE) streaming support in this release of BentoML v1.14
and OpenLLM v0.2.27
. See an example service definition for SSE streaming with Llama2.
- Added response streaming through SSE to the
bentoml.io.Text
IO Descriptor type. - Added async generator support to both API Server and Runner to
yield
incremental text responses. - Added supported to ☁️ BentoCloud to natively support SSE streaming.
🦾 OpenLLM added token streaming capabilities to support streaming responses from LLMs.
-
Added
/v1/generate_stream
endpoint for streaming responses from LLMs.curl -N -X 'POST' 'http://0.0.0.0:3000/v1/generate_stream' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "prompt": "### Instruction:\n What is the definition of time (200 words essay)?\n\n### Response:", "llm_config": { "use_llama2_prompt": false, "max_new_tokens": 4096, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "use_cache": true, "temperature": 0.89, "top_k": 50, "top_p": 0.76, "typical_p": 1, "epsilon_cutoff": 0, "eta_cutoff": 0, "diversity_penalty": 0, "repetition_penalty": 1, "encoder_repetition_penalty": 1, "length_penalty": 1, "no_repeat_ngram_size": 0, "renormalize_logits": false, "remove_invalid_values": false, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "encoder_no_repeat_ngram_size": 0, "n": 1, "best_of": 1, "presence_penalty": 0.5, "frequency_penalty": 0, "use_beam_search": false, "ignore_eos": false }, "adapter_name": null }'
What's Changed
- docs: Update the models doc by @Sherlock113 in #4145
- docs: Add more workflows to the GitHub Actions doc by @Sherlock113 in #4146
- docs: Add text embedding example to readme by @Sherlock113 in #4151
- fix: bento build cache miss by @xianml in #4153
- fix(buildx): parsing attestation on docker desktop by @aarnphm in #4155
New Contributors
Full Changelog: v1.1.3...v1.1.4