github xorbitsai/inference v1.7.1

latest releases: v1.9.1, v1.9.0, v1.8.1...
2 months ago

What's new in 1.7.1 (2025-06-27)

These are the changes in inference v1.7.1.

New features

Enhancements

  • ENH: add enable_flash_attn param for loading qwen3 embedding & rerank by @qinxuye in #3640
  • ENH: add more abilities for builtin model families API by @qinxuye in #3658
  • ENH: improve local cluster startup reliability via child-process readiness signaling by @Checkmate544 in #3642
  • ENH: FishSpeech support pcm by @codingl2k1 in #3680
  • ENH: Add 4-sample micro-batching to Qwen-3 reranker to reduce GPU memory by @yasu-oh in #3666
  • ENH: Limit default n_parallel for llama.cpp backend by @codingl2k1 in #3712
  • BLD: pin flash-attn & flashinfer-python version and limit sgl-kernel version by @amumu96 in #3669
  • BLD: Update Dockerfile by @XiaoXiaoJiangYun in #3695
  • REF: remove unused code by @qinxuye in #3664

Bug fixes

  • BUG: fix TTS error bug :No such file or directory by @robin12jbj in #3625
  • BUG: Fix max_tokens value in Qwen3 Reranker by @yasu-oh in #3665
  • BUG: fix custom embedding by @qinxuye in #3677
  • BUG: [UI] rename the command-line argument from download-hub to download_hub. by @yiboyasss in #3685
  • BUG: fix jina-clip-v2 for text only or image only by @qinxuye in #3690
  • BUG: internvl chat error using vllm engine by @amumu96 in #3722
  • BUG: fix the parsing logic of streaming tool calls by @amumu96 in #3721
  • BUG: fix <think> wrongly added when set chat_template_kwargs {"enable_thinking": False} by @qinxuye in #3718

Documentation

New Contributors

Full Changelog: v1.7.0...v1.7.1

Don't miss a new inference release

NewReleases is sending notifications on new releases.