What's new in 1.7.1 (2025-06-27)

These are the changes in inference v1.7.1.

New features

FEAT: [UI] enhance audio & rerank model registration params. by @yiboyasss in #3656
FEAT: support async client by @zhcn000000 in #3645
FEAT: [UI] add max_tokens display in rerank model. by @yiboyasss in #3671
FEAT: [UI] add model_ability options for LLM registration. by @yiboyasss in #3663
FEAT: support qwenLong-l1 by @Jun-Howie in #3691
FEAT: [UI] model registration supports packages. by @yiboyasss in #3702
FEAT: support MLU device by @nan9126 in #3693
FEAT: vllm v1 auto enabling by @qinxuye in #3637
FEAT: distributed inference for MLX by @qinxuye in #3700

ENH: add enable_flash_attn param for loading qwen3 embedding & rerank by @qinxuye in #3640
ENH: add more abilities for builtin model families API by @qinxuye in #3658
ENH: improve local cluster startup reliability via child-process readiness signaling by @Checkmate544 in #3642
ENH: FishSpeech support pcm by @codingl2k1 in #3680
ENH: Add 4-sample micro-batching to Qwen-3 reranker to reduce GPU memory by @yasu-oh in #3666
ENH: Limit default n_parallel for llama.cpp backend by @codingl2k1 in #3712
BLD: pin flash-attn & flashinfer-python version and limit sgl-kernel version by @amumu96 in #3669
BLD: Update Dockerfile by @XiaoXiaoJiangYun in #3695
REF: remove unused code by @qinxuye in #3664

BUG: fix TTS error bug :No such file or directory by @robin12jbj in #3625
BUG: Fix max_tokens value in Qwen3 Reranker by @yasu-oh in #3665
BUG: fix custom embedding by @qinxuye in #3677
BUG: [UI] rename the command-line argument from download-hub to download_hub. by @yiboyasss in #3685
BUG: fix jina-clip-v2 for text only or image only by @qinxuye in #3690
BUG: internvl chat error using vllm engine by @amumu96 in #3722
BUG: fix the parsing logic of streaming tool calls by @amumu96 in #3721
BUG: fix <think> wrongly added when set chat_template_kwargs {"enable_thinking": False} by @qinxuye in #3718

Full Changelog: v1.7.0...v1.7.1