Second hotfix for the "offline mode is enabled" crash on model load. 0.4.4 reverted the inference-path offline guards but kept the same trap on the load path, so users who updated to 0.4.4 kept hitting the exact error the release was supposed to fix (#526). This release removes the load-path guards and patches the transformers tokenizer load to be robust to HuggingFace metadata failures at the source, so the class of bug can't recur.
Reliability
- Load no longer fails with "offline mode is enabled" (#530, fixes #526). transformers 4.57.x added an unconditional
huggingface_hub.model_info()call insideAutoTokenizer.from_pretrained(via_patch_mistral_regex) that runs for every non-local repo load, regardless of cache state or whether the target model is actually a Mistral variant. The load-timeHF_HUB_OFFLINEguard from 0.4.2 turned that into a hard crash for cached online users the moment 0.4.4 removed the inference-path guard that had been masking the problem. Fix wraps_patch_mistral_regexso any exception from the HF metadata check is caught and the tokenizer is returned unchanged — matching the success-path behavior for non-Mistral repos. The wrapper installs atbackend.backendsimport time so it covers Qwen Base, Qwen CustomVoice, TADA, and every other transformers-backed engine on Windows, Linux, and CUDA alike. The load-timeforce_offline_if_cachedguards were removed — with the wrapper in place they provide zero value and only risk re-introducing the same failure mode. - No more 30s pause when generating without a network. The HuggingFace metadata timeout called out as a known caveat in 0.4.4 is covered by the same patch; offline users no longer wait for the check to time out before load completes.