huggingface/transformers v4.48.0 on GitHub

New models

ModernBERT

The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:

Rotary Positional Embeddings to support sequences of up to 8192 tokens.
Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
Flash Attention to speed up processing.
A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

Add ModernBERT to Transformers by @warner-benjamin in #35158

Aria

The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

Add Aria by @aymeric-roucher in #34157

TimmWrapper

We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.

Here's a general usage example:

import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor

checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)

with torch.no_grad():
    logits = model(**inputs).logits

top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)

Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:

import torch
from urllib.request import urlopen
from PIL import Image

from transformers import pipeline

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))

Add TimmWrapper by @qubvel and @amyeroberts in #34564

Pixtral-Large

Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.

Update Pixtral conversion script to support large format! by @ArthurZucker in #34801

ColPali

The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.

In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736

Falcon3

Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:

One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.

Add Falcon3 documentation by @mokeddembillel in #35307

Bamba

Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Checkout all Bamba-9B model checkpoints here.

Add the Bamba Model by @fabianlim in #34982

VitPose

ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.

The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

Add VitPose by @SangbumChoi and @NielsRogge in #30530

DINOv2 with registers

The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.

Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.

The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:

no artifacts
interpretable attention maps
and improved performances.

Add DINOv2 with registers by @NielsRogge in #35348

Emu3

The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.

Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..

Add Emu3 by @zucchini-nlp in #33770

Cohere2

A new Cohere update was added through a new "Cohere2" set of classes.

Add Cohere2 model by @alexrs-cohere in #35224

TextNet

TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.

Add TextNet by @jadechoghari in #34979

DiffLlama

Differential Transformer combines the Llama architecture with Differential Transformer's Attention.

Add DiffLllama by @weak-kajuma in #34083

PixtralLarge

The conversion script needed a few update, while the modeling code was barely changed!

[PixtralLarge] Update Pixtral conversion script to support large format! (#34801)

Moonshine

Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands
.

Add Moonshine by @eustlb in #34784

Quantization methods

VPTQ Quantization

From the VPTQ contributors:

VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq

FEAT : Adding VPTQ quantization method to HFQuantizer by @wejoncy in #34770

HIGGS Quantization

From the contributors:

HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.
Runtime support for HIGGS is implemented through FLUTE, and its library.
This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.

HIGGS Quantization Support by @BlackSamorez in #34997

Cleanup

We merged a cleanup for vision language models, to make sure it all models are standardized.

VLMs: major clean up 🧼 (#34502)

Breaking changes

Conversion scripts

Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern models/**/convert_*.py. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch .bin weights or pickle files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.

In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.

However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the main branch.

🚨🚨🚨 Delete conversion scripts when making release wheels by @Rocketknight1 in #35296

Backtracking in Nougat

A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.

🚨🚨🚨 Limit backtracking in Nougat regexp by @qubvel in #35264

Whisper decoding

This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:

➡️ Previously:
• Short-form: Returned a ModelOutput or torch.LongTensor, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict or torch.LongTensor, excluding decoder input IDs and the EOS token ID.

➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.

Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True and (return_timestamps=False or force_unique_generate_call=True).

In this case, the output will be a ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.

[Whisper] 🚨 Fix whisper decoding 🚨 by @eustlb in #34135

Attention refactor

In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.

🚨All attention refactor🚨 by @ArthurZucker in #35235

Bugfixes and improvements

[tokenizers] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer (#35593)
Setup loss_type in config at model init time (#34616)
[docs] Update Python version in translations by @jla524 in #35096
[docs] top_p, top_k, temperature docstrings by @stevhliu in #35065
Fix private forked repo. CI by @ydshieh in #35114
Add feature dim attributes to BitLinear for easier PEFT integration by @agostinv in #34946
Update I-JEPA checkpoints path by @qubvel in #35120
Fix GA loss bugs and add unit test by @techkang in #35121
[I-JEPA] Update docs by @NielsRogge in #35148
Corrected typo in agent system prompts by @Uvi-12 in #35143
Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature by @daniel-bogdoll in #34883
Fix typo in EETQ Tests by @MekkCyber in #35160
Cleanup: continue the init refactor by @LysandreJik in #35167
Super tiny fix logging message by @fzyzcjy in #35132
Fixed typo of 'avilable' in prompts.py by @Uvi-12 in #35145
[CI] Fix bnb quantization tests with accelerate>=1.2.0 by @matthewdouglas in #35172
Fix num_items_in_batch not being an integer by @xspirus in #35115
Assisted decoding multi-gpu by @zucchini-nlp in #35116
Fix file path for shard_num 1 with mllama converter by @strangiato in #35053
Support BatchNorm in Hubert pos_conv_emb as in fairseq by @gallilmaimon in #34389
Remove unnecessary masked_fill in deberta models by @xadupre in #35182
Fix DBRX LayerNorm init method by @hgt312 in #35177
Fixing GGUF support for StableLm by @MekkCyber in #35060
[i18n-ar] Translated file : docs/source/ar/community.md into Arabic by @AhmedAlmaghz in #33027
Multiple typo fixes in NLP, Audio docs by @henryhmko in #35181
Only import torch.distributed if it is available by @GaetanLepage in #35133
[i18n-] Translating Benchmarks.md to Chinese by @asdkfjsd in #35137
[docs] Fix FlashAttention link by @stevhliu in #35171
Update data collator docstrings to accurately reference Nvidia tensor core compute capability version by @johngrahamreynolds in #35188
[i18n-] Translating agents.md to Chinese by @HMJ0628 in #35139
BLIP: enable device map by @zucchini-nlp in #34850
🧹 Remove deprecated RotaryEmbedding parts in the Attention layers by @Cyrilvallez in #34858
[PEFT] Better Trainer error when prompt learning with loading best model at the end by @BenjaminBossan in #35087
Cleanup: continue the init refactor by @LysandreJik in #35170
Fix CI by @Cyrilvallez in #35208
Fix seamless TTS generate by @ylacombe in #34968
docs: clarify initializer_range parameter description in Idefics3VisionConfig by @h3110Fr13nd in #35215
Fixed typo of 'indentifier' in audio_utils.py by @Uvi-12 in #35226
Fix type hints for apply_chat_template by @Rocketknight1 in #35216
Support Python 3.10+ Union style in chat template type hints parsing by @RezaRahemtola in #35103
Refactoring AssistedCandidateGenerator for Improved Modularity and Reusability by @keyboardAnt in #35009
Change back to Thread for SF conversion by @ydshieh in #35236
[Init refactor] Modular changes by @LysandreJik in #35240
Fix typo in chat template example by @EricWinsorDSIT in #35250
Run model as compressed/uncompressed mode by @horheynm in #34719
skip Fuyu from test_generate by @nhamanasu in #35246
[tests] fix "Tester object has no attribute '_testMethodName'" by @faaany in #34910
Use rsfE with pytest by @ydshieh in #35119
Update AMD docker image (rocm 6.1) by @ivarflakstad in #35259
Fixed typos in Audio Classification Documentation by @Uvi-12 in #35263
Translating agents_advanced.md to Chinese by @HMJ0628 in #35231
Fix FSDP no longer working by @muellerzr in #35212
don't use no_sync when deepspeed doesn't support it for certain zero stages by @winglian in #35157
[i18n-Chinese] Translating perf_train_cpu.md to Chinese by @asdkfjsd in #35242
Fall back to slow image processor in ImageProcessingAuto when no fast processor available by @yonigozlan in #34785
Aggeregate test summary files in CircleCI workflow runs by @ydshieh in #34989
Blip: fix offloading and MP tests by @zucchini-nlp in #35239
Fix : model used to test ggml conversion of Falcon-7b is incorrect by @MekkCyber in #35083
Temporarily disable amd push ci by @ivarflakstad in #35293
Delete redundancy for loop checks. by @zhanluxianshen in #35288
[Whisper] patch float type on mps by @eustlb in #35295
Fix typos in Translated Audio Classification Docs by @jla524 in #35287
Translating "translate perf_infer_gpu_multi.md" to Chinese by @HMJ0628 in #35271
Fix wrongs in quicktour[zh] by @zhanluxianshen in #35272
Improved documentation of Automatic speech recognition by @Uvi-12 in #35268
fix modular order by @ArthurZucker in #35297
Add sdpa for Beit by @OmarManzoor in #34941
Support for SDPA for SAM models by @MagnusS0 in #34110
remove benchmark job in push-important-models.yml by @ydshieh in #35292
Fix typos in translated quicktour docs by @jla524 in #35302
Fix image preview in multi-GPU inference docs by @jla524 in #35303
Fix remove unused parameter in docs by @zzzzzsa in #35306
Add Cohere2 docs details by @alexrs-cohere in #35294
Fixed typo in audio_classification.md by @Uvi-12 in #35305
[docs] Improve register_pipeline by @stevhliu in #35300
Fix loading with only state dict and low_cpu_mem_usage = True by @SunMarc in #35217
[tests] make cuda-only tests device-agnostic by @faaany in #35222
Trigger GitHub CI with a comment on PR by @ydshieh in #35211
change bnb tests by @jiqing-feng in #34713
[Whisper] fix docstrings typo by @eustlb in #35319
feat: add benchmarks_entrypoint.py by @McPatate in #34495
Fix documentation for ColPali by @tonywu71 in #35321
Update comment CI bot by @ydshieh in #35323
PaliGemma: Make sure to add to suffix if is present in text by @probicheaux in #35201
Fix some fa2 tests by @ArthurZucker in #35340
Modernbert Release Fixes by @warner-benjamin in #35344
[docs] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347
fix onnx export of speech foundation models by @nikosanto13 in #34224
[Mamba2] Fix caching, slow path, and multi-gpu by @vasqu in #35154
Reduce CircleCI usage by @ydshieh in #35355
Implement AsyncTextIteratorStreamer for asynchronous streaming by @CISC in #34931
Cleaner attention interfaces by @Cyrilvallez in #35342
Add Tensor Parallel support for Qwen2VL by @jla524 in #35050
fix zoedepth initialization error under deepspeed zero3 by @Tavish9 in #35011
Aurevoir PyTorch 1 by @ydshieh in #35358
bugfix: torch.export failure caused by _make_causal_mask by @jiwoong-choi in #35291
update codecarbon by @nhamanasu in #35243
Update test fetcher when we want to test all by @ArthurZucker in #35364
Use weights_only=True with torch.load for transfo_xl by @ydshieh in #35241
Make test_generate_with_static_cache even less flaky by @ydshieh in #34995
Improve modular transformers documentation by @joelpaulkoch in #35322
Improved Documentation Of Audio Classification by @Uvi-12 in #35368
[docs] Follow up register_pipeline by @stevhliu in #35310
owlvit/2 dynamic input resolution by @bastrob in #34764
Fix new FA2 if is_causal is passed explicitly by @Cyrilvallez in #35390
bitsandbytes: simplify 8bit dequantization by @matthewdouglas in #35068
make LlamaModel._update_causal_mask torch compilable by @winglian in #35187
Patch GPTNeoX to use adequate FA2 if position_ids is provided by @taha-yassine in #35318
uniformize kwargs for SAM by @tibor-reiss in #34578
Deprecate _is_quantized_training_enabled by @MekkCyber in #34991
Scale loss before backward by @qgallouedec in #35207
Fix typing in docstring for PaliGemmaProcessor by @alvarobartt in #35278
Fix : VPTQ test by @MekkCyber in #35394
add bnb support for Ascend NPU by @statelesshz in #31512
bugfix Idefics3 processor - handle gracefully cases with text and no images by @mfarre in #35363
Adding logger.info about update_torch_dtype in some quantizers by @MekkCyber in #35046
Add compile test for fast image processor by @yonigozlan in #35184
Disable .github/workflows/self-comment-ci.yml for now by @ydshieh in #35366
enable non-cuda awq model support without modify version by @jiqing-feng in #35334
[GPTQ, CompressedTensors] Fix unsafe imports and metada check by @vasqu in #34815
Drop inplace operation for loss computation with gradient accumulation by @qgallouedec in #35416
Fix: Rename keyword argument in_channels to num_channels by @ningyuv in #35289
CLIP conversion script - Change fairseq to OpenAI by @gau-nernst in #35384
Fix f-string to show ACCELERATE_MIN_VERSION on error by @KSafran in #35189
Fix model_accepts_loss_kwargs for timm model by @qubvel in #35257
Update perf_infer_gpu_one.md: fix a typo by @martin0258 in #35441
Add compute_loss_func to Seq2SeqTrainer by @d223302 in #35136
Update docs for sdpa_kernel by @jla524 in #35410
[i18n-ar] Translated file: docs/source/ar/tasks/question_answering.md into Arabic by @AhmedAlmaghz in #35196
[i18n-ar] Translated file: docs/source/ar/tasks/summarization.md into Arabic by @AhmedAlmaghz in #35195
Update translated docs for sdpa_kernel by @jla524 in #35461
Reintroduce Python 3.9 support for ModernBERT by @tomaarsen in #35458
Fix new BNB test failures by @matthewdouglas in #35345
Fix docs typos. by @zhanluxianshen in #35465
Fix paligemma warning message by @hiyouga in #35486

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@ydshieh
- Fix private forked repo. CI (#35114)
- Change back to Thread for SF conversion (#35236)
- Use rsfE with pytest (#35119)
- Aggeregate test summary files in CircleCI workflow runs (#34989)
- remove benchmark job in push-important-models.yml (#35292)
- Trigger GitHub CI with a comment on PR (#35211)
- Update comment CI bot (#35323)
- Reduce CircleCI usage (#35355)
- Aurevoir PyTorch 1 (#35358)
- Use weights_only=True with torch.load for transfo_xl (#35241)
- Make test_generate_with_static_cache even less flaky (#34995)
- Disable .github/workflows/self-comment-ci.yml for now (#35366)
@aymeric-roucher
- Add Aria (#34157)
@NielsRogge
- [I-JEPA] Update docs (#35148)
- Add DINOv2 with registers (#35348)
@HMJ0628
- [i18n-] Translating agents.md to Chinese (#35139)
- Translating agents_advanced.md to Chinese (#35231)
- Translating "translate perf_infer_gpu_multi.md" to Chinese (#35271)
@alexrs-cohere
- Add Cohere2 model (#35224)
- Add Cohere2 docs details (#35294)
@ArthurZucker
- fix modular order (#35297)
- 🚨All attention refactor🚨 (#35235)
- Fix some fa2 tests (#35340)
- Update test fetcher when we want to test all (#35364)
@tonywu71
- Add ColPali to 🤗 transformers (#33736)
- Fix documentation for ColPali (#35321)
@OmarManzoor
- Add sdpa for Beit (#34941)
@fabianlim
- Add the Bamba Model (#34982)
@warner-benjamin
- Add ModernBERT to Transformers (#35158)
- Modernbert Release Fixes (#35344)
@wejoncy
- FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)
@bastrob
- owlvit/2 dynamic input resolution (#34764)
@BlackSamorez
- HIGGS Quantization Support (#34997)

huggingface/transformers v4.48.0 v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine on GitHub