Details
mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256)
-
Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
-
Add mobilenetv5 impl
-
Fix comments, remove unused vars
-
Fix permute and remove transpose of projection weights
-
Fix comments, remove debugging prints from hf_to_gguf
-
- Hard-code image_mean = 0 and image_std = 1
- Use available tensor mapping logic
- Remove redundant chat template replacement of soft tokens placeholder with media placeholder
-
- Move mobilenetv5 helpers declarations to
clip_graph_mobilenetv5struct and definitions to mobilenetv5.cpp
2.Remove unusedclip_is_gemma3nfunc declarations and definitions
- Move mobilenetv5 helpers declarations to
- Remove redundant
rescale_image_u8_to_f32func and usenormalize_image_u8_to_f32with zero mean and unit std - Calculate n_patches using image_size / patch_size
-
Remove obsolete comments
-
- convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
-
- Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology
-
Fix stem conv bias name
-
Remove explicit handling of bias term for stem conv
-
- Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
-
clean up conversion script
-
fix code style
-
also preserve audio tensors
-
trailing space
-
split arch A and V
-
rm unused gemma3 func
-
fix alignment
Co-authored-by: Xuan Son Nguyen son@huggingface.co
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: