pypi diffusers 0.39.0
Diffusers 0.39.0: New image and video pipelines, core library improvements, and more

6 hours ago

New Pipelines

Cosmos 3

Cosmos 3 is NVIDIA's unified world foundation model (WFM) for Physical AI — a single omni-model built on a Mixture-of-Transformers (MoT) architecture that combines world generation, physical reasoning, and action generation, replacing the separate Predict, Reason, and Transfer models from earlier Cosmos releases. A single Cosmos3OmniTransformer runs a Qwen-style language model in parallel with a diffusion generation pathway, joined by a 3D multimodal RoPE. This release also lands video-to-video and action-conditioned generation, and a sound encoder.

Thanks to @atharvajoshi10, @yzhautouskay, and @MaciejBalaNV for the contributions.

Ideogram 4

Ideogram 4 is a flow-matching text-to-image model that uses a multimodal text encoder and an asymmetric classifier-free guidance scheme: a dedicated unconditional_transformer produces the negative branch with zeroed text features, while the main transformer consumes the full packed text + image sequence. The pipeline ships with structured prompt upsampling and LoRA loading support.

Thanks to @JinLiIdeogram for the contribution.

Krea 2

Krea 2 (K2) is a flow-matching text-to-image model built around a single-stream MMDiT with grouped-query attention. A Qwen3-VL text encoder provides the conditioning — hidden states from twelve decoder layers are tapped per token and fused inside the transformer by a small text-fusion stage — and images are decoded with the Qwen-Image VAE. Both the base (midtrain) and TDM (distilled, few-step) checkpoints are supported, alongside a LoRA DreamBooth trainer.

Thanks to @EleaZhong and @Abhinay1997 for the contribution.

DreamLite

DreamLite is a text-to-image and image-editing model from ByteDance. It pairs a custom 2D U-Net (DreamLiteUNetModel) with the Qwen3-VL multimodal encoder as its prompt / image-instruction encoder, and uses an AutoencoderTiny (TAESD-style) VAE for fast latent encode/decode. A distilled DreamLiteMobilePipeline targets on-device, low-latency generation.

Thanks to @Carlofkl for the contribution.

PRX Pixel

PRXPixel is a pixel-space text-to-image generation model by Photoroom. A ~7B PRXTransformer2DModel denoises raw RGB images directly — no VAE is needed. The model is conditioned on a Qwen3-VL text encoder and uses flow matching where the transformer predicts the clean image at each step (x-prediction).

Thanks to @DavidBert for the contribution.

Motif-Video

Motif-Video is a 2B parameter diffusion transformer for text-to-video and image-to-video generation. It features a three-stage architecture (12 dual-stream + 16 single-stream + 8 DDT decoder layers), Shared Cross-Attention for stable text-video alignment over long sequences, a T5Gemma2 text encoder, and rectified flow matching for velocity prediction.

Thanks to @waitingcheung for the contribution.

AnyFlow

AnyFlow from NVIDIA, NUS, and MIT is the first any-step video diffusion framework built on flow maps, enabling a single model (bidirectional or causal) to adapt to arbitrary inference budgets. It ships both bidirectional and FAR causal pipelines built on Wan2.1 backbones, covering text-to-video, image-to-video, and video-to-video.

Thanks to @Enderfga for the contribution.

JoyAI-Image-Edit

JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal LLM with a 16B Multimodal Diffusion Transformer (MMDiT). JoyImageEditPipeline supports general image editing as well as spatial editing capabilities including object move, object rotation, and camera control.

Thanks to @Moran232 for the contribution.

DiffusionGemma

DiffusionGemma is a block-diffusion encoder-decoder language model. A causal encoder reads the clean prompt (and any previously generated blocks) into a KV cache, and a bidirectional decoder denoises a fixed-size "canvas" of tokens by cross-attending to that cache, committing the most confident tokens via the new BlockRefinementScheduler. The released checkpoint is google/diffusiongemma-26B-A4B-it.

Anima

Anima is a 2 billion parameter text-to-image model created via a collaboration between CircleStone Labs and Comfy Org. It is focused mainly on anime concepts, characters, and styles, but is also capable of generating a wide variety of other non-photorealistic content.

It reuses the CosmosTransformer3DModel with a Qwen3 text encoder, a T5-token text conditioner, and the AutoencoderKLQwenImage VAE.

Thanks to @rmatif for the contribution.

LTX-2.X IC LoRA and HDR Pipelines

New LTX2InContextPipeline (in-context LoRA) and LTX2HDRPipeline extend the LTX-2 family with in-context conditioning and HDR video generation.

Modular Pipeline Support

  • We added a modular pipeline for Stable Diffusion 3 (SD3) in #13324 (thanks to @AlanPonnachan).
  • We added a modular pipeline for Anima in #13732 (thanks to @rmatif).
  • LoRA loading is now enabled on ErnieImageModularPipeline (#13948) and Ideogram4ModularPipeline (#13980), thanks to @SamuelTallet.

Core Library

All commits

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @DN6
    • [CI] Update all workflows with permissions (#13672)
    • [CI] QOL improvement for PR size labeler (#13554)
    • Update Flax removal version (#13729)
    • Update contribution guidelines (#13753)
    • [CI] Replace print_env step in CI with diffusers-cli env (#13662)
    • [CI] Fix torch_device import in AutoencoderTesterMixin (#13852)
    • [CI] Refactor LTX Transformer Tests (#13254)
    • [CI] Refactor Bria Transformer Tests (#13341)
    • [CI] Refactor Chronoedit, PRX, EasyAnimate, Ovis transformer tests (#13347)
    • [CI] Refactor Chroma , LongCat and HiDream Transformer Tests (#13345)
    • [CI] Refactor Skyreels, Lumina, Ominigen, Mochi transformer tests (#13348)
    • [CI] Refactor SD3 Transformer Test (#13340)
    • [CI] Refactor Z Image Transformer Tests (#13253)
  • @yiyixuxu
    • [agents docs] update models.md with class attributes and attention mask (#13665)
    • [agents docs] update pipelines.md: (#13570)
    • [CI] claude_review: target source PR's branch for follow-up PRs (#13774)
    • [docs] update philosophy.md (finally) (#13808)
    • [.ai] add self-review skill (#13917)
    • update PR template and highlight AI-agent setup for contributors (#13913)
    • Point "Coding with AI agents" links at the rendered docs site (#13952)
    • Make root PHILOSOPHY.md a symlink to the docs philosophy page (#13954)
    • keep the agent symlinks (#13968)
    • Add Krea 2 (K2) text-to-image pipeline and transformer (#14045)
    • [.ai doc] Refine .ai attention-mask and component-mutation guidance (#13982)
    • [.ai] document single-file model layout and "don't reimplement Diffus… (#14048)
    • support loading pipeline from transformer style (flat) repo (#14096)
  • @akshan-main
    • Address ernie-image review findings #13577 (#13663)
    • refactor sana transformer tests (#13826)
    • refactor autoencoder tests (asymmetric_kl, ltx_video) (#13845)
    • refactor autoencoder_magvit tests (#13834)
    • refactor autoencoder_hunyuan_video tests (#13835)
    • refactor autoencoder_kl_cogvideox tests (#13840)
    • refactor autoencoder tests (vq, kvae_video, oobleck, consistency_decoder, tiny, vidtok) (#13849)
    • Add from_single_file support to ErnieImageTransformer2DModel (#13727)
    • refactor autoencoder tests (temporal decoder, cosmos, kvae, mochi) (#13832)
    • refactor controlnet_cosmos tests (#13847)
    • refactor unet_spatiotemporal tests (#13891)
    • refactor unet tests (3d_condition, motion, controlnetxs) (#13897)
    • refactor unet_1d tests (#13898)
    • refactor unet_2d tests (#13901)
    • fix(flux): enable true CFG with precomputed negative embeds (#13957)
    • fix(flux): tighten check_inputs validation (#13955)
    • fix(bria_fibo): fix guidance_embeds, prompt_embeds, tensor-image and multi-image crashes (#13981)
  • @AlanPonnachan
    • feat: Add Modular Pipeline for Stable Diffusion 3 (SD3) (#13324)
  • @Moran232
    • [feat] JoyAI-JoyImage-Edit support (#13444)
  • @terarachang
    • Add LoRA support for Cosmos Predict 2.5 and fix pipeline to match official Cosmos repo (#13664)
  • @dg845
    • Fix GGUF to Work Better with modules_to_not_convert / keep_in_fp32_modules (#13697)
    • Add LTX-2.X IC LoRA and HDR Pipelines (#13572)
  • @waitingcheung
    • feat: Add Motif-Video model and pipelines (#13551)
  • @kashif
    • [LLADA2] Fix llada2 review #13598 (#13698)
    • [discrete diffusion] Add DiffusionGemma pipeline and schedulers (#13986)
    • Add doc pages for the DiffusionGemma schedulers (#14092)
  • @linoytsaban
    • [LTX 2.3] update docs (#13788)
    • Add Ideogram4LoraLoaderMixin (LoRA loading for Ideogram4) (#13921)
    • [lora] add non-diffusers LoRA loading support for Krea 2 LoRAs (#14074)
  • @Enderfga
    • Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal) (#13745)
    • [AnyFlow] FAR: standalone causal-mask builder + torch.compile follow-up (#13792)
  • @atharvajoshi10
    • Adding Cosmos 3 to Diffusers (#13818)
    • multi-GPU VAE Fix for Cosmos 3 (#13924)
  • @rmatif
    • Add Anima modular pipeline (#13732)
  • @JingyaHuang
    • [Neuron] Add AWS Neuron (Trainium/Inferentia) as an officially supported device (#13289)
    • [Neuron] Enable torch.compile compatibility with Neuron device (#13485)
  • @apolinario
    • Add Ideogram 4 (#13859)
    • Add structured prompt upsampling to Ideogram4 (#13860)
    • Krea 2 LoRA DreamBooth trainer (#14046)
    • Ideogram4 lora training (#13861)
  • @yzhautouskay
    • Add Cosmos3 action generation support (#13823)
    • Add Cosmos3 video2video generation support (#13896)
  • @xin3he
    • Integrate AutoRound into Diffusers (#13552)
  • @Carlofkl
    • [Pipelines] Add DreamLite text-to-image and image-edit pipelines (#13815)
  • @liwd190019
    • Add tutorial translations in Chinese (#13932)
  • @MaciejBalaNV
    • Add Sound Encoder to Cosmos3 (#13911)
  • @DavidBert
    • Add PRXPixelPipeline: pixel-space PRX text-to-image pipeline (#13928)

Don't miss a new diffusers release

NewReleases is sending notifications on new releases.