pypi diffusers 0.15.0
v0.15.0 Beyond Image Generation

latest releases: 0.30.3, 0.30.2, 0.30.1...
17 months ago

Taking Diffusers Beyond Image Generation

We are very excited about this release! It brings new pipelines for video and audio to diffusers, showing that diffusion is a great choice for all sorts of generative tasks. The modular, pluggable approach of diffusers was crucial to integrate the new models intuitively and cohesively with the rest of the library. We hope you appreciate the consistency of the APIs and implementations, as our ultimate goal is to provide the best toolbox to help you solve the tasks you're interested in. Don't hesitate to get in touch if you use diffusers for other projects!

In addition to that, diffusers 0.15 includes a lot of new features and improvements. From performance and deployment improvements (faster pipeline loading) to increased flexibility for creative tasks (Karras sigmas, weight prompting, support for Automatic1111 textual inversion embeddings) to additional customization options (Multi-ControlNet) to training utilities (ControlNet, Min-SNR weighting). Read on for the details!

🎬 Text-to-Video

Text-guided video generation is not a fantasy anymore - it's as simple as spinning up a colab and running any of the two powerful open-sourced video generation models.

Text-to-Video

Alibaba's DAMO Vision Intelligence Lab has open-sourced a first research-only video generation model that can generatae some powerful video clips of up to a minute. To see Darth Vader riding a wave simply copy-paste the following lines into your favorite Python interpreter:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

vader

For more information you can have a look at "damo-vilab/text-to-video-ms-1.7b"

Text-to-Video Zero

Text2Video-Zero is a zero-shot text-to-video synthesis diffusion model that enables low cost yet consistent video generation with only pre-trained text-to-image diffusion models using simple pre-trained stable diffusion models, such as Stable Diffusion v1-5. Text2Video-Zero also naturally supports cool extension works of pre-trained text-to-image models such as Instruct Pix2Pix, ControlNet and DreamBooth, and based on which we present Video Instruct Pix2Pix, Pose Conditional, Edge Conditional and, Edge Conditional and DreamBooth Specialized applications.

Ftb9VnoakAE_B7T.mp4

For more information please have a look at PAIR/Text2Video-Zero

🔉 Audio Generation

Text-guided audio generation has made great progress over the last months with many advances being based on diffusion models.
The 0.15.0 release includes two powerful audio diffusion models.

AudioLDM

Inspired by Stable Diffusion, AudioLDM
is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
sound effects, human speech and music.

from diffusers import AudioLDMPipeline
import torch

repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]

The resulting audio output can be saved as a .wav file:

import scipy

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

For more information see cvssp/audioldm

Spectrogram Diffusion

This model from the Magenta team is a MIDI to audio generator. The pipeline takes a MIDI file as input and autoregressively generates 5-sec spectrograms which are concated together in the end and decoded to audio via a Spectrogram decoder.

from diffusers import SpectrogramDiffusionPipeline, MidiProcessor

pipe = SpectrogramDiffusionPipeline.from_pretrained("google/music-spectrogram-diffusion")
pipe = pipe.to("cuda")
processor = MidiProcessor()

# Download MIDI from: wget http://www.piano-midi.de/midis/beethoven/beethoven_hammerklavier_2.mid
output = pipe(processor("beethoven_hammerklavier_2.mid"))

audio = output.audios[0]

📗 New Docs

Documentation is crucially important for diffusers, as it's one of the first resources where people try to understand how everything works and fix any issues they are observing. We have spent a lot of time in this release reviewing all documents, adding new ones, reorganizing sections and bringing code examples up to date with the latest APIs. This effort has been led by @stevhliu (thanks a lot! 🙌) and @yiyixuxu, but many others have chimed in and contributed.

Check it out: https://huggingface.co/docs/diffusers/index

Don't hesitate to open PRs for fixes to the documentation, they are greatly appreciated as discussed in our (revised, of course) contribution guide.

Screenshot from 2023-04-12 18-08-35

🪄 Stable UnCLIP

Stable UnCLIP is the best open-sourced image variation model out there. Pass an initial image and optionally a prompt to generate variations of the image:

from diffusers import DiffusionPipeline
from diffusers.utils import load_image
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-unclip-small", torch_dtype=torch.float16)
pipe.to("cuda")

# get image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
image = load_image(url)

# run image variation
image = pipe(image).images[0]

For more information you can have a look at "stabilityai/stable-diffusion-2-1-unclip"

Fsei9kLaUAM27yZ.mp4

🚀 More ControlNet

ControlNet was released in diffusers in version 0.14.0, but we have some exciting developments: Multi-ControlNet, a training script, and upcoming event and a community image-to-image pipeline contributed by @mikegarts!

Multi-ControlNet

Thanks to community member @takuma104, it's now possible to use several ControlNet conditioning models at once! It works with the same API as before, only supplying a list of ControlNets instead of just once:

import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet_canny = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", 
                                                   torch_dtype=torch.float16).to("cuda")
controlnet_pose = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", 
                                                   torch_dtype=torch.float16).to("cuda")

pipe = StableDiffusionControlNetPipeline.from_pretrained(
	"example/a-sd15-variant-model", torch_dtype=torch.float16,
	controlnet=[controlnet_pose, controlnet_canny]
).to("cuda")

pose_image = ...
canny_image = ...
prompt = ...

image = pipe(prompt=prompt, image=[pose_image, canny_image]).images[0]

And this is an example of how this affects generation:

Control Image1 Control Image2 Generated
(none)
(none)

ControlNet Training

We have created a training script for ControlNet, and can't wait to see what new ideas the community may come up with! In fact, we are so pumped about it that we are organizing a JAX Diffusers sprint with a special focus on ControlNet, where participant teams will be assigned TPUs v4-8 to work on their projects 🤯. Those are some mean machines, so make sure you join our discord to follow the event: https://discord.com/channels/879548962464493619/897387888663232554/1092751149217615902.

🐈‍⬛ Textual Inversion, Revisited

Several great contributors have been working on textual inversion to get the most of it. @isamu-isozaki made it possible to perform multitoken training, and @piEsposito & @GuiyeC created an easy way to load textual inversion embeddings. These contributors are always a pleasure to work with 🙌, we feel honored and proud of this community 🙏

Loading textual inversion embeddings is compatible with the Automatic1111 format, so you can download embeddings from other services (such as civitai), and easily apply them in diffusers. Please check the updated documentation for details.

🏃 Faster loading of cached pipelines

We conducted a thorough investigation of the pipeline loading process to make it as fast as possible. This is the before and after:

Previous: 2.27 sec
Now: 1.1 sec

Instead of performing 3 HTTP operations, we now get all we need with just one. That single call is necessary to check whether any of the components in the pipeline were updated – if that's the case, then we need to download the new files. This improvement also applies when you load individual models instead of pre-trained pipelines.

This may not sound as much, but many people use diffusers for user-facing services where models and pipelines have to be reused on demand. By minimizing latency, they can provide a better service to their users and minimize operating costs.

This can be further reduced by forcing diffusers to just use the items on disk and never check for updates. This is not recommended for most users, but can be interesting in production environments.

🔩 Weight prompting using compel

Weight prompting is a popular method to increase the importance of some of the elements that appear in a text prompt, as a way to force image generation to obey to those concepts. Because diffusers is used in multitude of services and projects, we wanted to provide a very flexible way to adopt prompt weighting, so users can ultimately build the system they prefer. Our apprach was to:

  • Make the Stable Diffusion pipelines accept raw prompt embeddings. You are free to create the embeddings however you see fit, so users can come up with new ideas to express weighting in their projects.
  • At the same time, we adopted compel, by @damian0815, as a higher-level library to create the weighted embeddings.

You don't have to use compel to create the embeddings, but if you do, this is an example of how it looks in practice:

from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
from compel import Compel

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
prompt = "a red cat playing with a ball++"
prompt_embeds = compel_proc(prompt)

image = pipe(prompt_embeds=prompt_embeds, num_inference_steps=20).images[0]

img

As you can see, we assign more weight to the ball word using a compel-specific syntax (ball++). You can use other libraries (or your own) to create appropriate embeddings to pass to the pipeline.

You can read more details in the documentation.

🎲 Karras Sigmas for schedulers

Some diffusers schedulers now support Karras sigmas! Thanks @nipunjindal !

See Add Karras pattern to discrete euler in #2956 for more information.

All commits

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @Abhinay1997
    • [Community Pipeline] Unclip Image Interpolation (#2400)
  • @ssube
    • add OnnxStableDiffusionUpscalePipeline pipeline (#2158)
    • add support for pre-calculated prompt embeds to Stable Diffusion ONNX pipelines (#2597)
  • @haofanwang
    • Support convert LoRA safetensors into diffusers format (#2403)
    • Support LoRA for text encoder (#2588)
    • Support for Offset Noise in examples (#2753)
    • Update train_text_to_image_lora.py (#2767)
    • Update train_text_to_image_lora.py (#2795)
  • @isamu-isozaki
    • Added multitoken training for textual inversion. Issue 369 (#661)
  • @mikegarts
    • community stablediffusion controlnet img2img pipeline (#2584)
    • img2img.multiple.controlnets.pipeline (#2833)
  • @LolitaSian
    • add translated docs (#2587)
  • @Ttl
    • Controlnet training (#2545)
  • @nipunjindal
    • [1929]: Add CLIP guidance for Img2Img stable diffusion pipeline (#2723)
    • [2737]: Add DPMSolverMultistepScheduler to CLIP guided community pipeline (#2779)
    • [2761]: Add documentation for extra_in_channels UNet1DModel (#2817)
    • [2884]: Fix cross_attention_kwargs in StableDiffusionImg2ImgPipeline (#2902)
    • [2905]: Add Karras pattern to discrete euler (#2956)
    • [2064]: Add Karras to DPMSolverMultistepScheduler (#3001)
  • @bahjat-kawar
    • Add ModelEditing pipeline (#2721)
    • StableDiffusionModelEditingPipeline documentation (#2810)
  • @piEsposito
    • add load textual inversion embeddings to stable diffusion (#2009)
  • @19and99
    • [Pipeline] Add TextToVideoZeroPipeline (#2954)
    • Text2video zero refinements (#3070)
  • @MuhHanif
    • Flax memory efficient attention (#2889)

Don't miss a new diffusers release

NewReleases is sending notifications on new releases.