These are the release notes of the ๐งจ Diffusers library
Introducing Hugging Face's new library for diffusion models.
Diffusion models proved themselves very effective in artificial synthesis, even beating GANs for images. Because of that, they gained traction in the machine learning community and play an important role for systems like DALL-E 2 or Imagen to generate photorealistic images when prompted on text.
While the most prolific successes of diffusion models have been in the computer vision community, these models have also achieved remarkable results in other domains, such as:
and more.
Goals
The goals of diffusers are:
- to centralize the research of diffusion models from independent repositories to a clear and maintained project,
- to reproduce high impact machine learning systems such as DALLE and Imagen in a manner that is accessible for the public, and
- to create an easy to use API that enables one to train their own models or re-use checkpoints from other repositories for inference.
Release overview
Quickstart:
- For a light walk-through of the library, please have a look at the Official ๐งจ Diffusers Notebook.
- To directly jump into training a diffusion model yourself, please have a look at the Training Diffusers Notebook
Diffusers aims to be a modular toolbox for diffusion techniques, with a focus the following categories:
๐ Inference pipelines
Inference pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box. The goal is for them to stick as close as possible to their original implementation, and they can include components of other libraries (such as text encoders).
The original release contains the following pipelines:
- DDPM for unconditional image generation with discrete scheduling in pipeline_ddpm.
- DDIM for unconditional image generation with discrete scheduling in pipeline_ddim.
- PNDM for unconditional image generation with discrete scheduling in pipeline_pndm.
- Stochastic Differential Equations for unconditional image generation with continuous scheduling in score_sde_ve
- Latent diffusion for text to image generation / conditional image generation in pipeline_latent_diffusion as well as for unconditional image generation in latent_diffusion_uncond
We are currently working on enabling other pipelines for different modalities. The following pipelines are expected to land in a subsequent release:
- BDDMPipeline for spectrogram-to-sound vocoding
- GLIDEPipeline to support OpenAI's GLIDE model
- Grad-TTS for text to audio generation / conditional audio generation
- A reinforcement learning pipeline (happening in #105)
โฐ Schedulers
- Schedulers are the algorithms to use diffusion models in inference as well as for training. They include the noise schedules and define algorithm-specific diffusion steps.
- Schedulers can be used interchangable between diffusion models in inference to find the preferred tradef-off between speed and generation quality.
- Schedulers are available in numpy, but can easily be transformed into PyTorch.
The goal is for each scheduler to provide one or more step()
functions that should be called iteratively to unroll the diffusion loop during the forward pass. They are framework agnostic, but offer conversion methods which should allow easy conversion to PyTorch utilities.
The initial release contains the following schedulers:
- DDIM, from the Denoising Diffusion Implicit Models paper.
- DDPM, from the Denoising Diffusion Probabilistic Models paper.
- PNDM, from the Pseudo Numerical Methods for Diffusion Models on Manifolds paper
- SDE_VE, from the Score-Based Generative Modeling through Stochastic Differential Equations paper.
๐ญ Models
Models are hosted in the src/diffusers/models
folder.
For the initial release, you'll get to see a few building blocks, as well as some resulting models:
UNet2DModel
can be seen as a version of the recent UNet architectures as shown in recent papers. It can be seen as the unconditional version of the UNet model, in opposition to the conditional version that follows below.UNet2DConditionModel
is similar to theUNet2DModel
, but is conditional: it uses the cross-attention mechanism in order to have skip connections in its downsample and upsample layers. These cross-attentions can be fed by other models. An example of a pipeline using a conditional UNet model is the latent diffusion pipeline.AutoencoderKL
andVQModel
are still experimental models that are prone to breaking changes in the near future. However, they can already be used as part of the Latent Diffusion pipelines.
๐ Training example
The first release contains a dataset-agnostic unconditional example and a training notebook:
- The
train_unconditional.py
example, which trains a DDPM UNet model on a dataset of your choice. - More examples can be found under the Hugging Face Diffusers Notebooks
Credits
This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:
- @CompVis' latent diffusion models library, available here
- @hojonathanho original DDPM implementation, available here as well as the extremely useful translation into PyTorch by @pesser, available here
- @ermongroup's DDIM implementation, available here.
- @yang-song's Score-VE and Score-VP implementations, available here
We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available here.