Diffusion

Diffusion models learn to denoise data

Model configs

The model config file for a diffusion model should set the model_type to diffusion_cond if the model uses conditioning, or diffusion_uncond if it does not, and the model object should have the following properties:

diffusion
- The configuration for the diffusion model itself. See below for more information on the diffusion model config
pretransform
- The configuration of the diffusion model's pretransform, such as an autoencoder for latent diffusion.
- Optional
conditioning
- The configuration of the various conditioning modules for the diffusion model
- Only required for diffusion_cond
io_channels
- The base number of input/output channels for the diffusion model
- Used by inference scripts to determine the shape of the noise to generate for the diffusion model

Diffusion configs

type
- The underlying model type for the transformer
- For conditioned diffusion models, be one of dit (Diffusion Transformer), DAU1d (Dance Diffusion U-Net), or adp_cfg_1d (audio-diffusion-pytorch U-Net)
- Unconditioned diffusion models can also use adp_1d
cross_attention_cond_ids
- Conditioner ids for conditioning information to be used as cross-attention input
- If multiple ids are specified, the conditioning tensors will be concatenated along the sequence dimension
global_cond_ids
- Conditioner ids for conditioning information to be used as global conditioning input
- If multiple ids are specified, the conditioning tensors will be concatenated along the channel dimension
prepend_cond_ids
- Conditioner ids for conditioning information to be prepended to the model input
- If multiple ids are specified, the conditioning tensors will be concatenated along the sequence dimension
- Only works with diffusion transformer models
input_concat_ids
- Conditioner ids for conditioning information to be concatenated to the model input
- If multiple ids are specified, the conditioning tensors will be concatenated along the channel dimension
- If the conditioning tensors are not the same length as the model input, they will be interpolated along the sequence dimension to be the same length.
  - The interpolation algorithm is model-dependent, but usually uses nearest-neighbor resampling.
config
- The configuration for the model backbone itself
- Model-dependent

Training configs

The training config in the diffusion model config file should have the following properties:

learning_rate
- The learning rate to use during training
- Defaults to constant learning rate, can be overridden with optimizer_configs
use_ema
- If true, a copy of the model weights is maintained during training and updated as an exponential moving average of the trained model's weights.
- Optional. Default: true
log_loss_info
- If true, additional diffusion loss info will be gathered across all GPUs and displayed during training
- Optional. Default: false
loss_configs
- Configurations for the loss function calculation
- Optional
optimizer_configs
- Configuration for optimizers and schedulers
- Optional, overrides learning_rate
demo
- Configuration for the demos during training, including conditioning information

Example config

"training": {
    "use_ema": true,
    "log_loss_info": false,
    "optimizer_configs": {
        "diffusion": {
            "optimizer": {
                "type": "AdamW",
                "config": {
                    "lr": 5e-5,
                    "betas": [0.9, 0.999],
                    "weight_decay": 1e-3
                }
            },
            "scheduler": {
                "type": "InverseLR",
                "config": {
                    "inv_gamma": 1000000,
                    "power": 0.5,
                    "warmup": 0.99
                }
            }
        }
    },
    "demo": { ... }
}

Demo configs

The demo config in the diffusion model training config should have the following properties:

demo_every
- How many training steps between demos
demo_steps
- Number of diffusion timesteps to run for the demos
num_demos
- This is the number of examples to generate in each demo
demo_cond
- For conditioned diffusion models, this is the conditioning metadata to provide to each example, provided as a list
- NOTE: List must be the same length as num_demos
demo_cfg_scales
- For conditioned diffusion models, this provides a list of classifier-free guidance (CFG) scales to render during the demos. This can be helpful to get an idea of how the model responds to different conditioning strengths as training continues.

Example config

"demo": {
    "demo_every": 2000,
    "demo_steps": 250,
    "num_demos": 4,
    "demo_cond": [
        {"prompt": "A beautiful piano arpeggio", "seconds_start": 0, "seconds_total": 80},
        {"prompt": "A tropical house track with upbeat melodies, a driving bassline, and cheery vibes", "seconds_start": 0, "seconds_total": 250},
        {"prompt": "A cool 80s glam rock song with driving drums and distorted guitars", "seconds_start": 0, "seconds_total": 180},
        {"prompt": "A grand orchestral arrangement", "seconds_start": 0, "seconds_total": 190}
    ],
    "demo_cfg_scales": [3, 6, 9]
}

Model types

A variety of different model types can be used as the underlying backbone for a diffusion model. At the moment, this includes variants of U-Net and Transformer models.

Diffusion Transformers (DiT)

Transformers tend to consistently outperform U-Nets in terms of model quality, but are much more memory- and compute-intensive and work best on shorter sequences such as latent encodings of audio.

Continuous Transformer

This is our custom implementation of a transformer model, based on the x-transformers implementation, but with efficiency improvements such as fused QKV layers, and Flash Attention 2 support.

`x-transformers`

This model type uses the ContinuousTransformerWrapper class from the https://github.com/lucidrains/x-transformers repository as the diffusion transformer backbone.

x-transformers is a great baseline transformer implementation with lots of options for various experimental settings. It's great for testing out experimental features without implementing them yourself, but the implementations might not be fully optimized, and breaking changes may be introduced without much warning.

Diffusion U-Net

U-Nets use a hierarchical architecture to gradually downsample the input data before more heavy processing is performed, then upsample the data again, using skip connections to pass data across the downsampling "valley" (the "U" in the name) to the upsampling layer at the same resolution.

audio-diffusion-pytorch U-Net (ADP)

This model type uses a modified implementation of the UNetCFG1D class from version 0.0.94 of the https://github.com/archinetai/audio-diffusion-pytorch repo, with added Flash Attention support.

Dance Diffusion U-Net

This is a reimplementation of the U-Net used in Dance Diffusion. It has minimal conditioning support, only really supporting global conditioning. Mostly used for unconditional diffusion models.