Autoencoders

At a high level, autoencoders are models constructed of two parts: an encoder, and a decoder.

The encoder takes in an sequence (such as mono or stereo audio) and outputs a compressed representation of that sequence as a d-channel "latent sequence", usually heavily downsampled by a constant factor.

The decoder takes in a d-channel latent sequence and upsamples it back to the original input sequence length, reversing the compression of the encoder.

Autoencoders are trained with a combination of reconstruction and adversarial losses in order to create a compact and invertible representation of raw audio data that allows downstream models to work in a data-compressed "latent space", with various desirable and controllable properties such as reduced sequence length, noise resistance, and discretization.

The autoencoder architectures defined in stable-audio-tools are largely fully-convolutional, which allows autoencoders trained on small lengths to be applied to arbitrary-length sequences. For example, an autoencoder trained on 1-second samples could be used to encode 45-second inputs to a latent diffusion model.

Model configs

The model config file for an autoencoder should set the model_type to autoencoder, and the model object should have the following properties:

encoder
- Configuration for the autoencoder's encoder half
decoder
- Configuration for the autoencoder's decoder half
latent_dim
- Latent dimension of the autoencoder, used by inference scripts and downstream models
downsampling_ratio
- Downsampling ratio between the input sequence and the latent sequence, used by inference scripts and downstream models
io_channels
- Number of input and output channels for the autoencoder when they're the same, used by inference scripts and downstream models
bottleneck
- Configuration for the autoencoder's bottleneck
- Optional
pretransform
- A pretransform definition for the autoencoder, such as wavelet decomposition or another autoencoder
- See pretransforms.md for more information
- Optional
in_channels
- Specifies the number of input channels for the autoencoder, when it's different from io_channels, such as in a mono-to-stereo model
- Optional
out_channels
- Specifies the number of output channels for the autoencoder, when it's different from io_channels
- Optional

Training configs

The training config in the autoencoder model config file should have the following properties:

learning_rate
- The learning rate to use during training
use_ema
- If true, a copy of the model weights is maintained during training and updated as an exponential moving average of the trained model's weights.
- Optional. Default: false
warmup_steps
- The number of training steps before turning on adversarial losses
- Optional. Default: 0
encoder_freeze_on_warmup
- If true, freezes the encoder after the warmup steps have completed, so adversarial training only affects the decoder.
- Optional. Default: false
loss_configs
- Configurations for the loss function calculation
- Optional
optimizer_configs
- Configuration for optimizers and schedulers
- Optional

Loss configs

There are few different types of losses that are used for autoencoder training, including spectral losses, time-domain losses, adversarial losses, and bottleneck-specific losses.

Hyperparameters fo these losses as well as loss weighting factors can be configured in the loss_configs property in the training config.

Spectral losses

Multi-resolution STFT losses are the main reconstruction loss used for our audio autoencoders. We use the auraloss library for our spectral loss functions.

For mono autoencoders (io_channels == 1), we use the MultiResolutionSTFTLoss module.

For stereo autoencoders (io_channels == 2), we use the SumAndDifferenceSTFTLoss module.

Example config

"spectral": {
    "type": "mrstft",
    "config": {
        "fft_sizes": [2048, 1024, 512, 256, 128, 64, 32],
        "hop_sizes": [512, 256, 128, 64, 32, 16, 8],
        "win_lengths": [2048, 1024, 512, 256, 128, 64, 32],
        "perceptual_weighting": true
    },
    "weights": {
        "mrstft": 1.0
    }
}

Time-domain loss

We compute the L1 distance between the original audio and the decoded audio to provide a time-domain loss.

Example config

"time": {
    "type": "l1",
    "weights": {
        "l1": 0.1
    }
}

Adversarial losses

Adversarial losses bring in an ensemble of discriminator models to discriminate between real and fake audio, providing a signal to the autoencoder on perceptual discrepancies to fix.

We largely rely on the multi-scale STFT discriminator from the EnCodec repo

Example config

"discriminator": {
    "type": "encodec",
    "config": {
        "filters": 32,
        "n_ffts": [2048, 1024, 512, 256, 128],
        "hop_lengths": [512, 256, 128, 64, 32],
        "win_lengths": [2048, 1024, 512, 256, 128]
    },
    "weights": {
        "adversarial": 0.1,
        "feature_matching": 5.0
    }
}

Demo config

The only property to set for autoencoder training demos is the demo_every property, determining the number of steps between demos.

Example config

"demo": {
    "demo_every": 2000
}

Encoder and decoder types

Encoders and decoders are defined separately in the model configuration, so encoders and decoders from different model architectures and libraries can be used interchangeably.

Oobleck

Oobleck is Harmonai's in-house autoencoder architecture, implementing features from a variety of other autoencoder architectures.

Example config

"encoder": {
    "type": "oobleck",
    "config": {
        "in_channels": 2,
        "channels": 128,
        "c_mults": [1, 2, 4, 8],
        "strides": [2, 4, 8, 8],
        "latent_dim": 128,
        "use_snake": true
    }
},
"decoder": {
    "type": "oobleck",
    "config": {
        "out_channels": 2,
        "channels": 128,
        "c_mults": [1, 2, 4, 8],
        "strides": [2, 4, 8, 8],
        "latent_dim": 64,
        "use_snake": true,
        "use_nearest_upsample": false
    }
}

DAC

This is the Encoder and Decoder definitions from the descript-audio-codec repo. It's a simple fully-convolutional autoencoder with channels doubling every level. The encoder and decoder configs are passed directly into the constructors for the DAC Encoder and Decoder.

Note: This does not include the DAC quantizer, and does not load pre-trained DAC models, this is just the encoder and decoder definitions.

Example config

"encoder": {
    "type": "dac",
    "config": {
        "in_channels": 2,
        "latent_dim": 32,
        "d_model": 128,
        "strides": [2, 4, 4, 4]
    }
},
"decoder": {
    "type": "dac",
    "config": {
        "out_channels": 2,
        "latent_dim": 32,
        "channels": 1536,
        "rates": [4, 4, 4, 2]
    }
}

SEANet

This is the SEANetEncoder and SEANetDecoder definitions from Meta's EnCodec repo. This is the same encoder and decoder architecture used in the EnCodec models used in MusicGen, without the quantizer.

The encoder and decoder configs are passed directly into the SEANetEncoder and SEANetDecoder classes directly, though we reverse the input order of the strides (ratios) in the encoder to make it consistent with the order in the decoder.

Example config

"encoder": {
    "type": "seanet",
    "config": {
        "channels": 2,
        "dimension": 128,
        "n_filters": 64,
        "ratios": [4, 4, 8, 8],
        "n_residual_layers": 1,
        "dilation_base": 2,
        "lstm": 2,
        "norm": "weight_norm"
    }
},
"decoder": {
    "type": "seanet",
    "config": {
        "channels": 2,
        "dimension": 64,
        "n_filters": 64,
        "ratios": [4, 4, 8, 8],
        "n_residual_layers": 1,
        "dilation_base": 2,
        "lstm": 2,
        "norm": "weight_norm"
    }
},

Bottlenecks

In our terminology, the "bottleneck" of an autoencoder is a module placed between the encoder and decoder to enforce particular constraints on the latent space the encoder creates.

Bottlenecks have a similar interface to the autoencoder with encode() and decode() functions defined. Some bottlenecks return extra information in addition to the output latent series, such as quantized token indices, or additional losses to be considered during training.

To define a bottleneck for the autoencoder, you can provide the bottleneck object in the autoencoder's model configuration, with the following

VAE

The Variational Autoencoder (VAE) bottleneck splits the encoder's output in half along the channel dimension, treats the two halves as the "mean" and "scale" parameters for VAE sampling, and performs the latent sampling. At a basic level, the "scale" values determine the amount of noise to add to the "mean" latents, which creates a noise-resistant latent space where more of the latent space decodes to perceptually "valid" audio. This is particularly helpful for diffusion models where the outpus of the diffusion sampling process leave a bit of Gaussian error noise.

Note: For the VAE bottleneck to work, the output dimension of the encoder must be twice the size of the input dimension for the decoder.

Example config

"bottleneck": {
    "type": "vae"
}

Extra info

The VAE bottleneck also returns a kl value in the encoder info. This is the KL divergence between encoded/sampled latent space and a Gaussian distribution. By including this value as a loss value to optimize, we push our latent distribution closer to a normal distribution, potentially trading off reconstruction quality.

Example loss config

"bottleneck": {
    "type": "kl",
    "weights": {
        "kl": 1e-4
    }
}

Tanh

This bottleneck applies the tanh function to the latent series, "soft-clipping" the latent values to be between -1 and 1. This is a quick and dirty way to enforce a limit on the variance of the latent space, but training these models can be unstable as it's seemingly easy for the latent space to saturate the values to -1 or 1 and never recover.

Example config

"bottleneck": {
    "type": "tanh"
}

Wasserstein

The Wasserstein bottleneck implements the WAE-MMD regularization method from the Wasserstein Auto-Encoders paper, calculating the Maximum Mean Discrepancy (MMD) between the latent space and a Gaussian distribution. Including this value as a loss value to optimize leads to a more Gaussian latent space, but does not require stochastic sampling as with a VAE, so the encoder is deterministic.

The Wasserstein bottleneck also exposes the noise_augment_dim property, which concatenates noise_augment_dim channels of Gaussian noise to the latent series before passing into the decoder. This adds some stochasticity to the latents which can be helpful for adversarial training, while keeping the encoder outputs deterministic.

Note: The MMD calculation is very VRAM-intensive for longer sequence lengths, so training a Wasserstein autoencoder is best done on autoencoders with a decent downsampling factor, or on short sequence lengths. For inference, the MMD calculation is disabled.

Example config

"bottleneck": {
    "type": "wasserstein"
}

Extra info

This bottleneck adds the mmd value to the encoder info, representing the Maximum Mean Discrepancy.

Example loss config

"bottleneck": {
    "type": "mmd",
    "weights": {
        "mmd": 100
    }
}

L2 normalization (Spherical autoencoder)

The L2 normalization bottleneck normalizes the latents across the channel-dimension, projecting the latents to a d-dimensional hypersphere. This acts as a form of latent space normalization.

Example config

"bottleneck": {
    "type": "l2_norm"
}

RVQ

Residual vector quantization (RVQ) is currently the leading method for learning discrete neural audio codecs (tokenizers for audio). In vector quantization, each item in the latent sequence is individually "snapped" to the nearest vector in a discrete "codebook" of learned vectors. The index of the vector in the codebook can then be used as a token index for things like autoregressive transformers. Residual vector quantization improves the precision of normal vector quantization by adding additional codebooks. For a deeper dive into RVQ, check out this blog post by Dr. Scott Hawley.

This RVQ bottleneck uses lucidrains' implementation from the vector-quantize-pytorch repo, which provides a lot of different quantizer options. The bottleneck config is passed through to the ResidualVQ constructor.

Note: This RVQ implementation uses manual replacement of codebook vectors to reduce codebook collapse. This does not work with multi-GPU training as the random replacement is not synchronized across devices.

Example config

"bottleneck": {
    "type": "rvq",
    "config": {
        "num_quantizers": 4,
        "codebook_size": 2048,
        "dim": 1024,
        "decay": 0.99,
    }
}

DAC RVQ

This is the residual vector quantization implementation from the descript-audio-codec repo. It differs from the above implementation in that it does not use manual replacements to improve codebook usage, but instead uses learnable linear layers to project the latents down to a lower-dimensional space before performing the individual quantization operations. This means it's compatible with distributed training.

The bottleneck config is passed directly into the ResidualVectorQuantize constructor.

The quantize_on_decode property is also exposed, which moves the quantization process to the decoder. This should not be used during training, but is helpful when training latent diffusion models that use the quantization process as a way to remove error after the diffusion sampling process.

Example config

"bottleneck": {
    "type": "dac_rvq",
    "config": {
        "input_dim": 64,
        "n_codebooks": 9,
        "codebook_dim": 32,
        "codebook_size": 1024,
        "quantizer_dropout": 0.5
    }
}

Extra info

The DAC RVQ bottleneck also adds the following properties to the info object:

pre_quantizer
- The pre-quantization latent series, useful in combination with quantize_on_decode for training latent diffusion models.
vq/commitment_loss
- Commitment loss for the quantizer
vq/codebook_loss
- Codebook loss for the quantizer