Spaces:
Sleeping
Pretransforms
Many models require some fixed transform to be applied to the input audio before the audio is passed in to the trainable layers of the model, as well as a corresponding inverse transform to be applied to the outputs of the model. We refer to these as "pretransforms".
At the moment, stable-audio-tools
supports two pretransforms, frozen autoencoders for latent diffusion models and wavelet decompositions.
Pretransforms have a similar interface to autoencoders with "encode" and "decode" functions defined for each pretransform.
Autoencoder pretransform
To define a model with an autoencoder pretransform, you can define the "pretransform" property in the model config, with the type
property set to autoencoder
. The config
property should be an autoencoder model definition.
Example:
"pretransform": {
"type": "autoencoder",
"config": {
"encoder": {
...
},
"decoder": {
...
}
...normal autoencoder configuration
}
}
Latent rescaling
The original Latent Diffusion paper found that rescaling the latent series to unit variance before performing diffusion improved quality. To this end, we expose a scale
property on autoencoder pretransforms that will take care of this rescaling. The scale should be set to the original standard deviation of the latents, which can be determined experimentally, or by looking at the latent_std
value during training. The pretransform code will divide by this scale factor in the encode
function and multiply by this scale in the decode
function.
Wavelet pretransform
stable-audio-tools
also exposes wavelet decomposition as a pretransform. Wavelet decomposition is a quick way to trade off sequence length for channels in autoencoders, while maintaining a multi-band implicit bias.
Wavelet pretransforms take the following properties:
channels
- The number of input and output audio channels for the wavelet transform
levels
- The number of successive wavelet decompositions to perform. Each level doubles the channel count and halves the sequence length
wavelet
- The specific wavelet from PyWavelets to use, currently limited to
"bior2.2", "bior2.4", "bior2.6", "bior2.8", "bior4.4", "bior6.8"
- The specific wavelet from PyWavelets to use, currently limited to
Future work
We hope to add more filters and transforms to this list, including PQMF and STFT transforms.