Spaces:

ALeLacheur
/

musicprotection

Runtime error

App Files Files Community

ALeLacheur commited on Jul 30, 2024

Commit

5a9b731

1 Parent(s): 98a3a53

uploading audio diffusion attacks

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
audio_diffusion_attacks +0 -1
audio_diffusion_attacks_forhf/.DS_Store +0 -0
audio_diffusion_attacks_forhf/README.md +37 -0
audio_diffusion_attacks_forhf/assets/.DS_Store +0 -0
audio_diffusion_attacks_forhf/assets/audios/.DS_Store +0 -0
audio_diffusion_attacks_forhf/assets/audios/hyperpop.wav +0 -0
audio_diffusion_attacks_forhf/assets/example_MAS.png +0 -0
audio_diffusion_attacks_forhf/assets/example_duration.png +0 -0
audio_diffusion_attacks_forhf/assets/example_mel.png +0 -0
audio_diffusion_attacks_forhf/assets/example_untrained_phone_encoding.png +0 -0
audio_diffusion_attacks_forhf/assets/gradtts_system.png +0 -0
audio_diffusion_attacks_forhf/audio_ethics.yml +0 -0
audio_diffusion_attacks_forhf/config.yml +2 -0
audio_diffusion_attacks_forhf/gen_audio_ethics_3.10.yml +8 -0
audio_diffusion_attacks_forhf/models/.DS_Store +0 -0
audio_diffusion_attacks_forhf/models/__init__.py +0 -0
audio_diffusion_attacks_forhf/models/__pycache__/__init__.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/models/__pycache__/phoneme_encoder.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/models/__pycache__/style_diffusion.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/models/__pycache__/utils.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/models/datasets/__pycache__/music_datasets.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/models/datasets/music_datasets.py +65 -0
audio_diffusion_attacks_forhf/models/monotonic_align/.DS_Store +0 -0
audio_diffusion_attacks_forhf/models/monotonic_align/__init__.py +23 -0
audio_diffusion_attacks_forhf/models/monotonic_align/__pycache__/__init__.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/models/monotonic_align/build/temp.linux-x86_64-cpython-310/core.o +0 -0
audio_diffusion_attacks_forhf/models/monotonic_align/core.c +0 -0
audio_diffusion_attacks_forhf/models/monotonic_align/core.cpython-310-x86_64-linux-gnu.so +0 -0
audio_diffusion_attacks_forhf/models/monotonic_align/core.pyx +45 -0
audio_diffusion_attacks_forhf/models/monotonic_align/setup.py +11 -0
audio_diffusion_attacks_forhf/models/phoneme_encoder.py +363 -0
audio_diffusion_attacks_forhf/models/style_diffusion.py +111 -0
audio_diffusion_attacks_forhf/models/utils.py +77 -0
audio_diffusion_attacks_forhf/notebooks/data_exploration/00_fma_exploration.ipynb +0 -0
audio_diffusion_attacks_forhf/resources/cmu_dictionary +0 -0
audio_diffusion_attacks_forhf/scripts/.DS_Store +0 -0
audio_diffusion_attacks_forhf/scripts/data_processing/process_music_mels.py +106 -0
audio_diffusion_attacks_forhf/scripts/data_processing/process_music_numpy.py +74 -0
audio_diffusion_attacks_forhf/scripts/train/music_models/train_music_completion.py +243 -0
audio_diffusion_attacks_forhf/scripts/train/train_tts.py +430 -0
audio_diffusion_attacks_forhf/src/.DS_Store +0 -0
audio_diffusion_attacks_forhf/src/__pycache__/losses.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/src/__pycache__/music_gen.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/src/__pycache__/test_encoder_attack.cpython-310.pyc +0 -0
audio_diffusion_attacks_forhf/src/balancer.py +137 -0
audio_diffusion_attacks_forhf/src/losses.py +329 -0
audio_diffusion_attacks_forhf/src/music_gen.py +100 -0
audio_diffusion_attacks_forhf/src/speech_inference.py +94 -0
audio_diffusion_attacks_forhf/src/test_audio/.Il Sogno Del Marinaio - Nanos' Waltz.mp3.icloud +0 -0

.DS_Store CHANGED Viewed

Binary files a/.DS_Store and b/.DS_Store differ

audio_diffusion_attacks DELETED Viewed

	@@ -1 +0,0 @@
1	- Subproject commit 1aaf4563762c407f31436ad452a72dd5af929443

audio_diffusion_attacks_forhf/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

audio_diffusion_attacks_forhf/README.md ADDED Viewed

	@@ -0,0 +1,37 @@

+## Audio Data Ownership
+## Installation
+conda env create -n audio_ethics --file gen_audio_ethics_3.10.yml
+To set up wandb, please check out this following link: [https://docs.wandb.ai/quickstart](https://docs.wandb.ai/quickstart)
+## Run Encoder Attack
+cd src
+python test_encoder_attack.py
+## Overview
+## Task 1: Audio Completion with Diffusion Models
+For this task, we use the [Free Music Archive (FMA)](https://github.com/mdeff/fma), which is a collection of royalty-free music. You can use any version of the model you wish, but we'll use the `fma_large` partition for training an initial system.
+Note: If librosa version is too high, have to edit line in audioldm to be `fft_window = pad_center(fft_window, size=filter_length)`
+To preprocess FMA, configure the file with your corresponding path and run the correct preprocessing script to convert the `.mp3` files to numpy (Loading in audio files during training is prohibitively slow).
+- Proceprocessing for ArchiSound encoders: `nohup python -u scripts/data_processing/process_music_numpy.py > logs/process_48k_music.out &`
+## Task 2: TTS with Diffusion Models
+TTS with Diffusion (or flow) models is one approach of many that folks have been taking for SOTA TTS performance right now. In this repo, we have a model similar to
+[Grad-TTS](https://grad-tts.github.io/), with the example inference for Grad-TTS below:
+![Inference Figure for Grad-TTS](./assets/gradtts_system.png)
+To run, first you need to build the `monotonic_align` code:
+`cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..`
+You possibly might have to move the generated .so file to the `monotonic_align/` directory if it is generated in `montonic_align/build/`.

audio_diffusion_attacks_forhf/assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

audio_diffusion_attacks_forhf/assets/audios/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

audio_diffusion_attacks_forhf/assets/audios/hyperpop.wav ADDED Viewed

Binary file (640 kB). View file

audio_diffusion_attacks_forhf/assets/example_MAS.png ADDED Viewed

audio_diffusion_attacks_forhf/assets/example_duration.png ADDED Viewed

audio_diffusion_attacks_forhf/assets/example_mel.png ADDED Viewed

audio_diffusion_attacks_forhf/assets/example_untrained_phone_encoding.png ADDED Viewed

audio_diffusion_attacks_forhf/assets/gradtts_system.png ADDED Viewed

audio_diffusion_attacks_forhf/audio_ethics.yml ADDED Viewed

File without changes

audio_diffusion_attacks_forhf/config.yml ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ wandb_settings:
2	+ project_name: audio_attacks

audio_diffusion_attacks_forhf/gen_audio_ethics_3.10.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+name: gen_audio_ethics_3.10
+channels:
+  - defaults
+dependencies:
+  - python=3.10
+  - conda-forge::libsndfile
+  - librosa
+prefix: /home/willie/anaconda3/envs/gen_audio_ethics_3.10

audio_diffusion_attacks_forhf/models/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

audio_diffusion_attacks_forhf/models/__init__.py ADDED Viewed

File without changes

audio_diffusion_attacks_forhf/models/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (173 Bytes). View file

audio_diffusion_attacks_forhf/models/__pycache__/phoneme_encoder.cpython-310.pyc ADDED Viewed

Binary file (12 kB). View file

audio_diffusion_attacks_forhf/models/__pycache__/style_diffusion.cpython-310.pyc ADDED Viewed

Binary file (4.38 kB). View file

audio_diffusion_attacks_forhf/models/__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (2.54 kB). View file

audio_diffusion_attacks_forhf/models/datasets/__pycache__/music_datasets.cpython-310.pyc ADDED Viewed

Binary file (1.89 kB). View file

audio_diffusion_attacks_forhf/models/datasets/music_datasets.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""
+music_datasets.py
+    Desc: Contains the code for the music datasets.
+"""
+import torch
+from torch.utils.data import Dataset
+import torchaudio
+import numpy as np
+import pandas as pd
+"""
+MusicMelDataset:
+    Given pre-processed mel-spectrograms, return a chunk of audio from the mel, with a masked version of a defined length
+    Args:
+        audio_files: List of .npy files consisting of mel-specs
+        audio_len: length in seconds (roughly) of audio to be return
+        mask_ratio: Size of mask as a ration of audio_len
+        mask_start: Where the mask starts for learning
+            "midpoint": always mask out the second half of the mel-spec
+        crop_start: Where the starting point for the sample of audio is taken
+            "random": Random valid starting point from audio is taken
+"""
+class MusicMelDataset(Dataset):
+    def __init__(self, audio_files, audio_len = 6, mask_ratio = 0.5, mask_start = "midpoint", crop_start = "random"):
+        self.audio_files = audio_files
+        # Convert length to number of frames
+        self.audio_len = int(audio_len * 100) # 100 is heuristic conversion made
+        self.mask_ratio = mask_ratio
+        self.mask_len = int(np.floor(self.audio_len * mask_ratio))
+        self.mask_start = mask_start
+        self.crop_start = crop_start
+    def __len__(self):
+        return len(self.audio_files)
+    # Get a random crop using audio_length
+    def get_random_crop(self, mel):
+        crop_start = torch.randint(0, mel.shape[0] - self.audio_len - 1, (1,))
+        return mel[crop_start:crop_start + self.audio_len, :]
+    def __getitem__(self, idx):
+        mel = torch.Tensor(np.load(self.audio_files[idx]))
+        if self.crop_start == "random":
+            mel = self.get_random_crop(mel)
+        else:
+            raise NotImplementedError(f"{self.crop_start} is not an implemented parameter for crop_start")
+        mask = torch.ones_like(mel)
+        if self.mask_start == "midpoint":
+            if self.mask_ratio == 0.5:
+                mask[self.mask_len:, :] = 0
+            else:
+                mask[self.audio_len // 2 + self.mask_len, :] = 0
+        else:
+            raise NotImplementedError(f"{self.mask_start} is not an implemented parameter for mask_start")
+        mel_mask = mel*mask
+        return mel, mel_mask

audio_diffusion_attacks_forhf/models/monotonic_align/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

audio_diffusion_attacks_forhf/models/monotonic_align/__init__.py ADDED Viewed

	@@ -0,0 +1,23 @@

+""" from https://github.com/jaywalnut310/glow-tts """
+import numpy as np
+import torch
+from .core import maximum_path_c
+def maximum_path(value, mask):
+    """ Cython optimised version.
+    value: [b, t_x, t_y]
+    mask: [b, t_x, t_y]
+    """
+    value = value * mask
+    device = value.device
+    dtype = value.dtype
+    value = value.data.cpu().numpy().astype(np.float32)
+    path = np.zeros_like(value).astype(np.int32)
+    mask = mask.data.cpu().numpy()
+    t_x_max = mask.sum(1)[:, 0].astype(np.int32)
+    t_y_max = mask.sum(2)[:, 0].astype(np.int32)
+    maximum_path_c(path, value, t_x_max, t_y_max)
+    return torch.from_numpy(path).to(device=device, dtype=dtype)

audio_diffusion_attacks_forhf/models/monotonic_align/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (881 Bytes). View file

audio_diffusion_attacks_forhf/models/monotonic_align/build/temp.linux-x86_64-cpython-310/core.o ADDED Viewed

Binary file (236 kB). View file

audio_diffusion_attacks_forhf/models/monotonic_align/core.c ADDED Viewed

The diff for this file is too large to render. See raw diff

audio_diffusion_attacks_forhf/models/monotonic_align/core.cpython-310-x86_64-linux-gnu.so ADDED Viewed

Binary file (178 kB). View file

audio_diffusion_attacks_forhf/models/monotonic_align/core.pyx ADDED Viewed

	@@ -0,0 +1,45 @@

+import numpy as np
+cimport numpy as np
+cimport cython
+from cython.parallel import prange
+@cython.boundscheck(False)
+@cython.wraparound(False)
+cdef void maximum_path_each(int[:,::1] path, float[:,::1] value, int t_x, int t_y, float max_neg_val) nogil:
+    cdef int x
+    cdef int y
+    cdef float v_prev
+    cdef float v_cur
+    cdef float tmp
+    cdef int index = t_x - 1
+    for y in range(t_y):
+        for x in range(max(0, t_x + y - t_y), min(t_x, y + 1)):
+            if x == y:
+                v_cur = max_neg_val
+            else:
+                v_cur = value[x, y-1]
+            if x == 0:
+                if y == 0:
+                    v_prev = 0.
+                else:
+                    v_prev = max_neg_val
+            else:
+                v_prev = value[x-1, y-1]
+            value[x, y] = max(v_cur, v_prev) + value[x, y]
+    for y in range(t_y - 1, -1, -1):
+        path[index, y] = 1
+        if index != 0 and (index == y or value[index, y-1] < value[index-1, y-1]):
+            index = index - 1
+@cython.boundscheck(False)
+@cython.wraparound(False)
+cpdef void maximum_path_c(int[:,:,::1] paths, float[:,:,::1] values, int[::1] t_xs, int[::1] t_ys, float max_neg_val=-1e9) nogil:
+    cdef int b = values.shape[0]
+    cdef int i
+    for i in prange(b, nogil=True):
+        maximum_path_each(paths[i], values[i], t_xs[i], t_ys[i], max_neg_val)

audio_diffusion_attacks_forhf/models/monotonic_align/setup.py ADDED Viewed

	@@ -0,0 +1,11 @@

+""" from https://github.com/jaywalnut310/glow-tts """
+from distutils.core import setup
+from Cython.Build import cythonize
+import numpy
+setup(
+    name = 'monotonic_align',
+    ext_modules = cythonize("core.pyx"),
+    include_dirs=[numpy.get_include()]
+)

audio_diffusion_attacks_forhf/models/phoneme_encoder.py ADDED Viewed

	@@ -0,0 +1,363 @@

+""" from https://github.com/jaywalnut310/glow-tts and https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS/"""
+import math
+import torch
+from models.utils import sequence_mask, convert_pad_shape
+# def sequence_mask(length, max_length=None):
+#     if max_length is None:
+#         max_length = length.max()
+#     x = torch.arange(int(max_length), dtype=length.dtype, device=length.device)
+#     return x.unsqueeze(0) < length.unsqueeze(1)
+# def convert_pad_shape(pad_shape):
+#     l = pad_shape[::-1]
+#     pad_shape = [item for sublist in l for item in sublist]
+#     return pad_shape
+class BaseModule(torch.nn.Module):
+    def __init__(self):
+        super(BaseModule, self).__init__()
+    @property
+    def nparams(self):
+        """
+        Returns number of trainable parameters of the module.
+        """
+        num_params = 0
+        for name, param in self.named_parameters():
+            if param.requires_grad:
+                num_params += np.prod(param.detach().cpu().numpy().shape)
+        return num_params
+    def relocate_input(self, x: list):
+        """
+        Relocates provided tensors to the same device set for the module.
+        """
+        device = next(self.parameters()).device
+        for i in range(len(x)):
+            if isinstance(x[i], torch.Tensor) and x[i].device != device:
+                x[i] = x[i].to(device)
+        return x
+class LayerNorm(BaseModule):
+    def __init__(self, channels, eps=1e-4):
+        super(LayerNorm, self).__init__()
+        self.channels = channels
+        self.eps = eps
+        self.gamma = torch.nn.Parameter(torch.ones(channels))
+        self.beta = torch.nn.Parameter(torch.zeros(channels))
+    def forward(self, x):
+        n_dims = len(x.shape)
+        mean = torch.mean(x, 1, keepdim=True)
+        variance = torch.mean((x - mean)**2, 1, keepdim=True)
+        x = (x - mean) * torch.rsqrt(variance + self.eps)
+        shape = [1, -1] + [1] * (n_dims - 2)
+        x = x * self.gamma.view(*shape) + self.beta.view(*shape)
+        return x
+class ConvReluNorm(BaseModule):
+    def __init__(self, in_channels, hidden_channels, out_channels, kernel_size,
+                 n_layers, p_dropout):
+        super(ConvReluNorm, self).__init__()
+        self.in_channels = in_channels
+        self.hidden_channels = hidden_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.n_layers = n_layers
+        self.p_dropout = p_dropout
+        self.conv_layers = torch.nn.ModuleList()
+        self.norm_layers = torch.nn.ModuleList()
+        self.conv_layers.append(torch.nn.Conv1d(in_channels, hidden_channels,
+                                                kernel_size, padding=kernel_size//2))
+        self.norm_layers.append(LayerNorm(hidden_channels))
+        self.relu_drop = torch.nn.Sequential(torch.nn.ReLU(), torch.nn.Dropout(p_dropout))
+        for _ in range(n_layers - 1):
+            self.conv_layers.append(torch.nn.Conv1d(hidden_channels, hidden_channels,
+                                                    kernel_size, padding=kernel_size//2))
+            self.norm_layers.append(LayerNorm(hidden_channels))
+        self.proj = torch.nn.Conv1d(hidden_channels, out_channels, 1)
+        self.proj.weight.data.zero_()
+        self.proj.bias.data.zero_()
+    def forward(self, x, x_mask):
+        x_org = x
+        for i in range(self.n_layers):
+            x = self.conv_layers[i](x * x_mask)
+            x = self.norm_layers[i](x)
+            x = self.relu_drop(x)
+        x = x_org + self.proj(x)
+        return x * x_mask
+class DurationPredictor(BaseModule):
+    def __init__(self, in_channels, filter_channels, kernel_size, p_dropout):
+        super(DurationPredictor, self).__init__()
+        self.in_channels = in_channels
+        self.filter_channels = filter_channels
+        self.p_dropout = p_dropout
+        self.drop = torch.nn.Dropout(p_dropout)
+        self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels,
+                                      kernel_size, padding=kernel_size//2)
+        self.norm_1 = LayerNorm(filter_channels)
+        self.conv_2 = torch.nn.Conv1d(filter_channels, filter_channels,
+                                      kernel_size, padding=kernel_size//2)
+        self.norm_2 = LayerNorm(filter_channels)
+        self.proj = torch.nn.Conv1d(filter_channels, 1, 1)
+    def forward(self, x, x_mask):
+        x = self.conv_1(x * x_mask)
+        x = torch.relu(x)
+        x = self.norm_1(x)
+        x = self.drop(x)
+        x = self.conv_2(x * x_mask)
+        x = torch.relu(x)
+        x = self.norm_2(x)
+        x = self.drop(x)
+        x = self.proj(x * x_mask)
+        return x * x_mask
+class MultiHeadAttention(BaseModule):
+    def __init__(self, channels, out_channels, n_heads, window_size=None,
+                 heads_share=True, p_dropout=0.0, proximal_bias=False,
+                 proximal_init=False):
+        super(MultiHeadAttention, self).__init__()
+        assert channels % n_heads == 0
+        self.channels = channels
+        self.out_channels = out_channels
+        self.n_heads = n_heads
+        self.window_size = window_size
+        self.heads_share = heads_share
+        self.proximal_bias = proximal_bias
+        self.p_dropout = p_dropout
+        self.attn = None
+        self.k_channels = channels // n_heads
+        self.conv_q = torch.nn.Conv1d(channels, channels, 1)
+        self.conv_k = torch.nn.Conv1d(channels, channels, 1)
+        self.conv_v = torch.nn.Conv1d(channels, channels, 1)
+        if window_size is not None:
+            n_heads_rel = 1 if heads_share else n_heads
+            rel_stddev = self.k_channels**-0.5
+            self.emb_rel_k = torch.nn.Parameter(torch.randn(n_heads_rel,
+                             window_size * 2 + 1, self.k_channels) * rel_stddev)
+            self.emb_rel_v = torch.nn.Parameter(torch.randn(n_heads_rel,
+                             window_size * 2 + 1, self.k_channels) * rel_stddev)
+        self.conv_o = torch.nn.Conv1d(channels, out_channels, 1)
+        self.drop = torch.nn.Dropout(p_dropout)
+        torch.nn.init.xavier_uniform_(self.conv_q.weight)
+        torch.nn.init.xavier_uniform_(self.conv_k.weight)
+        if proximal_init:
+            self.conv_k.weight.data.copy_(self.conv_q.weight.data)
+            self.conv_k.bias.data.copy_(self.conv_q.bias.data)
+        torch.nn.init.xavier_uniform_(self.conv_v.weight)
+    def forward(self, x, c, attn_mask=None):
+        q = self.conv_q(x)
+        k = self.conv_k(c)
+        v = self.conv_v(c)
+        x, self.attn = self.attention(q, k, v, mask=attn_mask)
+        x = self.conv_o(x)
+        return x
+    def attention(self, query, key, value, mask=None):
+        b, d, t_s, t_t = (*key.size(), query.size(2))
+        query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
+        key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.k_channels)
+        if self.window_size is not None:
+            assert t_s == t_t, "Relative attention is only available for self-attention."
+            key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
+            rel_logits = self._matmul_with_relative_keys(query, key_relative_embeddings)
+            rel_logits = self._relative_position_to_absolute_position(rel_logits)
+            scores_local = rel_logits / math.sqrt(self.k_channels)
+            scores = scores + scores_local
+        if self.proximal_bias:
+            assert t_s == t_t, "Proximal bias is only available for self-attention."
+            scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device,
+                                                                    dtype=scores.dtype)
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -1e4)
+        p_attn = torch.nn.functional.softmax(scores, dim=-1)
+        p_attn = self.drop(p_attn)
+        output = torch.matmul(p_attn, value)
+        if self.window_size is not None:
+            relative_weights = self._absolute_position_to_relative_position(p_attn)
+            value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s)
+            output = output + self._matmul_with_relative_values(relative_weights,
+                                                                value_relative_embeddings)
+        output = output.transpose(2, 3).contiguous().view(b, d, t_t)
+        return output, p_attn
+    def _matmul_with_relative_values(self, x, y):
+        ret = torch.matmul(x, y.unsqueeze(0))
+        return ret
+    def _matmul_with_relative_keys(self, x, y):
+        ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
+        return ret
+    def _get_relative_embeddings(self, relative_embeddings, length):
+        pad_length = max(length - (self.window_size + 1), 0)
+        slice_start_position = max((self.window_size + 1) - length, 0)
+        slice_end_position = slice_start_position + 2 * length - 1
+        if pad_length > 0:
+            padded_relative_embeddings = torch.nn.functional.pad(
+                            relative_embeddings, convert_pad_shape([[0, 0],
+                            [pad_length, pad_length], [0, 0]]))
+        else:
+            padded_relative_embeddings = relative_embeddings
+        used_relative_embeddings = padded_relative_embeddings[:,
+                                   slice_start_position:slice_end_position]
+        return used_relative_embeddings
+    def _relative_position_to_absolute_position(self, x):
+        batch, heads, length, _ = x.size()
+        x = torch.nn.functional.pad(x, convert_pad_shape([[0,0],[0,0],[0,0],[0,1]]))
+        x_flat = x.view([batch, heads, length * 2 * length])
+        x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0,0],[0,0],[0,length-1]]))
+        x_final = x_flat.view([batch, heads, length+1, 2*length-1])[:, :, :length, length-1:]
+        return x_final
+    def _absolute_position_to_relative_position(self, x):
+        batch, heads, length, _ = x.size()
+        x = torch.nn.functional.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length-1]]))
+        x_flat = x.view([batch, heads, length**2 + length*(length - 1)])
+        x_flat = torch.nn.functional.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
+        x_final = x_flat.view([batch, heads, length, 2*length])[:,:,:,1:]
+        return x_final
+    def _attention_bias_proximal(self, length):
+        r = torch.arange(length, dtype=torch.float32)
+        diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
+        return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
+class FFN(BaseModule):
+    def __init__(self, in_channels, out_channels, filter_channels, kernel_size,
+                 p_dropout=0.0):
+        super(FFN, self).__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.filter_channels = filter_channels
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.conv_1 = torch.nn.Conv1d(in_channels, filter_channels, kernel_size,
+                                      padding=kernel_size//2)
+        self.conv_2 = torch.nn.Conv1d(filter_channels, out_channels, kernel_size,
+                                      padding=kernel_size//2)
+        self.drop = torch.nn.Dropout(p_dropout)
+    def forward(self, x, x_mask):
+        x = self.conv_1(x * x_mask)
+        x = torch.relu(x)
+        x = self.drop(x)
+        x = self.conv_2(x * x_mask)
+        return x * x_mask
+class Encoder(BaseModule):
+    def __init__(self, hidden_channels, filter_channels, n_heads, n_layers,
+                 kernel_size=1, p_dropout=0.0, window_size=None, **kwargs):
+        super(Encoder, self).__init__()
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.drop = torch.nn.Dropout(p_dropout)
+        self.attn_layers = torch.nn.ModuleList()
+        self.norm_layers_1 = torch.nn.ModuleList()
+        self.ffn_layers = torch.nn.ModuleList()
+        self.norm_layers_2 = torch.nn.ModuleList()
+        for _ in range(self.n_layers):
+            self.attn_layers.append(MultiHeadAttention(hidden_channels, hidden_channels,
+                                    n_heads, window_size=window_size, p_dropout=p_dropout))
+            self.norm_layers_1.append(LayerNorm(hidden_channels))
+            self.ffn_layers.append(FFN(hidden_channels, hidden_channels,
+                                       filter_channels, kernel_size, p_dropout=p_dropout))
+            self.norm_layers_2.append(LayerNorm(hidden_channels))
+    def forward(self, x, x_mask):
+        attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+        for i in range(self.n_layers):
+            x = x * x_mask
+            y = self.attn_layers[i](x, x, attn_mask)
+            y = self.drop(y)
+            x = self.norm_layers_1[i](x + y)
+            y = self.ffn_layers[i](x, x_mask)
+            y = self.drop(y)
+            x = self.norm_layers_2[i](x + y)
+        x = x * x_mask
+        return x
+class TextEncoder(BaseModule):
+    def __init__(self, n_vocab, n_feats, n_channels, filter_channels,
+                 filter_channels_dp, n_heads, n_layers, kernel_size,
+                 p_dropout, window_size=None, spk_emb_dim=64, n_spks=1):
+        super(TextEncoder, self).__init__()
+        self.n_vocab = n_vocab
+        self.n_feats = n_feats
+        self.n_channels = n_channels
+        self.filter_channels = filter_channels
+        self.filter_channels_dp = filter_channels_dp
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.spk_emb_dim = spk_emb_dim
+        self.n_spks = n_spks
+        self.emb = torch.nn.Embedding(n_vocab, n_channels)
+        torch.nn.init.normal_(self.emb.weight, 0.0, n_channels**-0.5)
+        self.prenet = ConvReluNorm(n_channels, n_channels, n_channels,
+                                   kernel_size=5, n_layers=3, p_dropout=0.5)
+        self.encoder = Encoder(n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels, n_heads, n_layers,
+                               kernel_size, p_dropout, window_size=window_size)
+        self.proj_m = torch.nn.Conv1d(n_channels + (spk_emb_dim if n_spks > 1 else 0), n_feats, 1)
+        self.proj_w = DurationPredictor(n_channels + (spk_emb_dim if n_spks > 1 else 0), filter_channels_dp,
+                                        kernel_size, p_dropout)
+    def forward(self, x, x_lengths, spk=None):
+        x = self.emb(x) * math.sqrt(self.n_channels)
+        x = torch.transpose(x, 1, -1)
+        x_mask = torch.unsqueeze(sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
+        x = self.prenet(x, x_mask)
+        if self.n_spks > 1:
+            x = torch.cat([x, spk.unsqueeze(-1).repeat(1, 1, x.shape[-1])], dim=1)
+        x = self.encoder(x, x_mask)
+        mu = self.proj_m(x) * x_mask
+        x_dp = torch.detach(x)
+        logw = self.proj_w(x_dp, x_mask)
+        return mu, logw, x_mask

audio_diffusion_attacks_forhf/models/style_diffusion.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""
+style_diffusion.py
+    Desc: Contains StyleVDiffusion models for training style transfer/editing models. These are essentially slight modifications of the original VDiffusion classes.
+"""
+from math import pi
+from typing import Any, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from torch import Tensor
+from tqdm import tqdm
+from audio_diffusion_pytorch.utils import default
+from audio_diffusion_pytorch import Diffusion, Sampler, VDiffusion, VSampler, LinearSchedule, Schedule, Distribution, UniformDistribution
+def pad_dims(x: Tensor, ndim: int) -> Tensor:
+    # Pads additional ndims to the right of the tensor
+    return x.view(*x.shape, *((1,) * ndim))
+def clip(x: Tensor, dynamic_threshold: float = 0.0):
+    if dynamic_threshold == 0.0:
+        return x.clamp(-1.0, 1.0)
+    else:
+        # Dynamic thresholding
+        # Find dynamic threshold quantile for each batch
+        x_flat = rearrange(x, "b ... -> b (...)")
+        scale = torch.quantile(x_flat.abs(), dynamic_threshold, dim=-1)
+        # Clamp to a min of 1.0
+        scale.clamp_(min=1.0)
+        # Clamp all values and scale
+        scale = pad_dims(scale, ndim=x.ndim - scale.ndim)
+        x = x.clamp(-scale, scale) / scale
+        return x
+def extend_dim(x: Tensor, dim: int):
+    # e.g. if dim = 4: shape [b] => [b, 1, 1, 1],
+    return x.view(*x.shape + (1,) * (dim - x.ndim))
+class StyleVDiffusion(Diffusion):
+    def __init__(
+        self, net: nn.Module, sigma_distribution: Distribution = UniformDistribution()
+    ):
+        super().__init__()
+        self.net = net
+        self.sigma_distribution = sigma_distribution
+    def get_alpha_beta(self, sigmas: Tensor) -> Tuple[Tensor, Tensor]:
+        angle = sigmas * pi / 2
+        alpha, beta = torch.cos(angle), torch.sin(angle)
+        return alpha, beta
+    def forward(self, x: Tensor, y: Tensor, **kwargs) -> Tensor:  # type: ignore
+        batch_size, device = x.shape[0], x.device
+        # Sample amount of noise to add for each batch element
+        sigmas = self.sigma_distribution(num_samples=batch_size, device=device)
+        sigmas_batch = extend_dim(sigmas, dim=y.ndim)
+        # Get noise
+        noise = torch.randn_like(y)
+        # Combine input and noise weighted by half-circle
+        alphas, betas = self.get_alpha_beta(sigmas_batch)
+        y_noisy = alphas * y + betas * noise
+        y_noisy = torch.concat((y_noisy, x), dim=1)
+        v_target = alphas * noise - betas * y
+        # Predict velocity and return loss
+        v_pred = self.net(y_noisy, sigmas, **kwargs)
+        return F.mse_loss(v_pred, v_target)
+class StyleVSampler(Sampler):
+    diffusion_types = [VDiffusion]
+    def __init__(self, net: nn.Module, schedule: Schedule = LinearSchedule()):
+        super().__init__()
+        self.net = net
+        self.schedule = schedule
+    def get_alpha_beta(self, sigmas: Tensor) -> Tuple[Tensor, Tensor]:
+        angle = sigmas * pi / 2
+        alpha, beta = torch.cos(angle), torch.sin(angle)
+        return alpha, beta
+    @torch.no_grad()
+    def forward(  # type: ignore
+        self, x:Tensor, x_noisy: Tensor, num_steps: int, show_progress: bool = False, **kwargs
+    ) -> Tensor:
+        b = x_noisy.shape[0]
+        x = x[None, ...]
+        sigmas = self.schedule(num_steps + 1, device=x_noisy.device)
+        sigmas = repeat(sigmas, "i -> i b", b=b)
+        sigmas_batch = extend_dim(sigmas, dim=x_noisy.ndim + 1)
+        alphas, betas = self.get_alpha_beta(sigmas_batch)
+        progress_bar = tqdm(range(num_steps), disable=not show_progress)
+        for i in progress_bar:
+            x_mix = torch.cat((x_noisy, x), dim=1)
+            v_pred = self.net(x_mix, sigmas[i], **kwargs)
+            x_pred = alphas[i] * x_noisy - betas[i] * v_pred
+            noise_pred = betas[i] * x_noisy + alphas[i] * v_pred
+            x_noisy = alphas[i + 1] * x_pred + betas[i + 1] * noise_pred
+            progress_bar.set_description(f"Sampling (noise={sigmas[i+1,0]:.2f})")
+        return x_noisy
+if __name__ == "__main__":
+    print("Loaded dependencies correctly.")

audio_diffusion_attacks_forhf/models/utils.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""
+utils.py
+    Desc: A file for miscellaneous util functions
+"""
+import numpy as np
+import torch
+# MonoTransform, this does not exist in PyTorch anymore since it is a simple mean calculation. We provide an implementation here
+class MonoTransform(object):
+    """
+    Convert audio sample to mono channel
+    Args for __call__:
+        audio_sample with shape (C, T) or (B, C, T), where C is the number of channels.
+    TODO: IMPLEMENT __call__
+    """
+    def __init__(self):
+        pass
+    def __call__(self, sample):
+        pass
+"""
+Below: Helper functions for Grad-TTS
+"""
+## Duration Loss
+## Desc: A function for computing the duration loss for the duration predictor
+def duration_loss(logw, logw_, lengths):
+    loss = torch.sum((logw - logw_)**2) / torch.sum(lengths)
+    return loss
+def intersperse(lst, item):
+    # Adds blank symbol
+    result = [item] * (len(lst) * 2 + 1)
+    result[1::2] = lst
+    return result
+def sequence_mask(length, max_length=None):
+    if max_length is None:
+        max_length = length.max()
+    x = torch.arange(int(max_length), dtype=length.dtype, device=length.device)
+    return x.unsqueeze(0) < length.unsqueeze(1)
+def fix_len_compatibility(length, num_downsamplings_in_unet=2):
+    while True:
+        if length % (2**num_downsamplings_in_unet) == 0:
+            return length
+        length += 1
+def convert_pad_shape(pad_shape):
+    l = pad_shape[::-1]
+    pad_shape = [item for sublist in l for item in sublist]
+    return pad_shape
+def generate_path(duration, mask):
+    device = duration.device
+    b, t_x, t_y = mask.shape
+    cum_duration = torch.cumsum(duration, 1)
+    path = torch.zeros(b, t_x, t_y, dtype=mask.dtype).to(device=device)
+    cum_duration_flat = cum_duration.view(b * t_x)
+    path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
+    path = path.view(b, t_x, t_y)
+    path = path - torch.nn.functional.pad(path, convert_pad_shape([[0, 0],
+                                          [1, 0], [0, 0]]))[:, :-1]
+    path = path * mask
+    return path

audio_diffusion_attacks_forhf/notebooks/data_exploration/00_fma_exploration.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

audio_diffusion_attacks_forhf/resources/cmu_dictionary ADDED Viewed

The diff for this file is too large to render. See raw diff

audio_diffusion_attacks_forhf/scripts/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

audio_diffusion_attacks_forhf/scripts/data_processing/process_music_mels.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""
+process_audio_mels.py
+    Desc: Run this script with the appropriate data paths to extract mels
+    Command: `python -u scripts/data/processing/process_music_mels.py`
+"""
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+import IPython
+import scipy
+import torch
+import torchaudio
+import os
+import ast
+import soundfile as sf
+import glob
+# Old Code for Importing AudioLDM
+# from audioldm.pipeline import build_model
+# HF Code for AudioLDM2
+# from diffusers import AudioLDM2Pipeline
+from audioldm.audio import wav_to_fbank, TacotronSTFT
+try:
+    from audioldm2 import build_model
+except:
+    from audioldm2 import build_model
+# TODO: Replace these with args
+audio_path = "/data/robbizorg/music_datasets/fma/data/fma_large/"
+target_audio_path = "/data/robbizorg/music_datasets/fma/data/fma_processed/"
+remake = False
+## AudioLDM Mel Spec
+default_mel_config = {
+    "preprocessing": {
+    "audio": {
+        "sampling_rate": 16000,
+        "max_wav_value": 32768,
+        "duration": 10.24,
+    },
+    "stft": {"filter_length": 1024, "hop_length": 160, "win_length": 1024},
+    "mel": {"n_mel_channels": 64, "mel_fmin": 0, "mel_fmax": 8000},
+}}
+fn_STFT = TacotronSTFT(
+    default_mel_config["preprocessing"]["stft"]["filter_length"],
+    default_mel_config["preprocessing"]["stft"]["hop_length"],
+    default_mel_config["preprocessing"]["stft"]["win_length"],
+    default_mel_config["preprocessing"]["mel"]["n_mel_channels"],
+    default_mel_config["preprocessing"]["audio"]["sampling_rate"],
+    default_mel_config["preprocessing"]["mel"]["mel_fmin"],
+    default_mel_config["preprocessing"]["mel"]["mel_fmax"],
+)
+if __name__ == "__main__":
+    audio_files = glob.glob(os.path.join(audio_path, "*/*.mp3"))
+    failed_files = []
+    # Preprocess all mel_specs
+    for i, f in enumerate(audio_files):
+        if i % 1000 == 0:
+            print(f"{i} of {len(audio_files)} files have been processed.")
+        dir_info = f.split("/")
+        filename = dir_info[-1].split(".")[0]
+        parent_dir = dir_info[-2]
+        # Skip the file if it's already generated
+        if not remake and os.path.exists(os.path.join(target_audio_path, parent_dir, filename + '.npy')):
+            continue
+        try:
+            audio, sr = torchaudio.load(f)
+        except:
+            failed_files.append(f)
+            print(f"Failed on File {f}")
+            continue
+        if audio.shape[0] == 2:
+            mono_audio = torch.mean(audio, axis = 0) # Convert to Mono
+        else:
+            mono_audio = audio[0, :] # remove channel info
+        # Resample Audio
+        resamp_16k = torchaudio.functional.resample(mono_audio, sr, 16000)
+        duration = resamp_16k.shape[0]/16000
+        target_length = int(duration * 100) # int(duration * 102.4)
+        mel, _, _ = wav_to_fbank(resamp_16k.cpu(), target_length=target_length, fn_STFT=fn_STFT)
+        # Make parent dir
+        if not os.path.exists(os.path.join(target_audio_path, parent_dir)):
+            os.mkdir(os.path.join(target_audio_path, parent_dir))
+        with open(os.path.join(target_audio_path, parent_dir, filename + '.npy'), 'wb') as numpy_f:
+            np.save(numpy_f, mel.numpy())
+    print("Failed_Files:", len(failed_files))
+    for x in failed_files:
+        print(x)

audio_diffusion_attacks_forhf/scripts/data_processing/process_music_numpy.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""
+process_music_numpy.py
+    Desc: Run this script with the appropriate data paths to preprocess audio files and convert to 48k for the ArchiSound encoders
+    Command: `python -u scripts/data/processing/process_music_mels.py`
+"""
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+import IPython
+import scipy
+import torch
+import torchaudio
+import os
+import ast
+import soundfile as sf
+import glob
+# Old Code for Importing AudioLDM
+# from audioldm.pipeline import build_model
+# HF Code for AudioLDM2
+# from diffusers import AudioLDM2Pipeline
+from audioldm.audio import wav_to_fbank, TacotronSTFT
+try:
+    from audioldm2 import build_model
+except:
+    from audioldm2 import build_model
+# TODO: Replace these with args
+audio_path = "/data/robbizorg/music_datasets/fma/data/fma_large/"
+target_audio_path = "/data/robbizorg/music_datasets/fma/data/fma_processed_48k/"
+remake = False
+if __name__ == "__main__":
+    audio_files = glob.glob(os.path.join(audio_path, "*/*.mp3"))
+    failed_files = []
+    # Preprocess all mel_specs
+    for i, f in enumerate(audio_files):
+        if i % 1000 == 0:
+            print(f"{i} of {len(audio_files)} files have been processed.")
+        dir_info = f.split("/")
+        filename = dir_info[-1].split(".")[0]
+        parent_dir = dir_info[-2]
+        # Skip the file if it's already generated
+        if not remake and os.path.exists(os.path.join(target_audio_path, parent_dir, filename + '.npy')):
+            continue
+        try:
+            audio, sr = torchaudio.load(f)
+        except:
+            failed_files.append(f)
+            print(f"Failed on File {f}")
+            continue
+        # Resample Audio--Don't need to make mono since archisound encoders take in stereo
+        resamp_48k = torchaudio.functional.resample(audio, sr, 48000)
+        # Make parent dir
+        if not os.path.exists(os.path.join(target_audio_path, parent_dir)):
+            os.mkdir(os.path.join(target_audio_path, parent_dir))
+        with open(os.path.join(target_audio_path, parent_dir, filename + '.npy'), 'wb') as numpy_f:
+            np.save(numpy_f, resamp_48k.numpy())
+    print("Failed_Files:", len(failed_files))
+    for x in failed_files:
+        print(x)

audio_diffusion_attacks_forhf/scripts/train/music_models/train_music_completion.py ADDED Viewed

	@@ -0,0 +1,243 @@

+"""
+train_music_completion.py
+    Desc: Train a model for completing a 3 seconds of audio given 3 seconds of music as input.
+    Note: There are two possible approaches for this task
+        1. Perform masking and try to get the model to fill in the blank with StyleVDiffusion
+        2. Condition on the mel-spec with VDiffusion
+"""
+import sys
+import numpy as np
+import torch
+import torch.nn as nn
+import torchaudio
+import gc
+import argparse
+import os
+from tqdm import tqdm
+import wandb
+from audio_diffusion_pytorch import DiffusionModel, UNetV0, VDiffusion, VSampler
+import soundfile as sf
+sys.path.append(".")
+from models.style_diffusion import StyleVDiffusion, StyleVSampler
+from models.datasets.music_datasets import MusicMelDataset
+import logging
+from audioldm.audio import wav_to_fbank, TacotronSTFT
+from audioldm2 import build_model
+# Uncomment out below if wanting to supress
+import warnings
+warnings.filterwarnings("ignore")
+# Set Sample Rate if like so if desired
+SAMPLE_RATE = 16000
+BATCH_SIZE = 16
+# Function for creating a model that acts on mel-specs
+def create_mel_model():
+    return DiffusionModel(
+        net_t=UNetV0, # The model type used for diffusion (U-Net V0 in this case)
+        # dim=2, # for spectrogram we can use 2D-CNN, but not going to for now
+        in_channels=600, # U-Net: number of input (time) channels
+        out_channels=300, # U-Net: number of output (time) channels
+        channels=[8, 32, 64, 128, 256, 512], # U-Net: channels at each layer
+        factors=[2, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
+        items=[2, 2, 2, 2, 2, 2], # U-Net: number of repeating items at each layer
+        attentions=[0, 0, 1, 1, 1, 1], # U-Net: attention enabled/disabled at each layer
+        attention_heads=8, # U-Net: number of attention heads per attention item
+        attention_features=64, # U-Net: number of attention features per attention item
+        diffusion_t=StyleVDiffusion, # The diffusion method used
+        sampler_t=StyleVSampler, # The diffusion sampler used
+        # embedding_features = 7, # Embedding Features for when conditioned
+        # cross_attentions=[0, 0, 0, 0, 1, 1, 1, 1]
+    )
+def create_2Dmel_model():
+    return DiffusionModel(
+        net_t=UNetV0, # The model type used for diffusion (U-Net V0 in this case)
+        dim=2, # for spectrogram we can use 2D-CNN
+        in_channels=2, # U-Net: number of input (time) channels
+        out_channels=1, # U-Net: number of output (time) channels
+        channels=[8, 32, 64, 128, 256, 512], # U-Net: channels at each layer
+        factors=[2, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
+        items=[2, 2, 2, 2, 2, 2], # U-Net: number of repeating items at each layer
+        attentions=[0, 0, 1, 1, 1, 1], # U-Net: attention enabled/disabled at each layer
+        attention_heads=8, # U-Net: number of attention heads per attention item
+        attention_features=64, # U-Net: number of attention features per attention item
+        diffusion_t=StyleVDiffusion, # The diffusion method used
+        sampler_t=StyleVSampler, # The diffusion sampler used
+        # embedding_features = 7, # Embedding Features for when conditioned
+        # cross_attentions=[0, 0, 0, 0, 1, 1, 1, 1]
+    )
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint", type=str, default='/data/robbizorg/attacksanddefenses/checkpoints/')
+    parser.add_argument("--resume", action="store_true")
+    parser.add_argument("--run_id", type=str, default='')
+    parser.add_argument("--debug", action="store_true")
+    parser.add_argument("--data_path", type = str, default = "./data/fma_valid_files.npy")
+    parser.add_argument("--epoch_num", type = int, default = 101)
+    args = parser.parse_args()
+    return vars(args)
+if __name__ == "__main__":
+    args = parse_args()
+    if args['run_id'] == '':
+        raise ValueError(f"Please provide a run_id for this training session.")
+    cuda_ids = [phy_id for phy_id in range(len(os.environ["CUDA_VISIBLE_DEVICES"].split(",")))]
+    if len(cuda_ids) > 1:
+        raise NotImplementedError("Currently training is only allowed on a single GPU")
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    logging.basicConfig(
+        format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+        level=os.environ.get("LOGLEVEL", "INFO").upper(),
+        stream=sys.stdout,
+        filemode='w',
+    )
+    logger = logging.getLogger("")
+    audio_files = list(np.load(args['data_path'], allow_pickle = True).item())
+    dataset = MusicMelDataset(audio_files, audio_len=5.12)
+    print(f"Dataset length: {len(dataset)}")
+    dataloader = torch.utils.data.DataLoader(
+        dataset,
+        batch_size=BATCH_SIZE,
+        shuffle=True,
+        num_workers=16,
+        pin_memory=False,
+    )
+    # Use this model for vocoder
+    vae_model = build_model().to(device)
+    diff_model = create_2Dmel_model().to(device)
+    optimizer = torch.optim.AdamW(params=list(diff_model.parameters()), lr=1e-4, betas= (0.95, 0.999), eps=1e-6, weight_decay=1e-3)
+    print(f"Number of parameters: {sum(p.numel() for p in diff_model.parameters() if p.requires_grad)}")
+    if not args['debug']:
+        run_id = wandb.util.generate_id()
+        if args["run_id"] is not None:
+            run_id = args["run_id"]
+        print(f"Run ID: {run_id}")
+        wandb.init(project="music-completion", resume=args["resume"], id=run_id)
+    epoch = 0
+    step = 0
+    checkpoint_path = os.path.join(args["checkpoint"], args["run_id"])
+    if not os.path.exists(checkpoint_path):
+        os.makedirs(checkpoint_path)
+        os.makedirs(os.path.join(checkpoint_path, "mels"))
+        os.makedirs(os.path.join(checkpoint_path, "wavs"))
+    if not args['debug'] and wandb.run.resumed:
+        if os.path.exists(checkpoint_path):
+            checkpoint = torch.load(checkpoint_path)
+        else:
+            checkpoint = torch.load(wandb.restore(checkpoint_path))
+        diff_model.load_state_dict(checkpoint['model_state_dict'])
+        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+        epoch = checkpoint['epoch']
+        step = epoch * len(dataloader)
+    scaler = torch.cuda.amp.GradScaler()
+    diff_model.train()
+    while epoch < args['epoch_num']:
+        avg_loss = 0
+        avg_loss_step = 0
+        progress = tqdm(dataloader)
+        for i, (audio, masked_audio) in enumerate(progress):
+            optimizer.zero_grad()
+            # audio = torch.swapaxes(audio.to(device), 1, 2)
+            # masked_audio = torch.swapaxes(masked_audio.to(device), 1, 2)
+            # audio = audio.to(device)
+            # masked_audio = masked_audio.to(device)
+            audio = audio.to(device).unsqueeze(1)
+            masked_audio = masked_audio.to(device).unsqueeze(1)
+            with torch.cuda.amp.autocast():
+                loss = diff_model(masked_audio, audio)
+                avg_loss += loss.item()
+                avg_loss_step += 1
+            scaler.scale(loss).backward()
+            scaler.step(optimizer)
+            scaler.update()
+            progress.set_postfix(
+                # loss=loss.item(),
+                loss=avg_loss / avg_loss_step,
+                epoch=epoch + i / len(dataloader),
+            )
+            if step % 500 == 0:
+            # if step % 1 == 0:
+                # Turn noise into new audio sample with diffusion
+                # noise = torch.randn(1, 300, 64, device=device)
+                # noise = torch.randn(1, 64, 300, device=device)
+                noise = torch.randn(1, 1, 512, 64, device=device) # 2D example
+                with torch.cuda.amp.autocast():
+                    sample = diff_model.sample(masked_audio[0], noise, num_steps=200)
+                    orig_wav = vae_model.mel_spectrogram_to_waveform(audio[0].unsqueeze(0), save = False)[0][0].astype(np.float32) # 1, 1, len
+                    gen_wav = vae_model.mel_spectrogram_to_waveform(sample, save = False)[0][0].astype(np.float32) # 1 1 len
+                    orig_dir = os.path.join(checkpoint_path, 'wavs', f'target_{step}0.wav')
+                    gen_dir = os.path.join(checkpoint_path, 'wavs', f'gen_{step}0.wav')
+                    sf.write(orig_dir, orig_wav, samplerate = 16000)
+                    sf.write(gen_dir, gen_wav, samplerate = 16000)
+                if not args['debug']:
+                    wandb.log({
+                        "step": step,
+                        "epoch": epoch + i / len(dataloader),
+                        "loss": avg_loss / avg_loss_step,
+                        "target_audio": wandb.Audio(orig_dir, caption="Target audio", sample_rate=SAMPLE_RATE),
+                        "generated_audio": wandb.Audio(gen_dir, caption="Generated audio", sample_rate=SAMPLE_RATE)
+                    })
+            if not args['debug'] and step % 100 == 0:
+                wandb.log({
+                    "step": step,
+                    "epoch": epoch + i / len(dataloader),
+                    "loss": avg_loss / avg_loss_step,
+                })
+                avg_loss = 0
+                avg_loss_step = 0
+            step += 1
+        epoch += 1
+        if not args['debug'] and epoch % 100 == 0:
+            torch.save({
+                'epoch': epoch,
+                'model_state_dict': diff_model.state_dict(),
+                'optimizer_state_dict': optimizer.state_dict(),
+            }, os.path.join(checkpoint_path, f"epoch-{epoch}.pt"))
+            wandb.save(checkpoint_path, base_path=args["checkpoint"])

audio_diffusion_attacks_forhf/scripts/train/train_tts.py ADDED Viewed

	@@ -0,0 +1,430 @@

+"""
+train_tts.py
+    Desc: An example script for training a Diffusion-based TTS model with a speaker encoder.
+"""
+import sys
+import torch
+import torch.nn as nn
+import torchaudio
+import gc
+import argparse
+import os
+from tqdm import tqdm
+import wandb
+from audio_diffusion_pytorch import DiffusionModel, UNetV0, VDiffusion, VSampler
+sys.path.append(".")
+from models.style_diffusion import StyleVDiffusion, StyleVSampler
+# from models.utils import MonoTransform
+# from util import calculate_codebook_bitrate, extract_melspectrogram, get_audio_file_bitrate, get_duration, load_neural_audio_codec
+from audioldm.pipeline import build_model
+import torch.multiprocessing as mp
+# Needed for Instruction/Prompt Models
+# from transformers import AutoTokenizer, T5EncoderModel
+import logging
+# Uncomment out below if wanting to supress
+# import warnings
+# warnings.filterwarnings("ignore")
+# Set Sample Rate if like so if desired
+SAMPLE_RATE = 16000
+BATCH_SIZE = 16
+NUM_SAMPLES = int(2.56 * SAMPLE_RATE)
+# NUM_SAMPLES = 2 ** 15
+def create_model():
+    return DiffusionModel(
+        net_t=UNetV0, # The model type used for diffusion (U-Net V0 in this case)
+        # dim=2, # for spectrogram we use 2D-CNN
+        in_channels=314, # U-Net: number of input (audio) channels
+        out_channels=157, # U-Net: number of output (audio) channels
+        channels=[256, 256, 512, 512, 768, 768, 1280, 1280], # U-Net: channels at each layer
+        factors=[2, 2, 2, 2, 2, 2, 2, 1], # U-Net: downsampling and upsampling factors at each layer
+        items=[2, 2, 2, 2, 2, 2, 2, 2], # U-Net: number of repeating items at each layer
+        attentions=[0, 0, 0, 0, 1, 1, 1, 1], # U-Net: attention enabled/disabled at each layer
+        attention_heads=8, # U-Net: number of attention heads per attention item
+        attention_features=64, # U-Net: number of attention features per attention item
+        diffusion_t=StyleVDiffusion, # The diffusion method used
+        sampler_t=StyleVSampler, # The diffusion sampler used
+        # embedding_features = 8,
+        # embedding_features = 2, # Embedding for when it's just res and weight
+        embedding_features = 7, # Embedding Features for when Severity is Dropped
+        cross_attentions=[0, 0, 0, 0, 1, 1, 1, 1]
+    )
+def main():
+    pass
+    # args = parse_args()
+    # os.environ["CUDA_DEVICE_ORDER"] = 'PCI_BUS_ID'
+    # os.environ["CUDA_VISIBLE_DEVICES"] = args['cuda_ids']
+    # cuda_ids = [phy_id for phy_id in range(len(args['cuda_ids'].split(",")))]
+    # logging.basicConfig(
+    #     format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+    #     datefmt="%Y-%m-%d %H:%M:%S",
+    #     level=os.environ.get("LOGLEVEL", "INFO").upper(),
+    #     stream=sys.stdout,
+    #     filemode='w',
+    # )
+    # logger = logging.getLogger("")
+    # # mp.set_start_method('spawn')
+    # # mp.set_sharing_strategy('file_system')
+    # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+#     # Load in text model
+#     # tokenizer = AutoTokenizer.from_pretrained("t5-small")
+#     # text_model = T5EncoderModel.from_pretrained("t5-small")
+#     # text_model.eval() # Don't want to train it!
+#     dataset = DSVAE_CondStyleWAVDataset(
+#         path="/data/robbizorg/pqvd_gen_w_conditioning/speech_non_speech_timesteps_VCTK.json",
+#         random_crop_size=NUM_SAMPLES,
+#         sample_rate=SAMPLE_RATE,
+#         transforms=AllTransform(
+#             mono=True,
+#         ),
+#         reconstructive = False, # Make this true to just train a reconstructive model
+#         identity_limit = 1 # Affects how often we learn identity mapping
+#     )
+#     print(f"Dataset length: {len(dataset)}")
+#     dataloader = torch.utils.data.DataLoader(
+#         dataset,
+#         batch_size=BATCH_SIZE,
+#         shuffle=True,
+#         num_workers=16,
+#         pin_memory=False,
+#     )
+#     vae_model = DSVAE(logger, **args).cuda()
+#     if not os.path.exists(args['model_path']):
+#         logger.warning("model not exist and we just create the new model......")
+#     else:
+#         logger.info("Model Exists")
+#         logger.info("Model Path is " + args['model_path'])
+#         vae_model.loadParameters(args['model_path'])
+#     vae_model = torch.nn.DataParallel(vae_model, device_ids = cuda_ids, output_device=cuda_ids[0])
+#     vae_model = vae_model.cuda()
+#     vae_model.eval()
+#     vae_model.module.eer = True
+#     diff_model = create_model().to(device)
+#     # audio_codec = build_model().to(device)
+#     # audio_codec.latent_t_size = 157
+#     # config, audio_codec, vocoder = load_neural_audio_codec('2021-05-19T22-16-54_vggsound_codebook', './logs', device)
+#     # optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
+#     optimizer = torch.optim.AdamW(params=list(diff_model.parameters()), lr=1e-4, betas= (0.95, 0.999), eps=1e-6, weight_decay=1e-3)
+#     print(f"Number of parameters: {sum(p.numel() for p in diff_model.parameters() if p.requires_grad)}")
+#     run_id = wandb.util.generate_id()
+#     if args["run_id"] is not None:
+#         run_id = args["run_id"]
+#     print(f"Run ID: {run_id}")
+#     wandb.init(project="audio-diffusion-no-condition", resume=args["resume"], id=run_id)
+#     epoch = 0
+#     step = 0
+#     checkpoint_path = os.path.join(args["checkpoint"], args["run_id"])
+#     if not os.path.exists(checkpoint_path):
+#         os.makedirs(checkpoint_path)
+#         os.makedirs(os.path.join(checkpoint_path, "mels"))
+#         os.makedirs(os.path.join(checkpoint_path, "wavs"))
+#     if wandb.run.resumed:
+#         if os.path.exists(checkpoint_path):
+#             checkpoint = torch.load(checkpoint_path)
+#         else:
+#             checkpoint = torch.load(wandb.restore(checkpoint_path))
+#         diff_model.load_state_dict(checkpoint['model_state_dict'])
+#         optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+#         epoch = checkpoint['epoch']
+#         step = epoch * len(dataloader)
+#     scaler = torch.cuda.amp.GradScaler()
+#     diff_model.train()
+#     while epoch < 101:
+#         avg_loss = 0
+#         avg_loss_step = 0
+#         progress = tqdm(dataloader)
+#         for i, (audio, target, embedding) in enumerate(progress):
+#             optimizer.zero_grad()
+#             audio = audio.to(device)
+#             target = target.to(device)
+#             embedding = embedding.to(device)
+#             with torch.no_grad():
+#                 embedding = embedding.float() # Make it float like the others
+#                 speaker_embed_source, content_embed_source = vae_model(audio)
+#                 speaker_embed_source = speaker_embed_source.unsqueeze(1).expand(-1, 157, -1)
+#                 audio_embed = torch.cat((speaker_embed_source, content_embed_source), axis = -1)
+#                 # zeroes = torch.zeros(16, 3, 128, dtype=audio_embed.dtype, device = audio_embed.device)
+#                 # audio_embed = torch.cat((audio_embed, zeroes), dim=1)
+#                 speaker_embed, content_embed = vae_model(target)
+#                 speaker_embed = speaker_embed.unsqueeze(1).expand(-1, 157, -1)
+#                 # in order to simulate paired data, do (naive) voice conversion first
+#                 target_embed = torch.cat((speaker_embed, content_embed_source), axis = -1)
+#                 # target_embed = torch.cat((target_embed, zeroes), dim = 1)
+#             with torch.cuda.amp.autocast():
+#                 loss = diff_model(audio_embed, target_embed, embedding=embedding)
+#                 avg_loss += loss.item()
+#                 avg_loss_step += 1
+#             scaler.scale(loss).backward()
+#             scaler.step(optimizer)
+#             scaler.update()
+#             progress.set_postfix(
+#                 # loss=loss.item(),
+#                 loss=avg_loss / avg_loss_step,
+#                 epoch=epoch + i / len(dataloader),
+#             )
+#             if step % 500 == 0:
+#             # if step % 1 == 0:
+#                 # Turn noise into new audio sample with diffusion
+#                 noise = torch.randn(1, 157, 128, device=device)
+#                 with torch.cuda.amp.autocast():
+#                     sample = diff_model.sample(audio_embed[0], noise, embedding=embedding[0][None, :], num_steps=200)
+#                 # Save the melspecs
+#                 audio_sub = torch.swapaxes(audio[0].unsqueeze(0), 1, 2)
+#                 # target_sub = torch.swapaxes(target[0].unsqueeze(0), 1, 2) # This is the original target audio, not what we want
+#                 target_sub = vae_model.module.share_decoder(target_embed).loc
+#                 gen_mel = vae_model.module.share_decoder(sample).loc
+#                 vae_model.module.draw_mel(audio_sub, mode=f"source_{step}", file_path = os.path.join(checkpoint_path, "mels"))
+#                 vae_model.module.draw_mel(target_sub, mode=f"target_{step}", file_path = os.path.join(checkpoint_path, "mels"))
+#                 vae_model.module.draw_mel(gen_mel, mode=f"gen_{step}", file_path = os.path.join(checkpoint_path, "mels"))
+#                 vae_model.module.mel2wav(audio_sub, mode=f"source_{step}", task="vc", file_path = os.path.join(checkpoint_path, "wavs"))
+#                 vae_model.module.mel2wav(target_sub, mode=f"target_{step}", task="vc", file_path = os.path.join(checkpoint_path, "wavs"))
+#                 vae_model.module.mel2wav(gen_mel, mode=f"gen_{step}", task="vc", file_path = os.path.join(checkpoint_path, "wavs"))
+#                 # torchaudio.save(os.path.join(checkpoint_path, 'wavs', f'test_input_sound_{step}.wav'), torch.from_numpy(audio_codec.mel_spectrogram_to_waveform(audio_codec.decode_first_stage(audio[0].unsqueeze(0))))[0], SAMPLE_RATE)
+#                 # torchaudio.save(os.path.join(checkpoint_path, 'wavs', f'test_generated_sound_{step}.wav'), torch.from_numpy(audio_codec.mel_spectrogram_to_waveform(audio_codec.decode_first_stage(sample[0].unsqueeze(0))))[0], SAMPLE_RATE)
+#                 # torchaudio.save(os.path.join(checkpoint_path, 'wavs', f'test_target_sound_{step}.wav'), torch.from_numpy(audio_codec.mel_spectrogram_to_waveform(audio_codec.decode_first_stage(target[0].unsqueeze(0))))[0], SAMPLE_RATE)
+#                 wandb.log({
+#                     "step": step,
+#                     "epoch": epoch + i / len(dataloader),
+#                     "loss": avg_loss / avg_loss_step,
+#                     "input_mel": wandb.Image(os.path.join(checkpoint_path, "mels", f"source_{step}_mel_0.png"), caption="Input Mel"),
+#                     "target_mel": wandb.Image(os.path.join(checkpoint_path, "mels", f"target_{step}_mel_0.png"), caption="Target Mel"),
+#                     "gen_mel": wandb.Image(os.path.join(checkpoint_path, "mels", f"gen_{step}_mel_0.png"), caption="Gen Mel"),
+#                     "input_audio": wandb.Audio(os.path.join(checkpoint_path, 'wavs', f'source_{step}0.wav'), caption="Input audio", sample_rate=SAMPLE_RATE),
+#                     "target_audio": wandb.Audio(os.path.join(checkpoint_path, 'wavs', f'target_{step}0.wav'), caption="Target audio", sample_rate=SAMPLE_RATE),
+#                     "generated_audio": wandb.Audio(os.path.join(checkpoint_path, 'wavs', f'gen_{step}0.wav'), caption="Generated audio", sample_rate=SAMPLE_RATE)
+#                 })
+#             if step % 100 == 0:
+#                 wandb.log({
+#                     "step": step,
+#                     "epoch": epoch + i / len(dataloader),
+#                     "loss": avg_loss / avg_loss_step,
+#                 })
+#                 avg_loss = 0
+#                 avg_loss_step = 0
+#             step += 1
+#         epoch += 1
+#         if epoch % 100 == 0:
+#             torch.save({
+#                 'epoch': epoch,
+#                 'model_state_dict': diff_model.state_dict(),
+#                 'optimizer_state_dict': optimizer.state_dict(),
+#             }, os.path.join(checkpoint_path, f"epoch-{epoch}.pt"))
+#             wandb.save(checkpoint_path, base_path=args["checkpoint"])
+# def parse_args():
+#     parser = argparse.ArgumentParser()
+#     parser.add_argument("--checkpoint", type=str, default='/data/robbizorg/pqvd_gen_w_dsvae/checkpoints/')
+#     parser.add_argument("--resume", action="store_true")
+#     parser.add_argument("--run_id", type=str, default='condition_ldm')
+#     ## Params from DSVAE
+#     parser.add_argument('--dataset',      type=str, default="VCTK", help='VCTK, LibriTTS')
+#     parser.add_argument('--encoder', type=str, default='dsvae', help='dsvae. tdnn')
+#     parser.add_argument('--vocoder', type=str, default='hifigan', help='wavenet, hifigan')
+#     parser.add_argument('--save_tsne', dest='save_tsne', action='store_true', help='save_tsne')
+#     parser.add_argument('--mel_tsne', dest='mel_tsne', action='store_true', help='mel_tsne')
+#     parser.add_argument('--feature', type=str, default='mel_spec', help='stft, mel_spec, mfcc')
+#     parser.add_argument('--model_path', type=str, default='/home/robbizorg/research/dsvae/save_models/dsvae/best699.pth')
+#     # parser.add_argument('--model_path', type=str, default='/data/andreaguz/save_models/dsvae_003_03/best699.pth') # Using the fine-tuned dsvae
+#     # parser.add_argument('--model_path', type=str, default='/data/andreaguz/save_models/dsvae_0001_0005/best.pth') # Using the fine-tuned dsvae
+#     parser.add_argument('--save_path', type=str, default='save_models/dsvae')
+#     parser.add_argument('--cuda_ids', type=str, default='0')
+#     parser.add_argument('--tsne_mode', type=str, default='test')
+#     parser.add_argument("--optimizer", type=str, default='adam', help='sgd, adam')
+#     parser.add_argument("--path_vc_1", type=str, default='', help='')
+#     parser.add_argument("--path_vc_2", type=str, default='', help='')
+#     parser.add_argument('--max_frames', type=int, default=100, help='1frame~10ms')
+#     parser.add_argument("--hop_size", type=int, default=256, help='hop_size')
+#     parser.add_argument("--win_length", type=int, default=1024, help='win_length')
+#     parser.add_argument("--spk_dim", type=int, default=64, help='spk_embed')
+#     parser.add_argument("--ecapa_spk_dim", type=int, default=128, help='ecapa spk_embed')
+#     parser.add_argument("--content_dim", type=int, default=64, help="content_embed")
+#     parser.add_argument("--conformer_hidden_dim", type=int, default=256, help="content_embed")
+#     parser.add_argument('--n_epochs', type=int, default=700, help='n_epochs')
+#     parser.add_argument('--eval_epoch', type=int, default=5, help='eval_epoch')
+#     parser.add_argument('--step_size', type=int, default=5, help='step_size')
+#     parser.add_argument('--num_workers', type=int, default=16, help='num_workers')
+#     parser.add_argument('--lr_decay_rate',type=float, default=0.95, help='lr_decay_rate')
+#     parser.add_argument('--lr',type=float, default=3e-4, help='lr_rate')
+#     # parser.add_argument('--klf_factor', type=float, default=3e-3, help='klf_factor')
+#     # parser.add_argument('--klt_factor', type=float, default=5, help='klt_factor')
+#     parser.add_argument('--klf_factor', type=float, default=3e-4, help='klf_factor') # Changed for the Fine-tuned Version
+#     parser.add_argument('--klt_factor', type=float, default=3e-3, help='klt_factor')
+#     parser.add_argument('--rec_factor', type=float, default=1, help='rec_factor')
+#     parser.add_argument('--vq_factor', type=float, default=1000, help='vq_factor')
+#     parser.add_argument('--zf_vq_factor', type=float, default=1000, help='vq_factor')
+#     parser.add_argument('--klf_std', type=float, default=0.5, help='klf_std')
+#     parser.add_argument('--rec_std', type=float, default=0.04, help='rec_std')
+#     parser.add_argument('--clip', type=float, default=1, help='rec_std')
+#     parser.add_argument('--phoneme_factor', type=float, default=1, help='phoneme_factor')
+#     parser.add_argument('--r_vq_factor', type=float, default=10, help='r_vq_factor')
+#     parser.add_argument('--compute_speaker_eer', dest='compute_speaker_eer', action='store_true', help='ASV EER')
+#     parser.add_argument('--eval_phoneme', dest='eval_phoneme', action='store_true', help='ASV EER')
+#     parser.add_argument('--num_eval', type=int, default=20, help='num of segments for eval')
+#     parser.add_argument('--batch_size', type=int, default=256, help='batch_size')
+#     parser.add_argument('--num_phonemes', type=int, default=100, help='num_phonemes')
+#     parser.add_argument('--with_phoneme', dest='with_phoneme', action='store_true', help='')
+#     parser.add_argument("--conversion", action='store_true', help='for conversion text')
+#     parser.add_argument("--conversion2", action='store_true', help='for conversion text')
+#     parser.add_argument("--conversion3", action='store_true', help='for conversion text')
+#     parser.add_argument("--mel2npy", action='store_true', help='mel2npy')
+#     parser.add_argument("--unconditional", action='store_true', help='unconditional')
+#     parser.add_argument('--zt_norm_mean', action='store_true', help='instancenorm1d on zt prior and post')
+#     parser.add_argument('--zf_norm_mean', action='store_true', help='instancenorm1d on zf prior and post')
+#     parser.add_argument('--freeze_encoder', action='store_true', help='if or not to freeze encoder')
+#     parser.add_argument('--freeze_decoder', action='store_true', help='if or not to freeze decoder')
+#     parser.add_argument("--sample_rate",type=int, default=16000, help='16000 or 48000')
+#     parser.add_argument('--noise_path', type=str, default='datasets/noise_list.scp', help='nosie invariant')
+#     parser.add_argument('--wav_aug_train', action='store_true', help='with data augmentation')
+#     parser.add_argument('--spec_aug_train', action='store_true', help='with data augmentation')
+#     parser.add_argument('--noise_train', action='store_true', help='noise')
+#     parser.add_argument('--triphn', action='store_true', help='with triphn')
+#     parser.add_argument('--train_hifigan', action='store_true', help='train hifigan')
+#     parser.add_argument("--prior_alignment", action='store_true', help='')
+#     parser.add_argument("--zf_vq", action='store_true', help='')
+#     parser.add_argument("--vq_prior_independent", action='store_true', help='')
+#     parser.add_argument("--vq_prior_regressive", action='store_true', help='')
+#     parser.add_argument("--vq_prior_pseudo", action='store_true', help='')
+#     parser.add_argument("--vq_size_zt",type=int, default=200, help='')
+#     parser.add_argument("--vq_size_zf",type=int, default=200, help='')
+#     parser.add_argument("--ignore_index",type=int, default=0, help='')
+#     parser.add_argument("--hidden_dim",type=int, default=256, help='')
+#     parser.add_argument("--share_encoder", type=str, default='cnn', help='')
+#     parser.add_argument("--share_decoder", type=str, default='cnn_lstm', help='cnn_lstm, cnn_transformer')
+#     parser.add_argument("--zt_encoder", type=str, default='lstm', help='lstm, conformer_encoder, transformer_encoder')
+#     parser.add_argument("--zf_encoder", type=str, default='lstm', help='lstm, transformer_encoder, ecapa_tdnn')
+#     parser.add_argument("--zt_prior_model", type=str, default='lstm', help='lstm, vqvae, transformer')
+#     parser.add_argument("--prior_signal", type=str, default='None', help='alignment_triphn, alignment_mono, melspec_pseudo, wavlm_pseudo, vq_embeds, vq_pseudo')
+#     parser.add_argument("--multi_scale_add", action='store_true', help='')
+#     parser.add_argument("--multi_scale_cat", action='store_true', help='')
+#     parser.add_argument("--num_scales",type=int, default=1, help='')
+#     parser.add_argument("--kmeans_num_clusters",type=int, default=50, help='')
+#     parser.add_argument("--wavlm_dim", type=int, default=768, help='')
+#     parser.add_argument("--ema_zt", action='store_true', help='')
+#     parser.add_argument("--ema_zf", action='store_true', help='')
+#     parser.add_argument("--r_vqvae", action='store_true', help='')
+#     parser.add_argument("--masked_mel", action='store_true', help='')
+#     parser.add_argument("--rec_noise", action='store_true', help='')
+#     parser.add_argument("--rec_mask", action='store_true', help='')
+#     parser.add_argument("--mel_classification", action='store_true', help='')
+#     parser.add_argument("--test_script", action='store_true', help='')
+#     parser.add_argument("--no_klt", action='store_true', help='')
+#     parser.add_argument("--zt_prior_ce_r_vq", action='store_true', help='')
+#     parser.add_argument('--zt_prior_ce_r_vq_factor', type=float, default=1000, help='factor')
+#     parser.add_argument("--zt_post_ce_r_vq", action='store_true', help='')
+#     parser.add_argument("--zt_prior_ce_kmeans", action='store_true', help='')
+#     parser.add_argument('--zt_prior_ce_kmeans_factor', type=float, default=1000, help='factor')
+#     parser.add_argument("--zt_post_ce_kmeans", action='store_true', help='')
+#     parser.add_argument('--zt_post_ce_kmeans_factor', type=float, default=10, help='factor')
+#     parser.add_argument("--zt_prior_ce_alignment", action='store_true', help='')
+#     parser.add_argument('--zt_prior_ce_alignment_factor', type=float, default=1000, help='factor')
+#     parser.add_argument("--prior_type", type=str, default='None', help='normal, condition, lm')
+#     parser.add_argument("--prior_embedding", type=str, default='one-hot', help='one-hot, embedding')
+#     parser.add_argument("--prior_mask", action='store_true', help='')
+#     parser.add_argument("--wavlm", action='store_true', help='')
+#     parser.add_argument("--wavlm_type", type=str, default='base', help='')
+#     parser.add_argument("--tts_phn_wav_path", type=str, default='', help='')
+#     parser.add_argument("--sr", type=str, default="16000", help='')
+#     parser.add_argument("--text", type=str, default="your tts", help='')
+#     parser.add_argument("--tts_align", action='store_true', help='')
+#     parser.add_argument("--tts_wavlm", action='store_true', help='')
+#     parser.add_argument("--tts", action='store_true', help='')
+#     parser.add_argument("--tts_config", type=str, default="conf/LibriTTS/preprocess.yaml", help='')
+#     parser.add_argument("--tts_target_wav_path", type=str, default='', help='')
+#     parser.add_argument("--speed", type=float, default='1.0', help='')
+#     parser.add_argument("--train_mapping", action='store_true', help='')
+#     parser.add_argument("--mapping_encoder", type=str, default='lstm', help='')
+#     parser.add_argument("--mapping_model_path", type=str, default='lstm', help='')
+#     parser.add_argument("--mask_mapping", action='store_true', help='')
+#     parser.add_argument("--mask_mapping_factor", type=float, default=1, help='')
+#     parser.add_argument("--l1_mapping_factor", type=float, default=1, help='')
+#     parser.add_argument("--mapping_ratio", type=float, default=1.0, help='')
+#     parser.add_argument("--condition2", action='store_true', help='')
+#     args = parser.parse_args()
+#     return update_args(**vars(args))
+# if __name__ == "__main__":
+#     # torch.cuda.empty_cache()
+#     main()

audio_diffusion_attacks_forhf/src/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

audio_diffusion_attacks_forhf/src/__pycache__/losses.cpython-310.pyc ADDED Viewed

Binary file (10.2 kB). View file

audio_diffusion_attacks_forhf/src/__pycache__/music_gen.cpython-310.pyc ADDED Viewed

Binary file (3.49 kB). View file

audio_diffusion_attacks_forhf/src/__pycache__/test_encoder_attack.cpython-310.pyc ADDED Viewed

Binary file (5.47 kB). View file

audio_diffusion_attacks_forhf/src/balancer.py ADDED Viewed

	@@ -0,0 +1,137 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import typing as tp
+import flashy
+import torch
+from torch import autograd
+class Balancer:
+    """Loss balancer.
+    The loss balancer combines losses together to compute gradients for the backward.
+    Given `y = f(...)`, and a number of losses `l1(y, ...)`, `l2(y, ...)`, with `...`
+    not having any dependence on `f`, the balancer can efficiently normalize the partial gradients
+    `d l1 / d y`, `d l2 / dy` before summing them in order to achieve a desired ratio between
+    the losses. For instance if `weights = {'l1': 2, 'l2': 1}`, 66% of the gradient
+    going into `f(...)` will come from `l1` on average, and 33% from `l2`. This allows for an easy
+    interpration of the weights even if the intrisic scale of `l1`, `l2` ... is unknown.
+    Noting `g1 = d l1 / dy`, etc., the balanced gradient `G` will be
+    (with `avg` an exponential moving average over the updates),
+        G = sum_i total_norm * g_i / avg(||g_i||) * w_i / sum(w_i)
+    If `balance_grads` is False, this is deactivated, and instead the gradient will just be the
+    standard sum of the partial gradients with the given weights.
+    A call to the backward method of the balancer will compute the the partial gradients,
+    combining all the losses and potentially rescaling the gradients,
+    which can help stabilize the training and reason about multiple losses with varying scales.
+    The obtained gradient with respect to `y` is then back-propagated to `f(...)`.
+    Expected usage:
+        weights = {'loss_a': 1, 'loss_b': 4}
+        balancer = Balancer(weights, ...)
+        losses: dict = {}
+        losses['loss_a'] = compute_loss_a(x, y)
+        losses['loss_b'] = compute_loss_b(x, y)
+        if model.training():
+            effective_loss = balancer.backward(losses, x)
+    Args:
+        weights (dict[str, float]): Weight coefficient for each loss. The balancer expect the losses keys
+            from the backward method to match the weights keys to assign weight to each of the provided loss.
+        balance_grads (bool): Whether to rescale gradients so that weights reflect the fraction of the
+            overall gradient, rather than a constant multiplier.
+        total_norm (float): Reference norm when rescaling gradients, ignored otherwise.
+        emay_decay (float): EMA decay for averaging the norms.
+        per_batch_item (bool): Whether to compute the averaged norm per batch item or not. This only holds
+            when rescaling the gradients.
+        epsilon (float): Epsilon value for numerical stability.
+        monitor (bool): If True, stores in `self.metrics` the relative ratio between the norm of the gradients
+            coming from each loss, when calling `backward()`.
+    """
+    def __init__(self, weights: tp.Dict[str, float], balance_grads: bool = True, total_norm: float = 1.,
+                 ema_decay: float = 0.999, per_batch_item: bool = True, epsilon: float = 1e-12,
+                 monitor: bool = False):
+        self.weights = weights
+        self.per_batch_item = per_batch_item
+        self.total_norm = total_norm or 1.
+        self.averager = flashy.averager(ema_decay or 1.)
+        self.epsilon = epsilon
+        self.monitor = monitor
+        self.balance_grads = balance_grads
+        self._metrics: tp.Dict[str, tp.Any] = {}
+    @property
+    def metrics(self):
+        return self._metrics
+    def backward(self, losses: tp.Dict[str, torch.Tensor], input: torch.Tensor) -> torch.Tensor:
+        """Compute the backward and return the effective train loss, e.g. the loss obtained from
+        computing the effective weights. If `balance_grads` is True, the effective weights
+        are the one that needs to be applied to each gradient to respect the desired relative
+        scale of gradients coming from each loss.
+        Args:
+            losses (Dict[str, torch.Tensor]): dictionary with the same keys as `self.weights`.
+            input (torch.Tensor): the input of the losses, typically the output of the model.
+                This should be the single point of dependence between the losses
+                and the model being trained.
+        """
+        norms = {}
+        grads = {}
+        for name, loss in losses.items():
+            # Compute partial derivative of the less with respect to the input.
+            grad, = autograd.grad(loss, [input], retain_graph=True)
+            if self.per_batch_item:
+                # We do not average the gradient over the batch dimension.
+                dims = tuple(range(1, grad.dim()))
+                norm = grad.norm(dim=dims, p=2).mean()
+            else:
+                norm = grad.norm(p=2)
+            norms[name] = norm
+            grads[name] = grad
+        count = 1
+        if self.per_batch_item:
+            count = len(grad)
+        # Average norms across workers. Theoretically we should average the
+        # squared norm, then take the sqrt, but it worked fine like that.
+        avg_norms = flashy.distrib.average_metrics(self.averager(norms), count)
+        # We approximate the total norm of the gradient as the sums of the norms.
+        # Obviously this can be very incorrect if all gradients are aligned, but it works fine.
+        total = sum(avg_norms.values())
+        self._metrics = {}
+        if self.monitor:
+            # Store the ratio of the total gradient represented by each loss.
+            for k, v in avg_norms.items():
+                self._metrics[f'ratio_{k}'] = v / total
+        total_weights = sum([self.weights[k] for k in avg_norms])
+        assert total_weights > 0.
+        desired_ratios = {k: w / total_weights for k, w in self.weights.items()}
+        out_grad = torch.zeros_like(input)
+        effective_loss = torch.tensor(0., device=input.device, dtype=input.dtype)
+        for name, avg_norm in avg_norms.items():
+            if self.balance_grads:
+                # g_balanced = g / avg(||g||) * total_norm * desired_ratio
+                scale = desired_ratios[name] * self.total_norm / (self.epsilon + avg_norm)
+            else:
+                # We just do regular weighted sum of the gradients.
+                scale = self.weights[name]
+            out_grad.add_(grads[name], alpha=scale)
+            effective_loss += scale * losses[name].detach()
+        # Send the computed partial derivative with respect to the output of the model to the model.
+        input.backward(out_grad)
+        return effective_loss

audio_diffusion_attacks_forhf/src/losses.py ADDED Viewed

	@@ -0,0 +1,329 @@

+#https://github.com/descriptinc/descript-audio-codec/blob/main/dac/nn/loss.py
+import typing
+from typing import List
+import torch
+import torch.nn.functional as F
+from audiotools import AudioSignal
+from audiotools import STFTParams
+from torch import nn
+class L1Loss(nn.L1Loss):
+    """L1 Loss between AudioSignals. Defaults
+    to comparing ``audio_data``, but any
+    attribute of an AudioSignal can be used.
+    Parameters
+    ----------
+    attribute : str, optional
+        Attribute of signal to compare, defaults to ``audio_data``.
+    weight : float, optional
+        Weight of this loss, defaults to 1.0.
+    Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/distance.py
+    """
+    def __init__(self, attribute: str = "audio_data", weight: float = 1.0, **kwargs):
+        self.attribute = attribute
+        self.weight = weight
+        super().__init__(**kwargs)
+    def forward(self, x: AudioSignal, y: AudioSignal):
+        """
+        Parameters
+        ----------
+        x : AudioSignal
+            Estimate AudioSignal
+        y : AudioSignal
+            Reference AudioSignal
+        Returns
+        -------
+        torch.Tensor
+            L1 loss between AudioSignal attributes.
+        """
+        if isinstance(x, AudioSignal):
+            x = getattr(x, self.attribute)
+            y = getattr(y, self.attribute)
+        return super().forward(x, y)
+class SISDRLoss(nn.Module):
+    """
+    Computes the Scale-Invariant Source-to-Distortion Ratio between a batch
+    of estimated and reference audio signals or aligned features.
+    Parameters
+    ----------
+    scaling : int, optional
+        Whether to use scale-invariant (True) or
+        signal-to-noise ratio (False), by default True
+    reduction : str, optional
+        How to reduce across the batch (either 'mean',
+        'sum', or none).], by default ' mean'
+    zero_mean : int, optional
+        Zero mean the references and estimates before
+        computing the loss, by default True
+    clip_min : int, optional
+        The minimum possible loss value. Helps network
+        to not focus on making already good examples better, by default None
+    weight : float, optional
+        Weight of this loss, defaults to 1.0.
+    Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/distance.py
+    """
+    def __init__(
+        self,
+        scaling: int = True,
+        reduction: str = "mean",
+        zero_mean: int = True,
+        clip_min: int = None,
+        weight: float = 1.0,
+    ):
+        self.scaling = scaling
+        self.reduction = reduction
+        self.zero_mean = zero_mean
+        self.clip_min = clip_min
+        self.weight = weight
+        super().__init__()
+    def forward(self, x: AudioSignal, y: AudioSignal):
+        eps = 1e-8
+        # nb, nc, nt
+        if isinstance(x, AudioSignal):
+            references = x.audio_data
+            estimates = y.audio_data
+        else:
+            references = x
+            estimates = y
+        nb = references.shape[0]
+        references = references.reshape(nb, 1, -1).permute(0, 2, 1)
+        estimates = estimates.reshape(nb, 1, -1).permute(0, 2, 1)
+        # samples now on axis 1
+        if self.zero_mean:
+            mean_reference = references.mean(dim=1, keepdim=True)
+            mean_estimate = estimates.mean(dim=1, keepdim=True)
+        else:
+            mean_reference = 0
+            mean_estimate = 0
+        _references = references - mean_reference
+        _estimates = estimates - mean_estimate
+        references_projection = (_references**2).sum(dim=-2) + eps
+        references_on_estimates = (_estimates * _references).sum(dim=-2) + eps
+        scale = (
+            (references_on_estimates / references_projection).unsqueeze(1)
+            if self.scaling
+            else 1
+        )
+        e_true = scale * _references
+        e_res = _estimates - e_true
+        signal = (e_true**2).sum(dim=1)
+        noise = (e_res**2).sum(dim=1)
+        sdr = -10 * torch.log10(signal / noise + eps)
+        if self.clip_min is not None:
+            sdr = torch.clamp(sdr, min=self.clip_min)
+        if self.reduction == "mean":
+            sdr = sdr.mean()
+        elif self.reduction == "sum":
+            sdr = sdr.sum()
+        return sdr
+class MultiScaleSTFTLoss(nn.Module):
+    """Computes the multi-scale STFT loss from [1].
+    Parameters
+    ----------
+    window_lengths : List[int], optional
+        Length of each window of each STFT, by default [2048, 512]
+    loss_fn : typing.Callable, optional
+        How to compare each loss, by default nn.L1Loss()
+    clamp_eps : float, optional
+        Clamp on the log magnitude, below, by default 1e-5
+    mag_weight : float, optional
+        Weight of raw magnitude portion of loss, by default 1.0
+    log_weight : float, optional
+        Weight of log magnitude portion of loss, by default 1.0
+    pow : float, optional
+        Power to raise magnitude to before taking log, by default 2.0
+    weight : float, optional
+        Weight of this loss, by default 1.0
+    match_stride : bool, optional
+        Whether to match the stride of convolutional layers, by default False
+    References
+    ----------
+    1.  Engel, Jesse, Chenjie Gu, and Adam Roberts.
+        "DDSP: Differentiable Digital Signal Processing."
+        International Conference on Learning Representations. 2019.
+    Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/spectral.py
+    """
+    def __init__(
+        self,
+        window_lengths: List[int] = [2048, 512],
+        loss_fn: typing.Callable = nn.L1Loss(),
+        clamp_eps: float = 1e-5,
+        mag_weight: float = 1.0,
+        log_weight: float = 1.0,
+        pow: float = 2.0,
+        weight: float = 1.0,
+        match_stride: bool = False,
+        window_type: str = None,
+    ):
+        super().__init__()
+        self.stft_params = [
+            STFTParams(
+                window_length=w,
+                hop_length=w // 4,
+                match_stride=match_stride,
+                window_type=window_type,
+            )
+            for w in window_lengths
+        ]
+        self.loss_fn = loss_fn
+        self.log_weight = log_weight
+        self.mag_weight = mag_weight
+        self.clamp_eps = clamp_eps
+        self.weight = weight
+        self.pow = pow
+    def forward(self, x: AudioSignal, y: AudioSignal):
+        """Computes multi-scale STFT between an estimate and a reference
+        signal.
+        Parameters
+        ----------
+        x : AudioSignal
+            Estimate signal
+        y : AudioSignal
+            Reference signal
+        Returns
+        -------
+        torch.Tensor
+            Multi-scale STFT loss.
+        """
+        loss = 0.0
+        for s in self.stft_params:
+            x.stft(s.window_length, s.hop_length, s.window_type)
+            y.stft(s.window_length, s.hop_length, s.window_type)
+            loss += self.log_weight * self.loss_fn(
+                x.magnitude.clamp(self.clamp_eps).pow(self.pow).log10(),
+                y.magnitude.clamp(self.clamp_eps).pow(self.pow).log10(),
+            )
+            loss += self.mag_weight * self.loss_fn(x.magnitude, y.magnitude)
+        return loss
+class MelSpectrogramLoss(nn.Module):
+    """Compute distance between mel spectrograms. Can be used
+    in a multi-scale way.
+    Parameters
+    ----------
+    n_mels : List[int]
+        Number of mels per STFT, by default [150, 80],
+    window_lengths : List[int], optional
+        Length of each window of each STFT, by default [2048, 512]
+    loss_fn : typing.Callable, optional
+        How to compare each loss, by default nn.L1Loss()
+    clamp_eps : float, optional
+        Clamp on the log magnitude, below, by default 1e-5
+    mag_weight : float, optional
+        Weight of raw magnitude portion of loss, by default 1.0
+    log_weight : float, optional
+        Weight of log magnitude portion of loss, by default 1.0
+    pow : float, optional
+        Power to raise magnitude to before taking log, by default 2.0
+    weight : float, optional
+        Weight of this loss, by default 1.0
+    match_stride : bool, optional
+        Whether to match the stride of convolutional layers, by default False
+    Implementation copied from: https://github.com/descriptinc/lyrebird-audiotools/blob/961786aa1a9d628cca0c0486e5885a457fe70c1a/audiotools/metrics/spectral.py
+    """
+    def __init__(
+        self,
+        n_mels: List[int] = [150, 80],
+        window_lengths: List[int] = [2048, 512],
+        loss_fn: typing.Callable = nn.MSELoss(),
+        clamp_eps: float = 1e-5,
+        mag_weight: float = 1.0,
+        log_weight: float = 1.0,
+        pow: float = 2.0,
+        weight: float = 1.0,
+        match_stride: bool = False,
+        mel_fmin: List[float] = [0.0, 0.0],
+        mel_fmax: List[float] = [None, None],
+        window_type: str = None,
+    ):
+        super().__init__()
+        self.stft_params = [
+            STFTParams(
+                window_length=w,
+                hop_length=w // 4,
+                match_stride=match_stride,
+                window_type=window_type,
+            )
+            for w in window_lengths
+        ]
+        self.n_mels = n_mels
+        self.loss_fn = loss_fn
+        self.clamp_eps = clamp_eps
+        self.log_weight = log_weight
+        self.mag_weight = mag_weight
+        self.weight = weight
+        self.mel_fmin = mel_fmin
+        self.mel_fmax = mel_fmax
+        self.pow = pow
+    def forward(self, x: AudioSignal, y: AudioSignal):
+        """Computes mel loss between an estimate and a reference
+        signal.
+        Parameters
+        ----------
+        x : AudioSignal
+            Estimate signal
+        y : AudioSignal
+            Reference signal
+        Returns
+        -------
+        torch.Tensor
+            Mel loss.
+        """
+        loss = 0.0
+        for n_mels, fmin, fmax, s in zip(
+            self.n_mels, self.mel_fmin, self.mel_fmax, self.stft_params
+        ):
+            kwargs = {
+                "window_length": s.window_length,
+                "hop_length": s.hop_length,
+                "window_type": s.window_type,
+            }
+            x_mels = x.mel_spectrogram(n_mels, mel_fmin=fmin, mel_fmax=fmax, **kwargs)
+            y_mels = y.mel_spectrogram(n_mels, mel_fmin=fmin, mel_fmax=fmax, **kwargs)
+            loss += self.log_weight * self.loss_fn(
+                x_mels.clamp(self.clamp_eps).pow(self.pow).log10(),
+                y_mels.clamp(self.clamp_eps).pow(self.pow).log10(),
+            )
+            loss += self.mag_weight * self.loss_fn(x_mels, y_mels)
+        return loss

audio_diffusion_attacks_forhf/src/music_gen.py ADDED Viewed

	@@ -0,0 +1,100 @@

+from transformers import AutoProcessor, MusicgenForConditionalGeneration
+#Andy removed: from datasets import load_dataset
+import torchaudio
+import torch
+#Andy edited: import losses
+import audio_diffusion_attacks_forhf.src.losses
+from audiotools import AudioSignal
+class MusicGenEval:
+    def __init__(self, input_sample_rate, audio_steps):
+        model_name="facebook/musicgen-stereo-small"
+        self.processor = AutoProcessor.from_pretrained(model_name)
+        self.model = MusicgenForConditionalGeneration.from_pretrained(model_name)
+        self.model=self.model.to(device='cuda')
+        self.input_sample_rate=input_sample_rate
+        self.audio_steps=audio_steps
+        self.mel_loss = losses.MelSpectrogramLoss(n_mels=[5, 10, 20, 40, 80, 160, 320],
+                                                 window_lengths=[32, 64, 128, 256, 512, 1024, 2048],
+                                                 mel_fmin=[0, 0, 0, 0, 0, 0, 0],
+                                                 pow=1.0,
+                                                 clamp_eps=1.0e-5,
+                                                 mag_weight=0.0)
+    def eval(self, original_audio, protected_audio):
+        original_audio=original_audio[:, :, :self.audio_steps]
+        protected_audio=protected_audio[:, :, :self.audio_steps]
+        input_len=original_audio.shape[-1]
+        unprotected_gen=self.generate_audio(original_audio)[0].to(device='cuda')
+        protected_gen=self.generate_audio(protected_audio)[0].to(device='cuda')
+        eval_dict={}
+        # Difference between original and unprotected gen
+        eval_dict["original_unprotectedgen_l1"]=torch.mean(torch.abs(original_audio-unprotected_gen[:, :input_len]))
+        eval_dict["original_unprotectedgen_mel"]=self.mel_loss(AudioSignal(original_audio, self.input_sample_rate), AudioSignal(unprotected_gen[:, :input_len], self.input_sample_rate))
+        # Difference between original and protected gen
+        eval_dict["original_protectedgen_l1"]=torch.mean(torch.abs(original_audio-protected_gen[:, :input_len]))
+        eval_dict["original_protectedgen_mel"]=self.mel_loss(AudioSignal(original_audio, self.input_sample_rate), AudioSignal(protected_gen[:, :input_len], self.input_sample_rate))
+        # Difference between protected and protected gen
+        eval_dict["protected_protectedgen_l1"]=torch.mean(torch.abs(protected_audio-protected_gen[:, :input_len]))
+        eval_dict["protected_protectedgen_mel"]=self.mel_loss(AudioSignal(protected_audio, self.input_sample_rate), AudioSignal(protected_gen[:, :input_len], self.input_sample_rate))
+        # Difference between unprotected gen and protected gen
+        eval_dict["protectedgen_unprotectedgen_l1"]=torch.mean(torch.abs(protected_gen-unprotected_gen))
+        eval_dict["protectedgen_unprotectedgen_mel"]=self.mel_loss(AudioSignal(protected_gen, self.input_sample_rate), AudioSignal(unprotected_gen, self.input_sample_rate))
+        return eval_dict, unprotected_gen, protected_gen
+    def generate_audio(self, audio):
+        torch.manual_seed(0)
+        transform = torchaudio.transforms.Resample(self.input_sample_rate, 32000).to(device='cuda')
+        waveform=transform(audio[0]).detach().cpu()
+        # waveform.clamp_(0,1)
+        a=torch.min(waveform)
+        b=torch.max(waveform)
+        c=waveform.isnan().any()
+        # sample = processor(raw_audio=waveform, sampling_rate=48000, return_tensors="pt")
+        inputs = self.processor(
+            audio=waveform,
+            sampling_rate=32000,
+            text=["music"],
+            padding=True,
+            return_tensors="pt",
+        )
+        for d in inputs.data:
+            inputs.data[d]=inputs.data[d].to(device='cuda')
+        audio_values = self.model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=1024)
+        transform = torchaudio.transforms.Resample(32000, self.input_sample_rate).to(device='cuda')
+        audio_values=transform(audio_values)
+        return audio_values
+model_name="facebook/musicgen-stereo-small"
+processor = AutoProcessor.from_pretrained(model_name)
+model = MusicgenForConditionalGeneration.from_pretrained(model_name).to(device='cuda')
+'''Andy commented:
+song_name="Texas Sun"
+waveform, sample_rate = torchaudio.load(f"test_audio/{song_name}.mp3")
+waveform=waveform[:, :500000]
+torch.manual_seed(0)
+transform = torchaudio.transforms.Resample(sample_rate, 32000)
+waveform=transform(waveform)
+# sample = processor(raw_audio=waveform, sampling_rate=48000, return_tensors="pt")
+inputs = processor(
+    audio=waveform,
+    sampling_rate=32000,
+    text=["music"],
+    padding=True,
+    return_tensors="pt",
+)
+for d in inputs.data:
+    inputs.data[d]=inputs.data[d].to(device='cuda')
+audio_values = model.generate(**inputs, do_sample=False, guidance_scale=3, max_new_tokens=512, top_k=0, top_p=250)
+torchaudio.save(f"test_audio/perturbed/{model_name[9:]}_{song_name}.mp3", audio_values.detach().cpu()[0], 32000)
+u=0
+'''

audio_diffusion_attacks_forhf/src/speech_inference.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import torch
+from TTS.api import TTS
+#Andy edited: import losses
+import audio_diffusion_attacks_forhf.src.losses
+from audiotools import AudioSignal
+import numpy as np
+import torchaudio
+import random
+import string
+import os
+class XTTS_Eval:
+    def __init__(self, input_sample_rate, text="The quick brown fox jumps over the lazy dog."):
+        self.model = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
+        self.model=self.model.to(device='cuda')
+        self.text=text
+        self.input_sample_rate=input_sample_rate
+        self.mel_loss = losses.MelSpectrogramLoss(n_mels=[5, 10, 20, 40, 80, 160, 320],
+                                                 window_lengths=[32, 64, 128, 256, 512, 1024, 2048],
+                                                 mel_fmin=[0, 0, 0, 0, 0, 0, 0],
+                                                 pow=1.0,
+                                                 clamp_eps=1.0e-5,
+                                                 mag_weight=0.0)
+    def eval(self, original_audio, protected_audio):
+        original_audio=original_audio[0]
+        protected_audio=protected_audio[0]
+        unprotected_gen=self.generate_audio(original_audio).to(device='cuda')
+        protected_gen=self.generate_audio(protected_audio).to(device='cuda')
+        match_len=min(original_audio.shape[1], unprotected_gen.shape[1])
+        if original_audio.shape[1]<unprotected_gen.shape[1]:
+            s_unprotected_gen=unprotected_gen[:, :match_len]
+            s_protected_gen=unprotected_gen[:, :match_len]
+            s_original_audio=original_audio
+            s_protected_audio=protected_audio
+        else:
+            s_unprotected_gen=unprotected_gen
+            s_protected_gen=unprotected_gen
+            s_original_audio=original_audio[:, :match_len]
+            s_protected_audio=protected_audio[:, :match_len]
+        match_len=min(protected_gen.shape[1], unprotected_gen.shape[1])
+        protected_gen=protected_gen[:,:match_len]
+        unprotected_gen=unprotected_gen[:,:match_len]
+        eval_dict={}
+        # Difference between original and unprotected gen
+        eval_dict["original_unprotectedgen_l1"]=torch.mean(torch.abs(s_original_audio-s_unprotected_gen))
+        eval_dict["original_unprotectedgen_mel"]=self.mel_loss(AudioSignal(s_original_audio, self.input_sample_rate), AudioSignal(s_unprotected_gen, self.input_sample_rate))
+        # Difference between original and protected gen
+        eval_dict["original_protectedgen_l1"]=torch.mean(torch.abs(s_original_audio-s_protected_gen))
+        eval_dict["original_protectedgen_mel"]=self.mel_loss(AudioSignal(s_original_audio, self.input_sample_rate), AudioSignal(s_protected_gen, self.input_sample_rate))
+        # Difference between protected and protected gen
+        eval_dict["protected_protectedgen_l1"]=torch.mean(torch.abs(s_protected_audio-s_protected_gen))
+        eval_dict["protected_protectedgen_mel"]=self.mel_loss(AudioSignal(s_protected_audio, self.input_sample_rate), AudioSignal(s_protected_gen, self.input_sample_rate))
+        # Difference between unprotected gen and protected gen
+        eval_dict["protectedgen_unprotectedgen_l1"]=torch.mean(torch.abs(protected_gen-unprotected_gen))
+        eval_dict["protectedgen_unprotectedgen_mel"]=self.mel_loss(AudioSignal(protected_gen, self.input_sample_rate), AudioSignal(unprotected_gen, self.input_sample_rate))
+        return eval_dict, unprotected_gen, protected_gen
+    def generate_audio(self, audio):
+        random_str=''.join(random.choices(string.ascii_uppercase + string.digits, k=50))
+        torchaudio.save(f"test_audio/{random_str}.wav", torch.reshape(audio.detach().cpu(), (2, audio.shape[1])), self.input_sample_rate, format="wav")
+        torch.manual_seed(0)
+        wav = self.model.tts(text=self.text,
+                      speaker_wav=f"test_audio/{random_str}.wav",
+                      language="en")
+        os.remove(f"test_audio/{random_str}.wav")
+        wav=torch.from_numpy(np.array(wav))
+        stereo_wave=torch.zeros((2, wav.shape[0]))
+        stereo_wave[:,:]=wav
+        transform = torchaudio.transforms.Resample(24000, self.input_sample_rate)
+        stereo_wave=transform(stereo_wave)
+        return stereo_wave
+# # Init TTS
+# tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
+#
+# # Run TTS
+# # ❗ Since this model is multi-lingual voice cloning model, we must set the target speaker_wav and language
+# # Text to speech list of amplitude values as output
+# # wav = tts.tts(text="Hello world!", speaker_wav=, language="en")
+# # Text to speech to a file
+# tts.tts_to_file(text="Hello world!",
+#                 speaker_wav="/media/willie/1caf5422-4135-4f2c-9619-c44041b51146/audio_data/DS_10283_3443/VCTK-Corpus-0.92/wav48_silence_trimmed/p227/p227_023_mic1.flac",
+#                 language="en",
+#                 file_path="/home/willie/eclipse-workspace/audio_diffusion_attacks/src/test_audio/speech/output.wav")

audio_diffusion_attacks_forhf/src/test_audio/.Il Sogno Del Marinaio - Nanos' Waltz.mp3.icloud ADDED Viewed

Binary file (192 Bytes). View file