Feature Extraction
music
sander-wood commited on
Commit
b27ec7f
·
verified ·
1 Parent(s): 961d3b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -30
README.md CHANGED
@@ -104,7 +104,7 @@ tags:
104
  ---
105
  # **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**
106
  [![Homepage](https://img.shields.io/badge/CLaMP%203%20Homepage-GitHub-181717?style=for-the-badge&logo=home-assistant)](https://sanderwood.github.io/clamp3/)
107
- [![Paper](https://img.shields.io/badge/CLaMP%203%20Paper-Coming%20Soon-lightgrey?style=for-the-badge&logo=arxiv)](#)
108
  [![GitHub](https://img.shields.io/badge/CLaMP%203%20Code-GitHub-181717?style=for-the-badge&logo=github)](https://github.com/sanderwood/clamp3)
109
  [![Demo](https://img.shields.io/badge/CLaMP%203%20Demo-Gradio-green?style=for-the-badge&logo=gradio)](https://huggingface.co/spaces/sander-wood/clamp3)
110
  [![Hugging Face](https://img.shields.io/badge/Model%20Weights-Hugging%20Face-ffcc00?style=for-the-badge&logo=huggingface)](https://huggingface.co/sander-wood/clamp3/tree/main)
@@ -128,26 +128,30 @@ CLaMP 3 is a multimodal and multilingual framework for music information retriev
128
  - Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.
129
 
130
  - **Datasets & Benchmarking:**
131
- - **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A **web-scale** dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries.
132
  - **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations.
133
 
134
- ### **Applications**
135
- CLaMP 3 supports a **wide range of music research tasks**, including but not limited to:
136
- - **Semantic Retrieval:** Find music based on **descriptions** or retrieve textual metadata for **audio or symbolic** inputs.
137
- - **Zero-Shot Classification:** Categorize **music by genre, region, or other attributes** without labeled training data.
138
- - **Music Quality Assessment:** Compute the **semantic distance** between reference and generated music features, similar to **Fréchet Inception Distance (FID)**.
139
- - **Cross-Modal Generative Model Evaluation:** Assess **text-to-music generation, music captioning**, and **symbolic-to-audio synthesis** models.
140
- - **Computational Musicology:** By visualizing the distribution of data within the **shared representation space**, researchers can explore regional music patterns, stylistic similarities, and cross-cultural influences.
141
 
142
- Importantly, these applications are **not restricted to any specific music modality or language**, making CLaMP 3 a powerful tool for **diverse music AI research**.
 
 
 
 
 
 
 
 
 
 
143
 
144
  ## **Repository Structure**
145
  - **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
146
  - **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.
147
- - **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** → Convert data into **Interleaved ABC, MTF, or MERT-extracted features**.
148
  - **[retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval)** → Semantic search, retrieval evaluation, and similarity calculations.
149
 
150
- > **Note:** Ensure the model weights are placed in the `code/` folder, and verify the **configuration hyperparameters** before use.
151
 
152
  ## **Getting Started**
153
  ### **Environment Setup**
@@ -161,7 +165,7 @@ conda activate clamp3
161
  #### **1. Convert Music Data to Compatible Formats**
162
  Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.
163
 
164
- > **Note:** Each script requires a manual edit of the `input_dir` variable at the top of the file before running, **except for the MERT extraction script (`extract_mert.py`), which takes command-line arguments for input and output paths.**
165
 
166
  ##### **1.1 Convert MusicXML to Interleaved ABC Notation**
167
 
@@ -182,7 +186,7 @@ python batch_interleaved_abc.py
182
  - **Output:** `.abc` *(Interleaved ABC for CLaMP 3)*
183
 
184
  ##### **1.2 Convert MIDI to MTF Format**
185
- CLaMP 3 processes **performance signals** in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):
186
 
187
  ```bash
188
  python batch_midi2mtf.py
@@ -191,7 +195,7 @@ python batch_midi2mtf.py
191
  - **Output:** `.mtf` *(MTF for CLaMP 3)*
192
 
193
  ##### **1.3 Extract Audio Features using MERT**
194
- For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract **MERT-based features** from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):
195
 
196
  ```bash
197
  python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path m-a-p/MERT-v1-95M --mean_features
@@ -199,28 +203,37 @@ python extract_mert.py --input_path <input_path> --output_path <output_path> --m
199
  - **Input:** `.mp3`, `.wav`
200
  - **Output:** `.npy` *(Processed audio features for CLaMP 3)*
201
 
202
- ### **Training and Feature Extraction**
203
- #### **1. Training Models**
204
- Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters and data paths**.
 
205
 
206
- To train CLaMP 3 on **symbolic music**, use **[train_clamp3_symbolic.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_symbolic.py)**:
207
 
 
 
 
208
  ```bash
209
  python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
210
  ```
211
-
212
- For **audio data**, use **[train_clamp3_audio.py](https://github.com/sanderwood/clamp3/blob/main/code/train_clamp3_audio.py)**:
213
-
214
  ```bash
215
  python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
216
  ```
217
 
218
- Alternatively, you can use **pre-trained weights**:
219
- - **[CLaMP 3 SAAS (Optimal for Audio)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth)**
220
- - **[CLaMP 3 C2 (Optimal for Symbolic Music)](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth)**
221
 
222
- By default, CLaMP 3 is configured for the **SAAS version**, which provides **optimal performance on audio data**. If working primarily with **symbolic music**, download the **C2 variant** and modify **line 66 in `config.py`** from **saas** to **c2**.
 
 
 
223
 
 
 
 
 
 
224
  #### **2. Feature Extraction**
225
  After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):
226
 
@@ -237,12 +250,50 @@ All extracted features are stored as `.npy` files.
237
  > **Note**: For retrieval, `--get_global` must be used. Without it, CLaMP 3 will not work correctly for retrieval tasks. You only omit `--get_global` if you are performing downstream fine-tuning or need raw feature extraction for custom tasks.
238
 
239
  ### **Retrieval and Classification**
240
- #### **1. Semantic Search**
241
- Retrieve **similar music features** using **[`semantic_search.py`](https://github.com/sanderwood/clamp3/tree/main/retrieval/semantic_search.py)**:
 
 
 
 
242
  ```bash
243
  python semantic_search.py <query_file> <reference_folder> [--top_k TOP_K]
244
  ```
245
- > **Note:** Zero-shot classification is essentially **semantic search**, where the query feature is compared against class prototypes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
246
 
247
  #### **2. Classification**
248
  Train a linear classifier using **[`train_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/train_cls.py)**:
@@ -255,4 +306,16 @@ python inference_cls.py <weights_path> <feature_folder> <output_file>
255
  ```
256
 
257
  ## **Citation**
258
- *Coming Soon...*
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ---
105
  # **CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages**
106
  [![Homepage](https://img.shields.io/badge/CLaMP%203%20Homepage-GitHub-181717?style=for-the-badge&logo=home-assistant)](https://sanderwood.github.io/clamp3/)
107
+ [![Paper](https://img.shields.io/badge/CLaMP%203%20Paper-Arxiv-red?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2502.10362)
108
  [![GitHub](https://img.shields.io/badge/CLaMP%203%20Code-GitHub-181717?style=for-the-badge&logo=github)](https://github.com/sanderwood/clamp3)
109
  [![Demo](https://img.shields.io/badge/CLaMP%203%20Demo-Gradio-green?style=for-the-badge&logo=gradio)](https://huggingface.co/spaces/sander-wood/clamp3)
110
  [![Hugging Face](https://img.shields.io/badge/Model%20Weights-Hugging%20Face-ffcc00?style=for-the-badge&logo=huggingface)](https://huggingface.co/sander-wood/clamp3/tree/main)
 
128
  - Trained on **27 languages** and generalizes to all **100 languages** supported by **[XLM-R](https://arxiv.org/abs/1911.02116)**.
129
 
130
  - **Datasets & Benchmarking:**
131
+ - **[M4-RAG](https://huggingface.co/datasets/sander-wood/m4-rag):** A web-scale dataset of **2.31M high-quality music-text pairs** across 27 languages and 194 countries.
132
  - **[WikiMT-X](https://huggingface.co/datasets/sander-wood/wikimt-x):** A MIR benchmark containing **1,000 triplets** of sheet music, audio, and diverse text annotations.
133
 
134
+ ### **What Can CLaMP 3 Do?**
 
 
 
 
 
 
135
 
136
+ CLaMP 3 unifies diverse music data and text into a shared representation space, enabling the following key capabilities:
137
+
138
+ - **Text-to-Music Retrieval**: Finds relevant music based on text descriptions in 100 languages.
139
+ - **Image-to-Music Retrieval**: Matches music that aligns with the scene depicted in the image.
140
+ - **Cross-Modal Music Retrieval**: Enables music retrieval and recommendation across different modalities.
141
+ - **Zero-Shot Music Classification**: Identifies musical attributes such as genres, moods, and styles without labeled training data.
142
+ - **Music Semantic Similarity Evaluation**: Measures semantic similarity between:
143
+ - **Generated music and its text prompt**, validating how well text-to-music models follow instructions.
144
+ - **Generated music and reference music**, assessing their semantic similarity, including aspects like style, instrumentation, and musicality.
145
+
146
+ For examples demonstrating these capabilities, visit [CLaMP 3 Homepage](https://sanderwood.github.io/clamp3/).
147
 
148
  ## **Repository Structure**
149
  - **[code/](https://github.com/sanderwood/clamp3/tree/main/code)** → Training & feature extraction scripts.
150
  - **[classification/](https://github.com/sanderwood/clamp3/tree/main/classification)** → Linear classification training and prediction.
151
+ - **[preprocessing/](https://github.com/sanderwood/clamp3/tree/main/preprocessing)** → Convert data into Interleaved ABC, MTF, or MERT-extracted features.
152
  - **[retrieval/](https://github.com/sanderwood/clamp3/tree/main/retrieval)** → Semantic search, retrieval evaluation, and similarity calculations.
153
 
154
+ > **Note:** Ensure the model weights are placed in the `code/` folder, and verify the configuration hyperparameters before use.
155
 
156
  ## **Getting Started**
157
  ### **Environment Setup**
 
165
  #### **1. Convert Music Data to Compatible Formats**
166
  Before using CLaMP 3, preprocess **MusicXML files** into **Interleaved ABC**, **MIDI files** into **MTF**, and **audio files** into **MERT-extracted features**.
167
 
168
+ > **Note:** Each script requires a manual edit of the `input_dir` variable at the top of the file before running, except for the MERT extraction script (`extract_mert.py`), which takes command-line arguments for input and output paths.
169
 
170
  ##### **1.1 Convert MusicXML to Interleaved ABC Notation**
171
 
 
186
  - **Output:** `.abc` *(Interleaved ABC for CLaMP 3)*
187
 
188
  ##### **1.2 Convert MIDI to MTF Format**
189
+ CLaMP 3 processes performance signals in **MIDI Text Format (MTF)**. Convert **MIDI files** (`.mid`, `.midi`) into **MTF format** using [`batch_midi2mtf.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/midi/batch_midi2mtf.py):
190
 
191
  ```bash
192
  python batch_midi2mtf.py
 
195
  - **Output:** `.mtf` *(MTF for CLaMP 3)*
196
 
197
  ##### **1.3 Extract Audio Features using MERT**
198
+ For audio processing, CLaMP 3 uses **MERT-extracted features** instead of raw waveforms. Extract MERT-based features from raw audio (`.mp3`, `.wav`) using [`extract_mert.py`](https://github.com/sanderwood/clamp3/blob/main/preprocessing/audio/extract_mert.py):
199
 
200
  ```bash
201
  python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path m-a-p/MERT-v1-95M --mean_features
 
203
  - **Input:** `.mp3`, `.wav`
204
  - **Output:** `.npy` *(Processed audio features for CLaMP 3)*
205
 
206
+ ### **Training and Feature Extraction**
207
+
208
+ #### **1. Training Models**
209
+ CLaMP 3 is the most powerful music retrieval model, and in most cases, retraining is not needed. However, if necessary, follow these steps.
210
 
211
+ 1. Modify **[config.py](https://github.com/sanderwood/clamp3/blob/main/code/config.py)** to adjust **hyperparameters** and **data paths**.
212
 
213
+ 2. Train on your own data.
214
+
215
+ To train CLaMP 3 on **symbolic music** (e.g., sheet music, MIDI), run:
216
  ```bash
217
  python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_symbolic.py
218
  ```
219
+ For **audio data**, use:
 
 
220
  ```bash
221
  python -m torch.distributed.launch --nproc_per_node=<GPUs> --use_env train_clamp3_audio.py
222
  ```
223
 
224
+ ##### **Using Pre-Trained Models (Recommended)**
225
+ For most use cases, it's best to use pre-trained weights instead of training from scratch.
 
226
 
227
+ | Version | Best for | Download Link |
228
+ |---------|---------|--------------|
229
+ | **CLaMP 3 SAAS** | **Audio-based retrieval (Recommended)** | [Download SAAS](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_saas_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |
230
+ | **CLaMP 3 C2** | **Symbolic music retrieval (Sheet music, MIDI)** | [Download C2](https://huggingface.co/sander-wood/clamp3/blob/main/weights_clamp3_c2_h_size_768_t_model_FacebookAI_xlm-roberta-base_t_length_128_a_size_768_a_layers_12_a_length_128_s_size_768_s_layers_12_p_size_64_p_length_512.pth) |
231
 
232
+ ##### **How to Switch Between Versions?**
233
+ By default, CLaMP 3 is configured for the **SAAS version** (optimized for audio).
234
+ - If working with **symbolic music (MIDI, sheet music)**, use the **C2 version**:
235
+ **Modify line 66 in `config.py`** from `"saas"` to `"c2"`.
236
+
237
  #### **2. Feature Extraction**
238
  After training (or using pre-trained weights), extract features using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py):
239
 
 
250
  > **Note**: For retrieval, `--get_global` must be used. Without it, CLaMP 3 will not work correctly for retrieval tasks. You only omit `--get_global` if you are performing downstream fine-tuning or need raw feature extraction for custom tasks.
251
 
252
  ### **Retrieval and Classification**
253
+ #### **1. Semantic Search**
254
+
255
+ To perform semantic search with CLaMP 3, you first need to extract the features for both your **query** and **reference** data using [`extract_clamp3.py`](https://github.com/sanderwood/clamp3/blob/main/code/extract_clamp3.py). The query is usually a text description, and the reference folder contains a large set of music data, such as audio or sheet music.
256
+
257
+ After extracting the features, you can perform the semantic search using the [`semantic_search.py`](https://github.com/sanderwood/clamp3/blob/main/retrieval/semantic_search.py) script. This search can be used for various tasks.
258
+
259
  ```bash
260
  python semantic_search.py <query_file> <reference_folder> [--top_k TOP_K]
261
  ```
262
+ - **`<query_file>`**: Path to the query feature (e.g., `ballad.npy`).
263
+ - **`<reference_folder>`**: Folder containing reference features for comparison.
264
+ - **`--top_k`**: *(Optional)* Number of top similar items to display (default is 10).
265
+
266
+ CLaMP 3's semantic search enables various retrieval and evaluation tasks by comparing features extracted from queries and reference data. Generally, the larger and more diverse the reference music dataset, the higher the likelihood of retrieving relevant and accurately matched music.
267
+
268
+ ##### **1. Text-to-Music Retrieval**
269
+ - **Query:** Text description of the desired music.
270
+ - **Reference:** Music data (e.g., audio files).
271
+ - **Output:** Retrieves music that best matches the semantic meaning of the text description.
272
+
273
+ ##### **2. Image-to-Music Retrieval**
274
+ - **Query:** Generate an image caption using models like [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-base).
275
+ - **Reference:** Music data (e.g., audio files).
276
+ - **Output:** Finds music that semantically aligns with the image.
277
+
278
+ ##### **3. Cross-Modal and Same-Modal Music Retrieval**
279
+ - **Cross-Modal Retrieval:**
280
+ - **Query:** Music data from one modality (e.g., audio).
281
+ - **Reference:** Music data from another modality (e.g., MIDI, ABC notation).
282
+ - **Output:** Finds semantically similar music across different representations.
283
+
284
+ - **Same-Modal Retrieval (Semantic-Based Music Recommendation):**
285
+ - **Query & Reference:** Both are from the same modality (e.g., audio-to-audio).
286
+ - **Output:** Recommends similar music based on semantic meaning.
287
+
288
+ ##### **4. Zero-Shot Music Classification**
289
+ - **Query:** Music data.
290
+ - **Reference:** Class descriptions (e.g., "It is classical," "It is folk").
291
+ - **Output:** Assigns the most relevant class based on feature similarity.
292
+
293
+ ##### **5. Music Semantic Similarity Evaluation**
294
+ - **Query:** High-quality music or music generation prompt.
295
+ - **Reference:** Generated music.
296
+ - **Output:** Ranks generated music based on semantic similarity to the query. For large-scale evaluation between generated music and reference music, it is recommended to use [`clamp3_score.py`](https://github.com/sanderwood/clamp3/blob/main/retrieval/clamp3_score.py).
297
 
298
  #### **2. Classification**
299
  Train a linear classifier using **[`train_cls.py`](https://github.com/sanderwood/clamp3/tree/main/classification/train_cls.py)**:
 
306
  ```
307
 
308
  ## **Citation**
309
+ If you find CLaMP 3 useful in your work, please consider citing our paper:
310
+
311
+ ```bibtex
312
+ @misc{wu2025clamp3universalmusic,
313
+ title={CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages},
314
+ author={Shangda Wu and Zhancheng Guo and Ruibin Yuan and Junyan Jiang and Seungheon Doh and Gus Xia and Juhan Nam and Xiaobing Li and Feng Yu and Maosong Sun},
315
+ year={2025},
316
+ eprint={2502.10362},
317
+ archivePrefix={arXiv},
318
+ primaryClass={cs.SD},
319
+ url={https://arxiv.org/abs/2502.10362}
320
+ }
321
+ ```