tensorblock/YuE-s1-7B-anneal-en-cot-GGUF · Any tips, links or advice on how to run this .gguf?

7 days ago

Any tips, links or advice on how to run this .gguf?

TensorBlock org 7 days ago

https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py#L73 - AutoModelForCausalLM.from_pretrained(gguf_model) should work

DeProgrammer

7 days ago

•

edited 6 days ago

I installed Miniconda for this (on Windows 10), so I had to also execute conda init before the second command (conda activate yue) in the setup process in the YuE readme.

I moved genre.txt and lyrics.txt into the inference folder, because the instructions said to run it with that as the working directory, and I figured it wasn't going to go hunting for those files.

Then I got an error when I first tried to install flash-attn, and I just followed the "unsafe, unsupported, undocumented workaround" in the error message and executed SET KMP_DUPLICATE_LIB_OK=TRUE before trying again.

It took something like an hour to build flash-attn... then I messed with the Python file based on the above hint and the Transformers docs and ended up just lazily changing that statement morriszms pointed to:

model = AutoModelForCausalLM.from_pretrained(
    "tensorblock/YuE-s1-7B-anneal-en-cot-GGUF",
    gguf_file="YuE-s1-7B-anneal-en-cot-Q8_0.gguf", 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2", # To enable flashattn, you have to install flash-attn
    )

Then I got another error, "Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch." I already saw it download pytorch, so I assume torch is fine and gguf is the problem, so I ran pip install gguf and tried again. Then it told me to run pip install accelerate, so I did that and tried again.

Then it said, "Converting and de-quantizing GGUF tensors..." and promptly consumed 14 GB of my 16 GB VRAM after spitting out the following various warnings, so I can only assume it isn't actually possible to run it quantized.

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
C:\Users\DePro\.conda\envs\yue\lib\site-packages\torch\nn\utils\weight_norm.py:134: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
infer_gguf.py:87: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  parameter_dict = torch.load(args.resume_path, map_location='cpu')
  0%|                                                                                                                                            | 0/3 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)

I ran it with the command line python infer.py --stage1_model YuE-s1-7B-anneal-en-cot-Q8_0 --stage2_model m-a-p/YuE-s2-1B-general --genre_txt genre.txt --lyrics_txt lyrics.txt --run_n_segments 2 --stage2_batch_size 4 --output_dir ./output --cuda_idx 0 --max_new_tokens 1000 from within the inference folder.

Transformers redownloaded the GGUF even though I already had it in that folder, too. Then it downloaded the stage2 model after taking ~3.5 minutes to run stage 1 inference. Stage 2 took 500 seconds on my RTX 4060 Ti.

Oh, and after running inference successfully once, it fails every time with this error:

Traceback (most recent call last):
  File "infer_gguf.py", line 197, in <module>
    ids = raw_output[0].cpu().numpy()
NameError: name 'raw_output' is not defined

natalie5

7 days ago

https://github.com/multimodal-art-projection/YuE/blob/main/inference/infer.py#L73 - AutoModelForCausalLM.from_pretrained(gguf_model) should work

They updated the licence of the model to Apache 2.0. This can be changed to Apache 2.0 too now :)

Ftfyhh

6 days ago

•

edited 6 days ago

NameError: name 'raw_output' is not defined

Put 3 or more segments into lyrics.txt ([verse], [chorus], [verse]). For some reason it won't work with 1 or 2 segments

DeProgrammer

6 days ago

Oh, that's weird, you're right. I thought I had reverted the genre.txt and vocals.txt before one of my attempts so I could say I didn't change anything and it quit working, but I apparently did not...