AttributeError: 'CLIPImageProcessor' object has no attribute 'patch_size' when loading fine-tuned LLaVA model from Google Drive

#47
by mdnasif - opened

I have fine-tuned a LLaVA (Large Language and Vision Assistant) model on Google Colab and saved it to my Google Drive. Here’s how I saved the model:

'''
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
import os

save_path = "/content/drive/MyDrive/fineTune model1/LLaVA-med-MAKAUT_v1"
os.makedirs(save_path, exist_ok=True)

trainer.model.save_pretrained(save_path)
trainer.tokenizer.save_pretrained(save_path)
processor.image_processor.save_pretrained(save_path)
'''

After saving, my Google Drive folder contains the following files:
README.md
adapter_model.safetensors
adapter_config.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json
tokenizer.model
tokenizer.json
preprocessor_config.json
config.json

However, when I try to load the model for testing, I get an AttributeError related to patch_size:

'''
import torch
from PIL import Image
from transformers import LlavaProcessor, LlavaForConditionalGeneration, CLIPImageProcessor
model_path = "/content/drive/MyDrive/fineTune model/LLaVA-med-MAKAUT_v1"
processor1 = LlavaProcessor.from_pretrained(model_path)
'''

Checking patch size from the model's vision_config

'''
patch_size = new_model_v1.config.vision_config.patch_size
print("Patch size:", patch_size)
Output: Patch size: 14
'''

Error Occurs Here :

'''
print(processor1.image_processor.patch_size)
'''

Error Message:

'''
AttributeError: 'CLIPImageProcessor' object has no attribute 'patch_size'
'''

What I Have Tried: Ensuring that the model is properly saved and loaded. Confirming that the patch size is present in the model's vision configuration (patch_size: 14). Attempting to manually set patch_size:

'''
processor1.image_processor.patch_size = 14
'''

However, this doesn't seem to be the right approach since CLIPImageProcessor doesn’t have this attribute.
Questions:

  1. Why does CLIPImageProcessor lack the patch_size attribute even though it is defined in the model’s vision_config?
  2. What is the correct way to ensure that the LLaVA processor aligns with the fine-tuned model’s configuration, especially concerning patch_size?
  3. Is there a recommended way to properly load and utilize the fine-tuned LLaVA model along with its processor for inference in Colab?
    Screenshot 2025-02-01 002209.png

Screenshot 2025-02-01 002320.png

Llava Hugging Face org
edited 6 days ago

@mdnasif hey!

I guess you trained the model with a long time ago, at the time a patch_size was not a required argument. Since the last release, it has become required. You can manually update the processor and save it back as follows:

processor.patch_size = model.config.vision_config.patch_size
processor.vision_feature_select_strategy = model.config.vision_feature_select_strategy
processor.num_additional_tokens = 1 # CLIP has 1 additional token which is [CLS]. If you use say Siglip, it should be 0, because Siglip doesn't add tokens on top of image grids
processor.save_pretrained("my_model_directory")

Subject: Mismatch Between Image Tokens and Features in LLaVA Model Fine-Tuning
Hi @RaushanTurganbay ma'am,
I’m working on fine-tuning a LLaVA model (LLaVA-1.5-7B) for a medical imaging task, and I’ve encountered an issue that I’m unable to resolve. I would greatly appreciate your expertise and guidance on this matter.

Issue Description
When I try to generate a response using the fine-tuned model, I encounter the following error:
ValueError: Image features and image tokens do not match: tokens: 575, features: 576

This error occurs during the generate() call, indicating a mismatch between the number of image tokens and image features.

Steps I’ve Taken:

Image Preprocessing:
I resized the input image to dimensions that are multiples of the patch_size (14 for LLaVA models).
The image is resized to 518x336, which is a multiple of 14.

Processor Configuration:
I manually set the patch_size and vision_feature_select_strategy in the processor to match the model's configuration.
I verified that the processor's configuration is correct.

Debugging Inputs:
I printed the inputs (input_ids, pixel_values, etc.) to ensure they are correctly formatted.
The inputs are moved to the GPU for processing.

Model Loading:
The model and processor are loaded from the fine-tuned directory, and the model is moved to the GPU.

Code Snippet
Here’s the relevant part of my code:

Load the processor and model

processor1 = LlavaProcessor.from_pretrained(model_path)
new_model_v1 = LlavaForConditionalGeneration.from_pretrained(model_path).to("cuda:0")

Resize the image

patch_size = new_model_v1.config.vision_config.patch_size
shortest_edge = processor1.image_processor.size.get("shortest_edge", 336)
original_width, original_height = raw_image.size
scale_factor = shortest_edge / min(original_width, original_height)
new_width = int(original_width * scale_factor)
new_height = int(original_height * scale_factor)

new_width = (new_width // patch_size) * patch_size
new_height = (new_height // patch_size) * patch_size
raw_image = raw_image.resize((new_width, new_height)) # Resized to multiples of patch_size

Process inputs

inputs = processor1(images=raw_image, text=prompt, return_tensors='pt')
inputs = {k: v.to("cuda:0") for k, v in inputs.items()}

Generate response

output = new_model_v1.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=200,
do_sample=False
)

Questions:

  1. Why is there a mismatch between the number of image tokens (575) and image features (576)?
  2. Is there a mistake in my image preprocessing or model configuration that could cause this issue?
  3. How can I ensure that the number of image tokens matches the number of image features?
  4. Are there any additional steps I need to take to align the image tokens and features correctly?

Additional Information
I have tried the Hugging Face transformers library both version 4.48.2 and 4.48.2.
The fine-tuned model is saved in Google Drive, and I’m loading it in a new Colab session.
The base model is llava-hf/llava-1.5-7b-hf.
Screenshot 2025-02-01 134601.png
Screenshot 2025-02-01 134700.png

mdnasif changed discussion status to closed

Sign up or log in to comment