czczup commited on
Commit
3e5c586
·
verified ·
1 Parent(s): 02eb177

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +219 -3
README.md CHANGED
@@ -1,3 +1,219 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - laion/laion2B-en
5
+ - laion/laion-coco
6
+ - laion/laion2B-multi
7
+ - kakaobrain/coyo-700m
8
+ - conceptual_captions
9
+ - wanng/wukong100m
10
+ pipeline_tag: visual-question-answering
11
+ ---
12
+
13
+ # Model Card for InternVL2-2B
14
+
15
+ [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
16
+
17
+ [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
18
+
19
+ ## Model Usage
20
+
21
+ We provide an example code to run InternVL2-2B using `transformers`.
22
+
23
+ You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
24
+
25
+ > Please use transformers==4.37.2 to ensure the model works normally.
26
+
27
+ ```python
28
+ import torch
29
+ import torchvision.transforms as T
30
+ from PIL import Image
31
+ from torchvision.transforms.functional import InterpolationMode
32
+ from transformers import AutoModel, AutoTokenizer
33
+
34
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
35
+ IMAGENET_STD = (0.229, 0.224, 0.225)
36
+
37
+
38
+ def build_transform(input_size):
39
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
40
+ transform = T.Compose([
41
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
42
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
43
+ T.ToTensor(),
44
+ T.Normalize(mean=MEAN, std=STD)
45
+ ])
46
+ return transform
47
+
48
+
49
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
50
+ best_ratio_diff = float('inf')
51
+ best_ratio = (1, 1)
52
+ area = width * height
53
+ for ratio in target_ratios:
54
+ target_aspect_ratio = ratio[0] / ratio[1]
55
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
56
+ if ratio_diff < best_ratio_diff:
57
+ best_ratio_diff = ratio_diff
58
+ best_ratio = ratio
59
+ elif ratio_diff == best_ratio_diff:
60
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
61
+ best_ratio = ratio
62
+ return best_ratio
63
+
64
+
65
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
66
+ orig_width, orig_height = image.size
67
+ aspect_ratio = orig_width / orig_height
68
+
69
+ # calculate the existing image aspect ratio
70
+ target_ratios = set(
71
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
72
+ i * j <= max_num and i * j >= min_num)
73
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
74
+
75
+ # find the closest aspect ratio to the target
76
+ target_aspect_ratio = find_closest_aspect_ratio(
77
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
78
+
79
+ # calculate the target width and height
80
+ target_width = image_size * target_aspect_ratio[0]
81
+ target_height = image_size * target_aspect_ratio[1]
82
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
83
+
84
+ # resize the image
85
+ resized_img = image.resize((target_width, target_height))
86
+ processed_images = []
87
+ for i in range(blocks):
88
+ box = (
89
+ (i % (target_width // image_size)) * image_size,
90
+ (i // (target_width // image_size)) * image_size,
91
+ ((i % (target_width // image_size)) + 1) * image_size,
92
+ ((i // (target_width // image_size)) + 1) * image_size
93
+ )
94
+ # split the image
95
+ split_img = resized_img.crop(box)
96
+ processed_images.append(split_img)
97
+ assert len(processed_images) == blocks
98
+ if use_thumbnail and len(processed_images) != 1:
99
+ thumbnail_img = image.resize((image_size, image_size))
100
+ processed_images.append(thumbnail_img)
101
+ return processed_images
102
+
103
+
104
+ def load_image(image_file, input_size=448, max_num=6):
105
+ image = Image.open(image_file).convert('RGB')
106
+ transform = build_transform(input_size=input_size)
107
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
108
+ pixel_values = [transform(image) for image in images]
109
+ pixel_values = torch.stack(pixel_values)
110
+ return pixel_values
111
+
112
+
113
+ path = 'OpenGVLab/InternVL2-2B'
114
+ model = AutoModel.from_pretrained(
115
+ path,
116
+ torch_dtype=torch.bfloat16,
117
+ low_cpu_mem_usage=True,
118
+ trust_remote_code=True).eval().cuda()
119
+
120
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
121
+ # set the max number of tiles in `max_num`
122
+ pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
123
+
124
+ generation_config = dict(
125
+ num_beams=1,
126
+ max_new_tokens=1024,
127
+ do_sample=False,
128
+ )
129
+
130
+ # pure-text conversation (纯文本对话)
131
+ question = 'Hello, who are you?'
132
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
133
+ print(f'User: {question}')
134
+ print(f'Assistant: {response}')
135
+
136
+ question = 'Can you tell me a story?'
137
+ response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
138
+ print(f'User: {question}')
139
+ print(f'Assistant: {response}')
140
+
141
+ # single-image single-round conversation (单图单轮对话)
142
+ question = '<image>\nPlease describe the image shortly.'
143
+ response = model.chat(tokenizer, pixel_values, question, generation_config)
144
+ print(f'User: {question}')
145
+ print(f'Assistant: {response}')
146
+
147
+ # single-image multi-round conversation (单图多轮对话)
148
+ question = '<image>\nPlease describe the image in detail.'
149
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
150
+ print(f'User: {question}')
151
+ print(f'Assistant: {response}')
152
+
153
+ question = 'Please write a poem according to the image.'
154
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
155
+ print(f'User: {question}')
156
+ print(f'Assistant: {response}')
157
+
158
+ # multi-image multi-round conversation (多图多轮对话)
159
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
160
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
161
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
162
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
163
+
164
+ question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
165
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
166
+ num_patches_list=num_patches_list,
167
+ history=None, return_history=True)
168
+ print(f'User: {question}')
169
+ print(f'Assistant: {response}')
170
+
171
+ question = 'What are the similarities and differences between these two images.'
172
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
173
+ num_patches_list=num_patches_list,
174
+ history=history, return_history=True)
175
+ print(f'User: {question}')
176
+ print(f'Assistant: {response}')
177
+
178
+ # batch inference, single image per sample (单图批处理)
179
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
180
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
181
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
182
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
183
+
184
+ questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
185
+ responses = model.batch_chat(tokenizer, pixel_values,
186
+ num_patches_list=num_patches_list,
187
+ questions=questions,
188
+ generation_config=generation_config)
189
+ for question, response in zip(questions, responses):
190
+ print(f'User: {question}')
191
+ print(f'Assistant: {response}')
192
+ ```
193
+
194
+ ## Citation
195
+
196
+ If you find this project useful in your research, please consider citing:
197
+
198
+ ```BibTeX
199
+ @article{chen2023internvl,
200
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
201
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
202
+ journal={arXiv preprint arXiv:2312.14238},
203
+ year={2023}
204
+ }
205
+ @article{chen2024far,
206
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
207
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
208
+ journal={arXiv preprint arXiv:2404.16821},
209
+ year={2024}
210
+ }
211
+ ```
212
+
213
+ ## License
214
+
215
+ This project is released under the MIT license.
216
+
217
+ ## Acknowledgement
218
+
219
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!