ynhe commited on
Commit
562277d
Β·
verified Β·
1 Parent(s): 4a32f3c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: apache-2.0
6
+ metrics:
7
+ - accuracy
8
+ tags:
9
+ - multimodal
10
+ pipeline_tag: video-text-to-text
11
+
12
+ ---
13
+
14
+ # πŸ“• InternVL_2_5_HiCo_R64 ⚑
15
+ <!-- [\[πŸ“° Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
16
+ [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
17
+ [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2501.12386)
18
+ <!-- [\[πŸ—¨οΈ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
19
+
20
+
21
+ ## πŸ“ˆ Performance
22
+ | Model | MVBench | LongVideoBench | VideoMME(w/o sub)|
23
+ | --- | --- | --- | --- |
24
+ |InternVL_2_5_HiCo_R64| - | - | - |
25
+
26
+ ## πŸš€ How to use the model
27
+
28
+ First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below:
29
+ ```
30
+ pip install transformers==4.40.1
31
+ pip install av
32
+ pip install imageio
33
+ pip install decord
34
+ pip install opencv-python
35
+ pip install flash-attn --no-build-isolation
36
+ ```
37
+ Then you could use our model:
38
+ ```python
39
+ from transformers import AutoModel, AutoTokenizer
40
+
41
+ # model setting
42
+ model_path = 'OpenGVLab/InternVL_2_5_HiCo_R64'
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
45
+ model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
46
+ image_processor = model.get_vision_tower().image_processor
47
+
48
+
49
+ # evaluation setting
50
+ max_num_frames = 512
51
+ generation_config = dict(
52
+ do_sample=False,
53
+ temperature=0.0,
54
+ max_new_tokens=1024,
55
+ top_p=0.1,
56
+ num_beams=1
57
+ )
58
+
59
+ video_path = "your_video.mp4"
60
+
61
+ # single-turn conversation
62
+ question1 = "Describe this video in detail."
63
+ output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
64
+
65
+ print(output1)
66
+
67
+ # multi-turn conversation
68
+ question2 = "How many people appear in the video?"
69
+ output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)
70
+
71
+ print(output2)
72
+ ```
73
+
74
+ ## ✏️ Citation
75
+
76
+ ```bibtex
77
+
78
+ @article{wang2025internvideo,
79
+ title={InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling},
80
+ author={Wang, Yi and Li, Xinhao and Yan, Ziang and He, Yinan and Yu, Jiashuo and Zeng, Xiangyu and Wang, Chenting and Ma, Changlian and Huang, Haian and Gao, Jianfei and Dou, Min and Chen, Kai and Wang, Wenhai and Qiao, Yu and Wang, Yali and Wang, Limin},
81
+ journal={arXiv preprint arXiv:2501.12386},
82
+ year={2025}
83
+ }
84
+ ```