File size: 6,068 Bytes
667dae1
 
 
c6ebbe3
c5f51b3
667dae1
 
c6ebbe3
 
 
46140e3
c6ebbe3
 
667dae1
c6ebbe3
7fd0df9
 
 
46140e3
7fd0df9
c34ba5e
 
7fd0df9
 
a118d50
 
 
 
 
 
 
 
 
 
46140e3
 
a118d50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a256ce5
a118d50
 
 
 
 
 
7fd0df9
 
 
 
 
 
 
 
 
 
46140e3
7fd0df9
 
 
 
6dd62a6
 
46140e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6dd62a6
1c8adcf
 
 
46140e3
 
1c8adcf
7fd0df9
 
 
 
46140e3
 
 
 
 
 
 
 
c6ebbe3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
tags:
- clip
- llm-jp-clip
- japanese-clip
library_name: open_clip
pipeline_tag: zero-shot-image-classification
license:
- apache-2.0
datasets:
- llm-jp/relaion2B-en-research-safe-japanese-translation
language:
- ja
---
# Model Card for llm-jp-clip-vit-large-patch14

# Model Details

Japanese CLIP model trained with [OpenCLIP](https://github.com/mlfoundations/open_clip) on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation), a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).

The total number of parameters of this model is 467M.

# How to Use

## Installation

```bash
$ pip install open_clip_torch
```

## Zero-shot Image Classification
```python
import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])
```

Reference: 
- [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs
- OpenCLIP [repository](https://github.com/mlfoundations/open_clip)


# Training Details

## Model Architecture

- Text Encoder: RoBERTa base with llm-jp-tokenizer
- Image Encoder: ViT-L/14

## Training Data

This model is trained on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation).
Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).

# Evaluation

Evaluation Code: https://github.com/llm-jp/clip-eval

**Table:** Performance of each model in zero-shot image classification and image-text retrieval tasks. **Bold** indicates first place, and _underline_ indicates second place.


| Model                        | Params (M) | ImageNet | Recruit | CIFAR10 | CIFAR100 | Food101 | Caltech101 | XM3600 I → T | XM3600 T → I | Avg.  |
|-----------------------------|-------------|----------|---------|---------|----------|---------|------------|-------------|-------------|------|
| **Japanese CLIP**           |             |          |         |         |          |         |            |             |             |      |
| [Rinna ViT-B/16](https://huggingface.co/rinna/japanese-clip-vit-b-16)              | 196         | 50.6     | 39.9    | 90.7    | 64.0     | 53.2    | 84.6       | 53.8        | 54.0        | 61.4 |
| [Rinna ViT-B/16 cloob](https://huggingface.co/rinna/japanese-cloob-vit-b-16)        | 196         | 54.6     | 41.6    | 88.2    | 60.3     | 57.2    | 80.2       | 53.4        | 53.4        | 61.1 |
| [LY ViT-B/16](https://huggingface.co/line-corporation/clip-japanese-base)                 | 196         | 52.0     | **83.8** | 96.3    | 76.7     | 73.9    | **88.4**   | **76.9**    | **78.0**    | **78.3** |
| [**llm-jp-ViT-B/16**](https://huggingface.co/llm-jp/llm-jp-clip-vit-base-patch16)        | 248         | 54.2     | 59.4    | 91.8    | 69.2     | _82.2_   | 85.6       | 73.6        | 72.7        | 73.6 |
| [StabilityAI ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16)        | 414         | **62.4** | 70.5    | _97.6_   | **84.1** | 74.0    | 86.7       | 67.3        | 66.0        | 76.1 |
| [**llm-jp-ViT-L/14**](https://huggingface.co/llm-jp/llm-jp-clip-vit-large-patch14)        | 467         | _59.5_   | 62.9    | 96.4    | 77.0     | **88.2** | _87.8_      | 74.1        | _74.1_      | _77.5_ |
| **Multilingual CLIP**       |             |          |         |         |          |         |            |             |             |      |
| [SigLIP B/16-256 multi](https://huggingface.co/google/siglip-base-patch16-256-multilingual)       | 370         | 51.9     | 71.2    | 92.4    | 65.8     | 78.6    | 85.6       | 45.9        | 43.0        | 66.8 |
| [jina-clip-v2](https://huggingface.co/jinaai/jina-clip-v2)                | 865         | 35.8     | 48.1    | 95.1    | 58.3     | 52.0    | 69.4       | 67.3        | 66.4        | 61.6 |
| [LAION ViT-H/14 multi](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k)        | 1193        | 53.0     | _74.5_   | **97.9** | _78.4_   | 74.3    | 85.1       | _75.0_      | 72.0        | 76.3 |


# LICENSE
[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)


Please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license.

# Citation

Bibtex:
```
@inproceedings{sugiura2025clip,
author = {杉浦 一瑳 and 栗田 修平 and 小田 悠介 and 河原大輔 and 岡崎 直観},
month = mar,
series = {言語処理学会第31回年次大会 (NLP2025)},
title = {オープンLLMによる翻訳を活用した日本語 CLIP の開発},
year = {2025}
}

```