File size: 2,326 Bytes
18fcfb8
090e1f4
3981ec0
c2c0ffb
 
3981ec0
 
 
 
 
 
 
 
 
c2c0ffb
 
 
 
 
 
 
 
 
 
 
3981ec0
 
 
c2c0ffb
dda0523
 
 
 
 
c2c0ffb
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
## Model description
`clip-asl-fingerspelling` is a classifier of American Sign Language (ASL) fingerspelled letters. The base is OpenAI’s CLIP Vision Model (`openai/clip-vit-base-patch32`)
fine-tuned with a classifier head added on top of the visual encoder. The full code, dataset details, and an example inference 
script are available on GitHub: [`clip-asl-fingerspelling`](https://github.com/aleksandra-baranowska/clip-asl-fingerspelling?tab=readme-ov-file).

## Training
The model was trained on 206,137 images of signs corresponding to 26 letters of the English alphabet (A-Z). [ASL Alphabet Dataset](https://www.kaggle.com/datasets/debashishsau/aslamerican-sign-language-aplhabet-dataset)
was processed using `CLIPProcessor` and split into train, validation, and test sets by 70%, 20% and 10% respectively. Training was done with the following parameters:
- Learning rate: 1e-5
- Batch size: 32
- Epochs: 10
- Optimizer: AdamW
- Learning rate scheduler: StepLR (step_size=5, gamma=0.1)

## Results
Applied performance metrics measured on the test set included Accuracy, Weighted F1 Score, and per-class F1 Score. 
The fine-tuned model achieves:
| Metric                   | Value   |
|--------------------------|--------|
| **Accuracy**        | 99.88% |
| **Weighted F1 Score** | 99.88% |

Per-class F1 scores vary from 99.61% to 100% (available in the [notebook version](https://colab.research.google.com/drive/1SHz-t2I9DKyxEbC9F7C4nKdhVSZyUXSJ?authuser=3#scrollTo=r3H2wC7jYcCn) of `clip-asl-fingerspelling.py`).

## How to use
Example inference is available on GitHub: ([inference script](https://github.com/aleksandra-baranowska/clip-asl-fingerspelling?tab=readme-ov-file#inference))
There are two scripts which show how to load the model along with the additional classifier layer and trained weigths. One is intended for
classification of a single given image, while the other is prepared to handle batch classification and provide performance results.

## Limitations
The dataset which the model was trained on contains some inaccurate signing which influences the final result. When tested on a small sample
of images with different conditions, the performance was much worse (Accuracy: 79.66%).


---
metrics:
- accuracy
- f1
base_model:
- openai/clip-vit-base-patch32
pipeline_tag: image-classification
---