File size: 2,326 Bytes
18fcfb8 090e1f4 3981ec0 c2c0ffb 3981ec0 c2c0ffb 3981ec0 c2c0ffb dda0523 c2c0ffb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
## Model description
`clip-asl-fingerspelling` is a classifier of American Sign Language (ASL) fingerspelled letters. The base is OpenAI’s CLIP Vision Model (`openai/clip-vit-base-patch32`)
fine-tuned with a classifier head added on top of the visual encoder. The full code, dataset details, and an example inference
script are available on GitHub: [`clip-asl-fingerspelling`](https://github.com/aleksandra-baranowska/clip-asl-fingerspelling?tab=readme-ov-file).
## Training
The model was trained on 206,137 images of signs corresponding to 26 letters of the English alphabet (A-Z). [ASL Alphabet Dataset](https://www.kaggle.com/datasets/debashishsau/aslamerican-sign-language-aplhabet-dataset)
was processed using `CLIPProcessor` and split into train, validation, and test sets by 70%, 20% and 10% respectively. Training was done with the following parameters:
- Learning rate: 1e-5
- Batch size: 32
- Epochs: 10
- Optimizer: AdamW
- Learning rate scheduler: StepLR (step_size=5, gamma=0.1)
## Results
Applied performance metrics measured on the test set included Accuracy, Weighted F1 Score, and per-class F1 Score.
The fine-tuned model achieves:
| Metric | Value |
|--------------------------|--------|
| **Accuracy** | 99.88% |
| **Weighted F1 Score** | 99.88% |
Per-class F1 scores vary from 99.61% to 100% (available in the [notebook version](https://colab.research.google.com/drive/1SHz-t2I9DKyxEbC9F7C4nKdhVSZyUXSJ?authuser=3#scrollTo=r3H2wC7jYcCn) of `clip-asl-fingerspelling.py`).
## How to use
Example inference is available on GitHub: ([inference script](https://github.com/aleksandra-baranowska/clip-asl-fingerspelling?tab=readme-ov-file#inference))
There are two scripts which show how to load the model along with the additional classifier layer and trained weigths. One is intended for
classification of a single given image, while the other is prepared to handle batch classification and provide performance results.
## Limitations
The dataset which the model was trained on contains some inaccurate signing which influences the final result. When tested on a small sample
of images with different conditions, the performance was much worse (Accuracy: 79.66%).
---
metrics:
- accuracy
- f1
base_model:
- openai/clip-vit-base-patch32
pipeline_tag: image-classification
--- |