Model description
clip-asl-fingerspelling
is a classifier of American Sign Language (ASL) fingerspelled letters. The base is OpenAI’s CLIP Vision Model (openai/clip-vit-base-patch32
)
fine-tuned with a classifier head added on top of the visual encoder. The full code, dataset details, and an example inference
script are available on GitHub: clip-asl-fingerspelling
.
Training
The model was trained on 206,137 images of signs corresponding to 26 letters of the English alphabet (A-Z). ASL Alphabet Dataset
was processed using CLIPProcessor
and split into train, validation, and test sets by 70%, 20% and 10% respectively. Training was done with the following parameters:
- Learning rate: 1e-5
- Batch size: 32
- Epochs: 10
- Optimizer: AdamW
- Learning rate scheduler: StepLR (step_size=5, gamma=0.1)
Results
Applied performance metrics measured on the test set included Accuracy, Weighted F1 Score, and per-class F1 Score. The fine-tuned model achieves:
Metric | Value |
---|---|
Accuracy | 99.88% |
Weighted F1 Score | 99.88% |
Per-class F1 scores vary from 99.61% to 100% (available in the notebook version of clip-asl-fingerspelling.py
).
How to use
Example inference is available on GitHub: (inference script) There are two scripts which show how to load the model along with the additional classifier layer and trained weigths. One is intended for classification of a single given image, while the other is prepared to handle batch classification and provide performance results.
Limitations
The dataset which the model was trained on contains some inaccurate signing which influences the final result. When tested on a small sample of images with different conditions, the performance was much worse (Accuracy: 79.66%).