aalof/clipvision-asl-fingerspelling

Model description

clip-asl-fingerspelling is a classifier of American Sign Language (ASL) fingerspelled letters. The base is OpenAI’s CLIP Vision Model (openai/clip-vit-base-patch32) fine-tuned with a classifier head added on top of the visual encoder. The full code, dataset details, and an example inference script are available on GitHub: clip-asl-fingerspelling.

Training

The model was trained on 206,137 images of signs corresponding to 26 letters of the English alphabet (A-Z). ASL Alphabet Dataset was processed using CLIPProcessor and split into train, validation, and test sets by 70%, 20% and 10% respectively. Training was done with the following parameters:

Learning rate: 1e-5
Batch size: 32
Epochs: 10
Optimizer: AdamW
Learning rate scheduler: StepLR (step_size=5, gamma=0.1)

Results

Applied performance metrics measured on the test set included Accuracy, Weighted F1 Score, and per-class F1 Score. The fine-tuned model achieves:

Metric	Value
Accuracy	99.88%
Weighted F1 Score	99.88%

Per-class F1 scores vary from 99.61% to 100% (available in the notebook version of clip-asl-fingerspelling.py).

How to use

Example inference is available on GitHub: (inference script) There are two scripts which show how to load the model along with the additional classifier layer and trained weigths. One is intended for classification of a single given image, while the other is prepared to handle batch classification and provide performance results.

Limitations

The dataset which the model was trained on contains some inaccurate signing which influences the final result. When tested on a small sample of images with different conditions, the performance was much worse (Accuracy: 79.66%).

aalof
/

clipvision-asl-fingerspelling

Model description

Training

Results

How to use

Limitations

metrics: - accuracy - f1 base_model: - openai/clip-vit-base-patch32 pipeline_tag: image-classification