## Model description `clip-asl-fingerspelling` is a classifier of American Sign Language (ASL) fingerspelled letters. The base is OpenAI’s CLIP Vision Model (`openai/clip-vit-base-patch32`) fine-tuned with a classifier head added on top of the visual encoder. The full code, dataset details, and an example inference script are available on GitHub: [`clip-asl-fingerspelling`](https://github.com/aleksandra-baranowska/clip-asl-fingerspelling?tab=readme-ov-file). ## Training The model was trained on 206,137 images of signs corresponding to 26 letters of the English alphabet (A-Z). [ASL Alphabet Dataset](https://www.kaggle.com/datasets/debashishsau/aslamerican-sign-language-aplhabet-dataset) was processed using `CLIPProcessor` and split into train, validation, and test sets by 70%, 20% and 10% respectively. Training was done with the following parameters: - Learning rate: 1e-5 - Batch size: 32 - Epochs: 10 - Optimizer: AdamW - Learning rate scheduler: StepLR (step_size=5, gamma=0.1) ## Results Applied performance metrics measured on the test set included Accuracy, Weighted F1 Score, and per-class F1 Score. The fine-tuned model achieves: | Metric | Value | |--------------------------|--------| | **Accuracy** | 99.88% | | **Weighted F1 Score** | 99.88% | Per-class F1 scores vary from 99.61% to 100% (available in the [notebook version](https://colab.research.google.com/drive/1SHz-t2I9DKyxEbC9F7C4nKdhVSZyUXSJ?authuser=3#scrollTo=r3H2wC7jYcCn) of `clip-asl-fingerspelling.py`). ## How to use Example inference is available on GitHub: ([inference script](https://github.com/aleksandra-baranowska/clip-asl-fingerspelling?tab=readme-ov-file#inference)) There are two scripts which show how to load the model along with the additional classifier layer and trained weigths. One is intended for classification of a single given image, while the other is prepared to handle batch classification and provide performance results. ## Limitations The dataset which the model was trained on contains some inaccurate signing which influences the final result. When tested on a small sample of images with different conditions, the performance was much worse (Accuracy: 79.66%). --- metrics: - accuracy - f1 base_model: - openai/clip-vit-base-patch32 pipeline_tag: image-classification ---