The fully-trained version of this model is now available at https://huggingface.co/sarvamai/sarvam-1

Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using this notebook on Google colab!

This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

The final checkpoint of sarvam-2b will be released soon, and it will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.

The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in this video.

The model was trained with NVIDIA NeMo™ Framework on the Yotta Shakti Cloud using HGX H100 systems.

Getting started:

from transformers import pipeline
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'

Tokenizer

sarvam-2b's tokenizer is built to be efficient for Indic languages and has an average fertility score of ~2 which is significantly lower than other models.

Here is a comparison of fertility scores between sarvam-2b and other popular models.

Sarvam-2B Llama-3.1 Gemma-2 GPT-4o
ben_Beng 2.07 8.02 3.72 2.34
eng_Latn 1.43 1.24 1.23 1.23
guj_Gujr 1.81 9.97 3.9 2.3
hin_Deva 1.4 2.67 1.96 1.65
kan_Knda 2.37 14.95 5.55 3.29
mal_Mlym 2.85 16.26 5.88 3.52
mar_Deva 1.77 3.99 3.2 2.56
ory_Orya 2.35 16.84 6.87 6.83
pan_Guru 1.68 8.19 3.37 2.72
tam_Taml 2.17 12.39 4.19 3.17
tel_Telu 2.14 13.3 4.57 3.06
Average 2.08 9.34 4.01 3.00

More technical details like evaluations and benchmarking will be posted soon.

Downloads last month
627
Safetensors
Model size
2.51B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for sarvamai/sarvam-2b-v0.5

Adapters
2 models
Finetunes
8 models
Quantizations
8 models

Spaces using sarvamai/sarvam-2b-v0.5 6