Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages

This is a sentence-transformers model finetuned from sentence-transformers/stsb-xlm-r-multilingual. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


# Download from the 🤗 Hub
model = SentenceTransformer("krutrim-ai-labs/vyakyarth")
# Run inference
sentences = ["मैं अपने दोस्त से मिला", "I met my friend", "I love you"
]
embeddings = np.array(model.encode(sentences))

print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0])
# Score : 0.9861017

print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0])
# Score 0.26329127

Evaluation/Benchmarking

Dataset Name : Flores Cross Lingual Sentence Retrieval Task of IndicXtreme Benchmark

Language	MuRIL	IndicBERT	Vyakyarth	jina-embeddings-v3
Bengali	77.0	91.0	98.7	97.4
Gujarati	67.0	92.4	98.7	97.3
Hindi	84.2	90.5	99.9	98.8
Kannada	88.4	89.1	99.2	96.8
Malayalam	82.2	89.2	98.7	96.3
Marathi	83.9	92.5	98.8	97.1
Sanskrit	36.4	30.4	90.1	84.1
Tamil	79.4	90.0	97.9	95.8
Telugu	43.5	88.6	97.5	97.3

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

License

This code repository and the model weights are licensed under the Krutrim Community License.

7. Citation

@inproceedings{
  author={Pushkar Singh, Sandeep Kumar Pandey, Rajkiran Panuganti},
  title={Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ola-krutrim/Vyakyarth}}
}

Contact

Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.