Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages

Static Badge Static Badge Static Badge Static Badge

This is a sentence-transformers model finetuned from sentence-transformers/stsb-xlm-r-multilingual. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Vyakyarth

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


# Download from the 🤗 Hub
model = SentenceTransformer("krutrim-ai-labs/vyakyarth")
# Run inference
sentences = ["मैं अपने दोस्त से मिला", "I met my friend", "I love you"
]
embeddings = np.array(model.encode(sentences))

print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0])
# Score : 0.9861017

print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0])
# Score 0.26329127

Evaluation/Benchmarking

Dataset Name : Flores Cross Lingual Sentence Retrieval Task of IndicXtreme Benchmark

Language MuRIL IndicBERT Vyakyarth jina-embeddings-v3
Bengali 77.0 91.0 98.7 97.4
Gujarati 67.0 92.4 98.7 97.3
Hindi 84.2 90.5 99.9 98.8
Kannada 88.4 89.1 99.2 96.8
Malayalam 82.2 89.2 98.7 96.3
Marathi 83.9 92.5 98.8 97.1
Sanskrit 36.4 30.4 90.1 84.1
Tamil 79.4 90.0 97.9 95.8
Telugu 43.5 88.6 97.5 97.3
{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

License

This code repository and the model weights are licensed under the Krutrim Community License.

7. Citation

@inproceedings{
  author={Pushkar Singh, Sandeep Kumar Pandey, Rajkiran Panuganti},
  title={Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ola-krutrim/Vyakyarth}}
}

Contact

Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.

Downloads last month
63
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model’s pipeline type.