Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages
This is a sentence-transformers model finetuned from sentence-transformers/stsb-xlm-r-multilingual. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Download from the 🤗 Hub
model = SentenceTransformer("krutrim-ai-labs/vyakyarth")
# Run inference
sentences = ["मैं अपने दोस्त से मिला", "I met my friend", "I love you"
]
embeddings = np.array(model.encode(sentences))
print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0])
# Score : 0.9861017
print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0])
# Score 0.26329127
Evaluation/Benchmarking
Dataset Name : Flores Cross Lingual Sentence Retrieval Task of IndicXtreme Benchmark
Language | MuRIL | IndicBERT | Vyakyarth | jina-embeddings-v3 |
---|---|---|---|---|
Bengali | 77.0 | 91.0 | 98.7 | 97.4 |
Gujarati | 67.0 | 92.4 | 98.7 | 97.3 |
Hindi | 84.2 | 90.5 | 99.9 | 98.8 |
Kannada | 88.4 | 89.1 | 99.2 | 96.8 |
Malayalam | 82.2 | 89.2 | 98.7 | 96.3 |
Marathi | 83.9 | 92.5 | 98.8 | 97.1 |
Sanskrit | 36.4 | 30.4 | 90.1 | 84.1 |
Tamil | 79.4 | 90.0 | 97.9 | 95.8 |
Telugu | 43.5 | 88.6 | 97.5 | 97.3 |
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
License
This code repository and the model weights are licensed under the Krutrim Community License.
7. Citation
@inproceedings{
author={Pushkar Singh, Sandeep Kumar Pandey, Rajkiran Panuganti},
title={Vyakyarth: A Multilingual Sentence Embedding Model for Indic Languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ola-krutrim/Vyakyarth}}
}
Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.
- Downloads last month
- 63
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model’s pipeline type.