GiliGold
/

Knesset-multi-e5-large

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

Knesset-multi-e5-large / README.md

GiliGold's picture

Update README.md

e7bfb3c verified 15 days ago

|

history blame contribute delete

2.21 kB

	---
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	datasets:
	- HaifaCLGroup/KnessetCorpus
	language:
	- he
	base_model:
	- intfloat/multilingual-e5-large
	---

	# Knesset-multi-e5-large

	This is a [sentence-transformers](https://www.sbert.net) model. It maps sentences and paragraphs to a 1024-dimensional dense vector space and can be used for tasks like clustering or semantic search.

	Knesset-multi-e5-large is based on the [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) model.
	The transformer encoder has been fine-tuned on [Knesset data](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) to better capture legislative and parliamentary language.

	## Usage (Sentence-Transformers)

	Using this model is straightforward if you have [sentence-transformers](https://www.sbert.net) installed:

	```bash
	pip install -U sentence-transformers
	```

	Then you can use the model like this:

	```python
	from sentence_transformers import SentenceTransformer
	sentences = ["זה משפט ראשון לדוגמה", "זה המשפט השני"]

	model = SentenceTransformer('GiliGold/Knesset-multi-e5-large')
	embeddings = model.encode(sentences)
	print(embeddings)
	```


	## Full Model Architecture
	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
	(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
	(2): Normalize()
	)
	```
	## Additional Details
	- Base Model: intfloat/multilingual-e5-large
	- Fine-Tuning Data: Knesset data
	- Key Modifications:
	The encoder part has been fine-tuned on [Knesset data](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) to enhance performance for tasks involving legislative and parliamentary content.
	The original pooling and normalization layers have been retained to ensure that the model's embeddings remain consistent with the architecture of the base model.
	## Citing & Authors
	<!--- Describe where people can find more information -->
	TBD