|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
datasets: |
|
- HaifaCLGroup/KnessetCorpus |
|
language: |
|
- he |
|
base_model: |
|
- intfloat/multilingual-e5-large |
|
--- |
|
|
|
# Knesset-multi-e5-large |
|
|
|
This is a [sentence-transformers](https://www.sbert.net) model. It maps sentences and paragraphs to a 1024-dimensional dense vector space and can be used for tasks like clustering or semantic search. |
|
|
|
**Knesset-multi-e5-large** is based on the [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) model. |
|
The transformer encoder has been fine-tuned on [Knesset data](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) to better capture legislative and parliamentary language. |
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model is straightforward if you have [sentence-transformers](https://www.sbert.net) installed: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["ืื ืืฉืคื ืจืืฉืื ืืืืืื", "ืื ืืืฉืคื ืืฉื ื"] |
|
|
|
model = SentenceTransformer('GiliGold/Knesset-multi-e5-large') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
## Full Model Architecture |
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) |
|
(2): Normalize() |
|
) |
|
``` |
|
## Additional Details |
|
- Base Model: intfloat/multilingual-e5-large |
|
- Fine-Tuning Data: Knesset data |
|
- Key Modifications: |
|
The encoder part has been fine-tuned on [Knesset data](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) to enhance performance for tasks involving legislative and parliamentary content. |
|
The original pooling and normalization layers have been retained to ensure that the model's embeddings remain consistent with the architecture of the base model. |
|
## Citing & Authors |
|
<!--- Describe where people can find more information --> |
|
TBD |