Update README.md
Browse files
README.md
CHANGED
@@ -1,124 +0,0 @@
|
|
1 |
-
---
|
2 |
-
pipeline_tag: sentence-similarity
|
3 |
-
language: fr
|
4 |
-
datasets:
|
5 |
-
- stsb_multi_mt
|
6 |
-
tags:
|
7 |
-
- Text
|
8 |
-
- Sentence Similarity
|
9 |
-
- Sentence-Embedding
|
10 |
-
- camembert-large
|
11 |
-
license: apache-2.0
|
12 |
-
model-index:
|
13 |
-
- name: sentence-camembert-large by Van Tuan DANG
|
14 |
-
results:
|
15 |
-
- task:
|
16 |
-
name: Sentence-Embedding
|
17 |
-
type: Text Similarity
|
18 |
-
dataset:
|
19 |
-
name: Text Similarity fr
|
20 |
-
type: stsb_multi_mt
|
21 |
-
args: fr
|
22 |
-
metrics:
|
23 |
-
- name: Test Pearson correlation coefficient
|
24 |
-
type: Pearson_correlation_coefficient
|
25 |
-
value: xx.xx
|
26 |
-
---
|
27 |
-
## Description:
|
28 |
-
[**Sentence-CamemBERT-Large**](https://huggingface.co/dangvantuan/sentence-camembert-large) is the Embedding Model for French developed by [La Javaness](https://www.lajavaness.com/). The purpose of this embedding model is to represent the content and semantics of a French sentence in a mathematical vector which allows it to understand the meaning of the text-beyond individual words in queries and documents, offering a powerful semantic search.
|
29 |
-
## Pre-trained sentence embedding models are state-of-the-art of Sentence Embeddings for French.
|
30 |
-
The model is Fine-tuned using pre-trained [facebook/camembert-large](https://huggingface.co/camembert/camembert-large) and
|
31 |
-
[Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train)
|
32 |
-
|
33 |
-
|
34 |
-
## Usage
|
35 |
-
The model can be used directly (without a language model) as follows:
|
36 |
-
|
37 |
-
```python
|
38 |
-
from sentence_transformers import SentenceTransformer
|
39 |
-
model = SentenceTransformer("dangvantuan/sentence-camembert-large")
|
40 |
-
|
41 |
-
sentences = ["Un avion est en train de décoller.",
|
42 |
-
"Un homme joue d'une grande flûte.",
|
43 |
-
"Un homme étale du fromage râpé sur une pizza.",
|
44 |
-
"Une personne jette un chat au plafond.",
|
45 |
-
"Une personne est en train de plier un morceau de papier.",
|
46 |
-
]
|
47 |
-
|
48 |
-
embeddings = model.encode(sentences)
|
49 |
-
```
|
50 |
-
|
51 |
-
## Evaluation
|
52 |
-
The model can be evaluated as follows on the French test data of stsb.
|
53 |
-
|
54 |
-
```python
|
55 |
-
from sentence_transformers import SentenceTransformer
|
56 |
-
from sentence_transformers.readers import InputExample
|
57 |
-
from datasets import load_dataset
|
58 |
-
def convert_dataset(dataset):
|
59 |
-
dataset_samples=[]
|
60 |
-
for df in dataset:
|
61 |
-
score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
|
62 |
-
inp_example = InputExample(texts=[df['sentence1'],
|
63 |
-
df['sentence2']], label=score)
|
64 |
-
dataset_samples.append(inp_example)
|
65 |
-
return dataset_samples
|
66 |
-
|
67 |
-
# Loading the dataset for evaluation
|
68 |
-
df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
|
69 |
-
df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
|
70 |
-
|
71 |
-
# Convert the dataset for evaluation
|
72 |
-
|
73 |
-
# For Dev set:
|
74 |
-
dev_samples = convert_dataset(df_dev)
|
75 |
-
val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
|
76 |
-
val_evaluator(model, output_path="./")
|
77 |
-
|
78 |
-
# For Test set:
|
79 |
-
test_samples = convert_dataset(df_test)
|
80 |
-
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
|
81 |
-
test_evaluator(model, output_path="./")
|
82 |
-
```
|
83 |
-
|
84 |
-
**Test Result**:
|
85 |
-
The performance is measured using Pearson and Spearman correlation:
|
86 |
-
- On dev
|
87 |
-
|
88 |
-
|
89 |
-
| Model | Pearson correlation | Spearman correlation | #params |
|
90 |
-
| ------------- | ------------- | ------------- |------------- |
|
91 |
-
| [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)| 88.2 |88.02 | 336M|
|
92 |
-
| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 86.73|86.54 | 110M |
|
93 |
-
| [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 79.22 | 79.16|135M |
|
94 |
-
| [GPT-3 (text-davinci-003)](https://platform.openai.com/docs/models) | 85 | NaN|175B |
|
95 |
-
| [GPT-(text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.75 | 80.44|NaN |
|
96 |
-
- On test
|
97 |
-
|
98 |
-
|
99 |
-
| Model | Pearson correlation | Spearman correlation |
|
100 |
-
| ------------- | ------------- | ------------- |
|
101 |
-
| [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)| 85.9 | 85.8|
|
102 |
-
| [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)| 82.36 | 81.64|
|
103 |
-
| [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 78.62 | 77.48|
|
104 |
-
| [GPT-3 (text-davinci-003)](https://platform.openai.com/docs/models) | 82 | NaN|175B |
|
105 |
-
| [GPT-(text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.05 | 77.56|NaN |
|
106 |
-
|
107 |
-
|
108 |
-
## Citation
|
109 |
-
|
110 |
-
|
111 |
-
@article{reimers2019sentence,
|
112 |
-
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
|
113 |
-
author={Nils Reimers, Iryna Gurevych},
|
114 |
-
journal={https://arxiv.org/abs/1908.10084},
|
115 |
-
year={2019}
|
116 |
-
}
|
117 |
-
|
118 |
-
|
119 |
-
@article{martin2020camembert,
|
120 |
-
title={CamemBERT: a Tasty French Language Mode},
|
121 |
-
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
|
122 |
-
journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
|
123 |
-
year={2020}
|
124 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|