Husain commited on
Commit
4309eec
·
1 Parent(s): 9b8ff32

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -124
README.md CHANGED
@@ -1,124 +0,0 @@
1
- ---
2
- pipeline_tag: sentence-similarity
3
- language: fr
4
- datasets:
5
- - stsb_multi_mt
6
- tags:
7
- - Text
8
- - Sentence Similarity
9
- - Sentence-Embedding
10
- - camembert-large
11
- license: apache-2.0
12
- model-index:
13
- - name: sentence-camembert-large by Van Tuan DANG
14
- results:
15
- - task:
16
- name: Sentence-Embedding
17
- type: Text Similarity
18
- dataset:
19
- name: Text Similarity fr
20
- type: stsb_multi_mt
21
- args: fr
22
- metrics:
23
- - name: Test Pearson correlation coefficient
24
- type: Pearson_correlation_coefficient
25
- value: xx.xx
26
- ---
27
- ## Description:
28
- [**Sentence-CamemBERT-Large**](https://huggingface.co/dangvantuan/sentence-camembert-large) is the Embedding Model for French developed by [La Javaness](https://www.lajavaness.com/). The purpose of this embedding model is to represent the content and semantics of a French sentence in a mathematical vector which allows it to understand the meaning of the text-beyond individual words in queries and documents, offering a powerful semantic search.
29
- ## Pre-trained sentence embedding models are state-of-the-art of Sentence Embeddings for French.
30
- The model is Fine-tuned using pre-trained [facebook/camembert-large](https://huggingface.co/camembert/camembert-large) and
31
- [Siamese BERT-Networks with 'sentences-transformers'](https://www.sbert.net/) on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train)
32
-
33
-
34
- ## Usage
35
- The model can be used directly (without a language model) as follows:
36
-
37
- ```python
38
- from sentence_transformers import SentenceTransformer
39
- model = SentenceTransformer("dangvantuan/sentence-camembert-large")
40
-
41
- sentences = ["Un avion est en train de décoller.",
42
- "Un homme joue d'une grande flûte.",
43
- "Un homme étale du fromage râpé sur une pizza.",
44
- "Une personne jette un chat au plafond.",
45
- "Une personne est en train de plier un morceau de papier.",
46
- ]
47
-
48
- embeddings = model.encode(sentences)
49
- ```
50
-
51
- ## Evaluation
52
- The model can be evaluated as follows on the French test data of stsb.
53
-
54
- ```python
55
- from sentence_transformers import SentenceTransformer
56
- from sentence_transformers.readers import InputExample
57
- from datasets import load_dataset
58
- def convert_dataset(dataset):
59
- dataset_samples=[]
60
- for df in dataset:
61
- score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
62
- inp_example = InputExample(texts=[df['sentence1'],
63
- df['sentence2']], label=score)
64
- dataset_samples.append(inp_example)
65
- return dataset_samples
66
-
67
- # Loading the dataset for evaluation
68
- df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
69
- df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
70
-
71
- # Convert the dataset for evaluation
72
-
73
- # For Dev set:
74
- dev_samples = convert_dataset(df_dev)
75
- val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
76
- val_evaluator(model, output_path="./")
77
-
78
- # For Test set:
79
- test_samples = convert_dataset(df_test)
80
- test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
81
- test_evaluator(model, output_path="./")
82
- ```
83
-
84
- **Test Result**:
85
- The performance is measured using Pearson and Spearman correlation:
86
- - On dev
87
-
88
-
89
- | Model | Pearson correlation | Spearman correlation | #params |
90
- | ------------- | ------------- | ------------- |------------- |
91
- | [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)| 88.2 |88.02 | 336M|
92
- | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 86.73|86.54 | 110M |
93
- | [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 79.22 | 79.16|135M |
94
- | [GPT-3 (text-davinci-003)](https://platform.openai.com/docs/models) | 85 | NaN|175B |
95
- | [GPT-(text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.75 | 80.44|NaN |
96
- - On test
97
-
98
-
99
- | Model | Pearson correlation | Spearman correlation |
100
- | ------------- | ------------- | ------------- |
101
- | [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)| 85.9 | 85.8|
102
- | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)| 82.36 | 81.64|
103
- | [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 78.62 | 77.48|
104
- | [GPT-3 (text-davinci-003)](https://platform.openai.com/docs/models) | 82 | NaN|175B |
105
- | [GPT-(text-embedding-ada-002)](https://platform.openai.com/docs/models) | 79.05 | 77.56|NaN |
106
-
107
-
108
- ## Citation
109
-
110
-
111
- @article{reimers2019sentence,
112
- title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
113
- author={Nils Reimers, Iryna Gurevych},
114
- journal={https://arxiv.org/abs/1908.10084},
115
- year={2019}
116
- }
117
-
118
-
119
- @article{martin2020camembert,
120
- title={CamemBERT: a Tasty French Language Mode},
121
- author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
122
- journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
123
- year={2020}
124
- }