Upload 2 files
Browse files
README.md
CHANGED
@@ -1,3 +1,4 @@
|
|
|
|
1 |
---
|
2 |
language: fr
|
3 |
tags:
|
@@ -19,30 +20,24 @@ pipeline_tag: token-classification
|
|
19 |
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
20 |
|
21 |
The predicted entities are:
|
22 |
-
|
23 |
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
24 |
-
|
25 |
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
|
26 |
-
|
27 |
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
28 |
-
|
29 |
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
30 |
-
|
31 |
- locations (LOC): le sud, Mars, l'océan, le bois, ...
|
32 |
-
|
33 |
- vehicles (VEH): avions, voitures, calèches, vélos, ...
|
34 |
|
35 |
## MODEL PERFORMANCES (LOOCV):
|
36 |
| NER_tag | precision | recall | f1_score | support |
|
37 |
|-----------|-------------|----------|------------|-----------|
|
38 |
-
| PER |
|
39 |
-
| FAC |
|
40 |
-
| TIME |
|
41 |
-
| GPE |
|
42 |
-
| LOC |
|
43 |
-
| VEH | 66.
|
44 |
-
| micro_avg |
|
45 |
-
| macro_avg |
|
46 |
|
47 |
## TRAINING PARAMETERS:
|
48 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
@@ -56,22 +51,22 @@ The predicted entities are:
|
|
56 |
## MODEL ARCHITECTURE:
|
57 |
Model Input: Maximum context camembert-large embeddings (1024 dimensions)
|
58 |
|
59 |
-
Locked Dropout: 0.5
|
60 |
|
61 |
-
Projection layer:
|
62 |
- layer type: highway layer
|
63 |
- input: 1024 dimensions
|
64 |
- output: 2048 dimensions
|
65 |
|
66 |
-
BiLSTM layer:
|
67 |
- input: 2048 dimensions
|
68 |
- output: 256 dimensions (hidden state)
|
69 |
|
70 |
-
Linear layer:
|
71 |
- input: 256 dimensions
|
72 |
- output: 25 dimensions (predicted labels with BIOES tagging scheme)
|
73 |
|
74 |
-
CRF layer
|
75 |
|
76 |
Model Output: BIOES labels sequence
|
77 |
|
@@ -112,7 +107,15 @@ Model Output: BIOES labels sequence
|
|
112 |
| 28 | TOTAL | 275,360 tokens |
|
113 |
|
114 |
## PREDICTIONS CONFUSION MATRIX:
|
115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
|
117 |
## CONTACT:
|
118 |
-
mail: antoine [dot] bourgois [at] protonmail [dot] com
|
|
|
1 |
+
|
2 |
---
|
3 |
language: fr
|
4 |
tags:
|
|
|
20 |
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
|
21 |
|
22 |
The predicted entities are:
|
|
|
23 |
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
|
|
|
24 |
- facilities (FAC): chatêau, sentier, chambre, couloir, ...
|
|
|
25 |
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
|
|
|
26 |
- geo-political entities (GPE): Montrouge, France, le petit hameau, ...
|
|
|
27 |
- locations (LOC): le sud, Mars, l'océan, le bois, ...
|
|
|
28 |
- vehicles (VEH): avions, voitures, calèches, vélos, ...
|
29 |
|
30 |
## MODEL PERFORMANCES (LOOCV):
|
31 |
| NER_tag | precision | recall | f1_score | support |
|
32 |
|-----------|-------------|----------|------------|-----------|
|
33 |
+
| PER | 90.58% | 93.52% | 92.03% | 31,570 |
|
34 |
+
| FAC | 70.49% | 71.75% | 71.12% | 2,294 |
|
35 |
+
| TIME | 58.40% | 58.68% | 58.54% | 1,670 |
|
36 |
+
| GPE | 76.69% | 74.05% | 75.35% | 871 |
|
37 |
+
| LOC | 60.92% | 44.37% | 51.35% | 773 |
|
38 |
+
| VEH | 66.18% | 49.25% | 56.47% | 465 |
|
39 |
+
| micro_avg | 86.70% | 88.64% | 87.61% | 37,643 |
|
40 |
+
| macro_avg | 70.55% | 65.27% | 67.48% | 37,643 |
|
41 |
|
42 |
## TRAINING PARAMETERS:
|
43 |
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
|
|
|
51 |
## MODEL ARCHITECTURE:
|
52 |
Model Input: Maximum context camembert-large embeddings (1024 dimensions)
|
53 |
|
54 |
+
- Locked Dropout: 0.5
|
55 |
|
56 |
+
- Projection layer:
|
57 |
- layer type: highway layer
|
58 |
- input: 1024 dimensions
|
59 |
- output: 2048 dimensions
|
60 |
|
61 |
+
- BiLSTM layer:
|
62 |
- input: 2048 dimensions
|
63 |
- output: 256 dimensions (hidden state)
|
64 |
|
65 |
+
- Linear layer:
|
66 |
- input: 256 dimensions
|
67 |
- output: 25 dimensions (predicted labels with BIOES tagging scheme)
|
68 |
|
69 |
+
- CRF layer
|
70 |
|
71 |
Model Output: BIOES labels sequence
|
72 |
|
|
|
107 |
| 28 | TOTAL | 275,360 tokens |
|
108 |
|
109 |
## PREDICTIONS CONFUSION MATRIX:
|
110 |
+
| Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O |
|
111 |
+
|---------------|-------|-------|--------|-------|-------|-------|------|
|
112 |
+
| PER | 29525 | 27 | 13 | 6 | 7 | 26 | 1966 |
|
113 |
+
| FAC | 43 | 1646 | 0 | 17 | 12 | 2 | 574 |
|
114 |
+
| TIME | 5 | 1 | 980 | 1 | 1 | 0 | 682 |
|
115 |
+
| GPE | 18 | 28 | 1 | 645 | 27 | 0 | 152 |
|
116 |
+
| LOC | 5 | 63 | 0 | 54 | 343 | 0 | 308 |
|
117 |
+
| VEH | 58 | 8 | 1 | 0 | 0 | 229 | 169 |
|
118 |
+
| O | 2902 | 532 | 682 | 110 | 167 | 89 | 0 |
|
119 |
|
120 |
## CONTACT:
|
121 |
+
mail: antoine [dot] bourgois [at] protonmail [dot] com
|