AntoineBourgois commited on
Commit
2eee424
·
verified ·
1 Parent(s): 418bd9c

Upload 2 files

Browse files
Files changed (1) hide show
  1. README.md +24 -21
README.md CHANGED
@@ -1,3 +1,4 @@
 
1
  ---
2
  language: fr
3
  tags:
@@ -19,30 +20,24 @@ pipeline_tag: token-classification
19
  This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
20
 
21
  The predicted entities are:
22
-
23
  - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
24
-
25
  - facilities (FAC): chatêau, sentier, chambre, couloir, ...
26
-
27
  - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
28
-
29
  - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
30
-
31
  - locations (LOC): le sud, Mars, l'océan, le bois, ...
32
-
33
  - vehicles (VEH): avions, voitures, calèches, vélos, ...
34
 
35
  ## MODEL PERFORMANCES (LOOCV):
36
  | NER_tag | precision | recall | f1_score | support |
37
  |-----------|-------------|----------|------------|-----------|
38
- | PER | 87.12% | 93.84% | 90.35% | 8,779 |
39
- | FAC | 61.38% | 67.00% | 64.07% | 503 |
40
- | TIME | 52.76% | 50.97% | 51.85% | 412 |
41
- | GPE | 72.24% | 76.92% | 74.51% | 247 |
42
- | LOC | 49.25% | 25.00% | 33.17% | 132 |
43
- | VEH | 66.07% | 39.36% | 49.33% | 94 |
44
- | micro_avg | 83.41% | 88.96% | 85.99% | 10,167 |
45
- | macro_avg | 64.81% | 58.85% | 60.55% | 10,167 |
46
 
47
  ## TRAINING PARAMETERS:
48
  - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
@@ -56,22 +51,22 @@ The predicted entities are:
56
  ## MODEL ARCHITECTURE:
57
  Model Input: Maximum context camembert-large embeddings (1024 dimensions)
58
 
59
- Locked Dropout: 0.5
60
 
61
- Projection layer:
62
  - layer type: highway layer
63
  - input: 1024 dimensions
64
  - output: 2048 dimensions
65
 
66
- BiLSTM layer:
67
  - input: 2048 dimensions
68
  - output: 256 dimensions (hidden state)
69
 
70
- Linear layer:
71
  - input: 256 dimensions
72
  - output: 25 dimensions (predicted labels with BIOES tagging scheme)
73
 
74
- CRF layer
75
 
76
  Model Output: BIOES labels sequence
77
 
@@ -112,7 +107,15 @@ Model Output: BIOES labels sequence
112
  | 28 | TOTAL | 275,360 tokens |
113
 
114
  ## PREDICTIONS CONFUSION MATRIX:
115
- *** IN CONSTRUCTION ***
 
 
 
 
 
 
 
 
116
 
117
  ## CONTACT:
118
- mail: antoine [dot] bourgois [at] protonmail [dot] com
 
1
+
2
  ---
3
  language: fr
4
  tags:
 
20
  This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
21
 
22
  The predicted entities are:
 
23
  - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
 
24
  - facilities (FAC): chatêau, sentier, chambre, couloir, ...
 
25
  - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
 
26
  - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
 
27
  - locations (LOC): le sud, Mars, l'océan, le bois, ...
 
28
  - vehicles (VEH): avions, voitures, calèches, vélos, ...
29
 
30
  ## MODEL PERFORMANCES (LOOCV):
31
  | NER_tag | precision | recall | f1_score | support |
32
  |-----------|-------------|----------|------------|-----------|
33
+ | PER | 90.58% | 93.52% | 92.03% | 31,570 |
34
+ | FAC | 70.49% | 71.75% | 71.12% | 2,294 |
35
+ | TIME | 58.40% | 58.68% | 58.54% | 1,670 |
36
+ | GPE | 76.69% | 74.05% | 75.35% | 871 |
37
+ | LOC | 60.92% | 44.37% | 51.35% | 773 |
38
+ | VEH | 66.18% | 49.25% | 56.47% | 465 |
39
+ | micro_avg | 86.70% | 88.64% | 87.61% | 37,643 |
40
+ | macro_avg | 70.55% | 65.27% | 67.48% | 37,643 |
41
 
42
  ## TRAINING PARAMETERS:
43
  - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
 
51
  ## MODEL ARCHITECTURE:
52
  Model Input: Maximum context camembert-large embeddings (1024 dimensions)
53
 
54
+ - Locked Dropout: 0.5
55
 
56
+ - Projection layer:
57
  - layer type: highway layer
58
  - input: 1024 dimensions
59
  - output: 2048 dimensions
60
 
61
+ - BiLSTM layer:
62
  - input: 2048 dimensions
63
  - output: 256 dimensions (hidden state)
64
 
65
+ - Linear layer:
66
  - input: 256 dimensions
67
  - output: 25 dimensions (predicted labels with BIOES tagging scheme)
68
 
69
+ - CRF layer
70
 
71
  Model Output: BIOES labels sequence
72
 
 
107
  | 28 | TOTAL | 275,360 tokens |
108
 
109
  ## PREDICTIONS CONFUSION MATRIX:
110
+ | Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O |
111
+ |---------------|-------|-------|--------|-------|-------|-------|------|
112
+ | PER | 29525 | 27 | 13 | 6 | 7 | 26 | 1966 |
113
+ | FAC | 43 | 1646 | 0 | 17 | 12 | 2 | 574 |
114
+ | TIME | 5 | 1 | 980 | 1 | 1 | 0 | 682 |
115
+ | GPE | 18 | 28 | 1 | 645 | 27 | 0 | 152 |
116
+ | LOC | 5 | 63 | 0 | 54 | 343 | 0 | 308 |
117
+ | VEH | 58 | 8 | 1 | 0 | 0 | 229 | 169 |
118
+ | O | 2902 | 532 | 682 | 110 | 167 | 89 | 0 |
119
 
120
  ## CONTACT:
121
+ mail: antoine [dot] bourgois [at] protonmail [dot] com