AntoineBourgois
/

BookNLP-fr_NER_camembert-large

@@ -1,3 +1,4 @@
 ---
 language: fr
 tags:
@@ -19,30 +20,24 @@ pipeline_tag: token-classification
 This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
 The predicted entities are:
 - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
 - facilities (FAC): chatêau, sentier, chambre, couloir, ...
 - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
 - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
 - locations (LOC): le sud, Mars, l'océan, le bois, ...
 - vehicles (VEH): avions, voitures, calèches, vélos, ...
 ## MODEL PERFORMANCES (LOOCV):
 | NER_tag   | precision   | recall   | f1_score   | support   |
 |-----------|-------------|----------|------------|-----------|
-| PER       | 87.12%      | 93.84%   | 90.35%     | 8,779     |
-| FAC       | 61.38%      | 67.00%   | 64.07%     | 503       |
-| TIME      | 52.76%      | 50.97%   | 51.85%     | 412       |
-| GPE       | 72.24%      | 76.92%   | 74.51%     | 247       |
-| LOC       | 49.25%      | 25.00%   | 33.17%     | 132       |
-| VEH       | 66.07%      | 39.36%   | 49.33%     | 94        |
-| micro_avg | 83.41%      | 88.96%   | 85.99%     | 10,167    |
-| macro_avg | 64.81%      | 58.85%   | 60.55%     | 10,167    |
 ## TRAINING PARAMETERS:
 - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
@@ -56,22 +51,22 @@ The predicted entities are:
 ## MODEL ARCHITECTURE:
 Model Input: Maximum context camembert-large embeddings (1024 dimensions)
- Locked Dropout: 0.5
- Projection layer:
   - layer type: highway layer
   - input: 1024 dimensions
   - output: 2048 dimensions
- BiLSTM layer:
   - input: 2048 dimensions
   - output: 256 dimensions (hidden state)
- Linear layer:
   - input: 256 dimensions
   - output: 25 dimensions (predicted labels with BIOES tagging scheme)
- CRF layer
 Model Output: BIOES labels sequence
@@ -112,7 +107,15 @@ Model Output: BIOES labels sequence
 | 28 | TOTAL                                                          | 275,360 tokens |
 ## PREDICTIONS CONFUSION MATRIX:
-*** IN CONSTRUCTION ***
 ## CONTACT:
-mail: antoine [dot] bourgois [at] protonmail [dot] com

 ---
 language: fr
 tags:
 This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts.
 The predicted entities are:
 - mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
 - facilities (FAC): chatêau, sentier, chambre, couloir, ...
 - time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
 - geo-political entities (GPE): Montrouge, France, le petit hameau, ...
 - locations (LOC): le sud, Mars, l'océan, le bois, ...
 - vehicles (VEH): avions, voitures, calèches, vélos, ...
 ## MODEL PERFORMANCES (LOOCV):
 | NER_tag   | precision   | recall   | f1_score   | support   |
 |-----------|-------------|----------|------------|-----------|
+| PER       | 90.58%      | 93.52%   | 92.03%     | 31,570    |
+| FAC       | 70.49%      | 71.75%   | 71.12%     | 2,294     |
+| TIME      | 58.40%      | 58.68%   | 58.54%     | 1,670     |
+| GPE       | 76.69%      | 74.05%   | 75.35%     | 871       |
+| LOC       | 60.92%      | 44.37%   | 51.35%     | 773       |
+| VEH       | 66.18%      | 49.25%   | 56.47%     | 465       |
+| micro_avg | 86.70%      | 88.64%   | 87.61%     | 37,643    |
+| macro_avg | 70.55%      | 65.27%   | 67.48%     | 37,643    |
 ## TRAINING PARAMETERS:
 - Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
 ## MODEL ARCHITECTURE:
 Model Input: Maximum context camembert-large embeddings (1024 dimensions)
+- Locked Dropout: 0.5
+- Projection layer:
   - layer type: highway layer
   - input: 1024 dimensions
   - output: 2048 dimensions
+- BiLSTM layer:
   - input: 2048 dimensions
   - output: 256 dimensions (hidden state)
+- Linear layer:
   - input: 256 dimensions
   - output: 25 dimensions (predicted labels with BIOES tagging scheme)
+- CRF layer
 Model Output: BIOES labels sequence
 | 28 | TOTAL                                                          | 275,360 tokens |
 ## PREDICTIONS CONFUSION MATRIX:
+| Gold Labels   |   PER |   FAC |   TIME |   GPE |   LOC |   VEH |    O |
+|---------------|-------|-------|--------|-------|-------|-------|------|
+| PER           | 29525 |    27 |     13 |     6 |     7 |    26 | 1966 |
+| FAC           |    43 |  1646 |      0 |    17 |    12 |     2 |  574 |
+| TIME          |     5 |     1 |    980 |     1 |     1 |     0 |  682 |
+| GPE           |    18 |    28 |      1 |   645 |    27 |     0 |  152 |
+| LOC           |     5 |    63 |      0 |    54 |   343 |     0 |  308 |
+| VEH           |    58 |     8 |      1 |     0 |     0 |   229 |  169 |
+| O             |  2902 |   532 |    682 |   110 |   167 |    89 |    0 |
 ## CONTACT:
+mail: antoine [dot] bourgois [at] protonmail [dot] com