|
|
|
--- |
|
language: fr |
|
tags: |
|
- NER |
|
- camembert |
|
- literary-texts |
|
- nested-entities |
|
- BookBLP-fr |
|
license: apache-2.0 |
|
metrics: |
|
- f1 |
|
- precision |
|
- recall |
|
base_model: |
|
- almanach/camembert-large |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
## INTRODUCTION: |
|
This model, developed as part of the [BookNLP-fr project](https://github.com/lattice-8094/fr-litbank), is a NER model built on top of [camembert-large](https://huggingface.co/almanach/camembert-large) embeddings, trained to predict nested entities in french, specifically for literary texts. |
|
|
|
The predicted entities are: |
|
- mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...) |
|
- facilities (FAC): chatêau, sentier, chambre, couloir, ... |
|
- time (TIME): le règne de Louis XIV, ce matin, en juillet, ... |
|
- geo-political entities (GPE): Montrouge, France, le petit hameau, ... |
|
- locations (LOC): le sud, Mars, l'océan, le bois, ... |
|
- vehicles (VEH): avion, voitures, calèche, vélos, ... |
|
|
|
## MODEL PERFORMANCES (LOOCV): |
|
| NER_tag | precision | recall | f1_score | support | support % | |
|
|-----------|-------------|----------|------------|-----------|-------------| |
|
| PER | 90.58% | 93.52% | 92.03% | 31,570 | 83.87% | |
|
| FAC | 70.49% | 71.75% | 71.12% | 2,294 | 6.09% | |
|
| TIME | 58.40% | 58.68% | 58.54% | 1,670 | 4.44% | |
|
| GPE | 76.69% | 74.05% | 75.35% | 871 | 2.31% | |
|
| LOC | 60.92% | 44.37% | 51.35% | 773 | 2.05% | |
|
| VEH | 66.18% | 49.25% | 56.47% | 465 | 1.24% | |
|
| micro_avg | 86.70% | 88.64% | 87.61% | 37,643 | 100.00% | |
|
| macro_avg | 70.55% | 65.27% | 67.48% | 37,643 | 100.00% | |
|
|
|
## TRAINING PARAMETERS: |
|
- Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE'] |
|
- Tagging scheme: BIOES |
|
- Nested entities levels: [0, 1] |
|
- Split strategy: Leave-one-out cross-validation (28 files) |
|
- Train/Validation split: 0.85 / 0.15 |
|
- Batch size: 16 |
|
- Initial learning rate: 0.00014 |
|
|
|
## MODEL ARCHITECTURE: |
|
Model Input: Maximum context camembert-large embeddings (1024 dimensions) |
|
|
|
- Locked Dropout: 0.5 |
|
|
|
- Projection layer: |
|
- layer type: highway layer |
|
- input: 1024 dimensions |
|
- output: 2048 dimensions |
|
|
|
- BiLSTM layer: |
|
- input: 2048 dimensions |
|
- output: 256 dimensions (hidden state) |
|
|
|
- Linear layer: |
|
- input: 256 dimensions |
|
- output: 25 dimensions (predicted labels with BIOES tagging scheme) |
|
|
|
- CRF layer |
|
|
|
Model Output: BIOES labels sequence |
|
|
|
## HOW TO USE: |
|
*** IN CONSTRUCTION *** |
|
|
|
## TRAINING CORPUS: |
|
| | Document | Tokens Count | Is included in model eval | |
|
|----|----------------------------------------------------------------|----------------|------------------------------------| |
|
| 0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | True | |
|
| 1 | 1840_Sand-George_Pauline | 12,315 tokens | True | |
|
| 2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | True | |
|
| 3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | True | |
|
| 4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | True | |
|
| 5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True | |
|
| 6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | True | |
|
| 7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | True | |
|
| 8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | True | |
|
| 9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True | |
|
| 10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True | |
|
| 11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True | |
|
| 12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | True | |
|
| 13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | True | |
|
| 14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | True | |
|
| 15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | True | |
|
| 16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | True | |
|
| 17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | True | |
|
| 18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | True | |
|
| 19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | True | |
|
| 20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | True | |
|
| 21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | True | |
|
| 22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True | |
|
| 23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | True | |
|
| 24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | True | |
|
| 25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | True | |
|
| 26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True | |
|
| 27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | True | |
|
| 28 | TOTAL | 275,360 tokens | 28 files used for cross-validation | |
|
|
|
## PREDICTIONS CONFUSION MATRIX: |
|
| Gold Labels | PER | FAC | TIME | GPE | LOC | VEH | O | support | |
|
|---------------|--------|-------|--------|-------|-------|-------|-------|-----------| |
|
| PER | 29,525 | 27 | 13 | 6 | 7 | 26 | 1,966 | 31,570 | |
|
| FAC | 43 | 1,646 | 0 | 17 | 12 | 2 | 574 | 2,294 | |
|
| TIME | 5 | 1 | 980 | 1 | 1 | 0 | 682 | 1,670 | |
|
| GPE | 18 | 28 | 1 | 645 | 27 | 0 | 152 | 871 | |
|
| LOC | 5 | 63 | 0 | 54 | 343 | 0 | 308 | 773 | |
|
| VEH | 58 | 8 | 1 | 0 | 0 | 229 | 169 | 465 | |
|
| O | 2,902 | 532 | 682 | 110 | 167 | 89 | 0 | 4,482 | |
|
|
|
## CONTACT: |
|
mail: antoine [dot] bourgois [at] protonmail [dot] com |
|
|