|
--- |
|
language: |
|
- en |
|
- multilingual |
|
license: cc-by-sa-4.0 |
|
library_name: span-marker |
|
tags: |
|
- span-marker |
|
- token-classification |
|
- ner |
|
- named-entity-recognition |
|
- generated_from_span_marker_trainer |
|
datasets: |
|
- DFKI-SLT/few-nerd |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
widget: |
|
- text: "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris." |
|
example_title: "English 1" |
|
- text: The WPC led the international peace movement in the decade after the Second |
|
World War, but its failure to speak out against the Soviet suppression of the |
|
1956 Hungarian uprising and the resumption of Soviet nuclear tests in 1961 marginalised |
|
it, and in the 1960s it was eclipsed by the newer, non-aligned peace organizations |
|
like the Campaign for Nuclear Disarmament. |
|
example_title: "English 2" |
|
- text: Most of the Steven Seagal movie "Under Siege" (co-starring Tommy Lee Jones) |
|
was filmed on the Battleship USS Alabama, which is docked on Mobile Bay at Battleship |
|
Memorial Park and open to the public. |
|
example_title: "English 3" |
|
- text: 'The Central African CFA franc (French: "franc CFA" or simply "franc", ISO |
|
4217 code: XAF) is the currency of six independent states in Central Africa: Cameroon, |
|
Central African Republic, Chad, Republic of the Congo, Equatorial Guinea and Gabon.' |
|
example_title: "English 4" |
|
- text: Brenner conducted post-doctoral research at Brandeis University with Gregory |
|
Petsko and then took his first academic position at Thomas Jefferson University |
|
in 1996, moving to Dartmouth Medical School in 2003, where he served as Associate |
|
Director for Basic Sciences at Norris Cotton Cancer Center. |
|
example_title: "English 5" |
|
- text: On Friday, October 27, 2017, the Senate of Spain (Senado) voted 214 to 47 |
|
to invoke Article 155 of the Spanish Constitution over Catalonia after the Catalan |
|
Parliament declared the independence. |
|
example_title: "English 6" |
|
- text: "Amelia Earthart voló su Lockheed Vega 5B monomotor a través del Océano Atlántico hasta París." |
|
example_title: "Spanish" |
|
- text: "Amelia Earthart a fait voler son monomoteur Lockheed Vega 5B à travers l'ocean Atlantique jusqu'à Paris." |
|
example_title: "French" |
|
- text: "Amelia Earthart flog mit ihrer einmotorigen Lockheed Vega 5B über den Atlantik nach Paris." |
|
example_title: "German" |
|
- text: "Амелия Эртхарт перелетела на своем одномоторном самолете Lockheed Vega 5B через Атлантический океан в Париж." |
|
example_title: "Russian" |
|
- text: "Amelia Earthart vloog met haar één-motorige Lockheed Vega 5B over de Atlantische Oceaan naar Parijs." |
|
example_title: "Dutch" |
|
- text: "Amelia Earthart przeleciała swoim jednosilnikowym samolotem Lockheed Vega 5B przez Ocean Atlantycki do Paryża." |
|
example_title: "Polish" |
|
- text: "Amelia Earthart flaug eins hreyfils Lockheed Vega 5B yfir Atlantshafið til Parísar." |
|
example_title: "Icelandic" |
|
- text: "Η Amelia Earthart πέταξε το μονοκινητήριο Lockheed Vega 5B της πέρα από τον Ατλαντικό Ωκεανό στο Παρίσι." |
|
example_title: "Greek" |
|
pipeline_tag: token-classification |
|
co2_eq_emissions: |
|
emissions: 572.6675932546113 |
|
source: codecarbon |
|
training_type: fine-tuning |
|
on_cloud: false |
|
cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K |
|
ram_total_size: 31.777088165283203 |
|
hours_used: 3.867 |
|
hardware_used: 1 x NVIDIA GeForce RTX 3090 |
|
base_model: bert-base-multilingual-cased |
|
model-index: |
|
- name: SpanMarker with bert-base-multilingual-cased on FewNERD |
|
results: |
|
- task: |
|
type: token-classification |
|
name: Named Entity Recognition |
|
dataset: |
|
name: FewNERD |
|
type: DFKI-SLT/few-nerd |
|
split: test |
|
metrics: |
|
- type: f1 |
|
value: 0.7006507253689264 |
|
name: F1 |
|
- type: precision |
|
value: 0.7040676584045078 |
|
name: Precision |
|
- type: recall |
|
value: 0.6972667978051558 |
|
name: Recall |
|
--- |
|
|
|
# SpanMarker with bert-base-multilingual-cased on FewNERD |
|
|
|
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) as the underlying encoder. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** SpanMarker |
|
- **Encoder:** [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) |
|
- **Maximum Sequence Length:** 256 tokens |
|
- **Maximum Entity Length:** 8 words |
|
- **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) |
|
- **Languages:** en, multilingual |
|
- **License:** cc-by-sa-4.0 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) |
|
- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) |
|
|
|
### Model Labels |
|
| Label | Examples | |
|
|:-----------------------------------------|:---------------------------------------------------------------------------------------------------------| |
|
| art-broadcastprogram | "Corazones", "Street Cents", "The Gale Storm Show : Oh , Susanna" | |
|
| art-film | "L'Atlantide", "Bosch", "Shawshank Redemption" | |
|
| art-music | "Atkinson , Danko and Ford ( with Brockie and Hilton )", "Hollywood Studio Symphony", "Champion Lover" | |
|
| art-other | "Aphrodite of Milos", "The Today Show", "Venus de Milo" | |
|
| art-painting | "Production/Reproduction", "Touit", "Cofiwch Dryweryn" | |
|
| art-writtenart | "The Seven Year Itch", "Time", "Imelda de ' Lambertazzi" | |
|
| building-airport | "Luton Airport", "Newark Liberty International Airport", "Sheremetyevo International Airport" | |
|
| building-hospital | "Hokkaido University Hospital", "Yeungnam University Hospital", "Memorial Sloan-Kettering Cancer Center" | |
|
| building-hotel | "Flamingo Hotel", "The Standard Hotel", "Radisson Blu Sea Plaza Hotel" | |
|
| building-library | "British Library", "Bayerische Staatsbibliothek", "Berlin State Library" | |
|
| building-other | "Communiplex", "Henry Ford Museum", "Alpha Recording Studios" | |
|
| building-restaurant | "Fatburger", "Carnegie Deli", "Trumbull" | |
|
| building-sportsfacility | "Sports Center", "Glenn Warner Soccer Facility", "Boston Garden" | |
|
| building-theater | "Sanders Theatre", "Pittsburgh Civic Light Opera", "National Paris Opera" | |
|
| event-attack/battle/war/militaryconflict | "Vietnam War", "Jurist", "Easter Offensive" | |
|
| event-disaster | "1693 Sicily earthquake", "the 1912 North Mount Lyell Disaster", "1990s North Korean famine" | |
|
| event-election | "March 1898 elections", "1982 Mitcham and Morden by-election", "Elections to the European Parliament" | |
|
| event-other | "Eastwood Scoring Stage", "Masaryk Democratic Movement", "Union for a Popular Movement" | |
|
| event-protest | "Russian Revolution", "Iranian Constitutional Revolution", "French Revolution" | |
|
| event-sportsevent | "Stanley Cup", "World Cup", "National Champions" | |
|
| location-GPE | "Mediterranean Basin", "Croatian", "the Republic of Croatia" | |
|
| location-bodiesofwater | "Norfolk coast", "Atatürk Dam Lake", "Arthur Kill" | |
|
| location-island | "Staten Island", "Laccadives", "new Samsat district" | |
|
| location-mountain | "Miteirya Ridge", "Ruweisat Ridge", "Salamander Glacier" | |
|
| location-other | "Victoria line", "Cartuther", "Northern City Line" | |
|
| location-park | "Painted Desert Community Complex Historic District", "Shenandoah National Park", "Gramercy Park" | |
|
| location-road/railway/highway/transit | "Friern Barnet Road", "Newark-Elizabeth Rail Link", "NJT" | |
|
| organization-company | "Church 's Chicken", "Dixy Chicken", "Texas Chicken" | |
|
| organization-education | "MIT", "Barnard College", "Belfast Royal Academy and the Ulster College of Physical Education" | |
|
| organization-government/governmentagency | "Supreme Court", "Diet", "Congregazione dei Nobili" | |
|
| organization-media/newspaper | "TimeOut Melbourne", "Clash", "Al Jazeera" | |
|
| organization-other | "IAEA", "Defence Sector C", "4th Army" | |
|
| organization-politicalparty | "Al Wafa ' Islamic", "Kenseitō", "Shimpotō" | |
|
| organization-religion | "Christian", "UPCUSA", "Jewish" | |
|
| organization-showorganization | "Lizzy", "Mr. Mister", "Bochumer Symphoniker" | |
|
| organization-sportsleague | "China League One", "NHL", "First Division" | |
|
| organization-sportsteam | "Luc Alphand Aventures", "Tottenham", "Arsenal" | |
|
| other-astronomything | "`` Caput Larvae ''", "Algol", "Zodiac" | |
|
| other-award | "GCON", "Order of the Republic of Guinea and Nigeria", "Grand Commander of the Order of the Niger" | |
|
| other-biologything | "BAR", "Amphiphysin", "N-terminal lipid" | |
|
| other-chemicalthing | "sulfur", "uranium", "carbon dioxide" | |
|
| other-currency | "Travancore Rupee", "$", "lac crore" | |
|
| other-disease | "bladder cancer", "hypothyroidism", "French Dysentery Epidemic of 1779" | |
|
| other-educationaldegree | "Master", "Bachelor", "BSc ( Hons ) in physics" | |
|
| other-god | "Fujin", "Raijin", "El" | |
|
| other-language | "Latin", "English", "Breton-speaking" | |
|
| other-law | "Thirty Years ' Peace", "United States Freedom Support Act", "Leahy–Smith America Invents Act ( AIA" | |
|
| other-livingthing | "monkeys", "insects", "patchouli" | |
|
| other-medical | "Pediatrics", "amitriptyline", "pediatrician" | |
|
| person-actor | "Edmund Payne", "Ellaline Terriss", "Tchéky Karyo" | |
|
| person-artist/author | "George Axelrod", "Hicks", "Gaetano Donizett" | |
|
| person-athlete | "Tozawa", "Neville", "Jaguar" | |
|
| person-director | "Richard Quine", "Frank Darabont", "Bob Swaim" | |
|
| person-other | "Richard Benson", "Campbell", "Holden" | |
|
| person-politician | "Rivière", "William", "Emeric" | |
|
| person-scholar | "Wurdack", "Stedman", "Stalmine" | |
|
| person-soldier | "Joachim Ziegler", "Krukenberg", "Helmuth Weidling" | |
|
| product-airplane | "Luton", "Spey-equipped FGR.2s", "EC135T2 CPDS" | |
|
| product-car | "Corvettes - GT1 C6R", "Phantom", "100EX" | |
|
| product-food | "V. labrusca", "yakiniku", "red grape" | |
|
| product-game | "Airforce Delta", "Hardcore RPG", "Splinter Cell" | |
|
| product-other | "PDP-1", "Fairbottom Bobs", "X11" | |
|
| product-ship | "HMS `` Chinkara ''", "Congress", "Essex" | |
|
| product-software | "Apdf", "Wikipedia", "AmiPDF" | |
|
| product-train | "Royal Scots Grey", "High Speed Trains", "55022" | |
|
| product-weapon | "AR-15 's", "ZU-23-2M Wróbel", "ZU-23-2MR Wróbel II" | |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
| Label | Precision | Recall | F1 | |
|
|:-----------------------------------------|:----------|:-------|:-------| |
|
| **all** | 0.7041 | 0.6973 | 0.7007 | |
|
| art-broadcastprogram | 0.5863 | 0.6252 | 0.6051 | |
|
| art-film | 0.7779 | 0.752 | 0.7647 | |
|
| art-music | 0.8014 | 0.7570 | 0.7786 | |
|
| art-other | 0.4209 | 0.3221 | 0.3649 | |
|
| art-painting | 0.5938 | 0.6667 | 0.6281 | |
|
| art-writtenart | 0.6854 | 0.6415 | 0.6628 | |
|
| building-airport | 0.8197 | 0.8242 | 0.8219 | |
|
| building-hospital | 0.7215 | 0.8187 | 0.7671 | |
|
| building-hotel | 0.7233 | 0.6906 | 0.7066 | |
|
| building-library | 0.7588 | 0.7268 | 0.7424 | |
|
| building-other | 0.5842 | 0.5855 | 0.5848 | |
|
| building-restaurant | 0.5567 | 0.4871 | 0.5195 | |
|
| building-sportsfacility | 0.6512 | 0.7690 | 0.7052 | |
|
| building-theater | 0.6994 | 0.7516 | 0.7246 | |
|
| event-attack/battle/war/militaryconflict | 0.7800 | 0.7332 | 0.7559 | |
|
| event-disaster | 0.5767 | 0.5266 | 0.5505 | |
|
| event-election | 0.5106 | 0.1319 | 0.2096 | |
|
| event-other | 0.4931 | 0.4145 | 0.4504 | |
|
| event-protest | 0.3711 | 0.4337 | 0.4000 | |
|
| event-sportsevent | 0.6156 | 0.6156 | 0.6156 | |
|
| location-GPE | 0.8175 | 0.8508 | 0.8338 | |
|
| location-bodiesofwater | 0.7297 | 0.7622 | 0.7456 | |
|
| location-island | 0.7314 | 0.6703 | 0.6995 | |
|
| location-mountain | 0.7538 | 0.7283 | 0.7409 | |
|
| location-other | 0.4370 | 0.3040 | 0.3585 | |
|
| location-park | 0.7063 | 0.6878 | 0.6969 | |
|
| location-road/railway/highway/transit | 0.7092 | 0.7259 | 0.7174 | |
|
| organization-company | 0.6911 | 0.6943 | 0.6927 | |
|
| organization-education | 0.7799 | 0.7973 | 0.7885 | |
|
| organization-government/governmentagency | 0.5518 | 0.4474 | 0.4942 | |
|
| organization-media/newspaper | 0.6268 | 0.6761 | 0.6505 | |
|
| organization-other | 0.5804 | 0.5341 | 0.5563 | |
|
| organization-politicalparty | 0.6627 | 0.7306 | 0.6949 | |
|
| organization-religion | 0.5636 | 0.6265 | 0.5934 | |
|
| organization-showorganization | 0.6023 | 0.6086 | 0.6054 | |
|
| organization-sportsleague | 0.6594 | 0.6497 | 0.6545 | |
|
| organization-sportsteam | 0.7341 | 0.7703 | 0.7518 | |
|
| other-astronomything | 0.7806 | 0.8289 | 0.8040 | |
|
| other-award | 0.7230 | 0.6703 | 0.6957 | |
|
| other-biologything | 0.6733 | 0.6366 | 0.6544 | |
|
| other-chemicalthing | 0.5962 | 0.5838 | 0.5899 | |
|
| other-currency | 0.7135 | 0.7822 | 0.7463 | |
|
| other-disease | 0.6260 | 0.7063 | 0.6637 | |
|
| other-educationaldegree | 0.6 | 0.6033 | 0.6016 | |
|
| other-god | 0.7051 | 0.7118 | 0.7085 | |
|
| other-language | 0.6849 | 0.7968 | 0.7366 | |
|
| other-law | 0.6814 | 0.6843 | 0.6829 | |
|
| other-livingthing | 0.5959 | 0.6443 | 0.6192 | |
|
| other-medical | 0.5247 | 0.4811 | 0.5020 | |
|
| person-actor | 0.8342 | 0.7960 | 0.8146 | |
|
| person-artist/author | 0.7052 | 0.7482 | 0.7261 | |
|
| person-athlete | 0.8396 | 0.8530 | 0.8462 | |
|
| person-director | 0.725 | 0.7329 | 0.7289 | |
|
| person-other | 0.6866 | 0.6672 | 0.6767 | |
|
| person-politician | 0.6819 | 0.6852 | 0.6835 | |
|
| person-scholar | 0.5468 | 0.4953 | 0.5198 | |
|
| person-soldier | 0.5360 | 0.5641 | 0.5497 | |
|
| product-airplane | 0.6825 | 0.6730 | 0.6777 | |
|
| product-car | 0.7205 | 0.7016 | 0.7109 | |
|
| product-food | 0.6036 | 0.5394 | 0.5697 | |
|
| product-game | 0.7740 | 0.6876 | 0.7282 | |
|
| product-other | 0.5250 | 0.4117 | 0.4615 | |
|
| product-ship | 0.6781 | 0.6763 | 0.6772 | |
|
| product-software | 0.6701 | 0.6603 | 0.6652 | |
|
| product-train | 0.5919 | 0.6051 | 0.5984 | |
|
| product-weapon | 0.6507 | 0.5433 | 0.5921 | |
|
|
|
## Uses |
|
|
|
### Direct Use for Inference |
|
|
|
```python |
|
from span_marker import SpanMarkerModel |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-fewnerd-fine-super") |
|
# Run inference |
|
entities = model.predict("Most of the Steven Seagal movie \"Under Siege \"(co-starring Tommy Lee Jones) was filmed on the, which is docked on Mobile Bay at Battleship Memorial Park and open to the public.") |
|
``` |
|
|
|
### Downstream Use |
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
```python |
|
from span_marker import SpanMarkerModel, Trainer |
|
|
|
# Download from the 🤗 Hub |
|
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-mbert-base-fewnerd-fine-super") |
|
|
|
# Specify a Dataset with "tokens" and "ner_tag" columns |
|
dataset = load_dataset("conll2003") # For example CoNLL2003 |
|
|
|
# Initialize a Trainer using the pretrained model & dataset |
|
trainer = Trainer( |
|
model=model, |
|
train_dataset=dataset["train"], |
|
eval_dataset=dataset["validation"], |
|
) |
|
trainer.train() |
|
trainer.save_model("tomaarsen/span-marker-mbert-base-fewnerd-fine-super-finetuned") |
|
``` |
|
</details> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
### Training Set Metrics |
|
| Training set | Min | Median | Max | |
|
|:----------------------|:----|:--------|:----| |
|
| Sentence length | 1 | 24.4945 | 267 | |
|
| Entities per sentence | 0 | 2.5832 | 88 | |
|
|
|
### Training Hyperparameters |
|
- learning_rate: 5e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_ratio: 0.1 |
|
- num_epochs: 3 |
|
|
|
### Training Results |
|
| Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |
|
|:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| |
|
| 0.2972 | 3000 | 0.0274 | 0.6488 | 0.6457 | 0.6473 | 0.9121 | |
|
| 0.5944 | 6000 | 0.0252 | 0.6686 | 0.6545 | 0.6615 | 0.9160 | |
|
| 0.8915 | 9000 | 0.0239 | 0.6918 | 0.6547 | 0.6727 | 0.9178 | |
|
| 1.1887 | 12000 | 0.0235 | 0.6962 | 0.6727 | 0.6842 | 0.9210 | |
|
| 1.4859 | 15000 | 0.0233 | 0.6872 | 0.6742 | 0.6806 | 0.9201 | |
|
| 1.7831 | 18000 | 0.0226 | 0.6969 | 0.6891 | 0.6929 | 0.9236 | |
|
| 2.0802 | 21000 | 0.0231 | 0.7030 | 0.6916 | 0.6973 | 0.9246 | |
|
| 2.3774 | 24000 | 0.0227 | 0.7020 | 0.6936 | 0.6978 | 0.9248 | |
|
| 2.6746 | 27000 | 0.0223 | 0.7079 | 0.6989 | 0.7034 | 0.9258 | |
|
| 2.9718 | 30000 | 0.0222 | 0.7089 | 0.7009 | 0.7049 | 0.9263 | |
|
|
|
### Environmental Impact |
|
Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon). |
|
- **Carbon Emitted**: 0.573 kg of CO2 |
|
- **Hours Used**: 3.867 hours |
|
|
|
### Training Hardware |
|
- **On Cloud**: No |
|
- **GPU Model**: 1 x NVIDIA GeForce RTX 3090 |
|
- **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K |
|
- **RAM Size**: 31.78 GB |
|
|
|
### Framework Versions |
|
- Python: 3.9.16 |
|
- SpanMarker: 1.4.1.dev |
|
- Transformers: 4.30.0 |
|
- PyTorch: 2.0.1+cu118 |
|
- Datasets: 2.14.0 |
|
- Tokenizers: 0.13.2 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
``` |
|
@software{Aarsen_SpanMarker, |
|
author = {Aarsen, Tom}, |
|
license = {Apache-2.0}, |
|
title = {{SpanMarker for Named Entity Recognition}}, |
|
url = {https://github.com/tomaarsen/SpanMarkerNER} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |