import streamlit as st
# Page configuration
st.set_page_config(
layout="wide",
initial_sidebar_state="auto"
)
# Custom CSS for better styling
st.markdown("""
""", unsafe_allow_html=True)
# Title
st.markdown('
Introduction to CamemBERT Annotators in Spark NLP
', unsafe_allow_html=True)
# Subtitle
st.markdown("""
Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:
""", unsafe_allow_html=True)
st.markdown("""
CamemBERT for Token Classification
The CamemBertForTokenClassification annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.
Token classification with CamemBERT enables:
- Named Entity Recognition (NER): Identifying and classifying entities such as names, organizations, locations, and other predefined categories.
- Information Extraction: Extracting key information from unstructured text for further analysis.
- Text Categorization: Enhancing document retrieval and categorization based on entity recognition.
Here is an example of how CamemBERT token classification works:
Entity |
Label |
Paris |
LOC |
Emmanuel Macron |
PER |
Élysée Palace |
ORG |
""", unsafe_allow_html=True)
# CamemBERT Token Classification - French WikiNER
st.markdown('CamemBERT Token Classification - French WikiNER
', unsafe_allow_html=True)
st.markdown("""
The camembert_base_token_classifier_wikiner is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.
""", unsafe_allow_html=True)
# How to Use the Model - Token Classification
st.markdown('How to Use the Model
', unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, expr
document_assembler = DocumentAssembler() \\
.setInputCol('text') \\
.setOutputCol('document')
tokenizer = Tokenizer() \\
.setInputCols(['document']) \\
.setOutputCol('token')
tokenClassifier = CamemBertForTokenClassification \\
.pretrained('camembert_base_token_classifier_wikiner', 'en') \\
.setInputCols(['document', 'token']) \\
.setOutputCol('ner') \\
.setCaseSensitive(True) \\
.setMaxSentenceLength(512)
# Convert NER labels to entities
ner_converter = NerConverter() \\
.setInputCols(['document', 'token', 'ner']) \\
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
tokenClassifier,
ner_converter
])
data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(
expr("explode(entities) as ner_chunk")
).select(
col("ner_chunk.result").alias("chunk"),
col("ner_chunk.metadata.entity").alias("ner_label")
).show(truncate=False)
''', language='python')
# Results
st.text("""
+------------------+---------+
|chunk |ner_label|
+------------------+---------+
|Paris |LOC |
|France |LOC |
|Emmanuel Macron |PER |
|Élysée Palace |ORG |
|Apple Inc. |ORG |
+------------------+---------+
""")
# Performance Metrics
st.markdown('Performance Metrics
', unsafe_allow_html=True)
st.markdown("""
Here are the detailed performance metrics for the CamemBERT token classification model:
Entity |
Precision |
Recall |
F1-Score |
LOC |
0.93 |
0.94 |
0.94 |
PER |
0.95 |
0.95 |
0.95 |
ORG |
0.92 |
0.91 |
0.91 |
MISC |
0.86 |
0.85 |
0.85 |
O |
0.99 |
0.99 |
0.99 |
Overall |
0.97 |
0.98 |
0.98 |
""", unsafe_allow_html=True)
# Model Information - Token Classification
st.markdown('Model Information
', unsafe_allow_html=True)
st.markdown("""
- Model Name: camembert_base_token_classifier_wikiner
- Compatibility: Spark NLP 4.2.0+
- License: Open Source
- Edition: Official
- Input Labels: [token, document]
- Output Labels: [ner]
- Language: French
- Size: 412.2 MB
- Case Sensitive: Yes
- Max Sentence Length: 512
""", unsafe_allow_html=True)
# References - Token Classification
st.markdown('References
', unsafe_allow_html=True)
st.markdown("""
""", unsafe_allow_html=True)