|
import streamlit as st
|
|
|
|
|
|
st.set_page_config(
|
|
layout="wide",
|
|
initial_sidebar_state="auto"
|
|
)
|
|
|
|
|
|
st.markdown("""
|
|
<style>
|
|
.main-title {
|
|
font-size: 36px;
|
|
color: #4A90E2;
|
|
font-weight: bold;
|
|
text-align: center;
|
|
}
|
|
.sub-title {
|
|
font-size: 24px;
|
|
color: #4A90E2;
|
|
margin-top: 20px;
|
|
}
|
|
.section {
|
|
background-color: #f9f9f9;
|
|
padding: 15px;
|
|
border-radius: 10px;
|
|
margin-top: 20px;
|
|
}
|
|
.section h2 {
|
|
font-size: 22px;
|
|
color: #4A90E2;
|
|
}
|
|
.section p, .section ul {
|
|
color: #666666;
|
|
}
|
|
.link {
|
|
color: #4A90E2;
|
|
text-decoration: none;
|
|
}
|
|
.benchmark-table {
|
|
width: 100%;
|
|
border-collapse: collapse;
|
|
margin-top: 20px;
|
|
}
|
|
.benchmark-table th, .benchmark-table td {
|
|
border: 1px solid #ddd;
|
|
padding: 8px;
|
|
text-align: left;
|
|
}
|
|
.benchmark-table th {
|
|
background-color: #4A90E2;
|
|
color: white;
|
|
}
|
|
.benchmark-table td {
|
|
background-color: #f2f2f2;
|
|
}
|
|
</style>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="main-title">Introduction to CamemBERT Annotators in Spark NLP</div>', unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<h2>CamemBERT for Token Classification</h2>
|
|
<p>The <strong>CamemBertForTokenClassification</strong> annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.</p>
|
|
<p>Token classification with CamemBERT enables:</p>
|
|
<ul>
|
|
<li><strong>Named Entity Recognition (NER):</strong> Identifying and classifying entities such as names, organizations, locations, and other predefined categories.</li>
|
|
<li><strong>Information Extraction:</strong> Extracting key information from unstructured text for further analysis.</li>
|
|
<li><strong>Text Categorization:</strong> Enhancing document retrieval and categorization based on entity recognition.</li>
|
|
</ul>
|
|
<p>Here is an example of how CamemBERT token classification works:</p>
|
|
<table class="benchmark-table">
|
|
<tr>
|
|
<th>Entity</th>
|
|
<th>Label</th>
|
|
</tr>
|
|
<tr>
|
|
<td>Paris</td>
|
|
<td>LOC</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Emmanuel Macron</td>
|
|
<td>PER</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Élysée Palace</td>
|
|
<td>ORG</td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">CamemBERT Token Classification - French WikiNER</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>The <strong>camembert_base_token_classifier_wikiner</strong> is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">How to Use the Model</div>', unsafe_allow_html=True)
|
|
st.code('''
|
|
from sparknlp.base import *
|
|
from sparknlp.annotator import *
|
|
from pyspark.ml import Pipeline
|
|
from pyspark.sql.functions import col, expr
|
|
|
|
document_assembler = DocumentAssembler() \\
|
|
.setInputCol('text') \\
|
|
.setOutputCol('document')
|
|
|
|
tokenizer = Tokenizer() \\
|
|
.setInputCols(['document']) \\
|
|
.setOutputCol('token')
|
|
|
|
tokenClassifier = CamemBertForTokenClassification \\
|
|
.pretrained('camembert_base_token_classifier_wikiner', 'en') \\
|
|
.setInputCols(['document', 'token']) \\
|
|
.setOutputCol('ner') \\
|
|
.setCaseSensitive(True) \\
|
|
.setMaxSentenceLength(512)
|
|
|
|
# Convert NER labels to entities
|
|
ner_converter = NerConverter() \\
|
|
.setInputCols(['document', 'token', 'ner']) \\
|
|
.setOutputCol('entities')
|
|
|
|
pipeline = Pipeline(stages=[
|
|
document_assembler,
|
|
tokenizer,
|
|
tokenClassifier,
|
|
ner_converter
|
|
])
|
|
|
|
data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text")
|
|
result = pipeline.fit(data).transform(data)
|
|
|
|
result.select(
|
|
expr("explode(entities) as ner_chunk")
|
|
).select(
|
|
col("ner_chunk.result").alias("chunk"),
|
|
col("ner_chunk.metadata.entity").alias("ner_label")
|
|
).show(truncate=False)
|
|
''', language='python')
|
|
|
|
|
|
st.text("""
|
|
+------------------+---------+
|
|
|chunk |ner_label|
|
|
+------------------+---------+
|
|
|Paris |LOC |
|
|
|France |LOC |
|
|
|Emmanuel Macron |PER |
|
|
|Élysée Palace |ORG |
|
|
|Apple Inc. |ORG |
|
|
+------------------+---------+
|
|
""")
|
|
|
|
|
|
st.markdown('<div class="sub-title">Performance Metrics</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>Here are the detailed performance metrics for the CamemBERT token classification model:</p>
|
|
<table class="benchmark-table">
|
|
<tr>
|
|
<th>Entity</th>
|
|
<th>Precision</th>
|
|
<th>Recall</th>
|
|
<th>F1-Score</th>
|
|
</tr>
|
|
<tr>
|
|
<td>LOC</td>
|
|
<td>0.93</td>
|
|
<td>0.94</td>
|
|
<td>0.94</td>
|
|
</tr>
|
|
<tr>
|
|
<td>PER</td>
|
|
<td>0.95</td>
|
|
<td>0.95</td>
|
|
<td>0.95</td>
|
|
</tr>
|
|
<tr>
|
|
<td>ORG</td>
|
|
<td>0.92</td>
|
|
<td>0.91</td>
|
|
<td>0.91</td>
|
|
</tr>
|
|
<tr>
|
|
<td>MISC</td>
|
|
<td>0.86</td>
|
|
<td>0.85</td>
|
|
<td>0.85</td>
|
|
</tr>
|
|
<tr>
|
|
<td>O</td>
|
|
<td>0.99</td>
|
|
<td>0.99</td>
|
|
<td>0.99</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Overall</td>
|
|
<td>0.97</td>
|
|
<td>0.98</td>
|
|
<td>0.98</td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">Model Information</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><strong>Model Name:</strong> camembert_base_token_classifier_wikiner</li>
|
|
<li><strong>Compatibility:</strong> Spark NLP 4.2.0+</li>
|
|
<li><strong>License:</strong> Open Source</li>
|
|
<li><strong>Edition:</strong> Official</li>
|
|
<li><strong>Input Labels:</strong> [token, document]</li>
|
|
<li><strong>Output Labels:</strong> [ner]</li>
|
|
<li><strong>Language:</strong> French</li>
|
|
<li><strong>Size:</strong> 412.2 MB</li>
|
|
<li><strong>Case Sensitive:</strong> Yes</li>
|
|
<li><strong>Max Sentence Length:</strong> 512</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr" target="_blank" rel="noopener">CamemBERT WikiNER Dataset</a></li>
|
|
<li><a class="link" href="https://sparknlp.org/2022/09/23/camembert_base_token_classifier_wikiner_en.html" target="_blank" rel="noopener">CamemBERT Token Classification on Spark NLP Hub</a></li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|