CamemBERT / pages /Workflow & Model Overview.py
abdullahmubeen10's picture
Upload 5 files
6b63571 verified
import streamlit as st
# Page configuration
st.set_page_config(
layout="wide",
initial_sidebar_state="auto"
)
# Custom CSS for better styling
st.markdown("""
<style>
.main-title {
font-size: 36px;
color: #4A90E2;
font-weight: bold;
text-align: center;
}
.sub-title {
font-size: 24px;
color: #4A90E2;
margin-top: 20px;
}
.section {
background-color: #f9f9f9;
padding: 15px;
border-radius: 10px;
margin-top: 20px;
}
.section h2 {
font-size: 22px;
color: #4A90E2;
}
.section p, .section ul {
color: #666666;
}
.link {
color: #4A90E2;
text-decoration: none;
}
.benchmark-table {
width: 100%;
border-collapse: collapse;
margin-top: 20px;
}
.benchmark-table th, .benchmark-table td {
border: 1px solid #ddd;
padding: 8px;
text-align: left;
}
.benchmark-table th {
background-color: #4A90E2;
color: white;
}
.benchmark-table td {
background-color: #f2f2f2;
}
</style>
""", unsafe_allow_html=True)
# Title
st.markdown('<div class="main-title">Introduction to CamemBERT Annotators in Spark NLP</div>', unsafe_allow_html=True)
# Subtitle
st.markdown("""
<div class="section">
<p>Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:</p>
</div>
""", unsafe_allow_html=True)
st.markdown("""
<div class="section">
<h2>CamemBERT for Token Classification</h2>
<p>The <strong>CamemBertForTokenClassification</strong> annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.</p>
<p>Token classification with CamemBERT enables:</p>
<ul>
<li><strong>Named Entity Recognition (NER):</strong> Identifying and classifying entities such as names, organizations, locations, and other predefined categories.</li>
<li><strong>Information Extraction:</strong> Extracting key information from unstructured text for further analysis.</li>
<li><strong>Text Categorization:</strong> Enhancing document retrieval and categorization based on entity recognition.</li>
</ul>
<p>Here is an example of how CamemBERT token classification works:</p>
<table class="benchmark-table">
<tr>
<th>Entity</th>
<th>Label</th>
</tr>
<tr>
<td>Paris</td>
<td>LOC</td>
</tr>
<tr>
<td>Emmanuel Macron</td>
<td>PER</td>
</tr>
<tr>
<td>Élysée Palace</td>
<td>ORG</td>
</tr>
</table>
</div>
""", unsafe_allow_html=True)
# CamemBERT Token Classification - French WikiNER
st.markdown('<div class="sub-title">CamemBERT Token Classification - French WikiNER</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>The <strong>camembert_base_token_classifier_wikiner</strong> is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.</p>
</div>
""", unsafe_allow_html=True)
# How to Use the Model - Token Classification
st.markdown('<div class="sub-title">How to Use the Model</div>', unsafe_allow_html=True)
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, expr
document_assembler = DocumentAssembler() \\
.setInputCol('text') \\
.setOutputCol('document')
tokenizer = Tokenizer() \\
.setInputCols(['document']) \\
.setOutputCol('token')
tokenClassifier = CamemBertForTokenClassification \\
.pretrained('camembert_base_token_classifier_wikiner', 'en') \\
.setInputCols(['document', 'token']) \\
.setOutputCol('ner') \\
.setCaseSensitive(True) \\
.setMaxSentenceLength(512)
# Convert NER labels to entities
ner_converter = NerConverter() \\
.setInputCols(['document', 'token', 'ner']) \\
.setOutputCol('entities')
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
tokenClassifier,
ner_converter
])
data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(
expr("explode(entities) as ner_chunk")
).select(
col("ner_chunk.result").alias("chunk"),
col("ner_chunk.metadata.entity").alias("ner_label")
).show(truncate=False)
''', language='python')
# Results
st.text("""
+------------------+---------+
|chunk |ner_label|
+------------------+---------+
|Paris |LOC |
|France |LOC |
|Emmanuel Macron |PER |
|Élysée Palace |ORG |
|Apple Inc. |ORG |
+------------------+---------+
""")
# Performance Metrics
st.markdown('<div class="sub-title">Performance Metrics</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>Here are the detailed performance metrics for the CamemBERT token classification model:</p>
<table class="benchmark-table">
<tr>
<th>Entity</th>
<th>Precision</th>
<th>Recall</th>
<th>F1-Score</th>
</tr>
<tr>
<td>LOC</td>
<td>0.93</td>
<td>0.94</td>
<td>0.94</td>
</tr>
<tr>
<td>PER</td>
<td>0.95</td>
<td>0.95</td>
<td>0.95</td>
</tr>
<tr>
<td>ORG</td>
<td>0.92</td>
<td>0.91</td>
<td>0.91</td>
</tr>
<tr>
<td>MISC</td>
<td>0.86</td>
<td>0.85</td>
<td>0.85</td>
</tr>
<tr>
<td>O</td>
<td>0.99</td>
<td>0.99</td>
<td>0.99</td>
</tr>
<tr>
<td>Overall</td>
<td>0.97</td>
<td>0.98</td>
<td>0.98</td>
</tr>
</table>
</div>
""", unsafe_allow_html=True)
# Model Information - Token Classification
st.markdown('<div class="sub-title">Model Information</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<ul>
<li><strong>Model Name:</strong> camembert_base_token_classifier_wikiner</li>
<li><strong>Compatibility:</strong> Spark NLP 4.2.0+</li>
<li><strong>License:</strong> Open Source</li>
<li><strong>Edition:</strong> Official</li>
<li><strong>Input Labels:</strong> [token, document]</li>
<li><strong>Output Labels:</strong> [ner]</li>
<li><strong>Language:</strong> French</li>
<li><strong>Size:</strong> 412.2 MB</li>
<li><strong>Case Sensitive:</strong> Yes</li>
<li><strong>Max Sentence Length:</strong> 512</li>
</ul>
</div>
""", unsafe_allow_html=True)
# References - Token Classification
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<ul>
<li><a class="link" href="https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr" target="_blank" rel="noopener">CamemBERT WikiNER Dataset</a></li>
<li><a class="link" href="https://sparknlp.org/2022/09/23/camembert_base_token_classifier_wikiner_en.html" target="_blank" rel="noopener">CamemBERT Token Classification on Spark NLP Hub</a></li>
</ul>
</div>
""", unsafe_allow_html=True)