import streamlit as st # Page configuration st.set_page_config( layout="wide", initial_sidebar_state="auto" ) # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Title st.markdown('
Introduction to CamemBERT Annotators in Spark NLP
', unsafe_allow_html=True) # Subtitle st.markdown("""

Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:

""", unsafe_allow_html=True) st.markdown("""

CamemBERT for Token Classification

The CamemBertForTokenClassification annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.

Token classification with CamemBERT enables:

Here is an example of how CamemBERT token classification works:

Entity Label
Paris LOC
Emmanuel Macron PER
Élysée Palace ORG
""", unsafe_allow_html=True) # CamemBERT Token Classification - French WikiNER st.markdown('
CamemBERT Token Classification - French WikiNER
', unsafe_allow_html=True) st.markdown("""

The camembert_base_token_classifier_wikiner is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.

""", unsafe_allow_html=True) # How to Use the Model - Token Classification st.markdown('
How to Use the Model
', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr document_assembler = DocumentAssembler() \\ .setInputCol('text') \\ .setOutputCol('document') tokenizer = Tokenizer() \\ .setInputCols(['document']) \\ .setOutputCol('token') tokenClassifier = CamemBertForTokenClassification \\ .pretrained('camembert_base_token_classifier_wikiner', 'en') \\ .setInputCols(['document', 'token']) \\ .setOutputCol('ner') \\ .setCaseSensitive(True) \\ .setMaxSentenceLength(512) # Convert NER labels to entities ner_converter = NerConverter() \\ .setInputCols(['document', 'token', 'ner']) \\ .setOutputCol('entities') pipeline = Pipeline(stages=[ document_assembler, tokenizer, tokenClassifier, ner_converter ]) data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text") result = pipeline.fit(data).transform(data) result.select( expr("explode(entities) as ner_chunk") ).select( col("ner_chunk.result").alias("chunk"), col("ner_chunk.metadata.entity").alias("ner_label") ).show(truncate=False) ''', language='python') # Results st.text(""" +------------------+---------+ |chunk |ner_label| +------------------+---------+ |Paris |LOC | |France |LOC | |Emmanuel Macron |PER | |Élysée Palace |ORG | |Apple Inc. |ORG | +------------------+---------+ """) # Performance Metrics st.markdown('
Performance Metrics
', unsafe_allow_html=True) st.markdown("""

Here are the detailed performance metrics for the CamemBERT token classification model:

Entity Precision Recall F1-Score
LOC 0.93 0.94 0.94
PER 0.95 0.95 0.95
ORG 0.92 0.91 0.91
MISC 0.86 0.85 0.85
O 0.99 0.99 0.99
Overall 0.97 0.98 0.98
""", unsafe_allow_html=True) # Model Information - Token Classification st.markdown('
Model Information
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True) # References - Token Classification st.markdown('
References
', unsafe_allow_html=True) st.markdown("""
""", unsafe_allow_html=True)