import streamlit as st # Page configuration st.set_page_config( layout="wide", initial_sidebar_state="auto" ) # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Title st.markdown('

Introduction to CamemBERT Annotators in Spark NLP

', unsafe_allow_html=True) # Subtitle st.markdown("""

Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:

""", unsafe_allow_html=True) st.markdown("""

CamemBERT for Token Classification

The CamemBertForTokenClassification annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.

Token classification with CamemBERT enables:

Named Entity Recognition (NER): Identifying and classifying entities such as names, organizations, locations, and other predefined categories.
Information Extraction: Extracting key information from unstructured text for further analysis.
Text Categorization: Enhancing document retrieval and categorization based on entity recognition.

Here is an example of how CamemBERT token classification works:

Entity	Label
Paris	LOC
Emmanuel Macron	PER
Élysée Palace	ORG

""", unsafe_allow_html=True) # CamemBERT Token Classification - French WikiNER st.markdown('

CamemBERT Token Classification - French WikiNER

', unsafe_allow_html=True) st.markdown("""

The camembert_base_token_classifier_wikiner is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.

""", unsafe_allow_html=True) # How to Use the Model - Token Classification st.markdown('

How to Use the Model

', unsafe_allow_html=True) st.code(''' from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline from pyspark.sql.functions import col, expr document_assembler = DocumentAssembler() \\ .setInputCol('text') \\ .setOutputCol('document') tokenizer = Tokenizer() \\ .setInputCols(['document']) \\ .setOutputCol('token') tokenClassifier = CamemBertForTokenClassification \\ .pretrained('camembert_base_token_classifier_wikiner', 'en') \\ .setInputCols(['document', 'token']) \\ .setOutputCol('ner') \\ .setCaseSensitive(True) \\ .setMaxSentenceLength(512) # Convert NER labels to entities ner_converter = NerConverter() \\ .setInputCols(['document', 'token', 'ner']) \\ .setOutputCol('entities') pipeline = Pipeline(stages=[ document_assembler, tokenizer, tokenClassifier, ner_converter ]) data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text") result = pipeline.fit(data).transform(data) result.select( expr("explode(entities) as ner_chunk") ).select( col("ner_chunk.result").alias("chunk"), col("ner_chunk.metadata.entity").alias("ner_label") ).show(truncate=False) ''', language='python') # Results st.text(""" +------------------+---------+ |chunk |ner_label| +------------------+---------+ |Paris |LOC | |France |LOC | |Emmanuel Macron |PER | |Élysée Palace |ORG | |Apple Inc. |ORG | +------------------+---------+ """) # Performance Metrics st.markdown('

Performance Metrics

', unsafe_allow_html=True) st.markdown("""

Here are the detailed performance metrics for the CamemBERT token classification model:

Entity	Precision	Recall	F1-Score
LOC	0.93	0.94	0.94
PER	0.95	0.95	0.95
ORG	0.92	0.91	0.91
MISC	0.86	0.85	0.85
O	0.99	0.99	0.99
Overall	0.97	0.98	0.98

""", unsafe_allow_html=True) # Model Information - Token Classification st.markdown('

Model Information

', unsafe_allow_html=True) st.markdown("""

Model Name: camembert_base_token_classifier_wikiner
Compatibility: Spark NLP 4.2.0+
License: Open Source
Edition: Official
Input Labels: [token, document]
Output Labels: [ner]
Language: French
Size: 412.2 MB
Case Sensitive: Yes
Max Sentence Length: 512

""", unsafe_allow_html=True) # References - Token Classification st.markdown('

References

', unsafe_allow_html=True) st.markdown("""

""", unsafe_allow_html=True)