Spaces:

spark-nlp
/

CamemBERT

Sleeping

App Files Files Community

CamemBERT / pages /Workflow & Model Overview.py

abdullahmubeen10

Upload 5 files

6b63571 verified 6 months ago

raw

history blame contribute delete

8.72 kB

	import streamlit as st

	# Page configuration
	st.set_page_config(
	layout="wide",
	initial_sidebar_state="auto"
	)

	# Custom CSS for better styling
	st.markdown("""
	<style>
	.main-title {
	font-size: 36px;
	color: #4A90E2;
	font-weight: bold;
	text-align: center;
	}
	.sub-title {
	font-size: 24px;
	color: #4A90E2;
	margin-top: 20px;
	}
	.section {
	background-color: #f9f9f9;
	padding: 15px;
	border-radius: 10px;
	margin-top: 20px;
	}
	.section h2 {
	font-size: 22px;
	color: #4A90E2;
	}
	.section p, .section ul {
	color: #666666;
	}
	.link {
	color: #4A90E2;
	text-decoration: none;
	}
	.benchmark-table {
	width: 100%;
	border-collapse: collapse;
	margin-top: 20px;
	}
	.benchmark-table th, .benchmark-table td {
	border: 1px solid #ddd;
	padding: 8px;
	text-align: left;
	}
	.benchmark-table th {
	background-color: #4A90E2;
	color: white;
	}
	.benchmark-table td {
	background-color: #f2f2f2;
	}
	</style>
	""", unsafe_allow_html=True)

	# Title
	st.markdown('<div class="main-title">Introduction to CamemBERT Annotators in Spark NLP</div>', unsafe_allow_html=True)

	# Subtitle
	st.markdown("""
	<div class="section">
	<p>Spark NLP offers a variety of CamemBERT-based annotators tailored for multiple natural language processing tasks. CamemBERT is a robust and versatile model designed specifically for the French language, offering state-of-the-art performance in a range of NLP applications. Below, we provide an overview of the four key CamemBERT annotators:</p>
	</div>
	""", unsafe_allow_html=True)

	st.markdown("""
	<div class="section">
	<h2>CamemBERT for Token Classification</h2>
	<p>The <strong>CamemBertForTokenClassification</strong> annotator is designed for Named Entity Recognition (NER) tasks using CamemBERT, a French language model derived from RoBERTa. This model efficiently handles token classification, which involves labeling tokens in a text with tags that correspond to specific entities. CamemBERT offers robust performance in French NLP tasks, making it a valuable tool for real-time applications in this language.</p>
	<p>Token classification with CamemBERT enables:</p>
	<ul>
	<li><strong>Named Entity Recognition (NER):</strong> Identifying and classifying entities such as names, organizations, locations, and other predefined categories.</li>
	<li><strong>Information Extraction:</strong> Extracting key information from unstructured text for further analysis.</li>
	<li><strong>Text Categorization:</strong> Enhancing document retrieval and categorization based on entity recognition.</li>
	</ul>
	<p>Here is an example of how CamemBERT token classification works:</p>
	<table class="benchmark-table">
	<tr>
	<th>Entity</th>
	<th>Label</th>
	</tr>
	<tr>
	<td>Paris</td>
	<td>LOC</td>
	</tr>
	<tr>
	<td>Emmanuel Macron</td>
	<td>PER</td>
	</tr>
	<tr>
	<td>Élysée Palace</td>
	<td>ORG</td>
	</tr>
	</table>
	</div>
	""", unsafe_allow_html=True)

	# CamemBERT Token Classification - French WikiNER
	st.markdown('<div class="sub-title">CamemBERT Token Classification - French WikiNER</div>', unsafe_allow_html=True)
	st.markdown("""
	<div class="section">
	<p>The <strong>camembert_base_token_classifier_wikiner</strong> is a fine-tuned CamemBERT model for token classification tasks, specifically adapted for Named Entity Recognition (NER) on the French WikiNER dataset. It is designed to recognize five types of entities: O, LOC, PER, MISC, and ORG.</p>
	</div>
	""", unsafe_allow_html=True)

	# How to Use the Model - Token Classification
	st.markdown('<div class="sub-title">How to Use the Model</div>', unsafe_allow_html=True)
	st.code('''
	from sparknlp.base import *
	from sparknlp.annotator import *
	from pyspark.ml import Pipeline
	from pyspark.sql.functions import col, expr

	document_assembler = DocumentAssembler() \\
	.setInputCol('text') \\
	.setOutputCol('document')

	tokenizer = Tokenizer() \\
	.setInputCols(['document']) \\
	.setOutputCol('token')

	tokenClassifier = CamemBertForTokenClassification \\
	.pretrained('camembert_base_token_classifier_wikiner', 'en') \\
	.setInputCols(['document', 'token']) \\
	.setOutputCol('ner') \\
	.setCaseSensitive(True) \\
	.setMaxSentenceLength(512)

	# Convert NER labels to entities
	ner_converter = NerConverter() \\
	.setInputCols(['document', 'token', 'ner']) \\
	.setOutputCol('entities')

	pipeline = Pipeline(stages=[
	document_assembler,
	tokenizer,
	tokenClassifier,
	ner_converter
	])

	data = spark.createDataFrame([["""Paris est la capitale de la France et abrite le Président Emmanuel Macron, qui réside au palais de l'Élysée. Apple Inc. a une présence significative dans la ville."""]]).toDF("text")
	result = pipeline.fit(data).transform(data)

	result.select(
	expr("explode(entities) as ner_chunk")
	).select(
	col("ner_chunk.result").alias("chunk"),
	col("ner_chunk.metadata.entity").alias("ner_label")
	).show(truncate=False)
	''', language='python')

	# Results
	st.text("""
	+------------------+---------+
	\|chunk \|ner_label\|
	+------------------+---------+
	\|Paris \|LOC \|
	\|France \|LOC \|
	\|Emmanuel Macron \|PER \|
	\|Élysée Palace \|ORG \|
	\|Apple Inc. \|ORG \|
	+------------------+---------+
	""")

	# Performance Metrics
	st.markdown('<div class="sub-title">Performance Metrics</div>', unsafe_allow_html=True)
	st.markdown("""
	<div class="section">
	<p>Here are the detailed performance metrics for the CamemBERT token classification model:</p>
	<table class="benchmark-table">
	<tr>
	<th>Entity</th>
	<th>Precision</th>
	<th>Recall</th>
	<th>F1-Score</th>
	</tr>
	<tr>
	<td>LOC</td>
	<td>0.93</td>
	<td>0.94</td>
	<td>0.94</td>
	</tr>
	<tr>
	<td>PER</td>
	<td>0.95</td>
	<td>0.95</td>
	<td>0.95</td>
	</tr>
	<tr>
	<td>ORG</td>
	<td>0.92</td>
	<td>0.91</td>
	<td>0.91</td>
	</tr>
	<tr>
	<td>MISC</td>
	<td>0.86</td>
	<td>0.85</td>
	<td>0.85</td>
	</tr>
	<tr>
	<td>O</td>
	<td>0.99</td>
	<td>0.99</td>
	<td>0.99</td>
	</tr>
	<tr>
	<td>Overall</td>
	<td>0.97</td>
	<td>0.98</td>
	<td>0.98</td>
	</tr>
	</table>
	</div>
	""", unsafe_allow_html=True)

	# Model Information - Token Classification
	st.markdown('<div class="sub-title">Model Information</div>', unsafe_allow_html=True)
	st.markdown("""
	<div class="section">
	<ul>
	<li><strong>Model Name:</strong> camembert_base_token_classifier_wikiner</li>
	<li><strong>Compatibility:</strong> Spark NLP 4.2.0+</li>
	<li><strong>License:</strong> Open Source</li>
	<li><strong>Edition:</strong> Official</li>
	<li><strong>Input Labels:</strong> [token, document]</li>
	<li><strong>Output Labels:</strong> [ner]</li>
	<li><strong>Language:</strong> French</li>
	<li><strong>Size:</strong> 412.2 MB</li>
	<li><strong>Case Sensitive:</strong> Yes</li>
	<li><strong>Max Sentence Length:</strong> 512</li>
	</ul>
	</div>
	""", unsafe_allow_html=True)

	# References - Token Classification
	st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
	st.markdown("""
	<div class="section">
	<ul>
	<li><a class="link" href="https://huggingface.co/datasets/Jean-Baptiste/wikiner_fr" target="_blank" rel="noopener">CamemBERT WikiNER Dataset</a></li>
	<li><a class="link" href="https://sparknlp.org/2022/09/23/camembert_base_token_classifier_wikiner_en.html" target="_blank" rel="noopener">CamemBERT Token Classification on Spark NLP Hub</a></li>
	</ul>
	</div>
	""", unsafe_allow_html=True)