scampion
/

piiranha

@@ -1,58 +1,101 @@
 ---
-library_name: transformers
-license: apache-2.0
-base_model: answerdotai/ModernBERT-base
-tags:
-- generated_from_trainer
 metrics:
-- precision
 - recall
 - f1
-- accuracy
-model-index:
-- name: piiranha
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# piiranha
-This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.001
-- Precision: 0.9212
-- Recall: 0.9272
-- F1: 0.9242
-- Accuracy: 0.9953
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 5e-05
-- train_batch_size: 32
-- eval_batch_size: 32
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-06 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 4
-### Training results
 | Epoch | Training Loss | Validation Loss | Precision | Recall | F1    | Accuracy |
 |-------|---------------|-----------------|-----------|--------|-------|----------|
@@ -61,9 +104,34 @@ The following hyperparameters were used during training:
 | 3     | 0.005000      | 0.015703        | 0.919432  | 0.928394 | 0.923892 | 0.995136 |
 | 4     | 0.001000      | 0.022899        | 0.921234  | 0.927212 | 0.924213 | 0.995267 |
-### Framework versions
-- Transformers 4.48.2
-- Pytorch 2.5.1+cu124
-- Datasets 3.2.0
-- Tokenizers 0.21.0

 ---
+datasets:
+- ai4privacy/pii-masking-400k
 metrics:
+- accuracy
 - recall
+- precision
 - f1
+base_model:
+- answerdotai/ModernBERT-base
+pipeline_tag: token-classification
+tags:
+- pii
+- privacy
+- personal
+- identification
 ---
+# 🐟 PII-RANHA: Privacy-Preserving Token Classification Model
+## Overview
+PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.
+This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.
+## Model Details
+### Model Architecture
+- **Base Model**: `answerdotai/ModernBERT-base`
+- **Task**: Token Classification
+- **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens)
+## Usage
+### Installation
+To use the model, ensure you have the `transformers` and `datasets` libraries installed:
+```bash
+pip install transformers datasets
+```
+Inference Example
+Here’s how to load and use the model for PII detection:
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+from transformers import pipeline
+# Load the model and tokenizer
+model_name = "scampion/piiranha"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Create a token classification pipeline
+pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)
+# Example input
+text = "My email is [email protected] and my phone number is 555-123-4567."
+# Detect PII
+results = pii_pipeline(text)
+for entity in results:
+    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
+```
+```bash
+Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
+Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
+Entity: ., Label: I-USERNAME, Score: 0.5871
+Entity: do, Label: I-USERNAME, Score: 0.5350
+Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
+Entity: -, Label: I-SOCIALNUM, Score: 0.5948
+Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
+Entity: -, Label: I-SOCIALNUM, Score: 0.6151
+Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
+Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
+```
+## Training Details
+### Dataset
+The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.
+### Training Configuration
+- **Batch Size:** 32
+- **Learning Rate:** 5e-5
+- **Epochs:** 4
+- **Optimizer:** AdamW
+- **Weight Decay:** 0.01
+- **Scheduler:** Linear learning rate scheduler
+### Evaluation Metrics
+The model was evaluated using the following metrics:
+- Precision
+- Recall
+- F1 Score
+- Accuracy
 | Epoch | Training Loss | Validation Loss | Precision | Recall | F1    | Accuracy |
 |-------|---------------|-----------------|-----------|--------|-------|----------|
 | 3     | 0.005000      | 0.015703        | 0.919432  | 0.928394 | 0.923892 | 0.995136 |
 | 4     | 0.001000      | 0.022899        | 0.921234  | 0.927212 | 0.924213 | 0.995267 |
+Would you like me to help analyze any trends in these metrics?
+## License
+This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
+For another license, contact the author.
+## Author
+Name: Sébastien Campion
+Email: [email protected]
+Date: 2025-01-30
+Version: 0.1
+## Citation
+If you use this model in your work, please cite it as follows:
+```bibtex
+@misc{piiranha2025,
+  author = {Sébastien Campion},
+  title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
+  year = {2025},
+  version = {0.1},
+  url = {https://huggingface.co/sebastien-campion/piiranha},
+}
+```
+## Disclaimer
+This model is provided "as-is" without any guarantees of performance or suitability for specific use cases.
+Always evaluate the model's performance in your specific context before deployment.