scampion commited on
Commit
0d8f8be
·
verified ·
1 Parent(s): 4a4b06e
Files changed (1) hide show
  1. README.md +110 -42
README.md CHANGED
@@ -1,58 +1,101 @@
1
  ---
2
- library_name: transformers
3
- license: apache-2.0
4
- base_model: answerdotai/ModernBERT-base
5
- tags:
6
- - generated_from_trainer
7
  metrics:
8
- - precision
9
  - recall
 
10
  - f1
11
- - accuracy
12
- model-index:
13
- - name: piiranha
14
- results: []
 
 
 
 
15
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # piiranha
21
 
22
- This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.001
25
- - Precision: 0.9212
26
- - Recall: 0.9272
27
- - F1: 0.9242
28
- - Accuracy: 0.9953
29
 
30
- ## Model description
 
 
31
 
32
- More information needed
 
33
 
34
- ## Intended uses & limitations
 
 
35
 
36
- More information needed
 
 
 
37
 
38
- ## Training and evaluation data
 
39
 
40
- More information needed
 
41
 
42
- ## Training procedure
 
 
 
43
 
44
- ### Training hyperparameters
45
 
46
- The following hyperparameters were used during training:
47
- - learning_rate: 5e-05
48
- - train_batch_size: 32
49
- - eval_batch_size: 32
50
- - seed: 42
51
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-06 and optimizer_args=No additional optimizer arguments
52
- - lr_scheduler_type: linear
53
- - num_epochs: 4
 
 
 
 
54
 
55
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
58
  |-------|---------------|-----------------|-----------|--------|-------|----------|
@@ -61,9 +104,34 @@ The following hyperparameters were used during training:
61
  | 3 | 0.005000 | 0.015703 | 0.919432 | 0.928394 | 0.923892 | 0.995136 |
62
  | 4 | 0.001000 | 0.022899 | 0.921234 | 0.927212 | 0.924213 | 0.995267 |
63
 
64
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- - Transformers 4.48.2
67
- - Pytorch 2.5.1+cu124
68
- - Datasets 3.2.0
69
- - Tokenizers 0.21.0
 
1
  ---
2
+ datasets:
3
+ - ai4privacy/pii-masking-400k
 
 
 
4
  metrics:
5
+ - accuracy
6
  - recall
7
+ - precision
8
  - f1
9
+ base_model:
10
+ - answerdotai/ModernBERT-base
11
+ pipeline_tag: token-classification
12
+ tags:
13
+ - pii
14
+ - privacy
15
+ - personal
16
+ - identification
17
  ---
18
+ # 🐟 PII-RANHA: Privacy-Preserving Token Classification Model
19
+
20
+ ## Overview
21
+ PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.
22
+
23
+ This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.
24
+
25
+ ## Model Details
26
+
27
+ ### Model Architecture
28
+ - **Base Model**: `answerdotai/ModernBERT-base`
29
+ - **Task**: Token Classification
30
+ - **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens)
31
 
 
 
32
 
33
+ ## Usage
34
 
35
+ ### Installation
36
+ To use the model, ensure you have the `transformers` and `datasets` libraries installed:
 
 
 
 
 
37
 
38
+ ```bash
39
+ pip install transformers datasets
40
+ ```
41
 
42
+ Inference Example
43
+ Here’s how to load and use the model for PII detection:
44
 
45
+ ```python
46
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
47
+ from transformers import pipeline
48
 
49
+ # Load the model and tokenizer
50
+ model_name = "scampion/piiranha"
51
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
52
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
53
 
54
+ # Create a token classification pipeline
55
+ pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)
56
 
57
+ # Example input
58
+ text = "My email is [email protected] and my phone number is 555-123-4567."
59
 
60
+ # Detect PII
61
+ results = pii_pipeline(text)
62
+ for entity in results:
63
+ print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
64
 
65
+ ```
66
 
67
+ ```bash
68
+ Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
69
+ Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
70
+ Entity: ., Label: I-USERNAME, Score: 0.5871
71
+ Entity: do, Label: I-USERNAME, Score: 0.5350
72
+ Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
73
+ Entity: -, Label: I-SOCIALNUM, Score: 0.5948
74
+ Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
75
+ Entity: -, Label: I-SOCIALNUM, Score: 0.6151
76
+ Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
77
+ Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
78
+ ```
79
 
80
+ ## Training Details
81
+
82
+ ### Dataset
83
+ The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.
84
+
85
+ ### Training Configuration
86
+ - **Batch Size:** 32
87
+ - **Learning Rate:** 5e-5
88
+ - **Epochs:** 4
89
+ - **Optimizer:** AdamW
90
+ - **Weight Decay:** 0.01
91
+ - **Scheduler:** Linear learning rate scheduler
92
+
93
+ ### Evaluation Metrics
94
+ The model was evaluated using the following metrics:
95
+ - Precision
96
+ - Recall
97
+ - F1 Score
98
+ - Accuracy
99
 
100
  | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
101
  |-------|---------------|-----------------|-----------|--------|-------|----------|
 
104
  | 3 | 0.005000 | 0.015703 | 0.919432 | 0.928394 | 0.923892 | 0.995136 |
105
  | 4 | 0.001000 | 0.022899 | 0.921234 | 0.927212 | 0.924213 | 0.995267 |
106
 
107
+ Would you like me to help analyze any trends in these metrics?
108
+
109
+ ## License
110
+ This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
111
+ For another license, contact the author.
112
+
113
+ ## Author
114
+ Name: Sébastien Campion
115
+
116
117
+
118
+ Date: 2025-01-30
119
+
120
+ Version: 0.1
121
+
122
+ ## Citation
123
+ If you use this model in your work, please cite it as follows:
124
+
125
+ ```bibtex
126
+ @misc{piiranha2025,
127
+ author = {Sébastien Campion},
128
+ title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
129
+ year = {2025},
130
+ version = {0.1},
131
+ url = {https://huggingface.co/sebastien-campion/piiranha},
132
+ }
133
+ ```
134
 
135
+ ## Disclaimer
136
+ This model is provided "as-is" without any guarantees of performance or suitability for specific use cases.
137
+ Always evaluate the model's performance in your specific context before deployment.