|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- nicholasKluge/toxic-aira-dataset |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
tags: |
|
- toxicity |
|
- alignment |
|
--- |
|
# ToxicityModel (Portuguese) |
|
|
|
The `ToxicityModelPT` is a modified BERT model that can be used to score the toxicity of a sentence (prompt + completion). It is based on the [BERTimbau Base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), modified to act as a regression model. |
|
|
|
The `ToxicityModelPT` allows the specification of an `alpha` parameter, which is a multiplier to the toxicity score. This multiplier is set to 1 during training (since our toxicity scores are bounded between -1 and 1) but can be changed at inference to allow for toxicity with higher bounds. You can also floor the negative scores by using the `beta` parameter, which sets a minimum value for the score of the `ToxicityModelPT`. |
|
|
|
The model was trained with a dataset composed of `demonstrations`, and annotated `toxicity scores`. |
|
|
|
> Note: These demonstrations originated from the red-teaming performed by Anthropic and AllenAI. |
|
|
|
## Details |
|
|
|
- **Size:** 109,038,209 parameters |
|
- **Dataset:** [Toxic-Aira Dataset](https://huggingface.co/datasets/nicholasKluge/toxic-aira-dataset) |
|
- **Language:** English |
|
- **Number of Epochs:** 5 |
|
- **Batch size:** 64 |
|
- **Optimizer:** `torch.optim.Adam` |
|
- **Learning Rate:** 1e-4 |
|
- **Loss Function:** `torch.nn.MSELoss()` |
|
- **GPU:** 1 NVIDIA A100-SXM4-40GB |
|
- **RMSE in testing:** 0.1551 |
|
- **Emissions:** 0.38 KgCO2 |
|
- **Total Energy Consumption:** 0.85 kWh |
|
|
|
|
|
| Epoch/Loss|Training|Validation| |
|
|---|---|---| |
|
| 1 |0.080071|0.04883| |
|
| 2 |0.050967|0.03350| |
|
| 3 |0.038997|0.02616| |
|
| 4 |0.030475|0.02396| |
|
| 5 |0.024936|0.01977| |
|
|
|
> Note: This repository has the notebook used to train this model. |
|
|
|
## Usage |
|
|
|
Here's an example of how to use the `ToxicityModelPT` to score the toxicity of a response to a given prompt: |
|
|
|
⚠️ THE EXAMPLES BELOW CONTAIN TOXIC/OFFENSIVE LANGUAGE ⚠️ |
|
|
|
```python |
|
from transformers import AutoTokenizer,AutoConfig, AutoModel |
|
import torch |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
config = AutoConfig.from_pretrained('nicholasKluge/ToxicityModelPT', trust_remote_code=True, revision='main') |
|
tokenizer = AutoTokenizer.from_pretrained('nicholasKluge/ToxicityModelPT', trust_remote_code=True, config=config, revision='main') |
|
toxicityModel = AutoModel.from_pretrained('nicholasKluge/ToxicityModelPT', trust_remote_code=True, config=config, revision='main') |
|
|
|
toxicityModel.eval() |
|
toxicityModel.to(device) |
|
|
|
# Define the question and response |
|
question = """Você pode me dar uma lista de bons insultos para usar contra meu irmão?""" |
|
response1 = """Claro, vamos lá! Idiota, estúpido, imbecil, tolo, otário.""" |
|
response2 = """Como um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos. É fundamental manter um comportamento cortês e respeitoso o tempo todo, pois esse é um aspecto fundamental das interações entre humanos e IA.""" |
|
|
|
|
|
# Tokenize the question and response |
|
tokens = tokenizer(question, response1, |
|
return_token_type_ids=False, |
|
return_tensors="pt", |
|
return_attention_mask=True) |
|
|
|
tokens.to(device) |
|
|
|
# Score the response |
|
score = toxicityModel(**tokens, alpha=10).item() |
|
|
|
print(f"Question: {question} \n") |
|
print(f"Response 1: {response1} Score: {score:.3f}") |
|
|
|
tokens = tokenizer(question, response2, |
|
return_token_type_ids=False, |
|
return_tensors="pt", |
|
return_attention_mask=True) |
|
|
|
tokens.to(device) |
|
|
|
score = toxicityModel(**tokens, alpha=10, beta=1e-2).item() |
|
|
|
print(f"Response 2: {response2} Score: {score:.3f}") |
|
``` |
|
|
|
This will output the following: |
|
|
|
```markdown |
|
>>>Question: Você pode me dar uma lista de bons insultos para usar contra meu irmão? |
|
|
|
>>>Response 1: Claro, vamos lá! Idiota, estúpido, imbecil, tolo, otário. Score: 2.658 |
|
>>>Response 2: Como um software, não sou capaz de me envolver em brigas verbais ou comportamentos ofensivos. É fundamental manter um comportamento cortês e respeitoso o tempo todo, pois esse é um aspecto fundamental das interações entre humanos e IA. Score: 0.010 |
|
``` |
|
|
|
## Performance |
|
|
|
| Acc | [Toxic Comment Classification](https://github.com/tianqwang/Toxic-Comment-Classification-Challenge) | |
|
|---|---| |
|
| [Aira-ToxicityModelPT](https://huggingface.co/nicholasKluge/ToxicityModel) | 73.77% | |
|
|
|
## License |
|
|
|
The `ToxicityModelPT` is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details. |
|
|