File size: 2,179 Bytes
77cff9e f9f16d8 7e9e677 77cff9e 5bdc74c 77cff9e 5bdc74c 77cff9e 389bfc9 77cff9e dba18b4 aecf2fe dba18b4 aecf2fe 0694e1f aecf2fe dba18b4 f9f16d8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
---
language:
- ru
tags:
- toxic comments classification
licenses:
- cc-by-nc-sa
license: openrail++
base_model:
- DeepPavlov/rubert-base-cased-conversational
---
Bert-based classifier (finetuned from [Conversational Rubert](https://huggingface.co/DeepPavlov/rubert-base-cased-conversational)) trained on merge of Russian Language Toxic Comments [dataset](https://www.kaggle.com/blackmoon/russian-language-toxic-comments/metadata) collected from 2ch.hk and Toxic Russian Comments [dataset](https://www.kaggle.com/alexandersemiletov/toxic-russian-comments) collected from ok.ru.
The datasets were merged, shuffled, and split into train, dev, test splits in 80-10-10 proportion.
The metrics obtained from test dataset is as follows
| | precision | recall | f1-score | support |
|:------------:|:---------:|:------:|:--------:|:-------:|
| 0 | 0.98 | 0.99 | 0.98 | 21384 |
| 1 | 0.94 | 0.92 | 0.93 | 4886 |
| accuracy | | | 0.97 | 26270|
| macro avg | 0.96 | 0.96 | 0.96 | 26270 |
| weighted avg | 0.97 | 0.97 | 0.97 | 26270 |
## How to use
```python
from transformers import BertTokenizer, BertForSequenceClassification
# load tokenizer and model weights
tokenizer = BertTokenizer.from_pretrained('s-nlp/russian_toxicity_classifier')
model = BertForSequenceClassification.from_pretrained('s-nlp/russian_toxicity_classifier')
# prepare the input
batch = tokenizer.encode('ты супер', return_tensors='pt')
# inference
model(batch)
```
## Citation
To acknowledge our work, please, use the corresponding citation:
```
@article{dementieva2022russe,
title={RUSSE-2022: Findings of the First Russian Detoxification Shared Task Based on Parallel Corpora},
author={Dementieva, Daryna and Logacheva, Varvara and Nikishina, Irina and Fenogenova, Alena and Dale, David and Krotova, Irina and Semenov, Nikita and Shavrina, Tatiana and Panchenko, Alexander}
}
```
## Licensing Information
This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good. |