Model Card for Model ID
BERT model trained on Bulgarian literature, Web, and other datasets - uncased.
Model Details
355M parameter BERT model trained on 19B (23B depending on tokenization) tokens for 3 epochs with Masked Language Modelling objective.
Tokenizer vocabulary size is 50176.
Model hidden dimension is 1024.
Feed-Forward dimension is 4096.
Hidden layer count is 24.
Developed by: Artificial Inteligence and Language Technologies Department at Institute of Information and Communication Technologies - Bulgarian Academy of Sciences.
Funded by: The model is pretrained within the CLaDA-BG: National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC, funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure). The training was performed at the supercomputer HEMUS at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
Model type: BERT
Language(s) (NLP): Bulgarian.
License: MIT
Uses
The model is intended to be used as a base model for fine-tuning tasks in NLP.
Direct Use
>>> from transformers import (
>>> PreTrainedTokenizerFast,
>>> BertForMaskedLM,
>>> pipeline
>>> )
>>> model = BertForMaskedLM.from_pretrained('AIaLT-IICT/bert_bg_lit_web_large_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/bert_bg_lit_web_large_uncased')
>>> fill_mask = pipeline(
>>> "fill-mask",
>>> model=model,
>>> tokenizer=tokenizer
>>> )
>>> fill_mask("Заради 3 завода няма да [MASK] нито есенниците неподхранени, нито зърното да поскъпне заради тях.")
[{'score': 0.5374130606651306,
'token': 17875,
'token_str': 'останат',
'sequence': 'заради 3 завода няма да останат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.23659224808216095,
'token': 15017,
'token_str': 'остави',
'sequence': 'заради 3 завода няма да остави нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.07789137959480286,
'token': 26913,
'token_str': 'оставим',
'sequence': 'заради 3 завода няма да оставим нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.06473446637392044,
'token': 12941,
'token_str': 'има',
'sequence': 'заради 3 завода няма да има нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.029545966535806656,
'token': 22988,
'token_str': 'оставят',
'sequence': 'заради 3 завода няма да оставят нито есенниците неподхранени, нито зърното да поскъпне заради тях.'}]
Out-of-Scope Use
The model is not trained on Next Sentence prediction so the [CLS] token embedding will not be useful out of the box. If you want to use the model for Sequence classification it is recommended to fine-tune it.
Recommendations
It is recommended to use the model for Token Classification and Sequence classification fine-tuning tasks. The model can be used within SentenceTransformers framework for producing embeddings.
Training Details
Training Data
Trained on 19B tokens mainly consisting of:
- uonlp/CulturaX
- MaCoCu-bg 2.0
- Literature
- others
Training Procedure
Trained with Masked Language Modelling with 20% masks for 3 epochs with tf32 mixed precision, 512 tokens context and batch size of 256*512 tokens.
Evaluation
The model is evaluated on the Masked Language Modelling objective on test split with 20% random masked tokens. It achieves test loss of 1.14 and test accuracy of 74.55%
Model Card Authors
Nikolay Paev
Model Card Contact
- Downloads last month
- 5