Update README.md
Browse files
README.md
CHANGED
@@ -8,9 +8,9 @@ tags:
|
|
8 |
datasets:
|
9 |
- mc4
|
10 |
- wikipedia
|
11 |
-
pipeline_tag: fill-mask
|
12 |
widget:
|
13 |
- text: "Moikka olen <mask> kielimalli."
|
|
|
14 |
---
|
15 |
|
16 |
# RoBERTa large model for Finnish
|
@@ -105,7 +105,7 @@ neutral. Therefore, the model can have biased predictions.
|
|
105 |
## Training data
|
106 |
|
107 |
This Finnish RoBERTa model was pretrained on the combination of five datasets:
|
108 |
-
- [mc4](https://huggingface.co/datasets/mc4), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
|
109 |
- [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
|
110 |
- [Yle Finnish News Archive](http://urn.fi/urn:nbn:fi:lb-2017070501)
|
111 |
- [Finnish News Agency Archive (STT)](http://urn.fi/urn:nbn:fi:lb-2018121001)
|
|
|
8 |
datasets:
|
9 |
- mc4
|
10 |
- wikipedia
|
|
|
11 |
widget:
|
12 |
- text: "Moikka olen <mask> kielimalli."
|
13 |
+
|
14 |
---
|
15 |
|
16 |
# RoBERTa large model for Finnish
|
|
|
105 |
## Training data
|
106 |
|
107 |
This Finnish RoBERTa model was pretrained on the combination of five datasets:
|
108 |
+
- [mc4](https://huggingface.co/datasets/mc4), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset
|
109 |
- [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
|
110 |
- [Yle Finnish News Archive](http://urn.fi/urn:nbn:fi:lb-2017070501)
|
111 |
- [Finnish News Agency Archive (STT)](http://urn.fi/urn:nbn:fi:lb-2018121001)
|