izumi-lab
/

bert-small-japanese-fin

@@ -32,12 +32,18 @@ The model architecture is the same as BERT small in the [original ELECTRA paper]
 ## Training Data
-The models are trained on the Japanese version of Wikipedia.
-The training corpus is generated from the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021.
 The corpus file is 2.9GB, consisting of approximately 20M sentences.
 ## Tokenization
 The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.

 ## Training Data
+The models are trained on Wikipedia corpus and financial corpus.
+The Wikipedia corpus is generated from the Japanese Wikipedia dump file as of June 1, 2021.
 The corpus file is 2.9GB, consisting of approximately 20M sentences.
+The financial corpus consists of 2 corpora:
+- Summaries of financial results from October 9, 2012, to December 31, 2020
+- Securities reports from February 8, 2018, to December 31, 2020 The financial corpus file is 5.2GB, consisting of approximately 27M sentences.
 ## Tokenization
 The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.