izumilab commited on
Commit
1084fe3
·
1 Parent(s): fbada5d

add financial corpus to Training Data section

Browse files
Files changed (1) hide show
  1. README.md +8 -2
README.md CHANGED
@@ -32,12 +32,18 @@ The model architecture is the same as BERT small in the [original ELECTRA paper]
32
 
33
  ## Training Data
34
 
35
- The models are trained on the Japanese version of Wikipedia.
36
 
37
- The training corpus is generated from the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021.
38
 
39
  The corpus file is 2.9GB, consisting of approximately 20M sentences.
40
 
 
 
 
 
 
 
41
  ## Tokenization
42
 
43
  The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.
 
32
 
33
  ## Training Data
34
 
35
+ The models are trained on Wikipedia corpus and financial corpus.
36
 
37
+ The Wikipedia corpus is generated from the Japanese Wikipedia dump file as of June 1, 2021.
38
 
39
  The corpus file is 2.9GB, consisting of approximately 20M sentences.
40
 
41
+ The financial corpus consists of 2 corpora:
42
+
43
+ - Summaries of financial results from October 9, 2012, to December 31, 2020
44
+ - Securities reports from February 8, 2018, to December 31, 2020 The financial corpus file is 5.2GB, consisting of approximately 27M sentences.
45
+
46
+
47
  ## Tokenization
48
 
49
  The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.