Update README.md
Browse files
README.md
CHANGED
@@ -83,7 +83,43 @@ CLUECorpus2020 and CLUECorpusSmall are used as training corpus.
|
|
83 |
|
84 |
## Training procedure
|
85 |
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
### BibTeX entry and citation info
|
89 |
|
|
|
83 |
|
84 |
## Training procedure
|
85 |
|
86 |
+
Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on Tencent Cloud TI-ONE. We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512.
|
87 |
+
|
88 |
+
```
|
89 |
+
python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
|
90 |
+
--vocab_path models/google_zh_vocab.txt \
|
91 |
+
--dataset_path cluecorpus_seq128_dataset.pt \
|
92 |
+
--processes_num 32 --seq_length 128 \
|
93 |
+
--dynamic_masking --target mlm
|
94 |
+
```
|
95 |
+
```
|
96 |
+
python3 pretrain.py --dataset_path cluecorpus_seq128_dataset.pt \
|
97 |
+
--vocab_path models/google_zh_vocab.txt \
|
98 |
+
--config_path models/bert_tiny_config.json \
|
99 |
+
--output_model_path models/cluecorpus_roberta_tiny_seq128_model.bin \
|
100 |
+
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
|
101 |
+
--total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
|
102 |
+
--learning_rate 1e-4 --batch_size 64 \
|
103 |
+
--tie_weights --encoder bert --target mlm
|
104 |
+
```
|
105 |
+
```
|
106 |
+
python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
|
107 |
+
--vocab_path models/google_zh_vocab.txt \
|
108 |
+
--dataset_path cluecorpus_seq512_dataset.pt \
|
109 |
+
--processes_num 32 --seq_length 512 \
|
110 |
+
--dynamic_masking --target mlm
|
111 |
+
```
|
112 |
+
```
|
113 |
+
python3 pretrain.py --dataset_path cluecorpus_seq512_dataset.pt \
|
114 |
+
--pretrained_model_path models/cluecorpus_roberta_tiny_seq128_model.bin-1000000 \
|
115 |
+
--vocab_path models/google_zh_vocab.txt \
|
116 |
+
--config_path models/bert_tiny_config.json \
|
117 |
+
--output_model_path models/cluecorpus_roberta_tiny_seq512_model.bin \
|
118 |
+
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
|
119 |
+
--total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
|
120 |
+
--learning_rate 5e-5 --batch_size 16 \
|
121 |
+
--tie_weights --encoder bert --target mlm
|
122 |
+
```
|
123 |
|
124 |
### BibTeX entry and citation info
|
125 |
|