Update README.md
Browse files
README.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~
|
2 |
|
3 |
Test network using [Differential Transformer (Attention)](https://arxiv.org/abs/2410.05258). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
|
4 |
|
|
|
1 |
+
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~135M param model.
|
2 |
|
3 |
Test network using [Differential Transformer (Attention)](https://arxiv.org/abs/2410.05258). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
|
4 |
|