update readme
Browse files
README.md
CHANGED
@@ -14,13 +14,6 @@ Given the current market price of H100 GPU hours, training the model only costs
|
|
14 |
To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
|
15 |
Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
|
16 |
|
17 |
-
<figure>
|
18 |
-
<center>
|
19 |
-
<img src="images/jetmoe_architecture.png" width="40%">
|
20 |
-
<figcaption>JetMoE Architecture</figcaption>
|
21 |
-
</center>
|
22 |
-
</figure>
|
23 |
-
|
24 |
## Evaluation Results
|
25 |
|Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
|
26 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -57,6 +50,13 @@ Each MoA and MoE layer has 8 expert, and 2 experts are activated for each input
|
|
57 |
It has 8 billion parameters in total and 2.2B active parameters.
|
58 |
JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
|
59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
**Input** Models input text only.
|
61 |
|
62 |
**Output** Models generate text only.
|
|
|
14 |
To our surprise, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B despite the lower training cost and computation.
|
15 |
Compared to a model with similar training and inference computation, JetMoE-8B achieves significantly better performance compared to Gemma-2B.
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
## Evaluation Results
|
18 |
|Model|Activate Params|Training Tokens|ARC-challenge|Hellaswag|MMLU|TruthfulQA|WinoGrande|GSM8k|Open LLM Leaderboard Average|MBPP|HumanEval|
|
19 |
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
50 |
It has 8 billion parameters in total and 2.2B active parameters.
|
51 |
JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10<sup>-4</sup> and a global batch-size of 4M tokens.
|
52 |
|
53 |
+
<figure>
|
54 |
+
<center>
|
55 |
+
<img src="images/jetmoe_architecture.png" width="40%">
|
56 |
+
<figcaption>JetMoE Architecture</figcaption>
|
57 |
+
</center>
|
58 |
+
</figure>
|
59 |
+
|
60 |
**Input** Models input text only.
|
61 |
|
62 |
**Output** Models generate text only.
|