Update README.md
Browse files
README.md
CHANGED
@@ -140,12 +140,12 @@ Tülu3 is designed for state-of-the-art performance on a diversity of tasks in a
|
|
140 |
| **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
|
141 |
| **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
|
142 |
|
143 |
-
|
144 |
| **Stage** | **Llama 3.1 405B** |
|
145 |
|-----------|-------------------|
|
146 |
| **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
|
147 |
| **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
|
148 |
| **Final Model (DPO)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |
|
|
|
149 |
|
150 |
|
151 |
## Using the model
|
@@ -230,6 +230,20 @@ See the Falcon 180B model card for an example of this.
|
|
230 |
| **IFEval (prompt loose)** | 82.1 | 82.6 | 83.2 | **88.0** | 87.6 | 76.0 | 79.9 |
|
231 |
| **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
|
232 |
| **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
233 |
|
234 |
|
235 |
## Hyperparamters
|
@@ -245,13 +259,13 @@ PPO settings for RLVR:
|
|
245 |
- **Gradient Norm Threshold**: 1.0
|
246 |
- **Learning Rate Schedule**: Linear
|
247 |
- **Generation Temperature**: 1.0
|
248 |
-
- **Batch Size (effective)**:
|
249 |
- **Max Token Length**: 2,048
|
250 |
- **Max Prompt Token Length**: 2,048
|
251 |
- **Penalty Reward Value for Responses without an EOS Token**: -10.0
|
252 |
-
- **Response Length**:
|
253 |
- **Total Episodes**: 100,000
|
254 |
-
- **KL penalty coefficient (beta)**:
|
255 |
- **Warm up ratio (omega)**: 0.0
|
256 |
|
257 |
## License and use
|
|
|
140 |
| **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
|
141 |
| **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
|
142 |
|
|
|
143 |
| **Stage** | **Llama 3.1 405B** |
|
144 |
|-----------|-------------------|
|
145 |
| **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
|
146 |
| **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
|
147 |
| **Final Model (DPO)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |
|
148 |
+
| **Reward Model (RM)**| (Same as 8B)
|
149 |
|
150 |
|
151 |
## Using the model
|
|
|
230 |
| **IFEval (prompt loose)** | 82.1 | 82.6 | 83.2 | **88.0** | 87.6 | 76.0 | 79.9 |
|
231 |
| **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
|
232 |
| **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
|
233 |
+
| Benchmark (eval) | Tülu 3 405B SFT | Tülu 3 405B DPO | Tülu 3 405B | Llama 3.1 405B Instruct | Nous Hermes 3 405B | Deepseek V3 | GPT 4o (11-24) |
|
234 |
+
|-----------------|----------------|----------------|-------------|------------------------|-------------------|-------------|----------------|
|
235 |
+
| **Avg w/o Safety** | 76.3 | 79.0 | 80.0 | 78.1 | 74.4 | 79.0 | **80.5** |
|
236 |
+
| **Avg w/ Safety** | 77.5 | 79.6 | 80.7 | 79.0 | 73.5 | 75.9 | **81.6** |
|
237 |
+
| **MMLU (5 shot, CoT)** | 84.4 | 86.6 | 87.0 | **88.0** | 84.9 | 82.1 | 87.9 |
|
238 |
+
| **PopQA (3 shot)** | **55.7** | 55.4 | 55.5 | 52.9 | 54.2 | 44.9 | 53.6 |
|
239 |
+
| **BigBenchHard (0 shot, CoT)** | 88.0 | 88.8 | 88.6 | 87.1 | 87.7 | **89.5** | 83.3 |
|
240 |
+
| **MATH (4 shot, Flex)** | 63.4 | 59.9 | 67.3 | 66.6 | 58.4 | **72.5** | 68.8 |
|
241 |
+
| **GSM8K (8 shot, CoT)** | 93.6 | 94.2 | **95.5** | 95.4 | 92.7 | 94.1 | 91.7 |
|
242 |
+
| **HumanEval (pass@10)** | 95.7 | **97.2** | 95.9 | 95.9 | 92.3 | 94.6 | 97.0 |
|
243 |
+
| **HumanEval+ (pass@10)** | 93.3 | **93.9** | 92.9 | 90.3 | 86.9 | 91.6 | 92.7 |
|
244 |
+
| **IFEval (prompt loose)** | 82.4 | 85.0 | 86.0 | **88.4** | 81.9 | 88.0 | 84.8 |
|
245 |
+
| **AlpacaEval 2 (LC % win)** | 30.4 | 49.8 | 51.4 | 38.5 | 30.2 | 53.5 | **65.0** |
|
246 |
+
| **Safety (6 task avg.)** | 87.7 | 85.5 | 86.7 | 86.8 | 65.8 | 72.2 | **90.9** |
|
247 |
|
248 |
|
249 |
## Hyperparamters
|
|
|
259 |
- **Gradient Norm Threshold**: 1.0
|
260 |
- **Learning Rate Schedule**: Linear
|
261 |
- **Generation Temperature**: 1.0
|
262 |
+
- **Batch Size (effective)**: 224
|
263 |
- **Max Token Length**: 2,048
|
264 |
- **Max Prompt Token Length**: 2,048
|
265 |
- **Penalty Reward Value for Responses without an EOS Token**: -10.0
|
266 |
+
- **Response Length**: 2,048
|
267 |
- **Total Episodes**: 100,000
|
268 |
+
- **KL penalty coefficient (beta)**: 0.05
|
269 |
- **Warm up ratio (omega)**: 0.0
|
270 |
|
271 |
## License and use
|