vwxyzjn commited on
Commit
fd8237d
·
verified ·
1 Parent(s): fe0a228

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -4
README.md CHANGED
@@ -140,12 +140,12 @@ Tülu3 is designed for state-of-the-art performance on a diversity of tasks in a
140
  | **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
141
  | **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
142
 
143
-
144
  | **Stage** | **Llama 3.1 405B** |
145
  |-----------|-------------------|
146
  | **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
147
  | **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
148
  | **Final Model (DPO)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |
 
149
 
150
 
151
  ## Using the model
@@ -230,6 +230,20 @@ See the Falcon 180B model card for an example of this.
230
  | **IFEval (prompt loose)** | 82.1 | 82.6 | 83.2 | **88.0** | 87.6 | 76.0 | 79.9 |
231
  | **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
232
  | **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
 
234
 
235
  ## Hyperparamters
@@ -245,13 +259,13 @@ PPO settings for RLVR:
245
  - **Gradient Norm Threshold**: 1.0
246
  - **Learning Rate Schedule**: Linear
247
  - **Generation Temperature**: 1.0
248
- - **Batch Size (effective)**: 512
249
  - **Max Token Length**: 2,048
250
  - **Max Prompt Token Length**: 2,048
251
  - **Penalty Reward Value for Responses without an EOS Token**: -10.0
252
- - **Response Length**: 1,024 (but 2,048 for MATH)
253
  - **Total Episodes**: 100,000
254
- - **KL penalty coefficient (beta)**: [0.1, 0.05, 0.03, 0.01]
255
  - **Warm up ratio (omega)**: 0.0
256
 
257
  ## License and use
 
140
  | **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
141
  | **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
142
 
 
143
  | **Stage** | **Llama 3.1 405B** |
144
  |-----------|-------------------|
145
  | **Base Model** | [meta-llama/llama-3.1-405B](https://huggingface.co/meta-llama/llama-3.1-405B) |
146
  | **SFT** | [allenai/llama-3.1-Tulu-3-405B-SFT](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B-SFT) |
147
  | **Final Model (DPO)** | [allenai/llama-3.1-Tulu-3-405B](https://huggingface.co/allenai/llama-3.1-Tulu-3-405B) |
148
+ | **Reward Model (RM)**| (Same as 8B)
149
 
150
 
151
  ## Using the model
 
230
  | **IFEval (prompt loose)** | 82.1 | 82.6 | 83.2 | **88.0** | 87.6 | 76.0 | 79.9 |
231
  | **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
232
  | **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
233
+ | Benchmark (eval) | Tülu 3 405B SFT | Tülu 3 405B DPO | Tülu 3 405B | Llama 3.1 405B Instruct | Nous Hermes 3 405B | Deepseek V3 | GPT 4o (11-24) |
234
+ |-----------------|----------------|----------------|-------------|------------------------|-------------------|-------------|----------------|
235
+ | **Avg w/o Safety** | 76.3 | 79.0 | 80.0 | 78.1 | 74.4 | 79.0 | **80.5** |
236
+ | **Avg w/ Safety** | 77.5 | 79.6 | 80.7 | 79.0 | 73.5 | 75.9 | **81.6** |
237
+ | **MMLU (5 shot, CoT)** | 84.4 | 86.6 | 87.0 | **88.0** | 84.9 | 82.1 | 87.9 |
238
+ | **PopQA (3 shot)** | **55.7** | 55.4 | 55.5 | 52.9 | 54.2 | 44.9 | 53.6 |
239
+ | **BigBenchHard (0 shot, CoT)** | 88.0 | 88.8 | 88.6 | 87.1 | 87.7 | **89.5** | 83.3 |
240
+ | **MATH (4 shot, Flex)** | 63.4 | 59.9 | 67.3 | 66.6 | 58.4 | **72.5** | 68.8 |
241
+ | **GSM8K (8 shot, CoT)** | 93.6 | 94.2 | **95.5** | 95.4 | 92.7 | 94.1 | 91.7 |
242
+ | **HumanEval (pass@10)** | 95.7 | **97.2** | 95.9 | 95.9 | 92.3 | 94.6 | 97.0 |
243
+ | **HumanEval+ (pass@10)** | 93.3 | **93.9** | 92.9 | 90.3 | 86.9 | 91.6 | 92.7 |
244
+ | **IFEval (prompt loose)** | 82.4 | 85.0 | 86.0 | **88.4** | 81.9 | 88.0 | 84.8 |
245
+ | **AlpacaEval 2 (LC % win)** | 30.4 | 49.8 | 51.4 | 38.5 | 30.2 | 53.5 | **65.0** |
246
+ | **Safety (6 task avg.)** | 87.7 | 85.5 | 86.7 | 86.8 | 65.8 | 72.2 | **90.9** |
247
 
248
 
249
  ## Hyperparamters
 
259
  - **Gradient Norm Threshold**: 1.0
260
  - **Learning Rate Schedule**: Linear
261
  - **Generation Temperature**: 1.0
262
+ - **Batch Size (effective)**: 224
263
  - **Max Token Length**: 2,048
264
  - **Max Prompt Token Length**: 2,048
265
  - **Penalty Reward Value for Responses without an EOS Token**: -10.0
266
+ - **Response Length**: 2,048
267
  - **Total Episodes**: 100,000
268
+ - **KL penalty coefficient (beta)**: 0.05
269
  - **Warm up ratio (omega)**: 0.0
270
 
271
  ## License and use