yifanzhang114 commited on
Commit
d0b0080
·
verified ·
1 Parent(s): e5df528

Upload trainer_state.json with huggingface_hub

Browse files
Files changed (1) hide show
  1. trainer_state.json +1906 -0
trainer_state.json ADDED
@@ -0,0 +1,1906 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9981378026070763,
5
+ "eval_steps": 500,
6
+ "global_step": 134,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.01,
13
+ "grad_norm": 20.656896651231627,
14
+ "learning_rate": 3.5714285714285716e-07,
15
+ "loss": 3.6136,
16
+ "step": 1,
17
+ "trainloss/critic_chosen": 1.459133505821228,
18
+ "trainloss/critic_rejected": 1.468864917755127,
19
+ "trainloss/reward": 1.459133505821228,
20
+ "trainrewards/accuracies": 0.5833333134651184,
21
+ "trainrewards/chosen": 0.3359375,
22
+ "trainrewards/margins": 0.0308837890625,
23
+ "trainrewards/rejected": 0.3046875
24
+ },
25
+ {
26
+ "epoch": 0.01,
27
+ "grad_norm": 20.747827742934472,
28
+ "learning_rate": 7.142857142857143e-07,
29
+ "loss": 3.6381,
30
+ "step": 2,
31
+ "trainloss/critic_chosen": 1.4447739124298096,
32
+ "trainloss/critic_rejected": 1.4999535083770752,
33
+ "trainloss/reward": 1.4447739124298096,
34
+ "trainrewards/accuracies": 0.5104166865348816,
35
+ "trainrewards/chosen": 0.314453125,
36
+ "trainrewards/margins": 0.01531982421875,
37
+ "trainrewards/rejected": 0.298828125
38
+ },
39
+ {
40
+ "epoch": 0.02,
41
+ "grad_norm": 19.429850511676147,
42
+ "learning_rate": 1.0714285714285714e-06,
43
+ "loss": 3.6713,
44
+ "step": 3,
45
+ "trainloss/critic_chosen": 1.4738179445266724,
46
+ "trainloss/critic_rejected": 1.505049228668213,
47
+ "trainloss/reward": 1.4738179445266724,
48
+ "trainrewards/accuracies": 0.5364583134651184,
49
+ "trainrewards/chosen": 0.302734375,
50
+ "trainrewards/margins": 0.015869140625,
51
+ "trainrewards/rejected": 0.287109375
52
+ },
53
+ {
54
+ "epoch": 0.03,
55
+ "grad_norm": 19.353394477551088,
56
+ "learning_rate": 1.4285714285714286e-06,
57
+ "loss": 3.6414,
58
+ "step": 4,
59
+ "trainloss/critic_chosen": 1.4593632221221924,
60
+ "trainloss/critic_rejected": 1.4944238662719727,
61
+ "trainloss/reward": 1.4593632221221924,
62
+ "trainrewards/accuracies": 0.5572916865348816,
63
+ "trainrewards/chosen": 0.357421875,
64
+ "trainrewards/margins": 0.0286865234375,
65
+ "trainrewards/rejected": 0.328125
66
+ },
67
+ {
68
+ "epoch": 0.04,
69
+ "grad_norm": 19.314515274756275,
70
+ "learning_rate": 1.7857142857142859e-06,
71
+ "loss": 3.6014,
72
+ "step": 5,
73
+ "trainloss/critic_chosen": 1.4528993368148804,
74
+ "trainloss/critic_rejected": 1.5066261291503906,
75
+ "trainloss/reward": 1.4528993368148804,
76
+ "trainrewards/accuracies": 0.7760417461395264,
77
+ "trainrewards/chosen": 0.494140625,
78
+ "trainrewards/margins": 0.130859375,
79
+ "trainrewards/rejected": 0.36328125
80
+ },
81
+ {
82
+ "epoch": 0.04,
83
+ "grad_norm": 17.65488892011503,
84
+ "learning_rate": 2.1428571428571427e-06,
85
+ "loss": 3.5132,
86
+ "step": 6,
87
+ "trainloss/critic_chosen": 1.4163868427276611,
88
+ "trainloss/critic_rejected": 1.4701489210128784,
89
+ "trainloss/reward": 1.4163868427276611,
90
+ "trainrewards/accuracies": 0.7864583730697632,
91
+ "trainrewards/chosen": 0.54296875,
92
+ "trainrewards/margins": 0.169921875,
93
+ "trainrewards/rejected": 0.373046875
94
+ },
95
+ {
96
+ "epoch": 0.05,
97
+ "grad_norm": 14.922878013160528,
98
+ "learning_rate": 2.5e-06,
99
+ "loss": 3.4609,
100
+ "step": 7,
101
+ "trainloss/critic_chosen": 1.4025382995605469,
102
+ "trainloss/critic_rejected": 1.4922353029251099,
103
+ "trainloss/reward": 1.4025382995605469,
104
+ "trainrewards/accuracies": 0.8281250596046448,
105
+ "trainrewards/chosen": 0.9453125,
106
+ "trainrewards/margins": 0.376953125,
107
+ "trainrewards/rejected": 0.56640625
108
+ },
109
+ {
110
+ "epoch": 0.06,
111
+ "grad_norm": 14.232594419094823,
112
+ "learning_rate": 2.8571428571428573e-06,
113
+ "loss": 3.3368,
114
+ "step": 8,
115
+ "trainloss/critic_chosen": 1.3685592412948608,
116
+ "trainloss/critic_rejected": 1.4283447265625,
117
+ "trainloss/reward": 1.3685592412948608,
118
+ "trainrewards/accuracies": 0.8645833730697632,
119
+ "trainrewards/chosen": 1.0078125,
120
+ "trainrewards/margins": 0.47265625,
121
+ "trainrewards/rejected": 0.53515625
122
+ },
123
+ {
124
+ "epoch": 0.07,
125
+ "grad_norm": 12.480541507697524,
126
+ "learning_rate": 3.2142857142857147e-06,
127
+ "loss": 3.0686,
128
+ "step": 9,
129
+ "trainloss/critic_chosen": 1.3232171535491943,
130
+ "trainloss/critic_rejected": 1.3606585264205933,
131
+ "trainloss/reward": 1.3232171535491943,
132
+ "trainrewards/accuracies": 0.9114583134651184,
133
+ "trainrewards/chosen": 1.671875,
134
+ "trainrewards/margins": 1.3359375,
135
+ "trainrewards/rejected": 0.333984375
136
+ },
137
+ {
138
+ "epoch": 0.07,
139
+ "grad_norm": 8.25153148096521,
140
+ "learning_rate": 3.5714285714285718e-06,
141
+ "loss": 3.0194,
142
+ "step": 10,
143
+ "trainloss/critic_chosen": 1.289527177810669,
144
+ "trainloss/critic_rejected": 1.3465213775634766,
145
+ "trainloss/reward": 1.289527177810669,
146
+ "trainrewards/accuracies": 0.9010416865348816,
147
+ "trainrewards/chosen": 1.3125,
148
+ "trainrewards/margins": 1.4375,
149
+ "trainrewards/rejected": -0.1220703125
150
+ },
151
+ {
152
+ "epoch": 0.08,
153
+ "grad_norm": 15.573626174864643,
154
+ "learning_rate": 3.928571428571429e-06,
155
+ "loss": 2.9476,
156
+ "step": 11,
157
+ "trainloss/critic_chosen": 1.266632318496704,
158
+ "trainloss/critic_rejected": 1.3359544277191162,
159
+ "trainloss/reward": 1.266632318496704,
160
+ "trainrewards/accuracies": 0.9010417461395264,
161
+ "trainrewards/chosen": 0.208984375,
162
+ "trainrewards/margins": 2.25,
163
+ "trainrewards/rejected": -2.046875
164
+ },
165
+ {
166
+ "epoch": 0.09,
167
+ "grad_norm": 8.798738351150583,
168
+ "learning_rate": 4.2857142857142855e-06,
169
+ "loss": 2.9208,
170
+ "step": 12,
171
+ "trainloss/critic_chosen": 1.285549521446228,
172
+ "trainloss/critic_rejected": 1.336460828781128,
173
+ "trainloss/reward": 1.285549521446228,
174
+ "trainrewards/accuracies": 0.9270833730697632,
175
+ "trainrewards/chosen": 1.390625,
176
+ "trainrewards/margins": 2.453125,
177
+ "trainrewards/rejected": -1.0625
178
+ },
179
+ {
180
+ "epoch": 0.1,
181
+ "grad_norm": 10.920314695532706,
182
+ "learning_rate": 4.642857142857144e-06,
183
+ "loss": 2.9737,
184
+ "step": 13,
185
+ "trainloss/critic_chosen": 1.3031879663467407,
186
+ "trainloss/critic_rejected": 1.3538345098495483,
187
+ "trainloss/reward": 1.3031879663467407,
188
+ "trainrewards/accuracies": 0.9583333134651184,
189
+ "trainrewards/chosen": 1.9453125,
190
+ "trainrewards/margins": 2.203125,
191
+ "trainrewards/rejected": -0.255859375
192
+ },
193
+ {
194
+ "epoch": 0.1,
195
+ "grad_norm": 7.902411216600844,
196
+ "learning_rate": 5e-06,
197
+ "loss": 2.8568,
198
+ "step": 14,
199
+ "trainloss/critic_chosen": 1.2445811033248901,
200
+ "trainloss/critic_rejected": 1.3195788860321045,
201
+ "trainloss/reward": 1.2445811033248901,
202
+ "trainrewards/accuracies": 0.90625,
203
+ "trainrewards/chosen": 1.5078125,
204
+ "trainrewards/margins": 2.09375,
205
+ "trainrewards/rejected": -0.5859375
206
+ },
207
+ {
208
+ "epoch": 0.11,
209
+ "grad_norm": 11.497405609399582,
210
+ "learning_rate": 4.999143312438893e-06,
211
+ "loss": 2.8485,
212
+ "step": 15,
213
+ "trainloss/critic_chosen": 1.2575112581253052,
214
+ "trainloss/critic_rejected": 1.303661823272705,
215
+ "trainloss/reward": 1.2575112581253052,
216
+ "trainrewards/accuracies": 0.9270833134651184,
217
+ "trainrewards/chosen": 0.55078125,
218
+ "trainrewards/margins": 1.84375,
219
+ "trainrewards/rejected": -1.2890625
220
+ },
221
+ {
222
+ "epoch": 0.12,
223
+ "grad_norm": 8.767557307270321,
224
+ "learning_rate": 4.9965738368864345e-06,
225
+ "loss": 2.8737,
226
+ "step": 16,
227
+ "trainloss/critic_chosen": 1.2387288808822632,
228
+ "trainloss/critic_rejected": 1.3038721084594727,
229
+ "trainloss/reward": 1.2387288808822632,
230
+ "trainrewards/accuracies": 0.9010416865348816,
231
+ "trainrewards/chosen": 1.578125,
232
+ "trainrewards/margins": 2.59375,
233
+ "trainrewards/rejected": -1.015625
234
+ },
235
+ {
236
+ "epoch": 0.13,
237
+ "grad_norm": 8.749611939075479,
238
+ "learning_rate": 4.992293334332821e-06,
239
+ "loss": 2.8681,
240
+ "step": 17,
241
+ "trainloss/critic_chosen": 1.2373957633972168,
242
+ "trainloss/critic_rejected": 1.302640676498413,
243
+ "trainloss/reward": 1.2373957633972168,
244
+ "trainrewards/accuracies": 0.9322916865348816,
245
+ "trainrewards/chosen": 1.5859375,
246
+ "trainrewards/margins": 2.203125,
247
+ "trainrewards/rejected": -0.61328125
248
+ },
249
+ {
250
+ "epoch": 0.13,
251
+ "grad_norm": 8.484573172931489,
252
+ "learning_rate": 4.986304738420684e-06,
253
+ "loss": 2.8305,
254
+ "step": 18,
255
+ "trainloss/critic_chosen": 1.23964262008667,
256
+ "trainloss/critic_rejected": 1.297165870666504,
257
+ "trainloss/reward": 1.23964262008667,
258
+ "trainrewards/accuracies": 0.9166666865348816,
259
+ "trainrewards/chosen": 0.60546875,
260
+ "trainrewards/margins": 1.765625,
261
+ "trainrewards/rejected": -1.15625
262
+ },
263
+ {
264
+ "epoch": 0.14,
265
+ "grad_norm": 5.43676917020596,
266
+ "learning_rate": 4.978612153434527e-06,
267
+ "loss": 2.7193,
268
+ "step": 19,
269
+ "trainloss/critic_chosen": 1.2180960178375244,
270
+ "trainloss/critic_rejected": 1.2327347993850708,
271
+ "trainloss/reward": 1.2180960178375244,
272
+ "trainrewards/accuracies": 0.9479166865348816,
273
+ "trainrewards/chosen": 1.5546875,
274
+ "trainrewards/margins": 2.296875,
275
+ "trainrewards/rejected": -0.7421875
276
+ },
277
+ {
278
+ "epoch": 0.15,
279
+ "grad_norm": 5.651490595569425,
280
+ "learning_rate": 4.9692208514878445e-06,
281
+ "loss": 2.8528,
282
+ "step": 20,
283
+ "trainloss/critic_chosen": 1.2103852033615112,
284
+ "trainloss/critic_rejected": 1.3035379648208618,
285
+ "trainloss/reward": 1.2103852033615112,
286
+ "trainrewards/accuracies": 0.9062500596046448,
287
+ "trainrewards/chosen": 1.75,
288
+ "trainrewards/margins": 2.6875,
289
+ "trainrewards/rejected": -0.94921875
290
+ },
291
+ {
292
+ "epoch": 0.16,
293
+ "grad_norm": 5.1975082682744524,
294
+ "learning_rate": 4.958137268909887e-06,
295
+ "loss": 2.7287,
296
+ "step": 21,
297
+ "trainloss/critic_chosen": 1.1857198476791382,
298
+ "trainloss/critic_rejected": 1.2193048000335693,
299
+ "trainloss/reward": 1.1857198476791382,
300
+ "trainrewards/accuracies": 0.9114583730697632,
301
+ "trainrewards/chosen": 1.4296875,
302
+ "trainrewards/margins": 2.21875,
303
+ "trainrewards/rejected": -0.7890625
304
+ },
305
+ {
306
+ "epoch": 0.16,
307
+ "grad_norm": 5.466958288822169,
308
+ "learning_rate": 4.9453690018345144e-06,
309
+ "loss": 2.7514,
310
+ "step": 22,
311
+ "trainloss/critic_chosen": 1.1868751049041748,
312
+ "trainloss/critic_rejected": 1.256333827972412,
313
+ "trainloss/reward": 1.1868751049041748,
314
+ "trainrewards/accuracies": 0.9427083730697632,
315
+ "trainrewards/chosen": 0.953125,
316
+ "trainrewards/margins": 1.71875,
317
+ "trainrewards/rejected": -0.76171875
318
+ },
319
+ {
320
+ "epoch": 0.17,
321
+ "grad_norm": 4.795005616591591,
322
+ "learning_rate": 4.930924800994192e-06,
323
+ "loss": 2.7025,
324
+ "step": 23,
325
+ "trainloss/critic_chosen": 1.1841645240783691,
326
+ "trainloss/critic_rejected": 1.2626478672027588,
327
+ "trainloss/reward": 1.1841645240783691,
328
+ "trainrewards/accuracies": 0.9270833134651184,
329
+ "trainrewards/chosen": 1.0625,
330
+ "trainrewards/margins": 2.09375,
331
+ "trainrewards/rejected": -1.0390625
332
+ },
333
+ {
334
+ "epoch": 0.18,
335
+ "grad_norm": 7.216599935232891,
336
+ "learning_rate": 4.914814565722671e-06,
337
+ "loss": 2.7024,
338
+ "step": 24,
339
+ "trainloss/critic_chosen": 1.1600149869918823,
340
+ "trainloss/critic_rejected": 1.216670036315918,
341
+ "trainloss/reward": 1.1600149869918823,
342
+ "trainrewards/accuracies": 0.90625,
343
+ "trainrewards/chosen": 1.953125,
344
+ "trainrewards/margins": 2.53125,
345
+ "trainrewards/rejected": -0.578125
346
+ },
347
+ {
348
+ "epoch": 0.19,
349
+ "grad_norm": 5.574536341669933,
350
+ "learning_rate": 4.897049337170483e-06,
351
+ "loss": 2.6825,
352
+ "step": 25,
353
+ "trainloss/critic_chosen": 1.17496919631958,
354
+ "trainloss/critic_rejected": 1.2430278062820435,
355
+ "trainloss/reward": 1.17496919631958,
356
+ "trainrewards/accuracies": 0.9427083730697632,
357
+ "trainrewards/chosen": 1.84375,
358
+ "trainrewards/margins": 2.71875,
359
+ "trainrewards/rejected": -0.87109375
360
+ },
361
+ {
362
+ "epoch": 0.19,
363
+ "grad_norm": 8.130831144336229,
364
+ "learning_rate": 4.8776412907378845e-06,
365
+ "loss": 2.7403,
366
+ "step": 26,
367
+ "trainloss/critic_chosen": 1.1843974590301514,
368
+ "trainloss/critic_rejected": 1.2316250801086426,
369
+ "trainloss/reward": 1.1843974590301514,
370
+ "trainrewards/accuracies": 0.9270833730697632,
371
+ "trainrewards/chosen": 0.322265625,
372
+ "trainrewards/margins": 2.09375,
373
+ "trainrewards/rejected": -1.78125
374
+ },
375
+ {
376
+ "epoch": 0.2,
377
+ "grad_norm": 4.106462749941039,
378
+ "learning_rate": 4.856603727730446e-06,
379
+ "loss": 2.6318,
380
+ "step": 27,
381
+ "trainloss/critic_chosen": 1.1325989961624146,
382
+ "trainloss/critic_rejected": 1.2125966548919678,
383
+ "trainloss/reward": 1.1325989961624146,
384
+ "trainrewards/accuracies": 0.9375,
385
+ "trainrewards/chosen": 0.98828125,
386
+ "trainrewards/margins": 1.859375,
387
+ "trainrewards/rejected": -0.875
388
+ },
389
+ {
390
+ "epoch": 0.21,
391
+ "grad_norm": 7.501840024960186,
392
+ "learning_rate": 4.833951066243004e-06,
393
+ "loss": 2.7439,
394
+ "step": 28,
395
+ "trainloss/critic_chosen": 1.156808853149414,
396
+ "trainloss/critic_rejected": 1.218095064163208,
397
+ "trainloss/reward": 1.156808853149414,
398
+ "trainrewards/accuracies": 0.9270833730697632,
399
+ "trainrewards/chosen": 2.03125,
400
+ "trainrewards/margins": 2.0,
401
+ "trainrewards/rejected": 0.021240234375
402
+ },
403
+ {
404
+ "epoch": 0.22,
405
+ "grad_norm": 10.542645887143404,
406
+ "learning_rate": 4.809698831278217e-06,
407
+ "loss": 2.6949,
408
+ "step": 29,
409
+ "trainloss/critic_chosen": 1.146854043006897,
410
+ "trainloss/critic_rejected": 1.2151950597763062,
411
+ "trainloss/reward": 1.146854043006897,
412
+ "trainrewards/accuracies": 0.9427083730697632,
413
+ "trainrewards/chosen": 2.546875,
414
+ "trainrewards/margins": 2.375,
415
+ "trainrewards/rejected": 0.169921875
416
+ },
417
+ {
418
+ "epoch": 0.22,
419
+ "grad_norm": 5.112451478263716,
420
+ "learning_rate": 4.783863644106502e-06,
421
+ "loss": 2.6784,
422
+ "step": 30,
423
+ "trainloss/critic_chosen": 1.1554011106491089,
424
+ "trainloss/critic_rejected": 1.2330732345581055,
425
+ "trainloss/reward": 1.1554011106491089,
426
+ "trainrewards/accuracies": 0.9322916865348816,
427
+ "trainrewards/chosen": 1.6875,
428
+ "trainrewards/margins": 2.390625,
429
+ "trainrewards/rejected": -0.703125
430
+ },
431
+ {
432
+ "epoch": 0.23,
433
+ "grad_norm": 5.406168864359725,
434
+ "learning_rate": 4.7564632108746524e-06,
435
+ "loss": 2.716,
436
+ "step": 31,
437
+ "trainloss/critic_chosen": 1.162062168121338,
438
+ "trainloss/critic_rejected": 1.2383043766021729,
439
+ "trainloss/reward": 1.162062168121338,
440
+ "trainrewards/accuracies": 0.9322916865348816,
441
+ "trainrewards/chosen": 0.609375,
442
+ "trainrewards/margins": 1.9765625,
443
+ "trainrewards/rejected": -1.3671875
444
+ },
445
+ {
446
+ "epoch": 0.24,
447
+ "grad_norm": 5.067065444400827,
448
+ "learning_rate": 4.72751631047092e-06,
449
+ "loss": 2.6516,
450
+ "step": 32,
451
+ "trainloss/critic_chosen": 1.1599905490875244,
452
+ "trainloss/critic_rejected": 1.2166763544082642,
453
+ "trainloss/reward": 1.1599905490875244,
454
+ "trainrewards/accuracies": 0.9270833730697632,
455
+ "trainrewards/chosen": 0.7265625,
456
+ "trainrewards/margins": 2.21875,
457
+ "trainrewards/rejected": -1.4921875
458
+ },
459
+ {
460
+ "epoch": 0.25,
461
+ "grad_norm": 6.759001147106114,
462
+ "learning_rate": 4.697042781654913e-06,
463
+ "loss": 2.6586,
464
+ "step": 33,
465
+ "trainloss/critic_chosen": 1.1513490676879883,
466
+ "trainloss/critic_rejected": 1.1805285215377808,
467
+ "trainloss/reward": 1.1513490676879883,
468
+ "trainrewards/accuracies": 0.9270833134651184,
469
+ "trainrewards/chosen": 1.8203125,
470
+ "trainrewards/margins": 2.234375,
471
+ "trainrewards/rejected": -0.408203125
472
+ },
473
+ {
474
+ "epoch": 0.25,
475
+ "grad_norm": 7.671545596305826,
476
+ "learning_rate": 4.665063509461098e-06,
477
+ "loss": 2.6397,
478
+ "step": 34,
479
+ "trainloss/critic_chosen": 1.1355525255203247,
480
+ "trainloss/critic_rejected": 1.1824406385421753,
481
+ "trainloss/reward": 1.1355525255203247,
482
+ "trainrewards/accuracies": 0.9479166865348816,
483
+ "trainrewards/chosen": 2.15625,
484
+ "trainrewards/margins": 2.34375,
485
+ "trainrewards/rejected": -0.1923828125
486
+ },
487
+ {
488
+ "epoch": 0.26,
489
+ "grad_norm": 4.120967770831028,
490
+ "learning_rate": 4.631600410885231e-06,
491
+ "loss": 2.6941,
492
+ "step": 35,
493
+ "trainloss/critic_chosen": 1.1876243352890015,
494
+ "trainloss/critic_rejected": 1.2462928295135498,
495
+ "trainloss/reward": 1.1876243352890015,
496
+ "trainrewards/accuracies": 0.9322916865348816,
497
+ "trainrewards/chosen": 1.6640625,
498
+ "trainrewards/margins": 2.453125,
499
+ "trainrewards/rejected": -0.78125
500
+ },
501
+ {
502
+ "epoch": 0.27,
503
+ "grad_norm": 4.873851901547121,
504
+ "learning_rate": 4.596676419863561e-06,
505
+ "loss": 2.5644,
506
+ "step": 36,
507
+ "trainloss/critic_chosen": 1.1080451011657715,
508
+ "trainloss/critic_rejected": 1.1967337131500244,
509
+ "trainloss/reward": 1.1080451011657715,
510
+ "trainrewards/accuracies": 0.96875,
511
+ "trainrewards/chosen": 0.80078125,
512
+ "trainrewards/margins": 2.125,
513
+ "trainrewards/rejected": -1.328125
514
+ },
515
+ {
516
+ "epoch": 0.28,
517
+ "grad_norm": 3.9620697117121004,
518
+ "learning_rate": 4.560315471555039e-06,
519
+ "loss": 2.5956,
520
+ "step": 37,
521
+ "trainloss/critic_chosen": 1.1373480558395386,
522
+ "trainloss/critic_rejected": 1.2142869234085083,
523
+ "trainloss/reward": 1.1373480558395386,
524
+ "trainrewards/accuracies": 0.9375000596046448,
525
+ "trainrewards/chosen": 1.0234375,
526
+ "trainrewards/margins": 2.40625,
527
+ "trainrewards/rejected": -1.390625
528
+ },
529
+ {
530
+ "epoch": 0.28,
531
+ "grad_norm": 5.768390511994523,
532
+ "learning_rate": 4.522542485937369e-06,
533
+ "loss": 2.6888,
534
+ "step": 38,
535
+ "trainloss/critic_chosen": 1.1469529867172241,
536
+ "trainloss/critic_rejected": 1.2045722007751465,
537
+ "trainloss/reward": 1.1469529867172241,
538
+ "trainrewards/accuracies": 0.9114583730697632,
539
+ "trainrewards/chosen": 1.8046875,
540
+ "trainrewards/margins": 2.484375,
541
+ "trainrewards/rejected": -0.6875
542
+ },
543
+ {
544
+ "epoch": 0.29,
545
+ "grad_norm": 4.874609149891752,
546
+ "learning_rate": 4.4833833507280884e-06,
547
+ "loss": 2.5543,
548
+ "step": 39,
549
+ "trainloss/critic_chosen": 1.1098886728286743,
550
+ "trainloss/critic_rejected": 1.1714903116226196,
551
+ "trainloss/reward": 1.1098886728286743,
552
+ "trainrewards/accuracies": 0.958333432674408,
553
+ "trainrewards/chosen": 1.8984375,
554
+ "trainrewards/margins": 2.625,
555
+ "trainrewards/rejected": -0.7265625
556
+ },
557
+ {
558
+ "epoch": 0.3,
559
+ "grad_norm": 3.396284173013532,
560
+ "learning_rate": 4.442864903642428e-06,
561
+ "loss": 2.6564,
562
+ "step": 40,
563
+ "trainloss/critic_chosen": 1.1380069255828857,
564
+ "trainloss/critic_rejected": 1.2159972190856934,
565
+ "trainloss/reward": 1.1380069255828857,
566
+ "trainrewards/accuracies": 0.9427083134651184,
567
+ "trainrewards/chosen": 0.9296875,
568
+ "trainrewards/margins": 1.9375,
569
+ "trainrewards/rejected": -1.0078125
570
+ },
571
+ {
572
+ "epoch": 0.31,
573
+ "grad_norm": 3.8092153086439087,
574
+ "learning_rate": 4.401014914000078e-06,
575
+ "loss": 2.5515,
576
+ "step": 41,
577
+ "trainloss/critic_chosen": 1.123491883277893,
578
+ "trainloss/critic_rejected": 1.1983730792999268,
579
+ "trainloss/reward": 1.123491883277893,
580
+ "trainrewards/accuracies": 0.9479166865348816,
581
+ "trainrewards/chosen": 1.015625,
582
+ "trainrewards/margins": 2.125,
583
+ "trainrewards/rejected": -1.1015625
584
+ },
585
+ {
586
+ "epoch": 0.31,
587
+ "grad_norm": 3.292224209377405,
588
+ "learning_rate": 4.357862063693486e-06,
589
+ "loss": 2.6296,
590
+ "step": 42,
591
+ "trainloss/critic_chosen": 1.1296896934509277,
592
+ "trainloss/critic_rejected": 1.193892002105713,
593
+ "trainloss/reward": 1.1296896934509277,
594
+ "trainrewards/accuracies": 0.9270833134651184,
595
+ "trainrewards/chosen": 1.3671875,
596
+ "trainrewards/margins": 2.234375,
597
+ "trainrewards/rejected": -0.859375
598
+ },
599
+ {
600
+ "epoch": 0.32,
601
+ "grad_norm": 4.97304962284229,
602
+ "learning_rate": 4.313435927530719e-06,
603
+ "loss": 2.5984,
604
+ "step": 43,
605
+ "trainloss/critic_chosen": 1.106866478919983,
606
+ "trainloss/critic_rejected": 1.1839522123336792,
607
+ "trainloss/reward": 1.106866478919983,
608
+ "trainrewards/accuracies": 0.9166666865348816,
609
+ "trainrewards/chosen": 1.859375,
610
+ "trainrewards/margins": 2.515625,
611
+ "trainrewards/rejected": -0.6640625
612
+ },
613
+ {
614
+ "epoch": 0.33,
615
+ "grad_norm": 3.1437931542625615,
616
+ "learning_rate": 4.267766952966369e-06,
617
+ "loss": 2.6053,
618
+ "step": 44,
619
+ "trainloss/critic_chosen": 1.141026496887207,
620
+ "trainloss/critic_rejected": 1.1881659030914307,
621
+ "trainloss/reward": 1.141026496887207,
622
+ "trainrewards/accuracies": 0.9375,
623
+ "trainrewards/chosen": 1.484375,
624
+ "trainrewards/margins": 2.5,
625
+ "trainrewards/rejected": -1.015625
626
+ },
627
+ {
628
+ "epoch": 0.34,
629
+ "grad_norm": 3.0260425195721727,
630
+ "learning_rate": 4.220886439234385e-06,
631
+ "loss": 2.6162,
632
+ "step": 45,
633
+ "trainloss/critic_chosen": 1.1437909603118896,
634
+ "trainloss/critic_rejected": 1.1694350242614746,
635
+ "trainloss/reward": 1.1437909603118896,
636
+ "trainrewards/accuracies": 0.9270833134651184,
637
+ "trainrewards/chosen": 1.3359375,
638
+ "trainrewards/margins": 2.265625,
639
+ "trainrewards/rejected": -0.93359375
640
+ },
641
+ {
642
+ "epoch": 0.34,
643
+ "grad_norm": 3.9421991947992803,
644
+ "learning_rate": 4.172826515897146e-06,
645
+ "loss": 2.559,
646
+ "step": 46,
647
+ "trainloss/critic_chosen": 1.1193464994430542,
648
+ "trainloss/critic_rejected": 1.1624045372009277,
649
+ "trainloss/reward": 1.1193464994430542,
650
+ "trainrewards/accuracies": 0.9270833134651184,
651
+ "trainrewards/chosen": 1.2109375,
652
+ "trainrewards/margins": 1.96875,
653
+ "trainrewards/rejected": -0.75
654
+ },
655
+ {
656
+ "epoch": 0.35,
657
+ "grad_norm": 4.76800798471375,
658
+ "learning_rate": 4.123620120825459e-06,
659
+ "loss": 2.5633,
660
+ "step": 47,
661
+ "trainloss/critic_chosen": 1.1039447784423828,
662
+ "trainloss/critic_rejected": 1.1683855056762695,
663
+ "trainloss/reward": 1.1039447784423828,
664
+ "trainrewards/accuracies": 0.9270833134651184,
665
+ "trainrewards/chosen": 1.5,
666
+ "trainrewards/margins": 1.8515625,
667
+ "trainrewards/rejected": -0.357421875
668
+ },
669
+ {
670
+ "epoch": 0.36,
671
+ "grad_norm": 4.677899041279874,
672
+ "learning_rate": 4.073300977624594e-06,
673
+ "loss": 2.6104,
674
+ "step": 48,
675
+ "trainloss/critic_chosen": 1.1293267011642456,
676
+ "trainloss/critic_rejected": 1.173600435256958,
677
+ "trainloss/reward": 1.1293267011642456,
678
+ "trainrewards/accuracies": 0.9270833134651184,
679
+ "trainrewards/chosen": 1.5625,
680
+ "trainrewards/margins": 1.953125,
681
+ "trainrewards/rejected": -0.38671875
682
+ },
683
+ {
684
+ "epoch": 0.36,
685
+ "grad_norm": 2.8973280324668815,
686
+ "learning_rate": 4.021903572521802e-06,
687
+ "loss": 2.5884,
688
+ "step": 49,
689
+ "trainloss/critic_chosen": 1.1289738416671753,
690
+ "trainloss/critic_rejected": 1.169731855392456,
691
+ "trainloss/reward": 1.1289738416671753,
692
+ "trainrewards/accuracies": 0.9375000596046448,
693
+ "trainrewards/chosen": 1.3125,
694
+ "trainrewards/margins": 2.515625,
695
+ "trainrewards/rejected": -1.203125
696
+ },
697
+ {
698
+ "epoch": 0.37,
699
+ "grad_norm": 2.9211772383175685,
700
+ "learning_rate": 3.969463130731183e-06,
701
+ "loss": 2.5868,
702
+ "step": 50,
703
+ "trainloss/critic_chosen": 1.1367411613464355,
704
+ "trainloss/critic_rejected": 1.1725157499313354,
705
+ "trainloss/reward": 1.1367411613464355,
706
+ "trainrewards/accuracies": 0.9114583730697632,
707
+ "trainrewards/chosen": 1.265625,
708
+ "trainrewards/margins": 2.484375,
709
+ "trainrewards/rejected": -1.21875
710
+ },
711
+ {
712
+ "epoch": 0.38,
713
+ "grad_norm": 3.4239805909008627,
714
+ "learning_rate": 3.916015592312083e-06,
715
+ "loss": 2.5442,
716
+ "step": 51,
717
+ "trainloss/critic_chosen": 1.101191759109497,
718
+ "trainloss/critic_rejected": 1.2063257694244385,
719
+ "trainloss/reward": 1.101191759109497,
720
+ "trainrewards/accuracies": 0.9583333730697632,
721
+ "trainrewards/chosen": 1.6171875,
722
+ "trainrewards/margins": 2.546875,
723
+ "trainrewards/rejected": -0.9296875
724
+ },
725
+ {
726
+ "epoch": 0.39,
727
+ "grad_norm": 3.3449791279382155,
728
+ "learning_rate": 3.861597587537568e-06,
729
+ "loss": 2.5532,
730
+ "step": 52,
731
+ "trainloss/critic_chosen": 1.1064534187316895,
732
+ "trainloss/critic_rejected": 1.1979490518569946,
733
+ "trainloss/reward": 1.1064534187316895,
734
+ "trainrewards/accuracies": 0.9427083730697632,
735
+ "trainrewards/chosen": 1.6015625,
736
+ "trainrewards/margins": 2.46875,
737
+ "trainrewards/rejected": -0.875
738
+ },
739
+ {
740
+ "epoch": 0.39,
741
+ "grad_norm": 3.9419484082989724,
742
+ "learning_rate": 3.806246411789872e-06,
743
+ "loss": 2.6147,
744
+ "step": 53,
745
+ "trainloss/critic_chosen": 1.138405680656433,
746
+ "trainloss/critic_rejected": 1.1973538398742676,
747
+ "trainloss/reward": 1.138405680656433,
748
+ "trainrewards/accuracies": 0.9375,
749
+ "trainrewards/chosen": 1.25,
750
+ "trainrewards/margins": 2.609375,
751
+ "trainrewards/rejected": -1.3671875
752
+ },
753
+ {
754
+ "epoch": 0.4,
755
+ "grad_norm": 3.4980254593155413,
756
+ "learning_rate": 3.7500000000000005e-06,
757
+ "loss": 2.5364,
758
+ "step": 54,
759
+ "trainloss/critic_chosen": 1.0845965147018433,
760
+ "trainloss/critic_rejected": 1.1989306211471558,
761
+ "trainloss/reward": 1.0845965147018433,
762
+ "trainrewards/accuracies": 0.9635417461395264,
763
+ "trainrewards/chosen": 1.6953125,
764
+ "trainrewards/margins": 2.59375,
765
+ "trainrewards/rejected": -0.89453125
766
+ },
767
+ {
768
+ "epoch": 0.41,
769
+ "grad_norm": 3.7684432316267347,
770
+ "learning_rate": 3.6928969006490212e-06,
771
+ "loss": 2.5578,
772
+ "step": 55,
773
+ "trainloss/critic_chosen": 1.105364441871643,
774
+ "trainloss/critic_rejected": 1.1862692832946777,
775
+ "trainloss/reward": 1.105364441871643,
776
+ "trainrewards/accuracies": 0.9270833730697632,
777
+ "trainrewards/chosen": 1.8046875,
778
+ "trainrewards/margins": 2.65625,
779
+ "trainrewards/rejected": -0.86328125
780
+ },
781
+ {
782
+ "epoch": 0.42,
783
+ "grad_norm": 2.733004985886796,
784
+ "learning_rate": 3.634976249348867e-06,
785
+ "loss": 2.5665,
786
+ "step": 56,
787
+ "trainloss/critic_chosen": 1.1256849765777588,
788
+ "trainloss/critic_rejected": 1.1650742292404175,
789
+ "trainloss/reward": 1.1256849765777588,
790
+ "trainrewards/accuracies": 0.9322916865348816,
791
+ "trainrewards/chosen": 1.328125,
792
+ "trainrewards/margins": 2.390625,
793
+ "trainrewards/rejected": -1.0546875
794
+ },
795
+ {
796
+ "epoch": 0.42,
797
+ "grad_norm": 3.063205301802556,
798
+ "learning_rate": 3.5762777420207382e-06,
799
+ "loss": 2.5733,
800
+ "step": 57,
801
+ "trainloss/critic_chosen": 1.1022924184799194,
802
+ "trainloss/critic_rejected": 1.1559257507324219,
803
+ "trainloss/reward": 1.1022924184799194,
804
+ "trainrewards/accuracies": 0.9166666865348816,
805
+ "trainrewards/chosen": 1.40625,
806
+ "trainrewards/margins": 2.28125,
807
+ "trainrewards/rejected": -0.875
808
+ },
809
+ {
810
+ "epoch": 0.43,
811
+ "grad_norm": 3.2936675397250985,
812
+ "learning_rate": 3.516841607689501e-06,
813
+ "loss": 2.529,
814
+ "step": 58,
815
+ "trainloss/critic_chosen": 1.0919924974441528,
816
+ "trainloss/critic_rejected": 1.1807957887649536,
817
+ "trainloss/reward": 1.0919924974441528,
818
+ "trainrewards/accuracies": 0.9375000596046448,
819
+ "trainrewards/chosen": 1.0703125,
820
+ "trainrewards/margins": 2.0625,
821
+ "trainrewards/rejected": -1.0
822
+ },
823
+ {
824
+ "epoch": 0.44,
825
+ "grad_norm": 2.9687788874925505,
826
+ "learning_rate": 3.4567085809127247e-06,
827
+ "loss": 2.5538,
828
+ "step": 59,
829
+ "trainloss/critic_chosen": 1.152530312538147,
830
+ "trainloss/critic_rejected": 1.128198504447937,
831
+ "trainloss/reward": 1.152530312538147,
832
+ "trainrewards/accuracies": 0.9375,
833
+ "trainrewards/chosen": 1.3125,
834
+ "trainrewards/margins": 2.171875,
835
+ "trainrewards/rejected": -0.86328125
836
+ },
837
+ {
838
+ "epoch": 0.45,
839
+ "grad_norm": 2.5189366946202374,
840
+ "learning_rate": 3.39591987386325e-06,
841
+ "loss": 2.4931,
842
+ "step": 60,
843
+ "trainloss/critic_chosen": 1.0971665382385254,
844
+ "trainloss/critic_rejected": 1.189927339553833,
845
+ "trainloss/reward": 1.0971665382385254,
846
+ "trainrewards/accuracies": 0.96875,
847
+ "trainrewards/chosen": 1.3828125,
848
+ "trainrewards/margins": 2.671875,
849
+ "trainrewards/rejected": -1.2890625
850
+ },
851
+ {
852
+ "epoch": 0.45,
853
+ "grad_norm": 4.707774798123127,
854
+ "learning_rate": 3.3345171480844275e-06,
855
+ "loss": 2.4995,
856
+ "step": 61,
857
+ "trainloss/critic_chosen": 1.1144541501998901,
858
+ "trainloss/critic_rejected": 1.1472208499908447,
859
+ "trainloss/reward": 1.1144541501998901,
860
+ "trainrewards/accuracies": 0.9739583730697632,
861
+ "trainrewards/chosen": 1.9921875,
862
+ "trainrewards/margins": 2.765625,
863
+ "trainrewards/rejected": -0.7734375
864
+ },
865
+ {
866
+ "epoch": 0.46,
867
+ "grad_norm": 3.621977923726089,
868
+ "learning_rate": 3.272542485937369e-06,
869
+ "loss": 2.5767,
870
+ "step": 62,
871
+ "trainloss/critic_chosen": 1.1388683319091797,
872
+ "trainloss/critic_rejected": 1.1852062940597534,
873
+ "trainloss/reward": 1.1388683319091797,
874
+ "trainrewards/accuracies": 0.9479167461395264,
875
+ "trainrewards/chosen": 1.8203125,
876
+ "trainrewards/margins": 3.09375,
877
+ "trainrewards/rejected": -1.265625
878
+ },
879
+ {
880
+ "epoch": 0.47,
881
+ "grad_norm": 4.340502288849219,
882
+ "learning_rate": 3.2100383617598075e-06,
883
+ "loss": 2.5008,
884
+ "step": 63,
885
+ "trainloss/critic_chosen": 1.0960522890090942,
886
+ "trainloss/critic_rejected": 1.1389869451522827,
887
+ "trainloss/reward": 1.0960522890090942,
888
+ "trainrewards/accuracies": 0.9427083730697632,
889
+ "trainrewards/chosen": 1.25,
890
+ "trainrewards/margins": 2.8125,
891
+ "trainrewards/rejected": -1.5703125
892
+ },
893
+ {
894
+ "epoch": 0.48,
895
+ "grad_norm": 3.2652013602478087,
896
+ "learning_rate": 3.147047612756302e-06,
897
+ "loss": 2.4784,
898
+ "step": 64,
899
+ "trainloss/critic_chosen": 1.1066646575927734,
900
+ "trainloss/critic_rejected": 1.1423835754394531,
901
+ "trainloss/reward": 1.1066646575927734,
902
+ "trainrewards/accuracies": 0.9427083730697632,
903
+ "trainrewards/chosen": 1.2265625,
904
+ "trainrewards/margins": 2.859375,
905
+ "trainrewards/rejected": -1.6328125
906
+ },
907
+ {
908
+ "epoch": 0.48,
909
+ "grad_norm": 4.460312181878758,
910
+ "learning_rate": 3.0836134096397642e-06,
911
+ "loss": 2.5315,
912
+ "step": 65,
913
+ "trainloss/critic_chosen": 1.097680926322937,
914
+ "trainloss/critic_rejected": 1.1829330921173096,
915
+ "trainloss/reward": 1.097680926322937,
916
+ "trainrewards/accuracies": 0.9322916865348816,
917
+ "trainrewards/chosen": 1.71875,
918
+ "trainrewards/margins": 2.375,
919
+ "trainrewards/rejected": -0.66015625
920
+ },
921
+ {
922
+ "epoch": 0.49,
923
+ "grad_norm": 5.398290397831798,
924
+ "learning_rate": 3.019779227044398e-06,
925
+ "loss": 2.4912,
926
+ "step": 66,
927
+ "trainloss/critic_chosen": 1.0728169679641724,
928
+ "trainloss/critic_rejected": 1.1528609991073608,
929
+ "trainloss/reward": 1.0728169679641724,
930
+ "trainrewards/accuracies": 0.9479166865348816,
931
+ "trainrewards/chosen": 1.75,
932
+ "trainrewards/margins": 2.1875,
933
+ "trainrewards/rejected": -0.44140625
934
+ },
935
+ {
936
+ "epoch": 0.5,
937
+ "grad_norm": 4.530365049006353,
938
+ "learning_rate": 2.9555888137303695e-06,
939
+ "loss": 2.4768,
940
+ "step": 67,
941
+ "trainloss/critic_chosen": 1.0978233814239502,
942
+ "trainloss/critic_rejected": 1.1454182863235474,
943
+ "trainloss/reward": 1.0978233814239502,
944
+ "trainrewards/accuracies": 0.9479166865348816,
945
+ "trainrewards/chosen": 1.515625,
946
+ "trainrewards/margins": 2.1875,
947
+ "trainrewards/rejected": -0.66015625
948
+ },
949
+ {
950
+ "epoch": 0.51,
951
+ "grad_norm": 3.090064735833262,
952
+ "learning_rate": 2.8910861626005774e-06,
953
+ "loss": 2.5542,
954
+ "step": 68,
955
+ "trainloss/critic_chosen": 1.1045993566513062,
956
+ "trainloss/critic_rejected": 1.1823933124542236,
957
+ "trainloss/reward": 1.1045993566513062,
958
+ "trainrewards/accuracies": 0.9166666865348816,
959
+ "trainrewards/chosen": 1.296875,
960
+ "trainrewards/margins": 2.296875,
961
+ "trainrewards/rejected": -1.0
962
+ },
963
+ {
964
+ "epoch": 0.51,
965
+ "grad_norm": 2.801294778929006,
966
+ "learning_rate": 2.82631548055013e-06,
967
+ "loss": 2.4752,
968
+ "step": 69,
969
+ "trainloss/critic_chosen": 1.0862737894058228,
970
+ "trainloss/critic_rejected": 1.1638906002044678,
971
+ "trainloss/reward": 1.0862737894058228,
972
+ "trainrewards/accuracies": 0.9479166865348816,
973
+ "trainrewards/chosen": 1.46875,
974
+ "trainrewards/margins": 2.8125,
975
+ "trainrewards/rejected": -1.359375
976
+ },
977
+ {
978
+ "epoch": 0.52,
979
+ "grad_norm": 3.5888770327583503,
980
+ "learning_rate": 2.761321158169134e-06,
981
+ "loss": 2.5502,
982
+ "step": 70,
983
+ "trainloss/critic_chosen": 1.1130059957504272,
984
+ "trainloss/critic_rejected": 1.1747164726257324,
985
+ "trainloss/reward": 1.1130059957504272,
986
+ "trainrewards/accuracies": 0.9583333134651184,
987
+ "trainrewards/chosen": 1.75,
988
+ "trainrewards/margins": 2.953125,
989
+ "trainrewards/rejected": -1.203125
990
+ },
991
+ {
992
+ "epoch": 0.53,
993
+ "grad_norm": 3.553005435624982,
994
+ "learning_rate": 2.696147739319613e-06,
995
+ "loss": 2.4735,
996
+ "step": 71,
997
+ "trainloss/critic_chosen": 1.1133400201797485,
998
+ "trainloss/critic_rejected": 1.1409944295883179,
999
+ "trainloss/reward": 1.1133400201797485,
1000
+ "trainrewards/accuracies": 0.9583333730697632,
1001
+ "trainrewards/chosen": 1.96875,
1002
+ "trainrewards/margins": 3.375,
1003
+ "trainrewards/rejected": -1.40625
1004
+ },
1005
+ {
1006
+ "epoch": 0.54,
1007
+ "grad_norm": 2.7088469336528145,
1008
+ "learning_rate": 2.6308398906073603e-06,
1009
+ "loss": 2.4512,
1010
+ "step": 72,
1011
+ "trainloss/critic_chosen": 1.1119564771652222,
1012
+ "trainloss/critic_rejected": 1.1244186162948608,
1013
+ "trainloss/reward": 1.1119564771652222,
1014
+ "trainrewards/accuracies": 0.96875,
1015
+ "trainrewards/chosen": 1.5703125,
1016
+ "trainrewards/margins": 3.03125,
1017
+ "trainrewards/rejected": -1.4609375
1018
+ },
1019
+ {
1020
+ "epoch": 0.54,
1021
+ "grad_norm": 3.938561115333166,
1022
+ "learning_rate": 2.5654423707696834e-06,
1023
+ "loss": 2.4921,
1024
+ "step": 73,
1025
+ "trainloss/critic_chosen": 1.0844348669052124,
1026
+ "trainloss/critic_rejected": 1.163710355758667,
1027
+ "trainloss/reward": 1.0844348669052124,
1028
+ "trainrewards/accuracies": 0.9583333730697632,
1029
+ "trainrewards/chosen": 1.0703125,
1030
+ "trainrewards/margins": 2.734375,
1031
+ "trainrewards/rejected": -1.6640625
1032
+ },
1033
+ {
1034
+ "epoch": 0.55,
1035
+ "grad_norm": 3.7076560975513293,
1036
+ "learning_rate": 2.5e-06,
1037
+ "loss": 2.4702,
1038
+ "step": 74,
1039
+ "trainloss/critic_chosen": 1.105428695678711,
1040
+ "trainloss/critic_rejected": 1.1134750843048096,
1041
+ "trainloss/reward": 1.105428695678711,
1042
+ "trainrewards/accuracies": 0.9531250596046448,
1043
+ "trainrewards/chosen": 1.1015625,
1044
+ "trainrewards/margins": 2.4375,
1045
+ "trainrewards/rejected": -1.328125
1046
+ },
1047
+ {
1048
+ "epoch": 0.56,
1049
+ "grad_norm": 4.584325275815331,
1050
+ "learning_rate": 2.434557629230318e-06,
1051
+ "loss": 2.5531,
1052
+ "step": 75,
1053
+ "trainloss/critic_chosen": 1.1023496389389038,
1054
+ "trainloss/critic_rejected": 1.1693300008773804,
1055
+ "trainloss/reward": 1.1023496389389038,
1056
+ "trainrewards/accuracies": 0.9322916865348816,
1057
+ "trainrewards/chosen": 1.6953125,
1058
+ "trainrewards/margins": 2.265625,
1059
+ "trainrewards/rejected": -0.5703125
1060
+ },
1061
+ {
1062
+ "epoch": 0.57,
1063
+ "grad_norm": 5.707921133643401,
1064
+ "learning_rate": 2.3691601093926406e-06,
1065
+ "loss": 2.512,
1066
+ "step": 76,
1067
+ "trainloss/critic_chosen": 1.0742218494415283,
1068
+ "trainloss/critic_rejected": 1.1473562717437744,
1069
+ "trainloss/reward": 1.0742218494415283,
1070
+ "trainrewards/accuracies": 0.9375000596046448,
1071
+ "trainrewards/chosen": 1.984375,
1072
+ "trainrewards/margins": 2.359375,
1073
+ "trainrewards/rejected": -0.380859375
1074
+ },
1075
+ {
1076
+ "epoch": 0.57,
1077
+ "grad_norm": 5.052893345106084,
1078
+ "learning_rate": 2.3038522606803882e-06,
1079
+ "loss": 2.5495,
1080
+ "step": 77,
1081
+ "trainloss/critic_chosen": 1.09754478931427,
1082
+ "trainloss/critic_rejected": 1.175227165222168,
1083
+ "trainloss/reward": 1.09754478931427,
1084
+ "trainrewards/accuracies": 0.9218751192092896,
1085
+ "trainrewards/chosen": 1.8671875,
1086
+ "trainrewards/margins": 2.359375,
1087
+ "trainrewards/rejected": -0.490234375
1088
+ },
1089
+ {
1090
+ "epoch": 0.58,
1091
+ "grad_norm": 3.505818483136781,
1092
+ "learning_rate": 2.238678841830867e-06,
1093
+ "loss": 2.5073,
1094
+ "step": 78,
1095
+ "trainloss/critic_chosen": 1.100816249847412,
1096
+ "trainloss/critic_rejected": 1.1553771495819092,
1097
+ "trainloss/reward": 1.100816249847412,
1098
+ "trainrewards/accuracies": 0.9375000596046448,
1099
+ "trainrewards/chosen": 1.4375,
1100
+ "trainrewards/margins": 2.1875,
1101
+ "trainrewards/rejected": -0.75
1102
+ },
1103
+ {
1104
+ "epoch": 0.59,
1105
+ "grad_norm": 4.2251215117971475,
1106
+ "learning_rate": 2.173684519449872e-06,
1107
+ "loss": 2.5035,
1108
+ "step": 79,
1109
+ "trainloss/critic_chosen": 1.093074083328247,
1110
+ "trainloss/critic_rejected": 1.163825511932373,
1111
+ "trainloss/reward": 1.093074083328247,
1112
+ "trainrewards/accuracies": 0.9531250596046448,
1113
+ "trainrewards/chosen": 0.91796875,
1114
+ "trainrewards/margins": 2.140625,
1115
+ "trainrewards/rejected": -1.21875
1116
+ },
1117
+ {
1118
+ "epoch": 0.6,
1119
+ "grad_norm": 4.171916933286059,
1120
+ "learning_rate": 2.1089138373994226e-06,
1121
+ "loss": 2.4726,
1122
+ "step": 80,
1123
+ "trainloss/critic_chosen": 1.0706841945648193,
1124
+ "trainloss/critic_rejected": 1.160952091217041,
1125
+ "trainloss/reward": 1.0706841945648193,
1126
+ "trainrewards/accuracies": 0.9322916865348816,
1127
+ "trainrewards/chosen": 0.98046875,
1128
+ "trainrewards/margins": 2.375,
1129
+ "trainrewards/rejected": -1.390625
1130
+ },
1131
+ {
1132
+ "epoch": 0.6,
1133
+ "grad_norm": 2.7690433360924085,
1134
+ "learning_rate": 2.0444111862696313e-06,
1135
+ "loss": 2.4269,
1136
+ "step": 81,
1137
+ "trainloss/critic_chosen": 1.0752573013305664,
1138
+ "trainloss/critic_rejected": 1.1339901685714722,
1139
+ "trainloss/reward": 1.0752573013305664,
1140
+ "trainrewards/accuracies": 0.9739583730697632,
1141
+ "trainrewards/chosen": 1.484375,
1142
+ "trainrewards/margins": 2.578125,
1143
+ "trainrewards/rejected": -1.09375
1144
+ },
1145
+ {
1146
+ "epoch": 0.61,
1147
+ "grad_norm": 3.358268001716196,
1148
+ "learning_rate": 1.9802207729556023e-06,
1149
+ "loss": 2.461,
1150
+ "step": 82,
1151
+ "trainloss/critic_chosen": 1.1075457334518433,
1152
+ "trainloss/critic_rejected": 1.1157523393630981,
1153
+ "trainloss/reward": 1.1075457334518433,
1154
+ "trainrewards/accuracies": 0.953125,
1155
+ "trainrewards/chosen": 1.8828125,
1156
+ "trainrewards/margins": 2.90625,
1157
+ "trainrewards/rejected": -1.0234375
1158
+ },
1159
+ {
1160
+ "epoch": 0.62,
1161
+ "grad_norm": 4.328068525423629,
1162
+ "learning_rate": 1.9163865903602374e-06,
1163
+ "loss": 2.5352,
1164
+ "step": 83,
1165
+ "trainloss/critic_chosen": 1.1028249263763428,
1166
+ "trainloss/critic_rejected": 1.1644842624664307,
1167
+ "trainloss/reward": 1.1028249263763428,
1168
+ "trainrewards/accuracies": 0.96875,
1169
+ "trainrewards/chosen": 1.8671875,
1170
+ "trainrewards/margins": 2.921875,
1171
+ "trainrewards/rejected": -1.0625
1172
+ },
1173
+ {
1174
+ "epoch": 0.63,
1175
+ "grad_norm": 3.266438978334478,
1176
+ "learning_rate": 1.852952387243698e-06,
1177
+ "loss": 2.4134,
1178
+ "step": 84,
1179
+ "trainloss/critic_chosen": 1.0756388902664185,
1180
+ "trainloss/critic_rejected": 1.1303694248199463,
1181
+ "trainloss/reward": 1.0756388902664185,
1182
+ "trainrewards/accuracies": 0.9687500596046448,
1183
+ "trainrewards/chosen": 1.9140625,
1184
+ "trainrewards/margins": 3.25,
1185
+ "trainrewards/rejected": -1.328125
1186
+ },
1187
+ {
1188
+ "epoch": 0.63,
1189
+ "grad_norm": 2.386641393706194,
1190
+ "learning_rate": 1.7899616382401935e-06,
1191
+ "loss": 2.401,
1192
+ "step": 85,
1193
+ "trainloss/critic_chosen": 1.0511287450790405,
1194
+ "trainloss/critic_rejected": 1.128703236579895,
1195
+ "trainloss/reward": 1.0511287450790405,
1196
+ "trainrewards/accuracies": 0.9583333730697632,
1197
+ "trainrewards/chosen": 1.6015625,
1198
+ "trainrewards/margins": 2.953125,
1199
+ "trainrewards/rejected": -1.359375
1200
+ },
1201
+ {
1202
+ "epoch": 0.64,
1203
+ "grad_norm": 3.7161933000807403,
1204
+ "learning_rate": 1.7274575140626318e-06,
1205
+ "loss": 2.4732,
1206
+ "step": 86,
1207
+ "trainloss/critic_chosen": 1.0827696323394775,
1208
+ "trainloss/critic_rejected": 1.1439146995544434,
1209
+ "trainloss/reward": 1.0827696323394775,
1210
+ "trainrewards/accuracies": 0.9583333134651184,
1211
+ "trainrewards/chosen": 1.0,
1212
+ "trainrewards/margins": 2.734375,
1213
+ "trainrewards/rejected": -1.734375
1214
+ },
1215
+ {
1216
+ "epoch": 0.65,
1217
+ "grad_norm": 3.4186216283012754,
1218
+ "learning_rate": 1.665482851915573e-06,
1219
+ "loss": 2.5064,
1220
+ "step": 87,
1221
+ "trainloss/critic_chosen": 1.093652367591858,
1222
+ "trainloss/critic_rejected": 1.1373913288116455,
1223
+ "trainloss/reward": 1.093652367591858,
1224
+ "trainrewards/accuracies": 0.927083432674408,
1225
+ "trainrewards/chosen": 1.09375,
1226
+ "trainrewards/margins": 2.5625,
1227
+ "trainrewards/rejected": -1.46875
1228
+ },
1229
+ {
1230
+ "epoch": 0.66,
1231
+ "grad_norm": 2.4263959266567996,
1232
+ "learning_rate": 1.6040801261367494e-06,
1233
+ "loss": 2.5409,
1234
+ "step": 88,
1235
+ "trainloss/critic_chosen": 1.1319228410720825,
1236
+ "trainloss/critic_rejected": 1.1887366771697998,
1237
+ "trainloss/reward": 1.1319228410720825,
1238
+ "trainrewards/accuracies": 0.9687501192092896,
1239
+ "trainrewards/chosen": 1.3125,
1240
+ "trainrewards/margins": 2.6875,
1241
+ "trainrewards/rejected": -1.375
1242
+ },
1243
+ {
1244
+ "epoch": 0.66,
1245
+ "grad_norm": 4.091003192293857,
1246
+ "learning_rate": 1.5432914190872757e-06,
1247
+ "loss": 2.5386,
1248
+ "step": 89,
1249
+ "trainloss/critic_chosen": 1.1037051677703857,
1250
+ "trainloss/critic_rejected": 1.1342533826828003,
1251
+ "trainloss/reward": 1.1037051677703857,
1252
+ "trainrewards/accuracies": 0.9427083134651184,
1253
+ "trainrewards/chosen": 1.640625,
1254
+ "trainrewards/margins": 2.375,
1255
+ "trainrewards/rejected": -0.734375
1256
+ },
1257
+ {
1258
+ "epoch": 0.67,
1259
+ "grad_norm": 4.356596196020246,
1260
+ "learning_rate": 1.4831583923105e-06,
1261
+ "loss": 2.4845,
1262
+ "step": 90,
1263
+ "trainloss/critic_chosen": 1.0889127254486084,
1264
+ "trainloss/critic_rejected": 1.1599314212799072,
1265
+ "trainloss/reward": 1.0889127254486084,
1266
+ "trainrewards/accuracies": 0.9583333134651184,
1267
+ "trainrewards/chosen": 1.875,
1268
+ "trainrewards/margins": 2.59375,
1269
+ "trainrewards/rejected": -0.7265625
1270
+ },
1271
+ {
1272
+ "epoch": 0.68,
1273
+ "grad_norm": 3.484859150407605,
1274
+ "learning_rate": 1.4237222579792618e-06,
1275
+ "loss": 2.504,
1276
+ "step": 91,
1277
+ "trainloss/critic_chosen": 1.1031081676483154,
1278
+ "trainloss/critic_rejected": 1.1596983671188354,
1279
+ "trainloss/reward": 1.1031081676483154,
1280
+ "trainrewards/accuracies": 0.953125,
1281
+ "trainrewards/chosen": 1.7265625,
1282
+ "trainrewards/margins": 2.5,
1283
+ "trainrewards/rejected": -0.765625
1284
+ },
1285
+ {
1286
+ "epoch": 0.69,
1287
+ "grad_norm": 3.5906077474254046,
1288
+ "learning_rate": 1.3650237506511333e-06,
1289
+ "loss": 2.497,
1290
+ "step": 92,
1291
+ "trainloss/critic_chosen": 1.1017568111419678,
1292
+ "trainloss/critic_rejected": 1.1597734689712524,
1293
+ "trainloss/reward": 1.1017568111419678,
1294
+ "trainrewards/accuracies": 0.9427083730697632,
1295
+ "trainrewards/chosen": 1.734375,
1296
+ "trainrewards/margins": 2.609375,
1297
+ "trainrewards/rejected": -0.87109375
1298
+ },
1299
+ {
1300
+ "epoch": 0.69,
1301
+ "grad_norm": 3.883326754801315,
1302
+ "learning_rate": 1.307103099350979e-06,
1303
+ "loss": 2.4881,
1304
+ "step": 93,
1305
+ "trainloss/critic_chosen": 1.1008602380752563,
1306
+ "trainloss/critic_rejected": 1.1622505187988281,
1307
+ "trainloss/reward": 1.1008602380752563,
1308
+ "trainrewards/accuracies": 0.9374999403953552,
1309
+ "trainrewards/chosen": 1.8359375,
1310
+ "trainrewards/margins": 2.65625,
1311
+ "trainrewards/rejected": -0.81640625
1312
+ },
1313
+ {
1314
+ "epoch": 0.7,
1315
+ "grad_norm": 3.106807473961497,
1316
+ "learning_rate": 1.2500000000000007e-06,
1317
+ "loss": 2.5239,
1318
+ "step": 94,
1319
+ "trainloss/critic_chosen": 1.1186132431030273,
1320
+ "trainloss/critic_rejected": 1.1955691576004028,
1321
+ "trainloss/reward": 1.1186132431030273,
1322
+ "trainrewards/accuracies": 0.9479166865348816,
1323
+ "trainrewards/chosen": 1.4453125,
1324
+ "trainrewards/margins": 2.78125,
1325
+ "trainrewards/rejected": -1.34375
1326
+ },
1327
+ {
1328
+ "epoch": 0.71,
1329
+ "grad_norm": 3.0694983477589237,
1330
+ "learning_rate": 1.193753588210128e-06,
1331
+ "loss": 2.4975,
1332
+ "step": 95,
1333
+ "trainloss/critic_chosen": 1.089274287223816,
1334
+ "trainloss/critic_rejected": 1.1611120700836182,
1335
+ "trainloss/reward": 1.089274287223816,
1336
+ "trainrewards/accuracies": 0.9166667461395264,
1337
+ "trainrewards/chosen": 1.21875,
1338
+ "trainrewards/margins": 2.625,
1339
+ "trainrewards/rejected": -1.4140625
1340
+ },
1341
+ {
1342
+ "epoch": 0.72,
1343
+ "grad_norm": 2.647849041797858,
1344
+ "learning_rate": 1.1384024124624324e-06,
1345
+ "loss": 2.4533,
1346
+ "step": 96,
1347
+ "trainloss/critic_chosen": 1.0731533765792847,
1348
+ "trainloss/critic_rejected": 1.1588420867919922,
1349
+ "trainloss/reward": 1.0731533765792847,
1350
+ "trainrewards/accuracies": 0.9531250596046448,
1351
+ "trainrewards/chosen": 1.2578125,
1352
+ "trainrewards/margins": 2.671875,
1353
+ "trainrewards/rejected": -1.40625
1354
+ },
1355
+ {
1356
+ "epoch": 0.72,
1357
+ "grad_norm": 3.0284092019206743,
1358
+ "learning_rate": 1.0839844076879186e-06,
1359
+ "loss": 2.52,
1360
+ "step": 97,
1361
+ "trainloss/critic_chosen": 1.1046061515808105,
1362
+ "trainloss/critic_rejected": 1.1355366706848145,
1363
+ "trainloss/reward": 1.1046061515808105,
1364
+ "trainrewards/accuracies": 0.9114583134651184,
1365
+ "trainrewards/chosen": 1.5234375,
1366
+ "trainrewards/margins": 2.515625,
1367
+ "trainrewards/rejected": -1.0
1368
+ },
1369
+ {
1370
+ "epoch": 0.73,
1371
+ "grad_norm": 3.239930376341791,
1372
+ "learning_rate": 1.0305368692688175e-06,
1373
+ "loss": 2.3829,
1374
+ "step": 98,
1375
+ "trainloss/critic_chosen": 1.0649518966674805,
1376
+ "trainloss/critic_rejected": 1.1148179769515991,
1377
+ "trainloss/reward": 1.0649518966674805,
1378
+ "trainrewards/accuracies": 0.9583333730697632,
1379
+ "trainrewards/chosen": 1.828125,
1380
+ "trainrewards/margins": 2.921875,
1381
+ "trainrewards/rejected": -1.0859375
1382
+ },
1383
+ {
1384
+ "epoch": 0.74,
1385
+ "grad_norm": 2.844057334430093,
1386
+ "learning_rate": 9.780964274781984e-07,
1387
+ "loss": 2.4761,
1388
+ "step": 99,
1389
+ "trainloss/critic_chosen": 1.0876730680465698,
1390
+ "trainloss/critic_rejected": 1.1597809791564941,
1391
+ "trainloss/reward": 1.0876730680465698,
1392
+ "trainrewards/accuracies": 0.9583333134651184,
1393
+ "trainrewards/chosen": 1.65625,
1394
+ "trainrewards/margins": 2.703125,
1395
+ "trainrewards/rejected": -1.046875
1396
+ },
1397
+ {
1398
+ "epoch": 0.74,
1399
+ "grad_norm": 2.523374395571222,
1400
+ "learning_rate": 9.266990223754069e-07,
1401
+ "loss": 2.4511,
1402
+ "step": 100,
1403
+ "trainloss/critic_chosen": 1.0983867645263672,
1404
+ "trainloss/critic_rejected": 1.1455085277557373,
1405
+ "trainloss/reward": 1.0983867645263672,
1406
+ "trainrewards/accuracies": 0.9791666865348816,
1407
+ "trainrewards/chosen": 1.5546875,
1408
+ "trainrewards/margins": 2.78125,
1409
+ "trainrewards/rejected": -1.21875
1410
+ },
1411
+ {
1412
+ "epoch": 0.75,
1413
+ "grad_norm": 3.3175745581436917,
1414
+ "learning_rate": 8.763798791745413e-07,
1415
+ "loss": 2.453,
1416
+ "step": 101,
1417
+ "trainloss/critic_chosen": 1.094862699508667,
1418
+ "trainloss/critic_rejected": 1.1401726007461548,
1419
+ "trainloss/reward": 1.094862699508667,
1420
+ "trainrewards/accuracies": 0.9531250596046448,
1421
+ "trainrewards/chosen": 1.625,
1422
+ "trainrewards/margins": 2.78125,
1423
+ "trainrewards/rejected": -1.15625
1424
+ },
1425
+ {
1426
+ "epoch": 0.76,
1427
+ "grad_norm": 2.8366252126004596,
1428
+ "learning_rate": 8.271734841028553e-07,
1429
+ "loss": 2.5483,
1430
+ "step": 102,
1431
+ "trainloss/critic_chosen": 1.0930631160736084,
1432
+ "trainloss/critic_rejected": 1.173164963722229,
1433
+ "trainloss/reward": 1.0930631160736084,
1434
+ "trainrewards/accuracies": 0.8958333134651184,
1435
+ "trainrewards/chosen": 1.390625,
1436
+ "trainrewards/margins": 2.515625,
1437
+ "trainrewards/rejected": -1.125
1438
+ },
1439
+ {
1440
+ "epoch": 0.77,
1441
+ "grad_norm": 2.9707853433568023,
1442
+ "learning_rate": 7.791135607656147e-07,
1443
+ "loss": 2.3986,
1444
+ "step": 103,
1445
+ "trainloss/critic_chosen": 1.0701719522476196,
1446
+ "trainloss/critic_rejected": 1.1288470029830933,
1447
+ "trainloss/reward": 1.0701719522476196,
1448
+ "trainrewards/accuracies": 0.9791667461395264,
1449
+ "trainrewards/chosen": 1.6328125,
1450
+ "trainrewards/margins": 2.765625,
1451
+ "trainrewards/rejected": -1.1328125
1452
+ },
1453
+ {
1454
+ "epoch": 0.77,
1455
+ "grad_norm": 3.309757754883857,
1456
+ "learning_rate": 7.322330470336314e-07,
1457
+ "loss": 2.429,
1458
+ "step": 104,
1459
+ "trainloss/critic_chosen": 1.0845046043395996,
1460
+ "trainloss/critic_rejected": 1.1261361837387085,
1461
+ "trainloss/reward": 1.0845046043395996,
1462
+ "trainrewards/accuracies": 0.9583333730697632,
1463
+ "trainrewards/chosen": 1.6796875,
1464
+ "trainrewards/margins": 2.71875,
1465
+ "trainrewards/rejected": -1.0234375
1466
+ },
1467
+ {
1468
+ "epoch": 0.78,
1469
+ "grad_norm": 2.4159819079210556,
1470
+ "learning_rate": 6.865640724692815e-07,
1471
+ "loss": 2.3868,
1472
+ "step": 105,
1473
+ "trainloss/critic_chosen": 1.0498684644699097,
1474
+ "trainloss/critic_rejected": 1.131639003753662,
1475
+ "trainloss/reward": 1.0498684644699097,
1476
+ "trainrewards/accuracies": 0.9687500596046448,
1477
+ "trainrewards/chosen": 1.5,
1478
+ "trainrewards/margins": 2.90625,
1479
+ "trainrewards/rejected": -1.3984375
1480
+ },
1481
+ {
1482
+ "epoch": 0.79,
1483
+ "grad_norm": 2.630877225161229,
1484
+ "learning_rate": 6.421379363065142e-07,
1485
+ "loss": 2.4745,
1486
+ "step": 106,
1487
+ "trainloss/critic_chosen": 1.0781538486480713,
1488
+ "trainloss/critic_rejected": 1.1649324893951416,
1489
+ "trainloss/reward": 1.0781538486480713,
1490
+ "trainrewards/accuracies": 0.9375,
1491
+ "trainrewards/chosen": 1.5,
1492
+ "trainrewards/margins": 2.78125,
1493
+ "trainrewards/rejected": -1.28125
1494
+ },
1495
+ {
1496
+ "epoch": 0.8,
1497
+ "grad_norm": 2.71300394057213,
1498
+ "learning_rate": 5.989850859999227e-07,
1499
+ "loss": 2.4433,
1500
+ "step": 107,
1501
+ "trainloss/critic_chosen": 1.0875132083892822,
1502
+ "trainloss/critic_rejected": 1.1300991773605347,
1503
+ "trainloss/reward": 1.0875132083892822,
1504
+ "trainrewards/accuracies": 0.9635416865348816,
1505
+ "trainrewards/chosen": 1.4140625,
1506
+ "trainrewards/margins": 3.109375,
1507
+ "trainrewards/rejected": -1.703125
1508
+ },
1509
+ {
1510
+ "epoch": 0.8,
1511
+ "grad_norm": 2.722489376200587,
1512
+ "learning_rate": 5.571350963575728e-07,
1513
+ "loss": 2.467,
1514
+ "step": 108,
1515
+ "trainloss/critic_chosen": 1.0709630250930786,
1516
+ "trainloss/critic_rejected": 1.154178500175476,
1517
+ "trainloss/reward": 1.0709630250930786,
1518
+ "trainrewards/accuracies": 0.9479166865348816,
1519
+ "trainrewards/chosen": 1.359375,
1520
+ "trainrewards/margins": 2.859375,
1521
+ "trainrewards/rejected": -1.5
1522
+ },
1523
+ {
1524
+ "epoch": 0.81,
1525
+ "grad_norm": 3.255161744830997,
1526
+ "learning_rate": 5.166166492719124e-07,
1527
+ "loss": 2.4854,
1528
+ "step": 109,
1529
+ "trainloss/critic_chosen": 1.1081310510635376,
1530
+ "trainloss/critic_rejected": 1.152534008026123,
1531
+ "trainloss/reward": 1.1081310510635376,
1532
+ "trainrewards/accuracies": 0.973958432674408,
1533
+ "trainrewards/chosen": 1.34375,
1534
+ "trainrewards/margins": 2.96875,
1535
+ "trainrewards/rejected": -1.6328125
1536
+ },
1537
+ {
1538
+ "epoch": 0.82,
1539
+ "grad_norm": 2.762498507683836,
1540
+ "learning_rate": 4.774575140626317e-07,
1541
+ "loss": 2.4388,
1542
+ "step": 110,
1543
+ "trainloss/critic_chosen": 1.065203070640564,
1544
+ "trainloss/critic_rejected": 1.0969582796096802,
1545
+ "trainloss/reward": 1.065203070640564,
1546
+ "trainrewards/accuracies": 0.9635416269302368,
1547
+ "trainrewards/chosen": 1.3046875,
1548
+ "trainrewards/margins": 2.609375,
1549
+ "trainrewards/rejected": -1.3125
1550
+ },
1551
+ {
1552
+ "epoch": 0.83,
1553
+ "grad_norm": 2.780757216314426,
1554
+ "learning_rate": 4.396845284449608e-07,
1555
+ "loss": 2.4319,
1556
+ "step": 111,
1557
+ "trainloss/critic_chosen": 1.083713173866272,
1558
+ "trainloss/critic_rejected": 1.119750738143921,
1559
+ "trainloss/reward": 1.083713173866272,
1560
+ "trainrewards/accuracies": 0.9687500596046448,
1561
+ "trainrewards/chosen": 1.7421875,
1562
+ "trainrewards/margins": 3.03125,
1563
+ "trainrewards/rejected": -1.296875
1564
+ },
1565
+ {
1566
+ "epoch": 0.83,
1567
+ "grad_norm": 3.7107544004289323,
1568
+ "learning_rate": 4.033235801364402e-07,
1569
+ "loss": 2.4846,
1570
+ "step": 112,
1571
+ "trainloss/critic_chosen": 1.106475830078125,
1572
+ "trainloss/critic_rejected": 1.1211233139038086,
1573
+ "trainloss/reward": 1.106475830078125,
1574
+ "trainrewards/accuracies": 0.9322916865348816,
1575
+ "trainrewards/chosen": 1.7421875,
1576
+ "trainrewards/margins": 2.703125,
1577
+ "trainrewards/rejected": -0.96484375
1578
+ },
1579
+ {
1580
+ "epoch": 0.84,
1581
+ "grad_norm": 3.3073751739512787,
1582
+ "learning_rate": 3.683995891147696e-07,
1583
+ "loss": 2.4629,
1584
+ "step": 113,
1585
+ "trainloss/critic_chosen": 1.0521959066390991,
1586
+ "trainloss/critic_rejected": 1.173767328262329,
1587
+ "trainloss/reward": 1.0521959066390991,
1588
+ "trainrewards/accuracies": 0.9531250596046448,
1589
+ "trainrewards/chosen": 1.828125,
1590
+ "trainrewards/margins": 2.921875,
1591
+ "trainrewards/rejected": -1.0859375
1592
+ },
1593
+ {
1594
+ "epoch": 0.85,
1595
+ "grad_norm": 2.99774406823753,
1596
+ "learning_rate": 3.3493649053890325e-07,
1597
+ "loss": 2.536,
1598
+ "step": 114,
1599
+ "trainloss/critic_chosen": 1.1110682487487793,
1600
+ "trainloss/critic_rejected": 1.155356526374817,
1601
+ "trainloss/reward": 1.1110682487487793,
1602
+ "trainrewards/accuracies": 0.9270833730697632,
1603
+ "trainrewards/chosen": 1.5859375,
1604
+ "trainrewards/margins": 2.75,
1605
+ "trainrewards/rejected": -1.1640625
1606
+ },
1607
+ {
1608
+ "epoch": 0.86,
1609
+ "grad_norm": 3.513330205624271,
1610
+ "learning_rate": 3.0295721834508686e-07,
1611
+ "loss": 2.4707,
1612
+ "step": 115,
1613
+ "trainloss/critic_chosen": 1.0783016681671143,
1614
+ "trainloss/critic_rejected": 1.1236062049865723,
1615
+ "trainloss/reward": 1.0783016681671143,
1616
+ "trainrewards/accuracies": 0.9270833134651184,
1617
+ "trainrewards/chosen": 1.703125,
1618
+ "trainrewards/margins": 2.671875,
1619
+ "trainrewards/rejected": -0.97265625
1620
+ },
1621
+ {
1622
+ "epoch": 0.86,
1623
+ "grad_norm": 2.800803642232422,
1624
+ "learning_rate": 2.7248368952908055e-07,
1625
+ "loss": 2.4803,
1626
+ "step": 116,
1627
+ "trainloss/critic_chosen": 1.080200433731079,
1628
+ "trainloss/critic_rejected": 1.1515780687332153,
1629
+ "trainloss/reward": 1.080200433731079,
1630
+ "trainrewards/accuracies": 0.9375,
1631
+ "trainrewards/chosen": 1.5546875,
1632
+ "trainrewards/margins": 2.5625,
1633
+ "trainrewards/rejected": -1.0
1634
+ },
1635
+ {
1636
+ "epoch": 0.87,
1637
+ "grad_norm": 2.7889069352140585,
1638
+ "learning_rate": 2.43536789125349e-07,
1639
+ "loss": 2.4905,
1640
+ "step": 117,
1641
+ "trainloss/critic_chosen": 1.088797688484192,
1642
+ "trainloss/critic_rejected": 1.1520254611968994,
1643
+ "trainloss/reward": 1.088797688484192,
1644
+ "trainrewards/accuracies": 0.9375,
1645
+ "trainrewards/chosen": 1.5,
1646
+ "trainrewards/margins": 2.515625,
1647
+ "trainrewards/rejected": -1.0078125
1648
+ },
1649
+ {
1650
+ "epoch": 0.88,
1651
+ "grad_norm": 2.931939214492335,
1652
+ "learning_rate": 2.1613635589349756e-07,
1653
+ "loss": 2.3937,
1654
+ "step": 118,
1655
+ "trainloss/critic_chosen": 1.0556377172470093,
1656
+ "trainloss/critic_rejected": 1.1278637647628784,
1657
+ "trainloss/reward": 1.0556377172470093,
1658
+ "trainrewards/accuracies": 0.9583333730697632,
1659
+ "trainrewards/chosen": 1.4375,
1660
+ "trainrewards/margins": 2.46875,
1661
+ "trainrewards/rejected": -1.03125
1662
+ },
1663
+ {
1664
+ "epoch": 0.89,
1665
+ "grad_norm": 2.954190958849344,
1666
+ "learning_rate": 1.9030116872178317e-07,
1667
+ "loss": 2.418,
1668
+ "step": 119,
1669
+ "trainloss/critic_chosen": 1.0978080034255981,
1670
+ "trainloss/critic_rejected": 1.1148779392242432,
1671
+ "trainloss/reward": 1.0978080034255981,
1672
+ "trainrewards/accuracies": 0.9687500596046448,
1673
+ "trainrewards/chosen": 1.4453125,
1674
+ "trainrewards/margins": 2.46875,
1675
+ "trainrewards/rejected": -1.03125
1676
+ },
1677
+ {
1678
+ "epoch": 0.89,
1679
+ "grad_norm": 2.9483773523832353,
1680
+ "learning_rate": 1.6604893375699594e-07,
1681
+ "loss": 2.4694,
1682
+ "step": 120,
1683
+ "trainloss/critic_chosen": 1.1018942594528198,
1684
+ "trainloss/critic_rejected": 1.1370322704315186,
1685
+ "trainloss/reward": 1.1018942594528198,
1686
+ "trainrewards/accuracies": 0.9322916865348816,
1687
+ "trainrewards/chosen": 1.3515625,
1688
+ "trainrewards/margins": 2.40625,
1689
+ "trainrewards/rejected": -1.0546875
1690
+ },
1691
+ {
1692
+ "epoch": 0.9,
1693
+ "grad_norm": 2.9215978058037764,
1694
+ "learning_rate": 1.4339627226955394e-07,
1695
+ "loss": 2.4822,
1696
+ "step": 121,
1697
+ "trainloss/critic_chosen": 1.1052017211914062,
1698
+ "trainloss/critic_rejected": 1.148177981376648,
1699
+ "trainloss/reward": 1.1052017211914062,
1700
+ "trainrewards/accuracies": 0.9531250596046448,
1701
+ "trainrewards/chosen": 1.3515625,
1702
+ "trainrewards/margins": 2.515625,
1703
+ "trainrewards/rejected": -1.1640625
1704
+ },
1705
+ {
1706
+ "epoch": 0.91,
1707
+ "grad_norm": 2.8843923021301667,
1708
+ "learning_rate": 1.223587092621162e-07,
1709
+ "loss": 2.4942,
1710
+ "step": 122,
1711
+ "trainloss/critic_chosen": 1.0736005306243896,
1712
+ "trainloss/critic_rejected": 1.168089509010315,
1713
+ "trainloss/reward": 1.0736005306243896,
1714
+ "trainrewards/accuracies": 0.9322916865348816,
1715
+ "trainrewards/chosen": 1.3359375,
1716
+ "trainrewards/margins": 2.328125,
1717
+ "trainrewards/rejected": -0.99609375
1718
+ },
1719
+ {
1720
+ "epoch": 0.92,
1721
+ "grad_norm": 2.8724683106941193,
1722
+ "learning_rate": 1.0295066282951738e-07,
1723
+ "loss": 2.4881,
1724
+ "step": 123,
1725
+ "trainloss/critic_chosen": 1.09504234790802,
1726
+ "trainloss/critic_rejected": 1.1352362632751465,
1727
+ "trainloss/reward": 1.09504234790802,
1728
+ "trainrewards/accuracies": 0.9322916865348816,
1729
+ "trainrewards/chosen": 1.4375,
1730
+ "trainrewards/margins": 2.3125,
1731
+ "trainrewards/rejected": -0.87109375
1732
+ },
1733
+ {
1734
+ "epoch": 0.92,
1735
+ "grad_norm": 3.0064917280475045,
1736
+ "learning_rate": 8.518543427732951e-08,
1737
+ "loss": 2.5066,
1738
+ "step": 124,
1739
+ "trainloss/critic_chosen": 1.0965893268585205,
1740
+ "trainloss/critic_rejected": 1.12990403175354,
1741
+ "trainloss/reward": 1.0965893268585205,
1742
+ "trainrewards/accuracies": 0.9166667461395264,
1743
+ "trainrewards/chosen": 1.4375,
1744
+ "trainrewards/margins": 2.3125,
1745
+ "trainrewards/rejected": -0.8828125
1746
+ },
1747
+ {
1748
+ "epoch": 0.93,
1749
+ "grad_norm": 2.6882210161223425,
1750
+ "learning_rate": 6.907519900580862e-08,
1751
+ "loss": 2.3973,
1752
+ "step": 125,
1753
+ "trainloss/critic_chosen": 1.0724809169769287,
1754
+ "trainloss/critic_rejected": 1.1239736080169678,
1755
+ "trainloss/reward": 1.0724809169769287,
1756
+ "trainrewards/accuracies": 0.9687500596046448,
1757
+ "trainrewards/chosen": 1.546875,
1758
+ "trainrewards/margins": 2.5625,
1759
+ "trainrewards/rejected": -1.015625
1760
+ },
1761
+ {
1762
+ "epoch": 0.94,
1763
+ "grad_norm": 3.2130299463812233,
1764
+ "learning_rate": 5.463099816548578e-08,
1765
+ "loss": 2.4583,
1766
+ "step": 126,
1767
+ "trainloss/critic_chosen": 1.053167700767517,
1768
+ "trainloss/critic_rejected": 1.1157554388046265,
1769
+ "trainloss/reward": 1.053167700767517,
1770
+ "trainrewards/accuracies": 0.9270833730697632,
1771
+ "trainrewards/chosen": 1.390625,
1772
+ "trainrewards/margins": 2.171875,
1773
+ "trainrewards/rejected": -0.7890625
1774
+ },
1775
+ {
1776
+ "epoch": 0.95,
1777
+ "grad_norm": 2.5537231163007004,
1778
+ "learning_rate": 4.186273109011374e-08,
1779
+ "loss": 2.5432,
1780
+ "step": 127,
1781
+ "trainloss/critic_chosen": 1.1048160791397095,
1782
+ "trainloss/critic_rejected": 1.1709802150726318,
1783
+ "trainloss/reward": 1.1048160791397095,
1784
+ "trainrewards/accuracies": 0.9270833134651184,
1785
+ "trainrewards/chosen": 1.234375,
1786
+ "trainrewards/margins": 2.296875,
1787
+ "trainrewards/rejected": -1.0546875
1788
+ },
1789
+ {
1790
+ "epoch": 0.95,
1791
+ "grad_norm": 3.455571318987563,
1792
+ "learning_rate": 3.077914851215585e-08,
1793
+ "loss": 2.4356,
1794
+ "step": 128,
1795
+ "trainloss/critic_chosen": 1.0750889778137207,
1796
+ "trainloss/critic_rejected": 1.1610357761383057,
1797
+ "trainloss/reward": 1.0750889778137207,
1798
+ "trainrewards/accuracies": 0.9635416865348816,
1799
+ "trainrewards/chosen": 1.734375,
1800
+ "trainrewards/margins": 2.625,
1801
+ "trainrewards/rejected": -0.89453125
1802
+ },
1803
+ {
1804
+ "epoch": 0.96,
1805
+ "grad_norm": 3.095008653826813,
1806
+ "learning_rate": 2.1387846565474047e-08,
1807
+ "loss": 2.4339,
1808
+ "step": 129,
1809
+ "trainloss/critic_chosen": 1.0719571113586426,
1810
+ "trainloss/critic_rejected": 1.1504600048065186,
1811
+ "trainloss/reward": 1.0719571113586426,
1812
+ "trainrewards/accuracies": 0.9583333730697632,
1813
+ "trainrewards/chosen": 1.6328125,
1814
+ "trainrewards/margins": 2.59375,
1815
+ "trainrewards/rejected": -0.96484375
1816
+ },
1817
+ {
1818
+ "epoch": 0.97,
1819
+ "grad_norm": 3.014103190303491,
1820
+ "learning_rate": 1.3695261579316776e-08,
1821
+ "loss": 2.4359,
1822
+ "step": 130,
1823
+ "trainloss/critic_chosen": 1.0608913898468018,
1824
+ "trainloss/critic_rejected": 1.1623928546905518,
1825
+ "trainloss/reward": 1.0608913898468018,
1826
+ "trainrewards/accuracies": 0.9791667461395264,
1827
+ "trainrewards/chosen": 1.5078125,
1828
+ "trainrewards/margins": 2.453125,
1829
+ "trainrewards/rejected": -0.94921875
1830
+ },
1831
+ {
1832
+ "epoch": 0.98,
1833
+ "grad_norm": 3.5824668969231825,
1834
+ "learning_rate": 7.70666566718009e-09,
1835
+ "loss": 2.457,
1836
+ "step": 131,
1837
+ "trainloss/critic_chosen": 1.0644171237945557,
1838
+ "trainloss/critic_rejected": 1.1539928913116455,
1839
+ "trainloss/reward": 1.0644171237945557,
1840
+ "trainrewards/accuracies": 0.9583333730697632,
1841
+ "trainrewards/chosen": 1.578125,
1842
+ "trainrewards/margins": 2.5,
1843
+ "trainrewards/rejected": -0.91796875
1844
+ },
1845
+ {
1846
+ "epoch": 0.98,
1847
+ "grad_norm": 3.085175163660414,
1848
+ "learning_rate": 3.4261631135654174e-09,
1849
+ "loss": 2.4695,
1850
+ "step": 132,
1851
+ "trainloss/critic_chosen": 1.0733301639556885,
1852
+ "trainloss/critic_rejected": 1.1289600133895874,
1853
+ "trainloss/reward": 1.0733301639556885,
1854
+ "trainrewards/accuracies": 0.9427083730697632,
1855
+ "trainrewards/chosen": 1.484375,
1856
+ "trainrewards/margins": 2.296875,
1857
+ "trainrewards/rejected": -0.80859375
1858
+ },
1859
+ {
1860
+ "epoch": 0.99,
1861
+ "grad_norm": 2.6627177185366087,
1862
+ "learning_rate": 8.566875611068503e-10,
1863
+ "loss": 2.456,
1864
+ "step": 133,
1865
+ "trainloss/critic_chosen": 1.0969208478927612,
1866
+ "trainloss/critic_rejected": 1.1649024486541748,
1867
+ "trainloss/reward": 1.0969208478927612,
1868
+ "trainrewards/accuracies": 0.96875,
1869
+ "trainrewards/chosen": 1.4765625,
1870
+ "trainrewards/margins": 2.59375,
1871
+ "trainrewards/rejected": -1.1171875
1872
+ },
1873
+ {
1874
+ "epoch": 1.0,
1875
+ "grad_norm": 2.6690531080971973,
1876
+ "learning_rate": 0.0,
1877
+ "loss": 2.4519,
1878
+ "step": 134,
1879
+ "trainloss/critic_chosen": 1.090218186378479,
1880
+ "trainloss/critic_rejected": 1.128463625907898,
1881
+ "trainloss/reward": 1.090218186378479,
1882
+ "trainrewards/accuracies": 0.9531250596046448,
1883
+ "trainrewards/chosen": 1.5078125,
1884
+ "trainrewards/margins": 2.53125,
1885
+ "trainrewards/rejected": -1.0234375
1886
+ },
1887
+ {
1888
+ "epoch": 1.0,
1889
+ "step": 134,
1890
+ "total_flos": 0.0,
1891
+ "train_loss": 2.6233635464710976,
1892
+ "train_runtime": 32287.388,
1893
+ "train_samples_per_second": 0.799,
1894
+ "train_steps_per_second": 0.004
1895
+ }
1896
+ ],
1897
+ "logging_steps": 1.0,
1898
+ "max_steps": 134,
1899
+ "num_input_tokens_seen": 0,
1900
+ "num_train_epochs": 1,
1901
+ "save_steps": 1000,
1902
+ "total_flos": 0.0,
1903
+ "train_batch_size": 1,
1904
+ "trial_name": null,
1905
+ "trial_params": null
1906
+ }