compiled models train faster so you can train more of them in a short experiment, to better convergence. 921107d verified SQCU commited on 20 days ago
89,301,000 parameter attention_ii, z_lossed model trained for 6250 steps at batchsize:4*32, device_batchsize:32 8a69386 verified SQCU commited on 22 days ago
sling the illustrious and mysterious "attention_II" models. also some layerwise rmsnorm, qkprojection rmsnorm models, one twice as large as the other. 1f45909 verified SQCU commited on 22 days ago