Thrilled to introduce Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini can also achieve 49.5% higher throughput than AdamW on Llama2-7B pre-training.
The design of Adam-mini is inspired by certain Hessian structures we observed on Transformers.
Feel free to try it out! Try switching to Adam-mini with the same hyperparams of AdamW, it would work with only half memory. Hope Adam-mini can help save time, cost, and energy in your tasks!
We are happy to introduce our InstantStyle, which is a framework that employs straightforward yet potent techniques for achieving effective disentanglement of style and content from reference images.
After giving GPU Programming a hands-on try, I have come to appreciate the level of complexity in AI compute:
- Existing/leading frameworks (CUDA, OpenCL, DSLs, even Triton), still fall at the mercy of low-level compute that requires deeper understanding and experience. - Ambiguous optimizations methods that will literally drive you mad π€― - Triton is cool but not cool enough (high level abstractions that fall back to low level compute issues as you build more specialized kernels) - As for CUDA, optimization requires considering all major components of the GPU (DRAM, SRAM, ALUs) π€ - Models today require stallion written GPU kernels to reduce storage and compute cost. - GPTQ was a big save ππΌ
@karpathy is right expertise in this area is scarce and the reason is quite obvious - uncertainties: we are still struggling to get peak performance from multi-connected GPUs while maintaining precision and reducing cost.