HARP: A challenging human-annotated math reasoning benchmark Paper • 2412.08819 • Published Dec 11, 2024 • 2
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training Paper • 2501.18511 • Published 7 days ago • 17 • 4
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models Paper • 2501.09653 • Published 21 days ago • 12 • 2