view article Article Revisiting TemplateGSM: Advancing Mathematical Reasoning in Language Models with Template-based Data Generation By yifAI • Nov 14, 2024 • 2
🧠 Reasoning datasets Collection Datasets with reasoning traces for math and code released by the community • 11 items • Updated 3 days ago • 61
Tulu 3 Datasets Collection All datasets released with Tulu 3 -- state of the art open post-training recipes. • 33 items • Updated 4 days ago • 68
ProX Refining Models Collection Adapted small language models used to generate data refining programs • 5 items • Updated Oct 10, 2024 • 2
Magpie-Qwen2 Datasets Collection Dataset built with Qwen2 72B and Qwen2 7B. • 6 items • Updated Jan 13 • 10
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 91
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence Paper • 2406.11931 • Published Jun 17, 2024 • 63
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published Jun 20, 2024 • 88
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 • 77
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct Paper • 2405.14906 • Published May 23, 2024 • 25