Reasoning Datasets Collection Distilled synthetic Reasoning datasets • 7 items • Updated 4 days ago • 44
view article Article Mastering Long Contexts in LLMs with KVPress By nvidia and 1 other • 14 days ago • 59
view article Article Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas By MaxNomic and 4 others • 14 days ago • 29
view article Article Exploring Synthetic Data Generation with DataDreamer By asoria • 16 days ago • 6
Towards Best Practices for Open Datasets for LLM Training Paper • 2501.08365 • Published 23 days ago • 53
high-quality Chinese training datasets Collection a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets. • 12 items • Updated 20 days ago • 9
view article Article Synthetic Data Generation with FastData and Hugging Face By asoria • 30 days ago • 14
view article Article Finding Moroccan Arabic (Darija) in Fineweb 2 By omarkamali and 3 others • Dec 8, 2024 • 21
view article Article Bridging the Gap Between Physical Numerical Simulations and Machine Learning: Introducing The Well By rubenohana • Dec 2, 2024 • 17
OLMo 2 Collection Artifacts for the second set of OLMo models. • 22 items • Updated about 1 month ago • 81
Marqo-Ecommerce-Embeddings Collection State-of-the-art embedding models fine-tuned for the ecommerce domain. +67% increase in evaluation metrics vs ViT-B-16-SigLIP. • 10 items • Updated Nov 14, 2024 • 17
NLI Eval Datasets Collection A curated collection of NLI evaluation datasets. Each dataset is exactly as originally proposed • 19 items • Updated Nov 12, 2024 • 3
BhasaAnuvaad Collection A Speech Translation Dataset for 13 Indian Languages • 11 items • Updated 21 days ago • 14