MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published May 13, 2024 • 19
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach Paper • 2405.15613 • Published May 24, 2024 • 15
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20, 2024 • 15
How Do Large Language Models Acquire Factual Knowledge During Pretraining? Paper • 2406.11813 • Published Jun 17, 2024 • 31
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published Jun 17, 2024 • 50
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published Jun 17, 2024 • 62
From Pixels to Prose: A Large Dataset of Dense Image Captions Paper • 2406.10328 • Published Jun 14, 2024 • 18
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17, 2024 • 21
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images Paper • 2406.13735 • Published Jun 19, 2024 • 5
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models Paper • 2406.14599 • Published Jun 20, 2024 • 17
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper • 2406.20094 • Published Jun 28, 2024 • 97
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity Paper • 2406.17720 • Published Jun 25, 2024 • 8
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published Jul 2, 2024 • 51
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild Paper • 2406.19380 • Published Jun 27, 2024 • 47
Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge Paper • 2407.03958 • Published Jul 4, 2024 • 19
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions Paper • 2407.06358 • Published Jul 8, 2024 • 19
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes Paper • 2407.10957 • Published Jul 15, 2024 • 24
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus Paper • 2407.11144 • Published Jul 15, 2024 • 9
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks Paper • 2407.19795 • Published Jul 29, 2024 • 11
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation Paper • 2408.00205 • Published Aug 1, 2024 • 5
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation Paper • 2408.02629 • Published Aug 5, 2024 • 14
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Paper • 2408.02900 • Published Aug 6, 2024 • 28
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond Paper • 2408.03900 • Published Aug 7, 2024 • 10
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models Paper • 2408.04594 • Published Aug 8, 2024 • 14
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads Paper • 2407.18245 • Published Jul 25, 2024 • 9
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays Paper • 2408.06281 • Published Aug 12, 2024 • 9
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning Paper • 2408.07089 • Published Aug 9, 2024 • 14
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning Paper • 2408.08441 • Published Aug 15, 2024 • 8
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images Paper • 2408.16176 • Published Aug 28, 2024 • 8
ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution Paper • 2408.15993 • Published Aug 28, 2024 • 8
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts Paper • 2409.00447 • Published Aug 31, 2024 • 2
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation Paper • 2407.17438 • Published Jul 24, 2024 • 24
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation Paper • 2411.04709 • Published Nov 5, 2024 • 25
Improving the detection of technical debt in Java source code with an enriched dataset Paper • 2411.05457 • Published Nov 8, 2024 • 2
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Paper • 2411.05830 • Published Nov 5, 2024 • 20
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 22
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation Paper • 2411.08380 • Published Nov 13, 2024 • 25
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19, 2024 • 51
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper • 2411.14794 • Published Nov 22, 2024 • 13
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published Dec 1, 2024 • 26
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information Paper • 2412.00947 • Published Dec 1, 2024 • 8
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation Paper • 2412.03304 • Published Dec 4, 2024 • 17
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing Paper • 2412.04280 • Published Dec 5, 2024 • 13
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published Dec 6, 2024 • 47
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 13
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations Paper • 2412.08580 • Published Dec 11, 2024 • 45
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation Paper • 2412.07147 • Published Dec 10, 2024 • 5
VisionArena: 230K Real World User-VLM Conversations with Preference Labels Paper • 2412.08687 • Published Dec 11, 2024 • 13
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training Paper • 2501.08197 • Published 23 days ago • 7
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models Paper • 2501.09653 • Published 21 days ago • 12
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Paper • 2501.15907 • Published 10 days ago • 15
OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas Paper • 2501.15427 • Published 11 days ago • 6
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training Paper • 2501.18511 • Published 7 days ago • 17
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation Paper • 2502.02589 • Published 2 days ago • 7
Generating Multi-Image Synthetic Data for Text-to-Image Customization Paper • 2502.01720 • Published 3 days ago • 4