view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks By Pclanglais • Aug 4, 2024 • 29
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing By Pclanglais • Jul 19, 2024 • 20
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM By Pclanglais • Apr 26, 2024 • 15
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18, 2024 • 22
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs By Pclanglais • Mar 20, 2024 • 18