AI-assisted German Employment Contract Review: A Benchmark Dataset Paper • 2501.17194 • Published 10 days ago • 1
Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data Paper • 2412.10121 • Published Dec 13, 2024 • 1
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies Paper • 2502.00894 • Published 4 days ago • 1
Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter Paper • 2501.14491 • Published 13 days ago • 1
Hierarchical Autoregressive Transformers: Combining Byte-~and Word-Level Processing for Robust, Adaptable Language Models Paper • 2501.10322 • Published 20 days ago • 1
Towards Best Practices for Open Datasets for LLM Training Paper • 2501.08365 • Published 23 days ago • 53
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages Paper • 2501.08284 • Published 23 days ago • 6
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models Paper • 2501.04828 • Published 29 days ago • 11
PolInterviews -- A Dataset of German Politician Public Broadcast Interviews Paper • 2501.04484 • Published 29 days ago • 1
view article Article FineWeb2-C: Help Build Better Language Models in Your Language By davanstrien and 5 others • Dec 23, 2024 • 18
BabyHGRN: Exploring RNNs for Sample-Efficient Training of Language Models Paper • 2412.15978 • Published Dec 20, 2024 • 1
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper • 2412.08802 • Published Dec 11, 2024 • 5
Evaluating Pixel Language Models on Non-Standardized Languages Paper • 2412.09084 • Published Dec 12, 2024 • 1
Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain Paper • 2412.09341 • Published Dec 12, 2024 • 1
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages Paper • 2412.09587 • Published Dec 12, 2024 • 3