Papers
arxiv:2407.12838
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Published on Jul 4, 2024
Authors:
Abstract
This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2407.12838 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2407.12838 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2407.12838 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a
collection
to link it from this page.