Hugging Face has introduced FinePDFs, the largest publicly available corpus of PDFs, comprising 475 million documents in 1,733 languages and totaling approximately 3 trillion tokens. At 3.65 terabytes, this dataset pioneers open training resources by offering high-quality, domain-specific content, especially in fields like law and academia. Historically challenging to process due to formatting issues and varied text extraction needs, FinePDFs employs a dual strategy using Docling for text extraction and RolmOCR for GPU-powered OCR. This allows for efficient processing while ensuring quality. The dataset features over 1.1 trillion tokens in English, with substantial contributions from Spanish, German, French, and more. Evaluated with 1.67B parameter models, FinePDFs demonstrated performance comparable to state-of-the-art HTML datasets. Available under the Open Data Commons Attribution license, it supports research and development through Hugging Face Hub and the Datatrove library, promoting data transparency in AI.
Source link

Share
Read more