Hugging Face Unveils FinePDFs: A Massive 3-Trillion-Token Dataset Derived from PDF Sources

Hugging Face has introduced FinePDFs, the largest publicly available corpus of PDFs, comprising 475 million documents in 1,733 languages and totaling approximately 3 trillion tokens. At 3.65 terabytes, this dataset pioneers open training resources by offering high-quality, domain-specific content, especially in fields like law and academia. Historically challenging to process due to formatting issues and varied text extraction needs, FinePDFs employs a dual strategy using Docling for text extraction and RolmOCR for GPU-powered OCR. This allows for efficient processing while ensuring quality. The dataset features over 1.1 trillion tokens in English, with substantial contributions from Spanish, German, French, and more. Evaluated with 1.67B parameter models, FinePDFs demonstrated performance comparable to state-of-the-art HTML datasets. Available under the Open Data Commons Attribution license, it supports research and development through Hugging Face Hub and the Datatrove library, promoting data transparency in AI.

Source link

News

Company:

Join our community of SUBSCRIBERS and be part of the conversation.

Fior Unveils Quantum-Safe Authentication Platform to Combat AI Agent Cyber Threats

Amdocs Unveils AI System for Autonomous Mobile Networks – Stock Titan

Introducing Lakshveer: The 8-Year-Old Prodigy Who Developed an AI Agent for Telegram Device Control – Tech News

Empowering Girls: Shift from ‘AI Tool Usage’ to ‘AI Business Innovation’ – Eye Witness News

Ensuring Child Safety: Social Media, Devices, and AI Tools for Kids – Sunderland Echo

Understanding Gram: A Comprehensive Overview

AMD Unveils Ryzen AI PRO 400 Series Desktop CPUs for Advanced AI Computing

Show HN: OpenBets – AI Agents Crafting Personas through Predictive Betting

Bridging the AI Return on Investment Gap

Collaborative Solutions: A Platform for Tackling Challenges Beyond AI’s Reach

Hugging Face Unveils FinePDFs: A Massive 3-Trillion-Token Dataset Derived from PDF Sources

Top AI Tools That Prioritize Your Privacy

Amazon and OpenAI Launch $50 Billion AI Partnership on AWS

Pentagon Labels Anthropic CEO Dario Amodei a ‘Liar’ with a ‘God Complex’ Amidst Approaching Deadline

Australia Considers Mandating App Stores and Search Engines to Eliminate Unsafe AI Services

“OpenClaw 0-Click Vulnerability Empowers Malicious Websites to Compromise Developer AI Agents” – CybersecurityNews

Local News

Fior Unveils Quantum-Safe Authentication Platform to Combat AI Agent Cyber Threats

Understanding Gram: A Comprehensive Overview

Amdocs Unveils AI System for Autonomous Mobile Networks – Stock Titan

AMD Unveils Ryzen AI PRO 400 Series Desktop CPUs for Advanced AI Computing

Fior Unveils Quantum-Safe Authentication Platform to Combat AI Agent Cyber Threats

Understanding Gram: A Comprehensive Overview

Amdocs Unveils AI System for Autonomous Mobile Networks – Stock Titan