The AI industry’s demand for high-quality training data drives companies like Anthropic to acquire vast amounts of text, often from books. Large language models (LLMs), such as those powering ChatGPT and Claude, require billions of words to understand and generate coherent responses. The quality of training data significantly influences the model’s effectiveness, as well-edited works yield better results than lower-quality text. However, publishers retain legal control over this content, complicating licensing negotiations. To circumvent these challenges, Anthropic initially resorted to digitizing pirated books, using the first-sale doctrine as a loophole to avoid legal complexities. As of 2024, the company has reconsidered this approach due to legal repercussions and is seeking safer, more legitimate sources for its training data, highlighting the ongoing tension between AI companies and copyright laws.
Source link
Anthropic Disposes of Millions of Print Books to Develop AI Models

Leave a Comment
Leave a Comment