Encyclopedia Britannica and Merriam-Webster just dropped a major copyright lawsuit on OpenAI, alleging the AI giant trained GPT-4 on their premium content without permission—and now the model spits out near-verbatim copies on demand. Filed Friday, the suit claims OpenAI systematically scraped and memorized copyrighted encyclopedia entries and dictionary definitions, adding fresh fuel to the already blazing debate over AI training practices and intellectual property rights.
OpenAI is facing another heavyweight copyright battle. Encyclopedia Britannica and dictionary publisher Merriam-Webster filed a lawsuit Friday alleging the AI company systematically copied their copyrighted content to train ChatGPT, then generated responses that reproduce their work without permission or attribution, Reuters first reported.
The allegations are pretty damning. According to the complaint filed in federal court, GPT-4 has essentially memorized huge chunks of Britannica's encyclopedia entries. "GPT-4 itself has 'memorized' much of Britannica's copyrighted content and will output near-verbatim copies of significant portions on demand," the lawsuit states. "The memorized examples are unauthorized copies that [OpenAI] used to train their models, including GPT-4."
This isn't just about a few snippets here and there. The publishers claim OpenAI engaged in a pattern of wholesale copying, scraping their carefully curated reference materials to build out the knowledge base that makes ChatGPT seem so authoritative. When users ask ChatGPT for information, they're often getting responses that closely mirror—sometimes word-for-word—content that Britannica and Merriam-Webster spent resources creating and fact-checking.
The timing couldn't be more pointed. OpenAI has been on a signing spree lately, cutting licensing deals with major publishers like News Corp and the Associated Press to legitimize its training data sources. But those agreements came after the company had already trained its models on massive amounts of web content—including, the lawsuit alleges, premium reference materials that were never meant to be free for commercial AI training.
Britannica and Merriam-Webster join a growing chorus of publishers and creators taking legal action against AI companies. The New York Times sued OpenAI and Microsoft in December 2023 over similar allegations of copyright infringement. Authors including John Grisham and George R.R. Martin have filed their own suits. Visual artists are challenging image generators like Stable Diffusion and Midjourney. The legal landscape is getting messy, fast.
What makes this case particularly interesting is who's bringing it. Encyclopedia Britannica isn't some scrappy startup crying foul—it's a 255-year-old institution that's been the gold standard for reference content since before the internet existed. Merriam-Webster has been America's dictionary authority since 1828. These are companies that built entire business models around being trusted, curated sources of information. Now they're watching an AI model trained on their work compete directly with them for user attention.
The legal argument centers on whether AI training constitutes fair use—a question that's becoming the defining intellectual property battle of the AI era. OpenAI and other AI companies have argued that training models on publicly available content is transformative use protected under copyright law. Publishers and creators counter that systematic copying for commercial purposes can't be fair use, especially when the resulting AI outputs can substitute for the original works.
The stakes are enormous. If Britannica and Merriam-Webster prevail, it could force OpenAI and other AI companies to either pay licensing fees for training data or strip their models of knowledge derived from copyrighted sources. That could fundamentally reshape how large language models are built and trained. It might also create a lucrative new revenue stream for content owners who've watched AI companies build billion-dollar valuations on the back of their intellectual property.
OpenAI hasn't publicly responded to the lawsuit yet, but the company has consistently maintained that its training practices fall within legal bounds. In previous cases, OpenAI has argued that its models learn patterns and concepts from training data rather than storing and reproducing content verbatim. The Britannica complaint directly challenges that narrative with allegations of memorization and near-verbatim reproduction.
For context, GPT-4 was trained on hundreds of billions of words scraped from the internet, books, and other text sources. The model reportedly cost over $100 million to train, and OpenAI has been notoriously secretive about exactly what data went into it. The company has said it filters out copyrighted content when possible, but clearly publishers like Britannica believe their content made it into the training mix anyway.
This lawsuit could also have ripple effects beyond OpenAI. Google, Meta, and other tech giants are all racing to build more powerful AI models, and they're all grappling with the same fundamental question about training data. If courts establish that reference content requires licensing, it'll reshape the entire AI industry's approach to dataset curation.
The Britannica lawsuit is another flashpoint in the ongoing war over AI training data and copyright. With established reference publishers now joining newspapers, authors, and artists in challenging OpenAI's practices, the company faces mounting legal and financial pressure to either prove its training methods constitute fair use or start paying for the content that powers its models. The outcome will likely set precedent for the entire AI industry and determine whether content creators can claw back control—and compensation—for intellectual property that's already been ingested by dozens of large language models. Watch for OpenAI's response and whether other reference publishers follow Britannica's lead.