Adobe Faces Class-Action Lawsuit Over Alleged Use of Pirated Books in AI Model Training
The artificial intelligence sector continues to grapple with escalating legal challenges surrounding the use of copyrighted materials in training datasets, as companies race to develop advanced language models. A recent proposed class-action lawsuit against Adobe highlights these tensions, accusing the software giant of incorporating unauthorized copies of authors’ works into its AI systems. This case underscores broader concerns about intellectual property rights in an industry projected to reach $1.8 trillion in market value by 2030, according to industry analysts, potentially reshaping how AI firms source and validate training data.
Adobe's SlimLM Model at the Center of Copyright Dispute
Allegations in the Elizabeth Lyon Lawsuit
Elizabeth Lyon, an Oregon-based author specializing in non-fiction writing guidebooks, has filed a proposed class-action lawsuit against Adobe, claiming the company misused pirated versions of her books and those of other authors to train its SlimLM language model. The suit alleges that Adobe’s training process involved datasets containing unauthorized reproductions, violating copyright laws without consent, credit, or compensation. Key details from the complaint include:
- Lyon’s works were part of a processed subset derived from the SlimPajama dataset, which the lawsuit describes as “a derivative copy of the RedPajama dataset, including the Books3 dataset.”
- Books3 comprises approximately 191,000 books, a collection that has fueled multiple infringement claims across the tech sector.
- The plaintiffs seek damages and an injunction to prevent further use of such materials, potentially affecting thousands of authors represented in the class.
This litigation, filed in a U.S. federal court, reflects a pattern where individual creators challenge large corporations over AI development practices. While Adobe has not publicly responded to the suit as of the filing date, the case could set precedents for liability in AI training pipelines.
Technical Background and Dataset Origins
Adobe’s SlimLM is described as an efficient small language model series optimized for on-device document assistance tasks, particularly on mobile platforms. The model was pre-trained on the SlimPajama-627B dataset, a deduplicated, multi-corpora open-source collection released by Cerebras in June 2023. This dataset, totaling 627 billion tokens, aims to provide cleaned data for AI research but traces its roots to the RedPajama dataset, which incorporates Books3. The interconnected nature of these datasets raises analytical questions about provenance tracking in AI development:
- RedPajama was designed as an open alternative to proprietary datasets like Common Crawl, but its inclusion of Books3—a scraped library of pirated e-books—has drawn scrutiny for lacking proper licensing.
- SlimPajama’s processing steps, intended to remove duplicates and low-quality content, did not fully excise copyrighted materials, according to the lawsuit.
- Historical context: Since 2023, AI firms have increasingly relied on such large-scale, web-scraped corpora to train models, with datasets like Books3 enabling rapid scaling but exposing companies to legal risks estimated to cost the industry billions in potential settlements.
No uncertainties are noted in the core technical descriptions, though the exact extent of copyrighted content in SlimPajama remains subject to ongoing discovery in the case.
Broader Industry Implications and Precedent-Setting Cases
This lawsuit is part of a wave of similar actions targeting AI companies for dataset practices, signaling potential regulatory shifts in how intellectual property is handled in machine learning. In September 2025, Apple faced a comparable suit alleging use of RedPajama for its Apple Intelligence model, while Salesforce encountered claims in October over the same dataset for training purposes. A notable precedent emerged in September 2025 when Anthropic settled with authors for $1.5 billion after accusations of using pirated works to train its Claude chatbot. This agreement, involving multiple plaintiffs, marked a significant financial acknowledgment of infringement risks and has influenced ongoing negotiations in the sector.
"The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including copying Books3)," the lawsuit states, emphasizing the chain of unauthorized derivations.
These cases highlight societal impacts, including erosion of trust between creators and tech firms, and economic pressures on authors whose works generate minimal royalties amid AI-driven content generation. Statistically, over 20 major AI copyright lawsuits have been filed since 2023, with settlements averaging in the hundreds of millions, per legal tracking data. For Adobe, whose AI initiatives like Firefly have driven a 15% revenue uptick in creative cloud services this fiscal year, the outcome could influence investment in ethical data sourcing and compliance frameworks. As AI adoption accelerates— with global enterprise spending on generative AI expected to hit $200 billion by 2025— these disputes prompt a reevaluation of open-source data ethics. What could this mean for the future of AI development, particularly in balancing innovation with creator rights?
