Like almost every existing technology company, Adobe has been leaning heavily into AI over the past few years. The software company has launched a range of AI services since 2023, including Firefly, an AI-powered media generation suite. But now the company's full embrace of the technology may have caused problems, as a new lawsuit alleges that the company used a pirated book to train one of its AI models.
A proposed class action lawsuit filed on behalf of Oregon author Elizabeth Ryan alleges that Adobe used pirated copies of numerous books, including hers, in training for its SlimLM program.
Adobe describes SlimLM as a series of small language models that “can be optimized for document assistance tasks on mobile devices.” It states that SlimLM was pre-trained on SlimPajama-627B, a “deduplicated multi-corpus open source dataset” released by Cerebras in June 2023. Lyon, who has written numerous guidebooks for nonfiction writing, said some of her work was included in the pre-training dataset that Adobe was using.
Ryan's lawsuit, first reported by Reuters, says her work was included in a processed subset of the manipulated dataset that Adobe's program was based on. “The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including a copy of Books3),” the lawsuit states. “Accordingly, because SlimPajama is a derivative copy of the RedPajama dataset, SlimPajama includes the Books3 dataset, which includes plaintiff and class member works.”
“Books3,” a massive collection of 191,000 books used to train the GenAI system, is a source of ongoing legal trouble for the technology community. RedPajama has also been cited in numerous lawsuits. In September, a lawsuit against Apple alleged that the company used copyrighted material to train Apple Intelligence models. The lawsuit cited this dataset and accused the tech companies of copying protected copyrighted works “without consent, credit, or compensation.” A similar lawsuit against Salesforce was filed in October, alleging that the company used RedPajama for training purposes.
Unfortunately for the tech industry, such lawsuits are now somewhat common. AI algorithms are trained on large datasets, and in some cases, those datasets allegedly contain pirated material. In September, Anthropic agreed to pay $1.5 billion to numerous authors who accused the company of using pirated copies of their works to train its chatbot Claude. The case was seen as a potential turning point in the ongoing legal battle over copyright over vast amounts of AI training data.
