If you've ever wondered whether that chatbot you're using knows the entire text of a particular book, answers are on the way. Computer scientists have developed a more effective way to coax memorized content from large language models, a development that may address regulatory concerns while helping to clarify copyright infringement claims arising from AI model training and inference.
Researchers affiliated with Carnegie Mellon University, Instituto Superior Técnico/INESC-ID, and AI security platform Hydrox AI describe their approach in a preprint paper titled "RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline."
The authors – André V. Duarte, Xuying Li, Bin Zeng, Arlindo L. Oliveira, Lei Li, and Zhuo Li – argue that the ongoing concerns about AI models being

The Register

Blaze Media
Fast Company Technology
Associated Press US and World News Video
NBC News
Raw Story
AlterNet
Akron Beacon Journal