OpenAI’s AI Models Allegedly Memorized Copyrighted Content, New Study Reveals

A recent study suggests that OpenAI's AI models may have 'memorized' copyrighted content during training, raising concerns amid ongoing lawsuits from rights-holders. Researchers from the University of Washington, University of Copenhagen, and Stanford developed a method to identify this memorization, focusing on 'high-surprisal' words. Tests indicated that GPT-4 had potentially memorized excerpts from popular fiction and New York Times articles. The findings highlight the need for greater data transparency in AI training practices, as OpenAI continues to advocate for relaxed copyright regulations for model development.