AI models at risk of "model collapse" from self-generated data
AI models trained on data generated by artificial intelligence can fail. Scientists emphasize that the quality of the information provided to the models is crucial for their functionality.
25 July 2024 12:51
In an article published in the prestigious scientific journal "Nature," scientists argue that artificial intelligence (AI) models can experience so-called "model collapse" when trained on data generated by other AI models. They highlight the necessity of using reliable, accurate data during the AI model training process to ensure their proper functioning.
The foundation of their argument is the concept of "model collapse," which refers to a situation where AI models are trained on datasets generated by other AI models. Scientists claim that such a process can lead to "contamination" of the results, meaning that the original content of the data is replaced with unrelated nonsense. Consequently, after several generations, AI models can begin generating content that makes no sense.
Scientists point to tools of generative artificial intelligence, such as large language models (LLMs), which have gained popularity and were mainly trained using human data. However, as researchers demonstrate, as these AI models proliferate on the internet, there is a risk that computer-generated content might be used to train other AI models and even itself. This process is known as a recursive loop.
Ilia Shumailov from Oxford University in the United Kingdom and his team illustrated, using mathematical models, how AI models can experience "collapse." They showed that AI might skip certain results (e.g., less common text fragments) in the training data, leading to training occurring on only part of the data set.
Collapsing AI models
Researchers also analyzed how AI models respond to a training dataset primarily created by artificial intelligence. They found that feeding the model data generated by AI degrades subsequent generations of models in terms of learning ability, ultimately leading to "model collapse."
All language models tested by the recursively trained scientists showed a tendency to repeat phrases. For example, the scientists conducted a test using a text about medieval architecture as a training text. By the ninth generation, artificial intelligence was generating content about hares instead of architecture.
The study authors emphasize that "model collapse" is inevitable if AI training uses datasets generated by previous generations of models. They claim that effective artificial intelligence training on its own results is possible but requires careful filtering of the generated data. Additionally, scientists note that technology companies that decide to use only human-generated content for AI training may gain a competitive advantage.