The Problem of AI 'Getting Stupid' and Its Consequences

By:Sophia Published 2024-08-04T06:05:39Z

This article comes from WeChat public account: SF Chinese (ID: kexuejiaodian), author: SF

AI can 'answer every question' mainly due to the large amount of training data. Currently, as long as the amount of data used for training AI is large enough, AI can continue to play the role of our 'good teacher and helpful friend'. However, things are not so simple and optimistic, and AI is getting stupid.

At this stage, the data for training AI mainly comes from the Internet. The massive amount of data in the Internet ensures that AI can answer our questions faster, more completely, and more appropriately to meet our needs. With the development of AI, the amount of data generated by AI in the Internet is bound to increase, and then the amount of data generated by AI itself in the data used for training AI will also increase. This will bring a big problem to AI.

AI is getting stupid

An article published in the journal Nature on July 24, 2024, points out that training AI with data generated by AI may cause AI to be on the verge of 'collapse' as it iterates.

Researchers from universities such as Oxford University, Cambridge University, Imperial College London, and the University of Toronto used network data in which the data generated by previous versions of large language models (LLM, such as GPT, OPT) accounted for the majority to train a certain version of LLM-n. They found that as n increases, LLM will show a 'model collapse' phenomenon.

Take Meta's large language model OPT as an example. The researchers tested OPT-125m. The earliest training data they input was 'According to the British writer Poyntz Wright, some medieval buildings that began to be built before 1360 were usually completed by experienced masons and casual masons, and local parish laborers would also participate. However, other authors don't think so. They believe that the person in charge of the construction team designed these buildings according to the examples of early perpendicular buildings.'

At the beginning, several versions of OPT could still give the accurate construction period of some perpendicular buildings according to the training data. Due to the later versions being trained with the data generated by the previous versions, as it iterates, the answers given by the later versions become more and more ridiculous - even to the 9th generation, OPT gave the names of a group of rabbits.

2. What will happen after AI gets stupid?

So, what are the consequences if AI gets stupid, or it collapses? The research team points out that this kind of long-term 'pollution' of the training data has already occurred. For example, the researchers observed the formation process of 'troll farms' (organizations that specifically spread false or inflammatory remarks in the Internet, which can be understood as 'trolls' or 'Internet flamers'). The 'pollution' that troll farms bring to search engines is that it leads to changes in search results. And what is more worrying is that as AI large language models enter the online world more, the scale of such 'pollution' will become larger and the spread speed will also increase faster.

For this reason, Google has reduced the search weight of troll farm content, and the search engine DuckDuckGo, which focuses on protecting users' privacy, simply deleted this content. But none of these approaches can fundamentally solve the problem of AI getting stupid. In order to allow AI to have long-term'regular learning' instead of being 'polluted', it is necessary to ensure that the original data created manually in the network can always be accessed. The researchers believe that to achieve this, the key lies in how to distinguish the data generated by AI from the data created manually.

This involves the problem of tracing the source of AI-generated data, but scientists still don't know how to track the source of AI-generated content on a large scale.

In the article, the researche rs gave a possible solution. Establish community-level cooperation to ensure that all aspects involved in AI-generated content can share information to solve the content traceability problem.

References:

https://www.nature.com/articles/s41586-024-07566-y#Abs1

AI training data model collapse content traceability