In recent years, large-scale model-driven artificial intelligence (AI) has infiltrated all aspects of modern society, but its rapid development can not be separated from the support of massive data, so much so that the industry describes the data as“Fuel” and“Minerals” that drive AI. But Illiat Suzkerville, co-founder and former Chief Scientist of artificial intelligence giant OpenAI, recently publicly warned that “AI training data is running out like fossil fuels.”, this immediately led to widespread discussion within the AI industry: Is it true that big AI models will run out of data? What about the future?
“The pre-training mode must come to an end.”
According to Wired.com, AI can not develop without three core elements: algorithms, computing power and data. Today’s computing power continues to grow with hardware upgrades and data center expansion, and algorithms continue to iterate, but the growth of data is starting to outpace AI’s needs. Speaking at the 38th annual meeting of neuro-information processing systems in Vancouver, Canada, Suzkerville warned that“The pre-training model as we know it will come to an end,” he explained, “AI’s training data, like oil, is running out. There is no changing the fact that we have only one internet. We have reached a peak and there will be no more data and we have to deal with what we have.”
Shenyang, a professor at Tsinghua University’s School of Journalism and Ai, told the Global Times on the 17th that the pre-training of large models refers to the construction of large-scale AI models, such as the GPT series, first of all, in the mass of unlabeled data on the initial training process. Through the self-supervised learning method, the model learns the basic structure, grammatical rules and extensive knowledge of the language, and forms a universal language representation. This phase enables the model to understand and generate natural language, and provides a solid foundation for subsequent specific tasks such as text categorization, question answering systems, and so on. Pre-training not only improves the performance of the model in various tasks, but also reduces the need for large amount of annotation data and speeds up the process of application development.
This is not the first time the AI industry has noticed“Insufficient data”. The economist recently reported that“AI companies will soon run out of most internet data”, citing the prediction of research firm Epoch AI, “Human text data available on the internet will be exhausted in 2028.”.
Why does AI need more and more data?
According to Shenyang, the demand for data for large-scale model training is indeed growing rapidly, showing a trend of approximately multiplying. Specifically, models like GPT typically require zero to trillions of words of data to be pre-trained. These large data sets help the model to understand the language structure and semantic relationships in depth, thus achieving its ultimate strong performance and broad application capabilities.
For each large model iteration will lead to a sharp increase in the demand for data, Shenyang explained that this is mainly due to the scale of the model and the need to improve performance. As the number of model parameters increases, the ability to learn and express the model also increases, more data is needed to fully train these parameters, and to ensure that the model has good generalization ability.
On the other hand, the diversity and coverage of data is also an important factor driving the growth of data demand. In order to improve the generality and adaptability of the model, it is necessary to use a large amount of data covering a wide range of topics and language styles, which not only helps the model to understand complex language structures and semantic relationships, but also helps the model to understand complex language structures and semantic relationships, it also ensures that it performs well in a variety of application scenarios. At the same time, with the expansion of the application scope of the model, such as multi-modal and cross-domain applications, the demand for different types and domains of data also increases significantly, which further promotes the growth of data volume.
In general, there is a strong positive correlation between technical iterations and data volumes. Every technological advance, especially the increase in the size and complexity of models, drives the need for larger and richer data sets. This rapid increase in demand is not only to improve model performance and generalization capabilities, but also to support its performance in broader and more complex application scenarios.
The demand for training data has grown exponentially as large models, such as GPT-4o and subsequent versions of Open Ai o 1pro, have expanded. With each iteration of the model, an increase in the number of parameters requires more data to ensure that the model is sufficiently learned and generalized. The growth of the internet and other data sources has not fully kept pace with this demand, resulting in a relative scarcity of high-quality data that can be used for training. In addition, with increasingly stringent privacy regulations, such as the European Union’s General Data Protection Regulation, access to and use of large-scale data has become more complex and restricted for large-scale modelling firms and institutions, the imbalance between supply and demand for data is further exacerbated.
Is the future about“Small data”?
Shen said that comparing the status of AI data available to traditional mineral resources was not simply because the“Total amount” of data had dried up, but rather as“Mineral deposits” were being mined, fewer high-quality“Ores”(high-quality data) are readily available, and the remaining data is either more homogeneous or of lower quality, so it can not directly meet the training needs of the new generation of large models. Today’s data may still be plentiful, but they are rife with biases, inconsistencies or lack of labelling, similar to the more depleted ore left over from mining, which requires more refining and processing.
So in the future, in addition to continuing to seek new sources of data (including more remote corpora, specialized fields of data) , you can also try to synthesize data, data enhancement, migration learning, federal learning and other strategies to improve the use of data efficiency and quality management level. In general, the dilemma is not simply“Insufficient quantity,” but the challenge of insufficient“Quality and availability” of data, the countermeasures are to improve the precision and efficiency of data processing in technology, strategy and system.
The synthetic data becomes a new way to deal with the shortage of large-scale model training data. In contrast to real data collected or measured from the real world, synthetic data are created by generating models that simulate the distribution and statistical characteristics of real data. It can generate a large amount of trainable data set according to the actual demand, but it also has the so-called“Over-fitting” problem, which results in the large model performs well on the synthetic data, but not well in the real scene.
Shenyang stressed that when we discussed the topic of“Whether the pre-training data of AI large-scale models will be exhausted”, which has attracted high attention in the world recently, to clarify two issues: first, whether the pre-training data discussed in the industry will be“Exhausted” is mainly for large model training text data, but large model for spatial data, video data, and sensor sensing of the natural world in the vast amount of data learning and utilization is just beginning. In other words, from AI large model learning and use of text data to these mentioned huge data, will face a huge scale of expansion. Second, in the future, on the one hand, we should continuously strengthen the pre-training of large-scale models, but more importantly, we should study reasoning, agent and human-computer symbiosis. “In other words, we need to study how to make AI stronger by learning from huge amounts of data. At the same time, we also need to study how to make humans stronger. No matter how strong AI is, humans will eventually be able to master it.”
Lu Benfu, a University of Chinese Academy of Sciences Professor, told the Global Times on the 17th that data from the so-called pre-training of AI large models had been “Exhausted”, mainly refers to the data on the Internet and a variety of published data. And the memory data of each person’s life is still in his or her mind and has not yet been effectively mined. With a heated debate about whether pre-training data will be exhausted, there are also views that the future will enter the“Small Model” era. Lu Benfu believes that the future of large models, vertical models and agents have to find their own value domain. There is also a“World Model” in academia, which, unlike today’s big language models, includes not only logical relations (probabilistic judgments) but also physical laws. So, the big model of the future at a higher level of the“Showdown” is not over.