By Tal Roded and Peter Slattery
In this article, we provide a high level overview of a key trend in AI models; that AI models’ data requirements may be growing faster than the supply of suitable data. We then explore some implications.
Progress in artificial intelligence is underpinned by advances in three areas: compute, data, and algorithms. Compute refers to the use of computer systems to perform calculations or process data. It encompasses a range of activities from simple arithmetic to complex simulations. Algorithms are the procedures or formulas that computer systems use for solving problems or completing tasks. Data refers to the information being processed or produced by a computer. Data is required to train and validate AI models.
AI models use training data to learn from and generate responses. Larger and better AI models generally require more training data; Figure 1 below uses information from one of our ‘computer progress’ datasets to show how this has changed between 2010 and 2023.
The amount of available data has exploded. In just 10 years, from 2010 to 2020, the total amount of new data generated per year grew 32x, from 2 zettabytes created in 2010 to over 64 zettabytes created in 2020 alone. Figure 2 draws on the Statista research dataset to illustrate this.
Even though more data is now produced every month than existed in total just over a decade ago, this is still failing to match growth in demand. As shown in Figure 3, analysis from Epoch, who we advise, suggests that several sources of data, such as high quality language data, will be exhausted within the next few years.
For instance, we won’t be able to train larger models as easily. Newer models may be less accurate and more biased due to insufficiently broad and dense training data. These problems would likely impede innovation in sectors which already depend on AI and future adopters.
Epoch predict a 20% probability that data will become a bottleneck to improvements in AI by 2040. So why won’t we run out of data?
There are strong economic incentives to use data more effectively, because this would reduce the rapidly growing costs from training larger models. In response, organizations and researchers are developing approaches to training which require less training data for the same level of AI model performance. Examples include techniques like few-shot learning, where models are optimized to learn from minimal examples, and transfer learning, where models can apply knowledge from one task to another without needing separate training.
AI models are also being trained using data augmentation. This is where existing data is modified to increase the training set without collecting new data, for instance, by rotating images or adding noise to audio files. Research is also improving our understanding of where and how to prioritize data quality over quantity. For instance, as costs increase, AI models are increasingly being trained on better (e.g., well-labelled, and diverse), but smaller training datasets.
Synthetic data is artificially generated using AI. This is particularly useful in domains where existing data is limited or insufficiently diverse. See Figure 4 for some examples (from recent research by Philip Isola)
Synthetic data also provides some additional benefits over real world data. For instance, it can be much cheaper, and can emulate scenarios and variations that might not be easily captured in real-world datasets or customized for specific types of models.
In summary, the development of AI models increasingly requires vast amounts of data, creating the risk that the demand for data will outpace the supply. However, this risk is perhaps relatively low due to ongoing innovations such as few-shot learning, transfer learning, data augmentation, and the use of synthetic data.
Our dataset exploring the use of algorithms, parameters, and training data can be accessed here. For further information on the types of algorithms developed over time and their applications and uses, check our Algorithm Wiki.
Data visualizations in this post were made in R using the tidyverse, readxl, ggthemes, and RColorBrewer packages by Tal Roded.
Charts not pulled from FRED in this post were made in R using the tidyverse, readxl, and ggthemes, and RColorBrewer packages.
Feedback was provided by Harry Lyu