In 2022, there were projections predict that we will have exhausted the stock of low-quality language data by 2030 to 2050, high-quality language data before 2026, and vision data by 2030 to 2060. This might slow down ML progress.
However, the conclusions rely on the unrealistic assumptions that current trends in ML data usage and production will continue and that there will be no major innovations in data efficiency. Relaxing these and other assumptions would be promising future work.
Improvements in data efficiency is needed to continue to improve AI.
A more realistic model should take into account increases in data efficiency, the use of synthetic data, and other algorithmic and economic factors.
They have seen some promising early advances on data efficiency, so if lack of data becomes a larger problem in the future we might expect larger advances to follow. This is particularly true because unlabeled data has never been a constraint in the past, so there is probably a lot of low-hanging fruit in unlabeled data efficiency. In the particular case of high-quality data, there are even more possibilities, such as quantity-quality tradeoffs and learned metrics to extract high-quality data from low-quality sources.
All in all, they believe that there is about a 20% chance that the scaling (as measured in training compute) of ML models will significantly slow down by 2040 due to a lack of training data.
Transformers with retrieval mechanisms are more sample efficient. EfficientZero is a dramatic example of data efficiency but in a different domain.
In addition to increased data efficiency, there is of use of synthetic data being used to train language model,
They project the growth of training datasets for vision and language models using both the historical growth rate and the compute-optimal dataset size given current scaling laws and existing compute availability estimates.
Keep reading with a 7-day free trial
Subscribe to next BIG future to keep reading this post and get 7 days of free access to the full post archives.