meta_pixel
Tapesearch Logo
Nature Podcast

The AI revolution is running out of data. What can researchers do?

Nature Podcast

[email protected]

Science, News, Technology

4.4859 Ratings

🗓️ 31 January 2025

⏱️ 17 minutes

🧾️ Download transcript

Summary

The explosive improvement in artificial intelligence (AI) technology has largely been driven by making neural networks bigger and training them on more data. But experts suggest that the developers of these systems may soon run out of data to train their models. As a result, teams are taking new approaches, such as searching for other unconventional data sources, or generating new data to train their AIs.


This is an audio version of our Feature: The AI revolution is running out of data. What can researchers do?



Hosted on Acast. See acast.com/privacy for more information.

Transcript

Click on a timestamp to play from that location

0:00.0

This is an audio long read from nature. In this episode, the AI revolution is running out of data.

0:09.3

What can researchers do? Written by Nicola Jones and read by me, Benjamin Thompson.

0:17.4

The internet is a vast ocean of human knowledge, but it isn't infinite, and artificial intelligence

0:26.0

researchers have nearly sucked it dry.

0:29.6

The past decade of explosive improvement in AI has been driven in large part by making neural

0:36.0

networks bigger and training them on ever more data.

0:40.6

This scaling has proved surprisingly effective at making large language models, or LLMs, such

0:46.9

as those that power the chatbot chat GPT, both more capable of replicating conversational

0:52.9

language and of developing emergent properties,

0:56.7

such as reasoning.

0:58.7

But some specialists say that we are now approaching the limits of scaling.

1:03.6

That's in part because of the ballooning energy requirements for computing.

1:07.9

But it's also because LLM developers are running out of the conventional

1:12.5

data sets used to train their models. A prominent study made headlines last year by putting

1:19.7

a number on this problem. Researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of a

1:30.7

data set used to train an AI model will reach the same size as the total estimated stock

1:36.9

of public online text. In other words, AI is likely to run out of training data in about

1:44.1

three years' time. At the same time,

1:47.5

data owners, such as newspaper publishers, are starting to crack down on how their content can be

1:53.3

used, tightening access even more. That's causing a crisis in the size of the Data Commons,

2:03.8

says Shane Longbray, an AI researcher at the Massachusetts Institute of Technology in Cambridge, who leads the Data Providence Initiative,

2:10.1

a grassroots organization that conducts audits of AI datasets.

...

Transcript will be available on the free plan in -58 days. Upgrade to see the full transcript now.

Disclaimer: The podcast and artwork embedded on this page are from [email protected], and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of [email protected] and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.

Copyright © Tapesearch 2025.