When The Internet Is Not Enough
There have been a lot of stories recently about how the large language models behind the biggest artificial intelligence products from OpenAI, Gemini and Meta have consumed so much data in the training of their models that they have (or will imminently) exhaust all of the data available on the Internet.
In case you missed that, I’ll repeat it. They have consumed virtually ALL of the data available on the Internet. All of it. That’s a lot of data.
I’ve previously written about the copyright issues at play with this behavior, and some of these recent stories from credible sources suggest that OpenAI, Google and Meta knew they were in potentially questionable territory in using copyrighted works to train their models, but did so anyway. They were so eager/desperate to find new, untapped pools of data to use for AI training that OpenAI and Google figured out how to transcribe the audio portions of more than one million hours of YouTube videos, likely in violation of YouTube’s own terms of use. And let’s not forget that YouTube is owned by Google! Meta even considered purchasing Simon & Schuster, the book publisher, to mine its catalog of books. While it does not excuse their violation of copyright holders’ rights, hearing about the need and competition for such vast amounts of data gives some insight as to why they proceeded without heeding the warning signs.