Aug 123 min read

Could NVIDIA and Other Tech Giants' Scraping of Internet Videos and Data Eventually Destroy Their LLMs?

In a report by 404 Media, NVIDIA just got caught red-handed stealthily scrapping around 80 years’ worth of video content and data daily from platforms like YouTube, Netflix, and MovieNet. This treasure trove of audiovisual delights is being fed into NVIDIA’s AI models, all in the name of creating their next wave of new products, including the much-anticipated Omniverse 3D world generator and the intriguingly named AI robotics project, GR00T. GR00T, we are told, aims to endow robots with the remarkable ability to understand instructions via language, video, and demonstration, presumably to make our future robotic helpers that much more adept at following orders.

NVIDIA Project GR00T

However, NVIDIA is not the only one that has been caught with its hand in the cookie jar. Rumor has it that OpenAI has also been indulging in a bit of YouTube scraping for training purposes, and they’re not alone. Apple and Anthropic are suspected of similarly raiding large-scale datasets that conveniently include YouTube videos.

YouTube’s CEO, Neal Mohan, is understandably pissed. He has previously stated in no uncertain terms that YouTube’s terms and conditions “do not allow for things like transcripts or video bits to be downloaded. That is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform.” But NVIDIA, not one to care, responded in an interview with Tom’s Guide, asserting that they hadn't infringed on any copyright laws and that their actions were comfortably shielded by the "fair use" doctrine.

Should AI Model Training Considered Fair Use? Probably.

"Copyright law protects particular expressions but not facts, ideas, data, or information. Anyone is free to learn facts, ideas, data, or information from another source and use it to make their own expressions."

NVIDIA’s Statement to Tom’s Guide

Technically, anything you post on the internet for public viewing is like a buffet for AI companies—deliciously available and ready to be devoured by their LLMs. This open-access approach is crucial unless, of course, you prefer to cement the dominance of current tech giants and create a cozy little oligopoly for their current LLMs.

Just as we humans learn from freely available resources like YouTube without much fuss, AI training on these resources should be viewed in the same light. Consider the fashion industry: brands often draw inspiration from each other, resulting in designs that are eerily similar, yet unique in some way or another. Similarly, while generative AI may nibble on online content for inspiration, it ultimately produces its own unique creations.

To reach the lofty goal of artificial general intelligence (AGI)—essentially crafting AI that can rival human intelligence—we need to treat AI training like a person accessing free online resources to create their own masterpieces. If regulations clamp down on AI training with cries of "copyright infringement," data costs will skyrocket. Only the tech behemoths will be able to afford such expenses, effectively pricing out newer players and stifling the development of smaller, more innovative LLMs.

But beyond the murky waters of copyright issues lies an even more pressing matter: the quality of data available today.

AI Models Could Fail If Trained Repeatedly on AI-Generated Content

It is becoming increasingly clear that generative AI is not just a passing trend but a permanent fixture in our digital landscape. More and more people are turning to AI to generate text, images, and videos for online content. Simultaneously, these very AI models are being trained on the vast expanse of the internet. But what happens as the volume of AI-generated content continues to grow?

A recent study by AI researchers (Shumailov, I., Shumaylov, Z., Zhao, Y., et al., Nature 631, 755–759 (2024)) has uncovered a rather alarming phenomenon. They found that the indiscriminate use of AI-generated content in training leads to irreversible defects in the resulting models. This phenomenon, which they term "model collapse," occurs when the tails of the original content distribution vanish. Essentially, the models start forgetting improbable events over time, becoming increasingly poisoned by their own distorted projections of reality. It's no wonder DALL-E outputs have been resembling a surrealist’s fever dream lately (though copyright issues might also play a part, but who knows).

Looking ahead, AI advancement will be stymied by two significant data issues: data availability and data quality. As more people master the art of using AI tools to generate content, the amount of original data will dwindle. If stringent copyright regulations are enforced on AI models, data will become the new digital gold—more valuable than Bitcoin (at least for AI companies). This scenario could lead to two possible outcomes. First, we might miss all forecasts for achieving AGI simply because there wasn't enough data. Second, the current giants like OpenAI, Anthropic, Google, and Meta will cement their positions as our lifelong AI vendors.

Comentarios