Large Tech Companies Use YouTube Videos for AI Model Training

TapTechNews on July 17, the non-profit news studio ProofNews released a blog post yesterday (July 16), stating that large technology companies including Apple, NVIDIA, Salesforce, and Anthropic have all used video resources from YouTube when training their AI models.

It is reported that these technology companies used a data set named YouTubeSubtitles with a size of 5.7 GB (489 million words) during the training of their AI models.

The data set was created by EleutherAI and was first released in 2020, involving the subtitle contents of 173,536 YouTube videos from more than 48,000 channels, including the subtitle contents of more than 12,000 videos that have been deleted from the platform.

The YouTubeSubtitles data set mainly collects resources from popular YouTube channels. TapTechNews attaches the following relevant information:

MrBeast (289 million subscribers, with 2 videos used for training)

MarquesBrownlee (19 million subscribers, with 7 videos)

Jacksepticeye (nearly 31 million subscribers, with 377 videos)

PewDiePie (111 million subscribers, with 337 videos)

The YouTubeSubtitles data set belongs to a data set called The Pile, which includes several other training data sets. Most of the The Pile data sets are open to anyone with enough space and computing power.

Likes