Apple and Other Tech Giants Involved in Using YouTube Video Subtitles for AI Training Without Consent

By:Jack Published 2024-07-16T13:48:07Z

TapTechNews July 16th news, according to Wired report, some tech giants including Apple have used the subtitle files of YouTube video creators' videos to train artificial intelligence models without their consent.

Apple and Other Tech Giants Involved in Using YouTube Video Subtitles for AI Training Without Consent_0

TapTechNews noted that the creators affected by this incident include well-known tech blogger MKBHD (Marques Brownlee), MrBeast, PewDiePie, as well as talk show hosts Stephen Colbert, John Oliver and Jimmy Kimmel, etc. These subtitle files used for training AI are equivalent to the text transcription content of the videos.

Investigative journalists disclosed that some of the world's richest tech companies have been leveraging materials from thousands of YouTube videos to train AI, which violates YouTube's rule prohibiting scraping content from the platform without permission. It is reported that more than 173,000 subtitle files from 48,000 channels of YouTube videos have been used to train artificial intelligence models, including Silicon Valley giants such as Apple, NVIDIA, and Salesforce.

According to reports, the one who downloaded these subtitle files is a non-profit organization named EleutherAI, who claim that their purpose is to help developers train AI models. Although EleutherAI's original intention may be to provide training materials for small developers and academic researchers, this data set has also been used by tech giants such as Apple.

According to a research paper released by EleutherAI, this data set is part of a large data set named The Pile they released. Most of the data sets in The Pile are public, and anyone with enough storage space and computing power can access. In addition to tech giants, some scholars and developers have also used this data set. However, companies with market values of tens or even hundreds of billions of dollars such as Apple, NVIDIA, and Salesforce also mentioned in their research papers and posts how they used this data set to train AI models.

Documents show that Apple used The Pile for training a few weeks before releasing the highly watched OpenELM model in April. And the release of the OpenELM model coincides with Apple's announcement that it will add new AI functions in iPhone and Macbook.

It should be noted that Apple itself did not download these data, but it was done by EleutherAI. Therefore, technically speaking, it is EleutherAI that violated YouTube's terms of use.

Although Apple and other companies may have used the public data set, this incident highlights the legal risks brought by scraping data from the Internet to train AI systems. There have been cases where AI systems copied entire paragraphs when answering niche topics before, and when companies use data sets compiled by third parties, it only increases the risk of using materials without permission.

Apple YouTube AI training EleutherAI subtitles