Meta Releases New Web Crawler for AI Training Data

By:Mia Published 2024-08-21T05:12:03Z

On August 21 Beijing time, recently, Meta quietly released a new web crawler to search the internet and collect a large amount of data to support its artificial intelligence models.

Meta Releases New Web Crawler for AI Training Data_0

According to three companies that track web crawlers, Meta's new web crawler robot, MetaExternalAgent, was launched last month and is similar to OpenAI's GPTBot, which can crawl artificial intelligence training data on the web, such as text in news articles or conversations in online discussion groups.

According to the usage archive history records, Meta did update a company website for developers at the end of July, and one tab showed the existence of the new crawler, but Meta has not publicly announced its new crawler robot so far.

Meta's Llama is one of the largest LLMs. Although the company did not disclose the training data used by the latest version of the model Llama3, but the initial version of the model used a large dataset collected from other sources such as CommonCrawl.

Earlier this year, Mark Zuckerberg, Meta's co-founder and CEO, boasted in an earnings call that the company's social platforms have accumulated a set of data sets for artificial intelligence training, even exceeding CommonCrawl.

The existence of the new crawler indicates that Meta's massive database may no longer be sufficient, as the company continues to work on updating Llama and expanding Meta AI, usually requiring new and high-quality training data to continuously improve functions.

Data from DarkVisitors shows that nearly 25% of the most popular websites worldwide have now blocked GPTBot, but only 2% of websites have blocked Meta's new crawler robot.

Meta Web Crawler artificial intelligence