AI Companies Bypassing Publishers' Content Restrictions

TapTechNews June 24. According to a report by Reuters last Saturday, TollBit, a startup focused on the field of 'content licensing', recently warned news publishers that several artificial intelligence companies are circumventing the common web standards that publishers use to prevent content from being crawled and are using the crawled content for training generative AI systems.

This news comes against the background of an open dispute between the AI search startup Perplexity and the media 'Forbes' over the same web standard. Currently, there is a broader debate between technology and media companies about the value of content in the era of generative AI.

TollBit positions itself as a'matchmaker' between content-starved AI companies and publishers willing to enter into significant licensing agreements with them.

TapTechNews note: 'Forbes' once accused Perplexity of plagiarizing its report content in the AI-generated summary, but the former did not label the source of the information and did not obtain permission from 'Forbes'.

In addition, 'Wired' magazine also published an investigative report last week and pointed out that Perpexity may have bypassed the ('Robots Exclusion Protocol' set by news publishers) or other programs that prevent web crawlers.

 AI Companies Bypassing Publishers' Content Restrictions_0

A trade organization, the 'News Media Alliance', which claims to represent more than 2,000 US publishers, has also expressed concerns about this behavior - AI companies turn a blind eye to the 'no crawl' mechanism or 'robots.txt' and other tools set by publishers. Danielle Coffey, the group's president, said, 'If AI companies can't stop massive crawling, we can't make a profit from valuable content and can't pay salaries to journalists.'

TollBit said that Perplexity is not the only offender who ignores the 'no crawl' mechanism on publishers' websites. According to its analysis, 'a large number' of AI platforms bypass this mechanism, and this mechanism sets up a 'white list' for AI platforms to crawl their own content - indicating which parts of their websites can be crawled.

'This means that AI platforms from multiple sources (not just one company) are choosing to bypass the robots.txt protocol to retrieve content from websites,' TollBit wrote, 'The more publisher logs we get, the more times this pattern occurs.'

Some publishers, including 'The New York Times', have sued AI companies for these infringement acts. Other publishers have signed licensing agreements with artificial intelligence companies, and AI companies are also willing to pay for the content, although often there are differences in the value of the materials. Many AI developers believe that they obtain content for free without violating any laws.

Likes