AI startup Perplexity has come under fire again, this time for allegedly scraping from websites that specifically blocked the startup's crawlers. The criticism comes from Cloudflare, a large internet infrastructure company, which claims the startup used evasive means to avoid the anti-scraping techniques that the website administrators put in place.
Cloudflare published findings earlier this week alleging that Perplexity’s bots accessed restricted websites in violation of the widely accepted Robots.txt standard, a file used by websites to instruct automated bots on which areas they are allowed to access.
Perplexity AI is under scrutiny for allegedly bypassing website restrictions during data collection. Cloudflare reports that Perplexity's crawlers disregarded these guidelines and tried to conceal their identity by employing sophisticated methods.
Cloudflare noted that it saw Perplexity's crawlers disguising themselves using a variety of schemes, such as rotating user-agent strings, which are designed to identify the type of browser and device being used to access a site, as well as changing IP addresses across different Autonomous System Numbers (ASNs), to disguise their crawlers as browsers like Google Chrome for macOS.
Legal experts believe that ignoring robots.txt violations could escalate the broader web scraping controversy around AI-generated content. Critics argue that Perplexity AI needs to adopt more transparent and ethical data sourcing practices. The scope of the operation was sizeable. Cloudflare analysis revealed that the same methods were being used at tens of thousands of websites, resulting in millions of automated requests.
The issue of Perplexity AI scraping has reignited debates about Artificial Intelligence training methods and data ownership. Cloudflare said the activity was detected through a combination of machine learning analysis and network monitoring, triggered by reports from customers whose sites had been accessed despite having Perplexity’s bots blocked.
Also Read: Aravind Srinivas Predicts Perplexity AI's Comet Browser May Replace Two Jobs
Responding to the claims, Perplexity’s spokesperson Jesse Dwyer dismissed the report as 'a sales pitch,' asserting that the bots named in Cloudflare’s findings did not belong to the company. Dwyer claimed that the screenshots Cloudflare reported did not show instances of actual data scraping.
Cloudflare responded to Dwyer, reiterating that controlled testing connected the activities to Perplexity's systems. In direct reference to these findings, Cloudflare removed Perplexity from its approved crawler list and implemented new protections to block this kind of access for publishers.
This event adds to the list of controversies about Artificial Intelligence startups and content scraping. Perplexity also drew backlash in 2023 for allegedly plagiarizing journalistic content and using subscription-based materials without permission.
Cloudflare's decision reflects a bigger change in how internet companies are addressing the data collection practices of AI systems. Cloudflare recently launched a marketplace that enables website owners to charge Artificial Intelligence companies for using their data, as well as to continue offering free tools to block AI training bots.
As generative AI tools continue to exist based on huge datasets of scraped content from the internet, more conversations are emerging regarding issues of digital ownership, consent, and responsible AI development.