Content
summary Summary

Cloudflare is accusing the AI search engine Perplexity of covertly crawling websites, even when site owners have explicitly forbidden access through robots.txt files or firewall rules. According to the company, Perplexity disguises its identity to sidestep restrictions and is violating established internet norms.

Ad

In a recent blog post, Cloudflare claims that Perplexity switches to stealth crawling tactics whenever its official bot is blocked. The company says it has removed Perplexity from its verified bots list and taken steps to block what it calls "stealth crawling."

Cloudflare's investigation began after customers reported that Perplexity continued to access their content despite explicit blocks in robots.txt and custom firewall rules.

Cloudflare exposes evasion tactics

To verify the allegations, Cloudflare ran its own experiment. The team set up new, unlisted domains and blocked all bots in the robots.txt file. They also used firewall rules to block Perplexity's declared crawlers, "PerplexityBot" and "Perplexity-User."

Ad
Ad

Despite these measures, Perplexity was still able to provide detailed information about the restricted domains, Cloudflare says. The tests showed that Perplexity either ignored robots.txt or didn't check it at all.

Disguised crawlers with rotating identities

Cloudflare's findings point to a two-step process. First, Perplexity attempts to access content with its declared user agent. If blocked, a second, undeclared crawler takes over, using a generic user agent that mimics Google Chrome on macOS.

This disguised crawler not only uses undeclared IP addresses but also rotates both IPs and ASNs (Autonomous System Numbers) to bypass blocks. An ASN is a unique identifier assigned to a network operator, representing a larger network that manages its own routing. By switching ASNs, the crawler can appear to originate from entirely different networks.

Cloudflare observed this activity across tens of thousands of domains, with millions of requests each day. When even the disguised crawler was blocked, Perplexity returned only vague answers, confirming that the block was effective.

OpenAI as a positive example

Cloudflare contrasts Perplexity's behavior with that of more transparent crawlers that respect site owners' rules. OpenAI is highlighted as a positive example: the company clearly declares its crawlers and their purpose, and it honors both robots.txt directives and network blocks.

Recommendation

In the same test, ChatGPT stopped crawling after reading the robots.txt file and did not attempt to access content using alternate user agents.

Cloudflare rolls out new protections

In response, Cloudflare has added the disguised crawler's signatures to its managed rules for blocking AI crawlers, which are available to all customers, including those on free plans. Customers already using bot management rules to block or challenge requests are already protected.

Cloudflare expects bot operators' tactics to keep evolving and says it is working with experts to standardize crawler behavior, including proposed updates for robots.txt.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Cloudflare accuses Perplexity of covertly crawling websites by disguising its identity and bypassing explicit blocks set by site owners in robots.txt files and firewall rules, calling the behavior a violation of internet norms.
  • Cloudflare’s investigation found that after its official crawler was blocked, Perplexity switched to using a generic user agent that mimics Google Chrome, rotating IP addresses and ASNs to evade detection, and continued accessing restricted content across tens of thousands of domains.
  • As a response, Cloudflare has removed Perplexity from its verified bots list, updated its protections to block disguised crawlers, and pointed to OpenAI as a positive example of respecting site owners’ rules and network blocks.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.