Skip to main content

Controlling AI Crawlers: robots.txt, llms.txt, and Who's Scraping You

ai robots.txt llms.txt seo crawlers
Illustration of AI crawlers and bots accessing a website through robots.txt rules

A few years ago there was one crawler that mattered: Googlebot. Now there’s a crowd of AI bots reading your site, and they’re not all there for the same reason. Some want to put you in AI answers. Some want your content to train a model. A few will ignore your wishes entirely.

Before you block or allow anything, the question to settle is what you actually want. Most site owners conflate two separate decisions:

  1. Do you want to appear in AI answers? (ChatGPT, Perplexity, Google’s AI Overviews citing your page)
  2. Do you want your content used to train models?

You can say yes to the first and no to the second. Different bots, different rules. Here’s how to tell them apart and what to do about it.

The bots that matter, and what each is for

AI crawlers fall into three jobs. Knowing the job tells you whether blocking helps or hurts you.

User-agentOperatorJob
GPTBotOpenAICrawls content for model training
OAI-SearchBotOpenAIIndexes pages for ChatGPT search results
ChatGPT-UserOpenAIFetches a page live when a user asks about it
ClaudeBotAnthropicCrawls content (training and product)
PerplexityBotPerplexityIndexes pages so Perplexity can cite them
Google-ExtendedGoogleControls use in Gemini / AI training, separate from Search
BytespiderByteDanceCrawls content for training
CCBotCommon CrawlOpen web archive widely used as training data
Applebot-ExtendedAppleControls use of your content for Apple’s AI training
AmazonbotAmazonCrawls for Alexa and Amazon’s AI
Meta-ExternalAgentMetaCrawls content for Meta’s AI

The important split: retrieval bots (OAI-SearchBot, PerplexityBot, ChatGPT-User) are how you get cited in AI answers. Training bots (GPTBot, CCBot, Bytespider, Applebot-Extended) feed models but don’t necessarily send you anything back. If you want visibility in AI answers, blocking the retrieval bots is shooting yourself in the foot.

Google-Extended deserves a callout: it controls whether your content feeds Google’s AI products without affecting your normal Google Search ranking. Blocking it does not hurt your SEO. That makes it the cleanest “opt out of AI training, keep my rankings” lever available.

The deprecated user-agent trap

Here’s a mistake we see in a lot of older robots.txt files. Anthropic’s old user-agents Claude-Web and anthropic-ai are no longer active. The current crawler is ClaudeBot. If your robots.txt blocks only the old strings, you are not blocking Anthropic’s crawler at all — you’re blocking ghosts.

The lesson generalizes: AI crawler names change, get retired, and split into new ones faster than SEO advice gets updated. Treat your robots.txt as something to review every few months, not set-and-forget.

How to write the rules

robots.txt lives at the root of your domain (https://yoursite.com/robots.txt) and is read top to bottom, one block per user-agent. To block a specific bot:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

A common, sensible posture for a business that wants to be found in AI answers but doesn’t want to donate to training datasets: allow the retrieval bots, block the training-only ones.

# Let AI answer engines cite us
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Opt out of training crawls
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Keep Google Search, opt out of Google's AI training
User-agent: Google-Extended
Disallow: /

There’s no single right answer here. A publisher protecting premium content blocks aggressively. A consultancy that wants to be the source ChatGPT quotes allows everything. Decide based on whether being cited is worth more to you than the content itself.

You can validate your file with our robots.txt Checker, and build one from scratch with the robots.txt Generator. For the SEO fundamentals, see why robots.txt matters.

The honest caveat: robots.txt is a request, not a fence

This matters enough to say plainly. robots.txt is voluntary. Well-behaved bots from OpenAI, Anthropic, Google, and Perplexity honor it. Plenty of scrapers do not, and some have been caught crawling with disguised or rotating user-agents.

If you need enforcement rather than a polite request, robots.txt is the wrong tool. You need edge-level blocking: a WAF rule, a CDN bot-management feature (Cloudflare and others now ship one-click AI bot blocking), or rate limiting that actually drops the connection. robots.txt expresses intent; the edge enforces it. Use both — the file documents your policy, the edge backs it up.

llms.txt: what it is, and why we’re lukewarm

You’ve probably heard about llms.txt. It’s a proposed standard from Jeremy Howard (Answer.AI), introduced in September 2024: a Markdown file at the root of your site that gives AI models a curated, clean summary of your most important pages, instead of making them parse your navigation and footers.

The idea is genuinely nice. The reality, as of mid-2026, is underwhelming:

  • Google has said it does not use llms.txt and has no plans to. Google’s Gary Illyes confirmed this in July 2025, and John Mueller compared it to the long-discredited keywords meta tag.
  • No major AI provider has committed to it as a ranking or answer signal in their production systems.
  • Adoption is thin. A study of 300,000 domains found roughly 1 in 10 had one. After a year and a half of hype, that’s not momentum.

So our position: adding an llms.txt is cheap and harmless, and if you maintain a docs site it can be a tidy index. But do not expect it to move you up in AI answers, and do not let anyone sell it to you as a ranking lever. It is not one today.

The work that does influence whether AI engines cite you is the same work that earns featured snippets: clear structure, real expertise, content that directly answers the question, and being crawlable in the first place. That’s Generative Engine Optimization, and it’s where your effort should go.

A decision framework

Strip away the noise and it comes down to three questions:

  1. Do you want to be cited in AI answers? If yes, allow the retrieval bots (OAI-SearchBot, PerplexityBot, ChatGPT-User) and do the GEO work. Most businesses should.
  2. Do you mind your content training models? If you do, block the training bots (GPTBot, CCBot, Bytespider, Applebot-Extended) and set Google-Extended to disallow. Your Search ranking is unaffected.
  3. Do you need this enforced? If you’re protecting genuinely valuable content, back robots.txt with edge-level bot blocking, because the file alone won’t stop a determined scraper.

Skip llms.txt as a ranking play. Add it only if a clean index of your content is useful for its own sake.

If you want a look at who’s actually hitting your site and a robots.txt that matches your strategy instead of someone else’s copy-paste, reach out and we’ll sort it out.

Need help shipping?

We help teams build and ship software that works. Performance, SEO, features, weekly demos, full ownership.

Get a Free Audit