AI bot directives in robots.txt
Explicit User-agent rules for AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) — whether allow or disallow. Demonstrates an intentional posture toward AI traffic.
On this page
- What are AI bot directives in robots.txt?
- Why do AI bot directives in robots.txt matter for conversational agents?
- Are AI bot directives in robots.txt required for my site?
- What the robots.txt standard says about AI crawler user-agents
- What good AI bot directives in robots.txt look like
- How do I add AI bot directives to my robots.txt file?
- How can I test my robots.txt for AI bot directives?
- Frequently asked questions
- Does blocking GPTBot in robots.txt prevent ChatGPT from citing my site?
- Do AI bots respect robots.txt directives, or do they ignore them?
- Should e-commerce sites block or allow AI bot crawlers in robots.txt?
- How do AI bot directives in robots.txt differ from the llms.txt standard?
- Can I add AI bot directives to robots.txt on Vercel or Netlify?
- Should SaaS documentation sites block or allow ClaudeBot and GPTBot?
- What happens if I don't specify any AI bot user-agents in robots.txt?
- How do I handle new AI bot user-agents that emerge after I configure robots.txt?
What are AI bot directives in robots.txt?
AI bot directives are explicit User-agent rules in your robots.txt file that tell AI crawlers—GPTBot, ClaudeBot, PerplexityBot, and others—whether they may access your content. Unlike traditional search-engine crawlers, these bots collect data to train models or populate knowledge bases for conversational agents. The directive can allow, disallow, or rate-limit them.
Technically, this means adding stanzas to /robots.txt with user-agent strings like GPTBot, ClaudeBot, Claude-Web, anthropic-ai, PerplexityBot, Google-Extended, Applebot-Extended, CCBot, FacebookBot, or Bytespider. The OpenAI crawler overview documents GPTBot; other vendors maintain similar docs. A directive can be Disallow: / (full block) or selective path rules. The key is intentionality—stating your stance, not leaving it ambiguous.
Why do AI bot directives in robots.txt matter for conversational agents?
When ChatGPT, Claude, or Perplexity answer questions, they may attempt to fetch fresh data from your site—either to cite sources in real-time or to update their training corpus. If your robots.txt is silent on AI bots, you're relying on each vendor's default behavior, which varies. Some crawl aggressively; others interpret silence as permission. An explicit directive ensures you control whether your documentation, pricing pages, or blog posts feed these systems—or whether agents must rely on stale snapshots.
For businesses building agent-first experiences—agentic commerce checkouts, API-driven support bots, or tool-calling integrations—clarity on crawling posture reduces surprise traffic spikes and WAF false positives. If you want your product referenced in ChatGPT answers, allowing the crawler improves citation freshness. If you disallow it, you signal a paywall or prefer direct API integrations over scraping. Either is fine; ambiguity is the problem.
Are AI bot directives in robots.txt required for my site?
This check is optional for most sites. The Robots Exclusion Protocol doesn't mandate AI-specific rules, and many sites operate fine without them. It becomes recommended—or required—if you're in one of three buckets: (1) you're deliberately courting AI citations (SaaS docs, news publishers, open knowledge bases); (2) you've seen unexplained traffic spikes and want to fence off training crawlers; or (3) you're under legal or compliance pressure to limit data reuse (financial services, healthcare portals). If none apply, leaving AI bots unmentioned is a rational default—though you lose the ability to measure or negotiate.
What the robots.txt standard says about AI crawler user-agents
The Robots Exclusion Protocol (RFC 9309) defines User-agent and Disallow directives but doesn't enumerate AI crawlers. Each AI vendor publishes its own bot documentation:
- OpenAI:
GPTBotandChatGPT-User - Anthropic:
ClaudeBot,Claude-Web,anthropic-ai - Perplexity:
PerplexityBot - Google:
Google-Extended(separate from Googlebot) - Apple:
Applebot-Extended - Common Crawl:
CCBot
A minimal valid example blocking all AI training crawlers:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Bytespider
Disallow: /
Allowing all AI crawlers is equally valid—just omit these stanzas or set Disallow: (blank). Selective rules (e.g., Disallow: /internal/) are supported but require vigilance as bot names evolve.
What good AI bot directives in robots.txt look like
The New York Times explicitly disallows AI training bots to protect licensed content, while Stack Overflow historically allowed crawlers but later tightened rules amid licensing negotiations. These illustrate the two poles: full block vs. controlled access.
A permissive example (conceptual, based on common open-documentation patterns):
User-agent: GPTBot
Allow: /docs/
Allow: /blog/
Disallow: /admin/
User-agent: ClaudeBot
Allow: /docs/
Allow: /blog/
Disallow: /admin/
A restrictive example:
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
The pattern: state your baseline for traditional crawlers, then override for AI-specific agents. Don't rely on catch-all User-agent: *—AI bots often have their own stanzas evaluated first.
How do I add AI bot directives to my robots.txt file?
- Audit your current
robots.txt: Fetchhttps://yourdomain.com/robots.txtand check for AI bot user-agents. If missing, proceed. - Decide your posture: Allow (for citation/discovery), disallow (for control/licensing), or selective (allow docs, block admin).
- Add stanzas: Append user-agent blocks for the bots above. Order matters—specific agents before wildcards.
- Validate syntax: Use Google's robots.txt tester or a local parser to catch typos.
- Deploy: If you're on a static host (Vercel, Netlify), place
robots.txtin/public. On Next.js, usepublic/robots.txtor generate it in middleware. On WordPress, use a plugin like Yoast or edit via FTP. - Monitor: Check server logs for these user-agents; confirm the crawlers respect your rules (most do; enforcement is voluntary).
Example for a SaaS allowing docs crawling:
User-agent: GPTBot
Allow: /docs/
Allow: /api-reference/
Disallow: /
User-agent: ClaudeBot
Allow: /docs/
Allow: /api-reference/
Disallow: /
How can I test my robots.txt for AI bot directives?
Fetch your robots.txt and check for AI-specific user-agents:
curl -s https://yourdomain.com/robots.txt | grep -E "(GPTBot|ClaudeBot|PerplexityBot|Google-Extended|CCBot)"
If the output is empty, you have no AI directives. Or just run a free scan and we'll check this for you alongside 30+ other agent-readiness signals.
Frequently asked questions
Does blocking GPTBot in robots.txt prevent ChatGPT from citing my site?
No—this is a common myth. Blocking GPTBot stops OpenAI from crawling your site for training data, but ChatGPT can still cite you if it has older snapshots or if users paste your URLs directly. Real-time citation crawlers use ChatGPT-User, which you must block separately. Disallowing both gives you fuller control over how OpenAI accesses your content.
Do AI bots respect robots.txt directives, or do they ignore them?
Most reputable AI crawlers—GPTBot, ClaudeBot, Google-Extended, PerplexityBot—respect robots.txt directives. However, compliance is voluntary; the protocol has no enforcement mechanism. Smaller or malicious bots may ignore rules. Monitor server logs for the user-agent strings listed in your robots.txt to verify compliance. Consider WAF rules or rate-limiting for non-compliant crawlers.
Should e-commerce sites block or allow AI bot crawlers in robots.txt?
It depends on your goals. Allowing GPTBot and ClaudeBot on product pages can help conversational agents recommend your products in shopping queries. Block them on checkout, account, or admin paths to prevent data leakage. Many e-commerce sites use selective Allow: /products/ rules while disallowing sensitive endpoints. Monitor for traffic spikes and adjust based on conversion data.
How do AI bot directives in robots.txt differ from the llms.txt standard?
robots.txt controls whether crawlers access your site; llms.txt tells agents how to use your content once accessed. You might disallow GPTBot in robots.txt to block training scrapes, while still publishing llms.txt to guide agents using cached or API-provided data. The two standards are complementary—robots.txt is access control, llms.txt is usage instruction.
Can I add AI bot directives to robots.txt on Vercel or Netlify?
Yes. Place a robots.txt file in your /public directory (Next.js, Vite, Astro) or at the root of your deploy folder. Vercel and Netlify serve static files at the root automatically. You can also generate robots.txt dynamically in Next.js middleware or API routes. After deploying, verify with curl https://yourdomain.com/robots.txt.
Should SaaS documentation sites block or allow ClaudeBot and GPTBot?
Most SaaS docs sites allow these crawlers on public documentation paths to improve discoverability in AI-powered search and coding assistants. Block them on customer portals, admin panels, or gated content. Example: Allow: /docs/ and Disallow: /dashboard/ for ClaudeBot. This maximizes citation potential while protecting sensitive areas. Monitor for traffic patterns and adjust rules as needed.
What happens if I don't specify any AI bot user-agents in robots.txt?
Silence is typically interpreted as permission—most AI crawlers will access your site under their default policies. OpenAI's GPTBot, for instance, crawls all public pages unless explicitly disallowed. If you have no position on AI training or citations, leaving robots.txt silent is fine. But you lose the ability to measure crawler impact or negotiate licensing. Explicit directives give you control and visibility.
How do I handle new AI bot user-agents that emerge after I configure robots.txt?
The AI crawler landscape evolves quickly—new bots launch regularly. Subscribe to vendor changelog feeds (OpenAI, Anthropic, Perplexity) or monitor your server logs for unfamiliar user-agents matching AI patterns. Update your robots.txt quarterly to add new crawlers. Consider a catch-all strategy: allow known-good bots explicitly, then use WAF rules to rate-limit or block unrecognized agents.