requiredTechnical· discoverability

robots.txt

Q: Does a missing robots.txt file mean AI agents can't crawl my site?

No—most crawlers interpret a missing `robots.txt` as 'allow all,' but it's a weak signal. An explicit, well-formed `robots.txt` is unambiguous and shows you've configured access intentionally. For agent discoverability and citation, an explicit file with targeted `User-agent` rules is far better than relying on default behavior.

Q: Can I use robots.txt to block AI training bots but still allow search indexing?

Yes. Use separate `User-agent` directives: allow `Googlebot` and `Bingbot` for search, then `Disallow: /` for `GPTBot`, `CCBot`, or other training crawlers. Remember that `User-agent: *` is a wildcard; specific rules override it. This lets you granularly control which bots access which content.

Q: Do SaaS product documentation sites need robots.txt differently than marketing sites?

Yes. SaaS docs often have versioned paths (`/v2/`, `/v3/`) and `/changelog/` or `/api-reference/` sections. Allow agent access to current docs with `Allow: /docs/` and reference a docs-specific sitemap. Block deprecated versions or internal-only guides with `Disallow: /docs/internal/`. This ensures agents cite up-to-date, public documentation in answers.

Q: What's the difference between robots.txt and llms.txt for AI agents?

The `robots.txt` file controls *access* (which bots can crawl which paths), per RFC 9309. The `llms.txt` file provides *context* (structured summaries and guidance for LLM consumption). Think of `robots.txt` as permission and `llms.txt` as instruction. Use both: `robots.txt` to gate crawling, `llms.txt` to optimize how agents understand your content.

Q: How do I serve robots.txt on Cloudflare Workers or Vercel Edge?

For **Cloudflare Workers**, handle `/robots.txt` in your `fetch` handler and return a `Response` with `Content-Type: text/plain`. For **Vercel**, place `robots.txt` in the `public/` directory (Next.js) or define a route in `vercel.json`. Both approaches let you serve a static or dynamic `robots.txt` at the root path.

Q: Can I use robots.txt to set different crawl rates for different AI agents?

Yes, via the non-standard `Crawl-delay` directive. For example, `User-agent: GPTBot` followed by `Crawl-delay: 10` tells GPTBot to wait 10 seconds between requests. While not in RFC 9309, it's widely supported by major crawlers. Use it to throttle aggressive bots while allowing faster access for priority agents.

Every crawlable site should publish a robots.txt at the origin root. AI agents read it to learn which paths are allowed and which user-agents are blocked.

6 min read· Spec ↗· Updated 2026-04-25

On this page

What is robots.txt and why do AI agents check it?
Why does robots.txt matter for GPTBot, ClaudeBot, and other AI crawlers?
Is robots.txt required for agent-ready sites?
What RFC 9309 says about robots.txt syntax and directives
What good robots.txt implementation looks like in production
How do I add or fix robots.txt on my site?
How can I test my robots.txt file right now?
Frequently asked questions
Does a missing robots.txt file mean AI agents can't crawl my site?
Can I use robots.txt to block AI training bots but still allow search indexing?
How does robots.txt apply to e-commerce sites selling through AI agents?
Do SaaS product documentation sites need robots.txt differently than marketing sites?
What's the difference between robots.txt and llms.txt for AI agents?
How do I serve robots.txt on Cloudflare Workers or Vercel Edge?
Can I use robots.txt to set different crawl rates for different AI agents?
Do I need a separate robots.txt for each subdomain (www, blog, docs)?

What is robots.txt and why do AI agents check it?

A robots.txt file is a plain-text resource served at the root of your domain—https://yoursite.com/robots.txt—that tells automated crawlers which parts of your site they may or may not access. Originally designed for search-engine spiders in the mid-1990s, it has become the de facto standard for polite bot behavior. AI agents—whether they're scraping for training data, building citation indexes, or executing live retrieval tasks—check this file before they start crawling.

The syntax and semantics are now codified in RFC 9309, published by the IETF in 2022. The spec defines how to declare user-agent names, disallow or allow specific URL paths, set crawl delays, and point to sitemaps. A well-formed robots.txt is a simple, declarative contract: "If you're bot X, stay out of /admin and /drafts, but feel free to read everything else."

Why does robots.txt matter for GPTBot, ClaudeBot, and other AI crawlers?

Modern AI agents—GPTBot (OpenAI), ClaudeBot (Anthropic), GoogleBot-Extended (Google for AI training), PerplexityBot, and dozens of others—all honor robots.txt before they scrape your site. If you want your content cited in ChatGPT, surfaced in Perplexity's answer engine, or indexed by Claude's contextual retrieval, a missing or misconfigured robots.txt can silently exclude you. Worse, an overly permissive file might expose staging environments, internal APIs, or user-generated content you never intended to train someone else's model.

The business outcomes are concrete. A properly scoped robots.txt improves your citation rate in agent-generated answers, because you've signaled which pages are canonical and public. It reduces false positives from Web Application Firewalls that can't distinguish a legitimate agent from a scraper. And for agentic commerce flows—where an agent browses your catalog, adds items to a cart, and completes checkout—explicit crawl permission is table stakes. If the agent can't discover your product pages, it can't buy from you.

Is robots.txt required for agent-ready sites?

This check is required for most sites. If your content is public and you want any form of agent discoverability—whether for citation, retrieval, or agentic transactions—you need a robots.txt that explicitly permits the user-agents you care about. A missing file is interpreted by most crawlers as "allow all," but that's a soft signal. An explicit, well-formed robots.txt is unambiguous and demonstrates you've thought through access control.

The exceptions are narrow: internal tools behind authentication, ephemeral dev environments, or sites that have deliberately chosen to block all automated access. Even then, an explicit Disallow: / for User-agent: * is clearer than silence.

What RFC 9309 says about robots.txt syntax and directives

RFC 9309 defines the following:

User-agent: A token identifying the crawler (case-insensitive).
Allow / Disallow: Path prefixes the agent may or may not access. Most specific rule wins.
Sitemap: Optional pointer to your XML sitemap (can appear multiple times).
Crawl-delay: Non-standard but widely supported directive; throttles request rate (seconds between requests).

A minimal valid example:

User-agent: *
Disallow: /admin/
Disallow: /tmp/

User-agent: GPTBot
Allow: /

Sitemap: https://example.com/sitemap.xml

This permits all agents except within /admin/ and /tmp/, explicitly allows GPTBot everywhere (overriding the wildcard), and advertises a sitemap for efficient discovery.

What good robots.txt implementation looks like in production

Companies that get this right publish clear, purpose-driven robots.txt files. Stripe, for example, blocks staging paths and internal tooling while allowing public docs and API references. Shopify permits agent crawling of product pages but disallows checkout and account routes.

A representative example (simplified):

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /account/
Allow: /products/
Allow: /collections/
Allow: /pages/

User-agent: GPTBot
Allow: /

User-agent: CCBot
Disallow: /

Sitemap: https://shop.example.com/sitemap.xml
Sitemap: https://shop.example.com/sitemap-products.xml

This allows general crawling of marketing and catalog pages, grants GPTBot full access, blocks CommonCrawl (CCBot), and points to two sitemaps for efficient indexing. The logic is explicit and auditable.

How do I add or fix robots.txt on my site?

Create the file. Add a plain-text file named robots.txt at the root of your public directory (e.g., public/robots.txt in Next.js, static/robots.txt in many frameworks).
Define your user-agent rules. Start with a wildcard rule, then override for specific bots:
```
User-agent: *
Disallow: /private/

User-agent: GPTBot
Allow: /
```
Add a sitemap reference. Help agents find your structured content:
```
Sitemap: https://yoursite.com/sitemap.xml
```

Deploy and verify. For Next.js, place robots.txt in the public/ folder; it's served automatically. For Vercel, the same applies. For Cloudflare Workers, serve it from a route handler:

export default {
  async fetch(request) {
    const url = new URL(request.url);
    if (url.pathname === '/robots.txt') {
      return new Response(robotsTxt, {
        headers: { 'Content-Type': 'text/plain' }
      });
    }
    // ...
  }
};

Test syntax. Use an online validator or the Google Search Console robots.txt tester to catch typos.

How can I test my robots.txt file right now?

curl -I https://yoursite.com/robots.txt
curl https://yoursite.com/robots.txt

The first command should return HTTP/1.1 200 OK with Content-Type: text/plain. The second should display your rules. Or just run a free scan and we'll check this for you alongside 30+ other agent-readiness signals.

Frequently asked questions

Does a missing robots.txt file mean AI agents can't crawl my site?

No—most crawlers interpret a missing robots.txt as 'allow all,' but it's a weak signal. An explicit, well-formed robots.txt is unambiguous and shows you've configured access intentionally. For agent discoverability and citation, an explicit file with targeted User-agent rules is far better than relying on default behavior.

Can I use robots.txt to block AI training bots but still allow search indexing?

Yes. Use separate User-agent directives: allow Googlebot and Bingbot for search, then Disallow: / for GPTBot, CCBot, or other training crawlers. Remember that User-agent: * is a wildcard; specific rules override it. This lets you granularly control which bots access which content.

How does robots.txt apply to e-commerce sites selling through AI agents?

For agentic commerce, you must allow agent crawlers to discover product pages, collections, and checkout flows. Use Allow: /products/ and Allow: /collections/ while blocking /admin/ and /account/. If agents can't read your catalog, they can't buy. A clear robots.txt with sitemap references accelerates product indexing and transaction readiness.

Do SaaS product documentation sites need robots.txt differently than marketing sites?

Yes. SaaS docs often have versioned paths (/v2/, /v3/) and /changelog/ or /api-reference/ sections. Allow agent access to current docs with Allow: /docs/ and reference a docs-specific sitemap. Block deprecated versions or internal-only guides with Disallow: /docs/internal/. This ensures agents cite up-to-date, public documentation in answers.

What's the difference between robots.txt and llms.txt for AI agents?

The robots.txt file controls access (which bots can crawl which paths), per RFC 9309. The llms.txt file provides context (structured summaries and guidance for LLM consumption). Think of robots.txt as permission and llms.txt as instruction. Use both: robots.txt to gate crawling, llms.txt to optimize how agents understand your content.

How do I serve robots.txt on Cloudflare Workers or Vercel Edge?

For Cloudflare Workers, handle /robots.txt in your fetch handler and return a Response with Content-Type: text/plain. For Vercel, place robots.txt in the public/ directory (Next.js) or define a route in vercel.json. Both approaches let you serve a static or dynamic robots.txt at the root path.

Can I use robots.txt to set different crawl rates for different AI agents?

Yes, via the non-standard Crawl-delay directive. For example, User-agent: GPTBot followed by Crawl-delay: 10 tells GPTBot to wait 10 seconds between requests. While not in RFC 9309, it's widely supported by major crawlers. Use it to throttle aggressive bots while allowing faster access for priority agents.

Do I need a separate robots.txt for each subdomain (www, blog, docs)?

Yes—each subdomain is a distinct origin and requires its own robots.txt at the root. So https://www.example.com/robots.txt, https://blog.example.com/robots.txt, and https://docs.example.com/robots.txt are all separate files. Agents check the file on the exact hostname they're crawling. Deploy tailored rules per subdomain to match each site's purpose.

Test it on your site

We check this — and 30+ other agent-readiness signals.

One scan. Per-finding evidence. Free.

Run a free scan

Related standards

recommended

/.well-known/* capability discovery

RFC 8615 defines /.well-known/ as a reserved namespace for site-wide metadata. Agents probe a known set: oauth-authorization-server, openid-configuration, mcp.json, agents.json, api-catalog, etc.

optional

Agent Skills index

Enumerable list of discrete skills your site exposes — lighter than MCP, heavier than a raw OpenAPI blob. Path: /.well-known/agent-skills/index.json.

optional

Agentic commerce protocols (ACP, UCP, MPP, x402)

Four overlapping standards that let AI agents pay and transact: Agentic Commerce Protocol, Universal Commerce Protocol, Merchant Payments Protocol, x402.