All standards
recommendedTechnical· bot access

Cloudflare Content Signals

A robots.txt directive (Content-Signal: search=yes, search-ai=yes, ai-train=no) declaring how your content may be used by search, AI search answers, and AI training.

6 min read· Spec ↗· Updated 2026-04-25
On this page

What is Cloudflare's Content Signals directive for robots.txt?

Content Signals is a robots.txt directive that lets you declare—in a single line—whether your site's content may be used for traditional search indexing, AI-powered search answers, and AI model training. Think of it as a machine-readable policy statement that search engines and AI vendors can check before scraping or ingesting your content.

Proposed by Cloudflare in their October 2024 blog post, the directive takes the form Content-Signal: search=yes, search-ai=yes, ai-train=no and sits alongside traditional User-agent and Disallow rules. Each of the three signals—search, search-ai, and ai-train—accepts yes, no, or conditional as a value. It's not a formal RFC or W3C standard, but Cloudflare has positioned it as a vendor-neutral extension to robots.txt, and early adopters include other CDN and security providers who recognize the need for a lightweight, defensible answer to "what's our AI policy?"

Why does Content Signals matter for AI agents and search-AI applications?

When Perplexity, ChatGPT Search, or Google's AI Overviews scrape your site to power conversational answers, they're making a judgment call about whether you've consented to that use. Without an explicit signal, they fall back on heuristics—rate-limiting, user-agent strings, or vague interpretations of existing robots.txt rules. Content Signals gives you a clean way to say "yes, power search answers with my content" (search-ai=yes) while simultaneously saying "no, don't use it to train your next GPT variant" (ai-train=no). That distinction matters: you might want Claude to cite your API documentation in a developer query, but not have that same documentation end up as training data for a competing LLM.

For agent frameworks like LangGraph or AutoGPT that autonomously fetch context from the web, a clear Content Signals directive reduces friction. It pre-empts WAF blocks (since you're explicitly permitting search-AI access) and makes it easier for agents to respect your terms without requiring custom logic per-site. If you're building agentic commerce flows or RAG pipelines that depend on third-party content, you want to avoid the legal and technical morass of scraping sites that haven't declared their stance. Content Signals gives you that clarity—and protects your own site from being accused of the same ambiguity.

This check is recommended for most sites because it costs you nothing to add and buys you meaningful legal and operational clarity. If your content has commercial value (SaaS docs, editorial articles, research reports), your legal team will eventually ask whether you've addressed AI training exposure. A single line in robots.txt is the cheapest defensible answer.

Skip it only if you genuinely don't care how your content is used—vanity blogs, ephemeral marketing sites, or content already behind authentication. If you're running an open-source project or public knowledge base and want maximum LLM visibility, you should still declare search-ai=yes, ai-train=yes explicitly rather than relying on permissive silence.

What the Content Signals specification says about search-ai and ai-train

Cloudflare's proposal defines three signals, each with three possible values:

  • search: traditional web search indexing (yes | no | conditional)
  • search-ai: AI-generated search answers that cite or summarize your content (yes | no | conditional)
  • ai-train: use of your content as training data for AI models (yes | no | conditional)

The conditional value allows you to link to a separate policy page (e.g., /ai-policy.html) where you specify terms—useful for enterprise sites that need per-contract nuance. The directive is a single line in your robots.txt file, typically near the top:

Content-Signal: search=yes, search-ai=yes, ai-train=no

This is not an IETF standard. Cloudflare published it as a de facto proposal, and adoption depends on vendors (OpenAI, Anthropic, Google, Perplexity) choosing to honor it. As of early 2025, it's been acknowledged by several CDN and security providers, but enforcement is voluntary.

What good Content Signals implementation looks like in robots.txt

Cloudflare itself publishes a permissive Content Signals directive that allows all three uses:

Content-Signal: search=yes, search-ai=yes, ai-train=yes

Many SaaS companies with public documentation opt for a middle ground—allowing search and AI answers but declining training:

Content-Signal: search=yes, search-ai=yes, ai-train=no

If you want to see examples in the wild, inspect the robots.txt of major CDN providers or developer platforms that have publicly commented on AI training policies. Be aware that adoption is still emerging; don't assume every site you admire has implemented this yet.

How do I add Content Signals to my site's robots.txt file?

  1. Open your robots.txt file (usually at the root of your domain, e.g., https://example.com/robots.txt).

  2. Add the Content-Signal directive near the top, before any User-agent blocks:

    Content-Signal: search=yes, search-ai=yes, ai-train=no
    
  3. Choose your values based on business policy:

    • search=yes unless you're deliberately delisting from search engines
    • search-ai=yes if you want citations in ChatGPT Search, Perplexity, or Google AI Overviews
    • ai-train=no if you want to decline training use (most common for commercial content)
  4. Deploy the change:

    • Static sites (Vercel, Netlify): Update public/robots.txt and redeploy.
    • Next.js: Place the directive in public/robots.txt or use a dynamic route if you need per-environment logic.
    • Cloudflare Workers: Serve robots.txt from a Worker or static asset binding; append the Content-Signal line.
  5. Test the file is accessible (see next section).

How do I test my Content Signals implementation in robots.txt?

curl -I https://yoursite.com/robots.txt

Then fetch the full content:

curl https://yoursite.com/robots.txt | grep "Content-Signal"

You should see your directive in the output. Or just run a free scan and we'll check this for you alongside 30+ other agent-readiness signals.

Frequently asked questions

Is Content Signals an official standard like robots.txt?

No. Content Signals is a vendor-neutral extension proposed by Cloudflare in October 2024, not an IETF or W3C standard. Adoption is voluntary—vendors like OpenAI, Anthropic, or Perplexity choose whether to honor it. Think of it as an emerging convention rather than a formal RFC, similar to how X-Robots-Tag became widely adopted without official standardization.

Can I set Content Signals to block ChatGPT Search but allow Perplexity?

Not directly. Content Signals uses broad categories (search-ai, ai-train) rather than per-vendor flags. To block specific crawlers, combine Content Signals with traditional User-agent rules like User-agent: GPTBot and Disallow: /. The conditional value lets you link to a detailed policy page with vendor-specific terms if needed.

Do I need Content Signals if I already block AI crawlers with User-agent rules?

Content Signals offers declarative clarity that User-agent blocks don't. Blocking GPTBot or PerplexityBot is reactive and fragile—new agents appear constantly. Setting search-ai=yes, ai-train=no states your policy upfront, reducing legal ambiguity and making compliance easier for well-behaved crawlers. Many sites use both approaches together.

How does Content Signals work for SaaS documentation sites?

SaaS docs typically benefit from search-ai=yes (so Claude or ChatGPT can cite your API references) but ai-train=no (to prevent competitors from training models on your proprietary knowledge). This balances discoverability with IP protection. Add the directive to your docs subdomain's robots.txt even if your marketing site uses different settings.

What does search-ai=conditional mean in Content Signals?

The conditional value signals that your AI policy isn't a simple yes/no—it depends on context like commercial use, attribution, or per-contract terms. You pair it with a link to a policy page (e.g., /ai-policy.html) in your robots.txt comments or site footer. It's useful for enterprises with nuanced legal requirements.

Can I add Content Signals to a WordPress site without a plugin?

Yes. Use an SEO plugin like Yoast or Rank Math that lets you edit robots.txt directly, or add it via FTP to your root directory. Alternatively, use a code snippet plugin to inject the directive programmatically. Just ensure your WordPress install isn't generating a virtual robots.txt that overwrites manual edits—check /wp-admin/options-reading.php.

How does Content Signals compare to the Spawning.ai API for AI training opt-out?

Spawning's API (used by Stability AI, HuggingFace) requires registration and returns JSON opt-out data, while Content Signals is a lightweight robots.txt line readable by any crawler. Content Signals is simpler for most sites, but Spawning offers granular per-image or per-URL opt-out—useful for stock photo libraries or creative portfolios. Many sites use both.

Will setting ai-train=no in Content Signals stop all LLM training on my content?

No—only vendors who choose to honor the directive. There's no legal enforcement mechanism built into robots.txt. However, a clearly stated ai-train=no strengthens your position in disputes and makes it harder for vendors to claim implied consent. Pair it with Terms of Service that explicitly prohibit training use for maximum protection.

Test it on your site
We check this — and 30+ other agent-readiness signals.
One scan. Per-finding evidence. Free.
Run a free scan
Related standards