Content-Signal in robots.txt
Content-Signal is one of the most important new signals for AI-use governance because it expresses a layer that ordinary Allow and Disallow rules do not cover.
robots.txt can say which crawlers should access which paths. It does not fully express what those crawlers should do with content after access. Cloudflare’s Content Signals Policy adds a vocabulary for that second question.
For Better Robots.txt, this matters because it separates access from use.
Current Better Robots status
Better Robots treats Content-Signal as a governance-relevant layer, but it should not be confused with hard enforcement.
Current public status:
- Content-Signal is documented as a post-crawl usage preference signal.
- A future score-neutral detection pass is appropriate for
/check. - Content-Signal is not a WAF rule, firewall rule or crawler-authentication mechanism.
- If
/checkdoes not explicitly mark Content-Signal as scored, it should not be inferred as part of the current score. - Future profile alignment may evaluate
search,ai-inputandai-train, but that requires an explicit ruleset update.
The basic pattern
Cloudflare gives this example:
User-Agent: *
Content-Signal: search=yes, ai-train=no
Allow: /This line does not replace Allow or Disallow. It adds a declared preference about downstream use.
The three currently important purposes are:
| Signal | Practical meaning | Governance question |
|---|---|---|
search | Search indexing and results with links or excerpts | Should content be used for search discovery? |
ai-input | Use as input to AI models at query time, including retrieval, grounding, or RAG-style answers | Should content be used inside answer generation or retrieval workflows? |
ai-train | Training or fine-tuning AI models | Should content be used to improve future models? |
The absence of a signal should not be over-read. Cloudflare explicitly describes absence as neutral for that specific use.
Why this belongs in crawler governance
Better Robots already separates crawler families by purpose:
- search crawlers;
- AI search crawlers;
- training crawlers;
- user-triggered fetchers;
- social and preview crawlers;
- SEO tools;
- bad bots.
Content-Signal adds another axis. It says what the site declares about use after access.
That makes it a direct fit for the audit’s intent-profile model.
For example, the profile AI search open, training restricted may align with:
Content-Signal: search=yes, ai-input=yes, ai-train=noor, in a more conservative interpretation:
Content-Signal: search=yes, ai-train=noThe difference is doctrinally important. ai-input=yes permits answer-time input more explicitly than search=yes alone. A site owner should not enable it by accident.
Access and use can disagree
The most useful audit finding is not simply “Content-Signal exists.”
The useful finding is whether the declared use signal aligns with crawler access rules.
Case 1: training use refused, training crawlers still open
User-agent: *
Content-Signal: ai-train=no
Allow: /
User-agent: GPTBot
Allow: /This does not automatically mean the site is broken. It means the site is relying on a downstream-use declaration while still allowing access to a training-related crawler. Some crawlers may respect the signal. Some may not.
A careful audit should say:
Usage restriction declared, access still open.
Case 2: training use allowed, training crawlers blocked
User-agent: *
Content-Signal: ai-train=yes
Allow: /
User-agent: GPTBot
Disallow: /This is a stronger contradiction. The usage signal says training is allowed, but the access rule blocks a known training crawler.
A careful audit should say:
Training use is declared as allowed, but a training crawler is blocked.
Case 3: ai-input refused, answer systems still opened
User-agent: *
Content-Signal: ai-input=no
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /This needs interpretation. ai-input=no is about answer-time use as model input. Some bots may serve search, retrieval, citation, or user-triggered use cases. The audit should not flatten all of them into one category.
How Better Robots should treat Content-Signal
Content-Signal should be treated as a strong machine-readable preference, not as hard enforcement.
A safe audit model is:
| Situation | Audit interpretation |
|---|---|
| Present and aligned with selected profile | Positive signal |
| Present but contradictory with crawler access rules | Warning or mismatch finding |
| Present with invalid values | Warning |
| Absent | Informational or not evaluated, not a hard failure |
Absence should not be punished too strongly because Content-Signal is not the Robots Exclusion Protocol itself and is not universally enforced.
How this relates to Better Robots.txt PRO
At first, /check should detect and explain Content-Signal without changing the core score.
Later, Better Robots.txt PRO could emit a profile-based line such as:
Content-Signal: search=yes, ai-input=yes, ai-train=nofor sites that explicitly choose an AI-search-open and training-restricted posture.
That should be a deliberate profile decision, not a hidden default.
Recommended profile mapping
| Profile | Possible Content-Signal posture | Notes |
|---|---|---|
| AI search open, training restricted | search=yes, ai-input=yes, ai-train=no | Strongest expression of retrieval-time use plus training restriction |
| Publisher protection | search=yes, ai-input=no, ai-train=no | Keeps search discovery while refusing answer-time input and training |
| Maximum AI visibility | search=yes, ai-input=yes, ai-train=yes | Useful only when the site deliberately accepts broad AI reuse |
| WordPress safe default | search=yes, ai-train=no, optionally ai-input=yes | Should depend on the site owner’s appetite for answer-time use |
| Strict crawler restriction | ai-input=no, ai-train=no, search depends on whether Search visibility remains desired | Do not accidentally turn a strict AI posture into a Search blackout |
What not to claim
Do not claim that Content-Signal guarantees obedience.
Do not claim it replaces WAF, Bot Management, signed-agent verification, logs, or contractual controls.
Do not claim it is a Search ranking factor.
Do not claim every crawler respects it.
The correct claim is narrower and stronger:
Content-Signalgives a site a machine-readable way to express post-crawl use preferences inrobots.txt.
That is exactly the kind of signal a crawler-governance audit should understand.
FAQ
Is Content-Signal the same as Disallow?
No. Disallow is access guidance for a path. Content-Signal is a declared preference about what content may be used for after access.
Should a site add Content-Signal even if it blocks AI bots?
Possibly. A block rule expresses access posture. Content-Signal expresses usage posture. The two can reinforce each other when they are coherent.
Can Better Robots audit this today?
This page describes the recommended direction for the audit and plugin. The safest implementation path is to detect Content-Signal as a non-scoring informational signal first, then add profile-alignment scoring after the ruleset is updated.
Is ai-input the same as training?
No. ai-input covers answer-time use such as retrieval, grounding, or other use as model input. ai-train covers training or fine-tuning.
Read next
- Search vs ai-input vs ai-train
- What AI usage signals can and cannot do
- Signal vs enforcement for AI crawlers
- Crawler governance vs agentic readiness
- AI training vs AI search crawlers
References
- Cloudflare Content Signals Policy: https://blog.cloudflare.com/content-signals-policy/
- Content Signals reference site: https://contentsignals.org/