Content-Signal in robots.txt

Content-Signal is one of the most important new signals for AI-use governance because it expresses a layer that ordinary Allow and Disallow rules do not cover.

robots.txt can say which crawlers should access which paths. It does not fully express what those crawlers should do with content after access. Cloudflare’s Content Signals Policy adds a vocabulary for that second question.

For Better Robots.txt, this matters because it separates access from use.

Run the free audit See the layer model Read search vs ai-input vs ai-train

Current Better Robots status

Better Robots treats Content-Signal as a governance-relevant layer, but it should not be confused with hard enforcement.

Current public status:

Content-Signal is documented as a post-crawl usage preference signal.
A future score-neutral detection pass is appropriate for /check.
Content-Signal is not a WAF rule, firewall rule or crawler-authentication mechanism.
If /check does not explicitly mark Content-Signal as scored, it should not be inferred as part of the current score.
Future profile alignment may evaluate search, ai-input and ai-train, but that requires an explicit ruleset update.

The basic pattern

Cloudflare gives this example:

txt

User-Agent: *
Content-Signal: search=yes, ai-train=no
Allow: /

This line does not replace Allow or Disallow. It adds a declared preference about downstream use.

The three currently important purposes are:

Signal	Practical meaning	Governance question
`search`	Search indexing and results with links or excerpts	Should content be used for search discovery?
`ai-input`	Use as input to AI models at query time, including retrieval, grounding, or RAG-style answers	Should content be used inside answer generation or retrieval workflows?
`ai-train`	Training or fine-tuning AI models	Should content be used to improve future models?

The absence of a signal should not be over-read. Cloudflare explicitly describes absence as neutral for that specific use.

Why this belongs in crawler governance

Better Robots already separates crawler families by purpose:

search crawlers;
AI search crawlers;
training crawlers;
user-triggered fetchers;
social and preview crawlers;
SEO tools;
bad bots.

Content-Signal adds another axis. It says what the site declares about use after access.

That makes it a direct fit for the audit’s intent-profile model.

For example, the profile AI search open, training restricted may align with:

txt

Content-Signal: search=yes, ai-input=yes, ai-train=no

or, in a more conservative interpretation:

txt

Content-Signal: search=yes, ai-train=no

The difference is doctrinally important. ai-input=yes permits answer-time input more explicitly than search=yes alone. A site owner should not enable it by accident.

Access and use can disagree

The most useful audit finding is not simply “Content-Signal exists.”

The useful finding is whether the declared use signal aligns with crawler access rules.

Case 1: training use refused, training crawlers still open

txt

User-agent: *
Content-Signal: ai-train=no
Allow: /

User-agent: GPTBot
Allow: /

This does not automatically mean the site is broken. It means the site is relying on a downstream-use declaration while still allowing access to a training-related crawler. Some crawlers may respect the signal. Some may not.

A careful audit should say:

Usage restriction declared, access still open.

Case 2: training use allowed, training crawlers blocked

txt

User-agent: *
Content-Signal: ai-train=yes
Allow: /

User-agent: GPTBot
Disallow: /

This is a stronger contradiction. The usage signal says training is allowed, but the access rule blocks a known training crawler.

A careful audit should say:

Training use is declared as allowed, but a training crawler is blocked.

Case 3: ai-input refused, answer systems still opened

txt

User-agent: *
Content-Signal: ai-input=no
Allow: /

User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

This needs interpretation. ai-input=no is about answer-time use as model input. Some bots may serve search, retrieval, citation, or user-triggered use cases. The audit should not flatten all of them into one category.

How Better Robots should treat Content-Signal

Content-Signal should be treated as a strong machine-readable preference, not as hard enforcement.

A safe audit model is:

Situation	Audit interpretation
Present and aligned with selected profile	Positive signal
Present but contradictory with crawler access rules	Warning or mismatch finding
Present with invalid values	Warning
Absent	Informational or not evaluated, not a hard failure

Absence should not be punished too strongly because Content-Signal is not the Robots Exclusion Protocol itself and is not universally enforced.

How this relates to Better Robots.txt PRO

At first, /check should detect and explain Content-Signal without changing the core score.

Later, Better Robots.txt PRO could emit a profile-based line such as:

txt

Content-Signal: search=yes, ai-input=yes, ai-train=no

for sites that explicitly choose an AI-search-open and training-restricted posture.

That should be a deliberate profile decision, not a hidden default.

Recommended profile mapping

Profile	Possible Content-Signal posture	Notes
AI search open, training restricted	`search=yes, ai-input=yes, ai-train=no`	Strongest expression of retrieval-time use plus training restriction
Publisher protection	`search=yes, ai-input=no, ai-train=no`	Keeps search discovery while refusing answer-time input and training
Maximum AI visibility	`search=yes, ai-input=yes, ai-train=yes`	Useful only when the site deliberately accepts broad AI reuse
WordPress safe default	`search=yes, ai-train=no`, optionally `ai-input=yes`	Should depend on the site owner’s appetite for answer-time use
Strict crawler restriction	`ai-input=no, ai-train=no`, `search` depends on whether Search visibility remains desired	Do not accidentally turn a strict AI posture into a Search blackout

What not to claim

Do not claim that Content-Signal guarantees obedience.

Do not claim it replaces WAF, Bot Management, signed-agent verification, logs, or contractual controls.

Do not claim it is a Search ranking factor.

Do not claim every crawler respects it.

The correct claim is narrower and stronger:

Content-Signal gives a site a machine-readable way to express post-crawl use preferences in robots.txt.

That is exactly the kind of signal a crawler-governance audit should understand.

FAQ

Is Content-Signal the same as Disallow?

No. Disallow is access guidance for a path. Content-Signal is a declared preference about what content may be used for after access.

Should a site add Content-Signal even if it blocks AI bots?

Possibly. A block rule expresses access posture. Content-Signal expresses usage posture. The two can reinforce each other when they are coherent.

Can Better Robots audit this today?

This page describes the recommended direction for the audit and plugin. The safest implementation path is to detect Content-Signal as a non-scoring informational signal first, then add profile-alignment scoring after the ruleset is updated.

Is ai-input the same as training?

No. ai-input covers answer-time use such as retrieval, grounding, or other use as model input. ai-train covers training or fine-tuning.

References

Cloudflare Content Signals Policy: https://blog.cloudflare.com/content-signals-policy/
Content Signals reference site: https://contentsignals.org/

Content-Signal in robots.txt ​

Current Better Robots status ​

The basic pattern ​

Why this belongs in crawler governance ​

Access and use can disagree ​

Case 1: training use refused, training crawlers still open ​

Case 2: training use allowed, training crawlers blocked ​

Case 3: ai-input refused, answer systems still opened ​

How Better Robots should treat Content-Signal ​

How this relates to Better Robots.txt PRO ​

Recommended profile mapping ​

What not to claim ​

FAQ ​

Is Content-Signal the same as Disallow? ​

Should a site add Content-Signal even if it blocks AI bots? ​

Can Better Robots audit this today? ​

Is ai-input the same as training? ​

Read next ​

References ​