Skip to main contentSkip to content

Content-Signal in robots.txt

Content-Signal is one of the most important new signals for AI-use governance because it expresses a layer that ordinary Allow and Disallow rules do not cover.

robots.txt can say which crawlers should access which paths. It does not fully express what those crawlers should do with content after access. Cloudflare’s Content Signals Policy adds a vocabulary for that second question.

For Better Robots.txt, this matters because it separates access from use.

Current Better Robots status

Better Robots treats Content-Signal as a governance-relevant layer, but it should not be confused with hard enforcement.

Current public status:

  • Content-Signal is documented as a post-crawl usage preference signal.
  • A future score-neutral detection pass is appropriate for /check.
  • Content-Signal is not a WAF rule, firewall rule or crawler-authentication mechanism.
  • If /check does not explicitly mark Content-Signal as scored, it should not be inferred as part of the current score.
  • Future profile alignment may evaluate search, ai-input and ai-train, but that requires an explicit ruleset update.

The basic pattern

Cloudflare gives this example:

txt
User-Agent: *
Content-Signal: search=yes, ai-train=no
Allow: /

This line does not replace Allow or Disallow. It adds a declared preference about downstream use.

The three currently important purposes are:

SignalPractical meaningGovernance question
searchSearch indexing and results with links or excerptsShould content be used for search discovery?
ai-inputUse as input to AI models at query time, including retrieval, grounding, or RAG-style answersShould content be used inside answer generation or retrieval workflows?
ai-trainTraining or fine-tuning AI modelsShould content be used to improve future models?

The absence of a signal should not be over-read. Cloudflare explicitly describes absence as neutral for that specific use.

Why this belongs in crawler governance

Better Robots already separates crawler families by purpose:

  • search crawlers;
  • AI search crawlers;
  • training crawlers;
  • user-triggered fetchers;
  • social and preview crawlers;
  • SEO tools;
  • bad bots.

Content-Signal adds another axis. It says what the site declares about use after access.

That makes it a direct fit for the audit’s intent-profile model.

For example, the profile AI search open, training restricted may align with:

txt
Content-Signal: search=yes, ai-input=yes, ai-train=no

or, in a more conservative interpretation:

txt
Content-Signal: search=yes, ai-train=no

The difference is doctrinally important. ai-input=yes permits answer-time input more explicitly than search=yes alone. A site owner should not enable it by accident.

Access and use can disagree

The most useful audit finding is not simply “Content-Signal exists.”

The useful finding is whether the declared use signal aligns with crawler access rules.

Case 1: training use refused, training crawlers still open

txt
User-agent: *
Content-Signal: ai-train=no
Allow: /

User-agent: GPTBot
Allow: /

This does not automatically mean the site is broken. It means the site is relying on a downstream-use declaration while still allowing access to a training-related crawler. Some crawlers may respect the signal. Some may not.

A careful audit should say:

Usage restriction declared, access still open.

Case 2: training use allowed, training crawlers blocked

txt
User-agent: *
Content-Signal: ai-train=yes
Allow: /

User-agent: GPTBot
Disallow: /

This is a stronger contradiction. The usage signal says training is allowed, but the access rule blocks a known training crawler.

A careful audit should say:

Training use is declared as allowed, but a training crawler is blocked.

Case 3: ai-input refused, answer systems still opened

txt
User-agent: *
Content-Signal: ai-input=no
Allow: /

User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /

This needs interpretation. ai-input=no is about answer-time use as model input. Some bots may serve search, retrieval, citation, or user-triggered use cases. The audit should not flatten all of them into one category.

How Better Robots should treat Content-Signal

Content-Signal should be treated as a strong machine-readable preference, not as hard enforcement.

A safe audit model is:

SituationAudit interpretation
Present and aligned with selected profilePositive signal
Present but contradictory with crawler access rulesWarning or mismatch finding
Present with invalid valuesWarning
AbsentInformational or not evaluated, not a hard failure

Absence should not be punished too strongly because Content-Signal is not the Robots Exclusion Protocol itself and is not universally enforced.

How this relates to Better Robots.txt PRO

At first, /check should detect and explain Content-Signal without changing the core score.

Later, Better Robots.txt PRO could emit a profile-based line such as:

txt
Content-Signal: search=yes, ai-input=yes, ai-train=no

for sites that explicitly choose an AI-search-open and training-restricted posture.

That should be a deliberate profile decision, not a hidden default.

ProfilePossible Content-Signal postureNotes
AI search open, training restrictedsearch=yes, ai-input=yes, ai-train=noStrongest expression of retrieval-time use plus training restriction
Publisher protectionsearch=yes, ai-input=no, ai-train=noKeeps search discovery while refusing answer-time input and training
Maximum AI visibilitysearch=yes, ai-input=yes, ai-train=yesUseful only when the site deliberately accepts broad AI reuse
WordPress safe defaultsearch=yes, ai-train=no, optionally ai-input=yesShould depend on the site owner’s appetite for answer-time use
Strict crawler restrictionai-input=no, ai-train=no, search depends on whether Search visibility remains desiredDo not accidentally turn a strict AI posture into a Search blackout

What not to claim

Do not claim that Content-Signal guarantees obedience.

Do not claim it replaces WAF, Bot Management, signed-agent verification, logs, or contractual controls.

Do not claim it is a Search ranking factor.

Do not claim every crawler respects it.

The correct claim is narrower and stronger:

Content-Signal gives a site a machine-readable way to express post-crawl use preferences in robots.txt.

That is exactly the kind of signal a crawler-governance audit should understand.

FAQ

Is Content-Signal the same as Disallow?

No. Disallow is access guidance for a path. Content-Signal is a declared preference about what content may be used for after access.

Should a site add Content-Signal even if it blocks AI bots?

Possibly. A block rule expresses access posture. Content-Signal expresses usage posture. The two can reinforce each other when they are coherent.

Can Better Robots audit this today?

This page describes the recommended direction for the audit and plugin. The safest implementation path is to detect Content-Signal as a non-scoring informational signal first, then add profile-alignment scoring after the ruleset is updated.

Is ai-input the same as training?

No. ai-input covers answer-time use such as retrieval, grounding, or other use as model input. ai-train covers training or fine-tuning.

References