Skip to main contentSkip to content

robots.txt declaration vs technical blocking

robots.txt is a public declaration layer. It is not a firewall.

Better Robots treats robots.txt as a governance surface because it tells cooperative crawlers what the site declares. That is valuable, but it is not the same as technical enforcement.

Four different layers

LayerMain questionExamples
DeclarationWhat does the site ask cooperative crawlers to do?robots.txt, User-agent, Allow, Disallow, Sitemap
Post-crawl use preferenceWhat may happen after access?Content-Signal, AI usage policy, ai-train, ai-input
EnforcementWhat is technically blocked?WAF, server rules, authentication, rate limits, bot verification
EvidenceWhat actually happened?logs, verified bot IPs, request traces, re-audits

Why robots.txt still matters

robots.txt is still the first public place many crawlers inspect. It remains the most widely known crawl-access declaration format.

It helps a site express:

  • what should remain crawlable;
  • what should not be crawled;
  • which sitemap should be used;
  • whether specific crawler families are handled separately;
  • whether search, training and user-triggered fetchers are collapsed or distinguished.

What robots.txt cannot do

robots.txt cannot guarantee that every crawler will obey.

It cannot authenticate a bot.

It cannot stop a malicious scraper by itself.

It cannot prove that content was not used for training.

It cannot replace legal terms, WAF rules, logs or server-level enforcement.

Where Content-Signal fits

Content-Signal belongs to the post-crawl usage layer. It can express preferences such as:

txt
search=yes
ai-input=no
ai-train=no

That is useful because it says something different from Allow and Disallow. It describes declared usage preference, not access itself.

How Better Robots uses this distinction

Better Robots /check audits declarations, posture and coherence.

A strong audit can say:

txt
Your access rules allow GPTBot, but your declared post-crawl usage preference refuses training.

or:

txt
Your robots.txt blocks training crawlers, but no AI usage policy explains the intended reuse boundary.

That is governance. Enforcement remains a separate layer.

Practical recommendation

Use robots.txt to declare cooperative crawler access.

Use Content-Signal and AI policies to declare downstream usage preferences.

Use WAF and server rules to enforce against unwanted traffic.

Use logs and re-audits to verify what is actually happening.