robots.txt declaration vs technical blocking
robots.txt is a public declaration layer. It is not a firewall.
Better Robots treats robots.txt as a governance surface because it tells cooperative crawlers what the site declares. That is valuable, but it is not the same as technical enforcement.
Four different layers
| Layer | Main question | Examples |
|---|---|---|
| Declaration | What does the site ask cooperative crawlers to do? | robots.txt, User-agent, Allow, Disallow, Sitemap |
| Post-crawl use preference | What may happen after access? | Content-Signal, AI usage policy, ai-train, ai-input |
| Enforcement | What is technically blocked? | WAF, server rules, authentication, rate limits, bot verification |
| Evidence | What actually happened? | logs, verified bot IPs, request traces, re-audits |
Why robots.txt still matters
robots.txt is still the first public place many crawlers inspect. It remains the most widely known crawl-access declaration format.
It helps a site express:
- what should remain crawlable;
- what should not be crawled;
- which sitemap should be used;
- whether specific crawler families are handled separately;
- whether search, training and user-triggered fetchers are collapsed or distinguished.
What robots.txt cannot do
robots.txt cannot guarantee that every crawler will obey.
It cannot authenticate a bot.
It cannot stop a malicious scraper by itself.
It cannot prove that content was not used for training.
It cannot replace legal terms, WAF rules, logs or server-level enforcement.
Where Content-Signal fits
Content-Signal belongs to the post-crawl usage layer. It can express preferences such as:
search=yes
ai-input=no
ai-train=noThat is useful because it says something different from Allow and Disallow. It describes declared usage preference, not access itself.
How Better Robots uses this distinction
Better Robots /check audits declarations, posture and coherence.
A strong audit can say:
Your access rules allow GPTBot, but your declared post-crawl usage preference refuses training.or:
Your robots.txt blocks training crawlers, but no AI usage policy explains the intended reuse boundary.That is governance. Enforcement remains a separate layer.
Practical recommendation
Use robots.txt to declare cooperative crawler access.
Use Content-Signal and AI policies to declare downstream usage preferences.
Use WAF and server rules to enforce against unwanted traffic.
Use logs and re-audits to verify what is actually happening.