robots.txt, llms.txt, and WebMCP
Modern machine access is not one problem. It is a stack of different control surfaces.
For WordPress teams, the most common mistake is to collapse every new AI or agentic question into one file. That creates bad decisions:
- using
robots.txtas if it could describe every form of AI use; - using
llms.txtas if it could enforce crawler behavior; - treating WebMCP-style interaction as if it were just another crawl directive;
- assuming a public AI policy proves runtime compliance;
- assuming a Lighthouse check certifies full agentic readiness.
The safer model is to separate the layers.
The short version
| Surface | Primary role | Good use | Bad use |
|---|---|---|---|
robots.txt | Crawl-access guidance | Tell crawlers which paths they should or should not fetch | Treat it as indexing, licensing, training, or runtime enforcement |
llms.txt | Machine-readable orientation | Summarize the site and route machines to priority source pages | Treat it as a ranking factor, crawler block, or sitemap replacement |
Content-Signal | Post-crawl usage preference | Declare search, ai-input, and ai-train posture in robots.txt | Treat it as access blocking or guaranteed enforcement |
| AI usage policy | Public interpretation layer | Explain acceptable machine use, limits, and source precedence | Treat it as hard technical enforcement |
| Governance files | Machine-readable policy stack | Clarify precedence, ambiguity handling, and response limits | Let every file speak with equal authority |
| WebMCP-style surfaces | Agent interaction and tool-use layer | Give agents structured ways to interact with site capabilities | Treat it as equivalent to robots.txt or llms.txt |
| Edge / WAF controls | Runtime enforcement | Verify, block, allowlist, or rate-limit traffic | Expect WordPress content files to enforce identity |
Each layer matters. None of them should pretend to be the whole system.
robots.txt: crawl access, not full machine governance
robots.txt remains the first public crawl policy file most bots inspect. It is useful for:
- path-level allow/disallow guidance;
- crawler-family segmentation;
- sitemap declaration;
- WordPress crawl-hygiene cleanup;
- reducing low-value fetches;
- keeping search crawlers open while restricting selected AI crawlers.
But robots.txt is not a complete machine-use policy.
It does not reliably express:
- whether content may be used for model training;
- whether a user-triggered agent may fetch a page;
- whether a system may quote or summarize a passage;
- how to resolve contradictory site claims;
- which page is the canonical source for a topic;
- whether a runtime visitor is a verified agent.
That is why Better Robots.txt treats robots.txt as the base layer, not as the entire policy stack.
llms.txt: orientation, not enforcement
llms.txt is a machine-readable summary layer.
It can help a machine reader understand:
- what the site is;
- which pages are primary;
- which policies exist;
- which support pages matter;
- which source pages should be read before lower-value pages;
- what the site does not claim.
It should not be used as:
- a crawler block;
- a replacement for
robots.txt; - a ranking promise;
- a proof of ingestion;
- a list of every URL;
- a license agreement by itself;
- a substitute for clear source pages.
A good llms.txt makes the site more legible. It does not force external systems to obey it.
Content-Signal: usage posture, not access control
Content-Signal fills a gap between robots.txt access rules and broader AI usage policy. It can declare whether the site permits or refuses uses such as search, answer-time model input, or training.
That does not make it a hard block. It is a machine-readable preference signal. The right audit question is whether it agrees with the rest of the site’s crawler and policy posture.
Read: Content-Signal in robots.txt.
AI usage policy: human-readable and machine-readable explanation
An AI usage policy explains the public posture of the site.
It can say:
- what the site allows or refuses;
- how policy signals should be interpreted;
- which files are higher priority;
- what the site does not guarantee;
- how machines should handle unsupported claims;
- when runtime verification is required.
For Better Robots.txt, policy surfaces are important because crawler instructions and machine summaries can be over-read. A policy can constrain that over-reading.
But a policy is still not runtime enforcement. It is a public statement and interpretive guide.
Governance files: precedence and ambiguity reduction
A mature site should not publish isolated files that conflict with each other.
Governance files solve the coordination problem. They can define:
- source precedence;
- response legitimacy;
- anti-plausibility constraints;
- output boundaries;
- canonical entrypoints;
- entity relationships;
- routing indexes;
- terms and definitions.
This is why Better Robots.txt publishes a governance stack under /.well-known/ and related public files. The goal is not to multiply files for decoration. The goal is to prevent machines from treating every page, policy, summary, and marketing statement as equally authoritative.
WebMCP-style surfaces: interaction, not crawl policy
WebMCP-style surfaces belong to a different category. They are about structured agent interaction, not ordinary crawler access.
Where robots.txt says “these paths should or should not be fetched,” and llms.txt says “here is how to understand the site,” an agent interaction layer may say:
- these actions are available;
- these inputs are expected;
- these outputs are returned;
- these tools or workflows can be invoked;
- these constraints apply during interaction.
That is closer to an interface contract than to a crawl policy.
For most WordPress sites, WebMCP is not the first implementation step. The first steps are more basic:
- keep search crawl open where needed;
- separate AI crawler purposes;
- publish a useful
llms.txtif appropriate; - document AI usage posture;
- reduce low-value WordPress routes;
- improve source pages;
- fix accessibility and interaction stability.
Only then does a structured agent-interaction layer become a serious next step.
Where Better Robots.txt fits in the stack
Better Robots.txt is strongest in these layers:
robots.txtgeneration and review;- crawler-family segmentation;
- WordPress crawl hygiene;
- AI crawler posture;
- optional
llms.txtpublication; - machine-readable governance signals;
- audit interpretation and correction workflow.
It does not claim to be:
- a WebMCP server;
- an accessibility remediation tool;
- a WAF;
- a signed-agent identity verifier;
- a full UI agent testing suite;
- a Search ranking guarantee.
That boundary is important for trust.
The implementation sequence
Phase 1: stabilize robots.txt
Use Better Robots.txt to create a coherent crawl policy and avoid accidental Search blocking.
Read:
- Robots.txt checker for AI crawlers
- Control AI crawlers on WordPress
- Robots.txt presence and validity check
Phase 2: separate crawler purposes
Distinguish Search, training, user-triggered retrieval, archives, SEO tools, and bad bots.
Read:
- AI training vs AI search crawlers
- Search vs answer vs training permissions
- What AI usage signals can and cannot do
Phase 3: publish machine-readable guidance
If useful, publish llms.txt and policy surfaces that point to the right pages.
Read:
- llms.txt and Lighthouse audit for WordPress
- How to add llms.txt on WordPress
- AI governance files checker
Phase 4: improve source pages
A machine-readable summary is only as useful as the pages it points to.
Read:
Phase 5: inspect agent interaction
Use Lighthouse Agentic Browsing, accessibility checks, front-end QA, form testing, and workflow reviews.
Read:
- Lighthouse Agentic Browsing for WordPress
- Agentic readiness checklist for WordPress
- Crawler governance vs agentic readiness
- Content-Signal in robots.txt
- Agentic readiness for WordPress
FAQ
Does WebMCP replace robots.txt?
No. They solve different problems. robots.txt is crawl-access guidance. WebMCP-style surfaces are closer to structured agent interaction.
Does llms.txt replace WebMCP?
No. llms.txt summarizes and routes. WebMCP-style surfaces can expose interaction capabilities and constraints.
Can Better Robots.txt implement WebMCP today?
Better Robots.txt should be understood primarily as a WordPress crawl-governance and machine-guidance layer. WebMCP-style interaction would be a separate implementation category.
Which layer should a WordPress site implement first?
Start with robots.txt and Search safety. Then separate crawler purposes. Then publish accurate machine guidance. Then improve source pages and interaction readiness.