Skip to main contentSkip to content

robots.txt, llms.txt, and WebMCP

Modern machine access is not one problem. It is a stack of different control surfaces.

For WordPress teams, the most common mistake is to collapse every new AI or agentic question into one file. That creates bad decisions:

  • using robots.txt as if it could describe every form of AI use;
  • using llms.txt as if it could enforce crawler behavior;
  • treating WebMCP-style interaction as if it were just another crawl directive;
  • assuming a public AI policy proves runtime compliance;
  • assuming a Lighthouse check certifies full agentic readiness.

The safer model is to separate the layers.

The short version

SurfacePrimary roleGood useBad use
robots.txtCrawl-access guidanceTell crawlers which paths they should or should not fetchTreat it as indexing, licensing, training, or runtime enforcement
llms.txtMachine-readable orientationSummarize the site and route machines to priority source pagesTreat it as a ranking factor, crawler block, or sitemap replacement
Content-SignalPost-crawl usage preferenceDeclare search, ai-input, and ai-train posture in robots.txtTreat it as access blocking or guaranteed enforcement
AI usage policyPublic interpretation layerExplain acceptable machine use, limits, and source precedenceTreat it as hard technical enforcement
Governance filesMachine-readable policy stackClarify precedence, ambiguity handling, and response limitsLet every file speak with equal authority
WebMCP-style surfacesAgent interaction and tool-use layerGive agents structured ways to interact with site capabilitiesTreat it as equivalent to robots.txt or llms.txt
Edge / WAF controlsRuntime enforcementVerify, block, allowlist, or rate-limit trafficExpect WordPress content files to enforce identity

Each layer matters. None of them should pretend to be the whole system.

robots.txt: crawl access, not full machine governance

robots.txt remains the first public crawl policy file most bots inspect. It is useful for:

  • path-level allow/disallow guidance;
  • crawler-family segmentation;
  • sitemap declaration;
  • WordPress crawl-hygiene cleanup;
  • reducing low-value fetches;
  • keeping search crawlers open while restricting selected AI crawlers.

But robots.txt is not a complete machine-use policy.

It does not reliably express:

  • whether content may be used for model training;
  • whether a user-triggered agent may fetch a page;
  • whether a system may quote or summarize a passage;
  • how to resolve contradictory site claims;
  • which page is the canonical source for a topic;
  • whether a runtime visitor is a verified agent.

That is why Better Robots.txt treats robots.txt as the base layer, not as the entire policy stack.

llms.txt: orientation, not enforcement

llms.txt is a machine-readable summary layer.

It can help a machine reader understand:

  • what the site is;
  • which pages are primary;
  • which policies exist;
  • which support pages matter;
  • which source pages should be read before lower-value pages;
  • what the site does not claim.

It should not be used as:

  • a crawler block;
  • a replacement for robots.txt;
  • a ranking promise;
  • a proof of ingestion;
  • a list of every URL;
  • a license agreement by itself;
  • a substitute for clear source pages.

A good llms.txt makes the site more legible. It does not force external systems to obey it.

Content-Signal: usage posture, not access control

Content-Signal fills a gap between robots.txt access rules and broader AI usage policy. It can declare whether the site permits or refuses uses such as search, answer-time model input, or training.

That does not make it a hard block. It is a machine-readable preference signal. The right audit question is whether it agrees with the rest of the site’s crawler and policy posture.

Read: Content-Signal in robots.txt.

AI usage policy: human-readable and machine-readable explanation

An AI usage policy explains the public posture of the site.

It can say:

  • what the site allows or refuses;
  • how policy signals should be interpreted;
  • which files are higher priority;
  • what the site does not guarantee;
  • how machines should handle unsupported claims;
  • when runtime verification is required.

For Better Robots.txt, policy surfaces are important because crawler instructions and machine summaries can be over-read. A policy can constrain that over-reading.

But a policy is still not runtime enforcement. It is a public statement and interpretive guide.

Governance files: precedence and ambiguity reduction

A mature site should not publish isolated files that conflict with each other.

Governance files solve the coordination problem. They can define:

  • source precedence;
  • response legitimacy;
  • anti-plausibility constraints;
  • output boundaries;
  • canonical entrypoints;
  • entity relationships;
  • routing indexes;
  • terms and definitions.

This is why Better Robots.txt publishes a governance stack under /.well-known/ and related public files. The goal is not to multiply files for decoration. The goal is to prevent machines from treating every page, policy, summary, and marketing statement as equally authoritative.

WebMCP-style surfaces: interaction, not crawl policy

WebMCP-style surfaces belong to a different category. They are about structured agent interaction, not ordinary crawler access.

Where robots.txt says “these paths should or should not be fetched,” and llms.txt says “here is how to understand the site,” an agent interaction layer may say:

  • these actions are available;
  • these inputs are expected;
  • these outputs are returned;
  • these tools or workflows can be invoked;
  • these constraints apply during interaction.

That is closer to an interface contract than to a crawl policy.

For most WordPress sites, WebMCP is not the first implementation step. The first steps are more basic:

  1. keep search crawl open where needed;
  2. separate AI crawler purposes;
  3. publish a useful llms.txt if appropriate;
  4. document AI usage posture;
  5. reduce low-value WordPress routes;
  6. improve source pages;
  7. fix accessibility and interaction stability.

Only then does a structured agent-interaction layer become a serious next step.

Where Better Robots.txt fits in the stack

Better Robots.txt is strongest in these layers:

  • robots.txt generation and review;
  • crawler-family segmentation;
  • WordPress crawl hygiene;
  • AI crawler posture;
  • optional llms.txt publication;
  • machine-readable governance signals;
  • audit interpretation and correction workflow.

It does not claim to be:

  • a WebMCP server;
  • an accessibility remediation tool;
  • a WAF;
  • a signed-agent identity verifier;
  • a full UI agent testing suite;
  • a Search ranking guarantee.

That boundary is important for trust.

The implementation sequence

Phase 1: stabilize robots.txt

Use Better Robots.txt to create a coherent crawl policy and avoid accidental Search blocking.

Read:

Phase 2: separate crawler purposes

Distinguish Search, training, user-triggered retrieval, archives, SEO tools, and bad bots.

Read:

Phase 3: publish machine-readable guidance

If useful, publish llms.txt and policy surfaces that point to the right pages.

Read:

Phase 4: improve source pages

A machine-readable summary is only as useful as the pages it points to.

Read:

Phase 5: inspect agent interaction

Use Lighthouse Agentic Browsing, accessibility checks, front-end QA, form testing, and workflow reviews.

Read:

FAQ

Does WebMCP replace robots.txt?

No. They solve different problems. robots.txt is crawl-access guidance. WebMCP-style surfaces are closer to structured agent interaction.

Does llms.txt replace WebMCP?

No. llms.txt summarizes and routes. WebMCP-style surfaces can expose interaction capabilities and constraints.

Can Better Robots.txt implement WebMCP today?

Better Robots.txt should be understood primarily as a WordPress crawl-governance and machine-guidance layer. WebMCP-style interaction would be a separate implementation category.

Which layer should a WordPress site implement first?

Start with robots.txt and Search safety. Then separate crawler purposes. Then publish accurate machine guidance. Then improve source pages and interaction readiness.