Skip to main contentSkip to content

Robots.txt checker for AI crawlers

Start with the scan. Run the free AI robots.txt checker, then use this guide to understand what each block means, why it matters, and how Better Robots.txt can fix the WordPress side.

A traditional robots.txt validator answers a narrow question: does /robots.txt exist and does its syntax look acceptable? That was enough when the practical audience was mostly search crawlers. It is not enough anymore.

Modern sites are read by search engines, AI search crawlers, model-training crawlers, user-triggered retrieval agents, social preview bots, advertising validators, SEO tools, archive services, and many low-value automated actors. They do not have the same purpose. They do not create the same value. They do not carry the same risk. A single User-agent: * group can no longer express a serious posture for every machine visitor.

The Better Robots.txt checker was built around a more useful question: does the site express a clear, machine-readable, WordPress-aware crawler and AI governance posture?

The tool checks the surfaces that machines meet first: robots.txt, sitemap references, crawler directives, llms.txt, governance pointers, resource access, and WordPress crawl hygiene. It does not pretend that one file can force every AI company to obey. It measures whether your site is explicit, coherent, and actionable enough for search engines, AI systems, consultants, and site owners to understand what you intend.

Why a classic robots.txt validator is no longer enough

A validator can tell you that this file is syntactically valid:

txt
User-agent: *
Disallow: /wp-admin/
Sitemap: https://example.com/sitemap.xml

That does not answer the questions that matter for AI-era visibility:

QuestionWhy it matters
Is OAI-SearchBot allowed while GPTBot is restricted?Search discovery and training control are different goals.
Is Claude-SearchBot handled separately from ClaudeBot?Anthropic documents different robots for search, user requests, and model-development crawling.
Is Google-Extended used without blocking Googlebot?Google-Extended is a control token, not a replacement for Googlebot.
Is llms.txt published as a guidance surface?It can help machine readers find concise context, but it is not an enforcement layer.
Are WordPress trap paths controlled without blocking public resources?WordPress sites often waste crawl attention through feeds, reply parameters, internal search, WooCommerce carts, and account pages.
Does the site expose a clear AI usage policy?A policy reduces ambiguity, but it must not be confused with crawler enforcement.

Better Robots.txt treats the audit as a bridge between diagnosis and configuration. The scan tells you what is missing. The plugin gives WordPress users a safer path to implement the correction without manually editing server files.

What the checker analyzes

1. Robots.txt presence and validity

The first block checks whether /robots.txt exists, responds correctly, looks like a real robots.txt file, and contains parseable rules. It also evaluates whether the file exposes a Sitemap directive and whether obvious syntax or structure problems make interpretation harder.

This matters because robots.txt is scoped to the exact origin. A file on https://example.com/robots.txt does not automatically govern https://www.example.com/robots.txt, http://example.com/robots.txt, or https://blog.example.com/robots.txt. Google’s documentation also reminds site owners that robots.txt is a crawl-access mechanism, not a security layer and not a reliable way to remove pages from search results. See Google’s robots.txt introduction and robots.txt specification notes.

2. Search crawler baseline

The scan checks whether Googlebot, Bingbot, and other search-related crawlers appear to be accidentally blocked. A site can become too defensive when trying to control AI crawlers. The common mistake is to use a broad rule like Disallow: / against all agents, then assume visibility will survive because the intention was only to block model training.

Search crawler access is a baseline, not an advanced feature. If your public pages, media, CSS, JavaScript, and sitemap cannot be reached by the crawlers that matter to search visibility, everything else is secondary.

3. AI crawler coverage

The checker evaluates whether major AI-related agents are explicitly addressed instead of being left to wildcard behavior. The scan currently focuses on operational families such as OpenAI, Anthropic, Google, Perplexity, Apple, Meta, ByteDance, Amazon, You.com, DuckAssist, Cohere, AI2, and Diffbot.

Explicit coverage does not mean blocking everything. It means the site is not silent. Silence is the weakest posture because it gives no practical clue about whether the owner wants AI search visibility, training exclusion, user-triggered retrieval, or a conservative no-AI stance.

4. Training, search, and user-triggered retrieval

AI crawler governance becomes useful only when it separates finalities.

OpenAI documents different user agents including GPTBot, OAI-SearchBot, and ChatGPT-User. The practical implication is that blocking a training-related crawler is not the same thing as blocking a search crawler or a user-triggered fetcher. See OpenAI’s crawler documentation and publisher FAQ.

Anthropic also documents separate robots, including ClaudeBot, Claude-SearchBot, and Claude-User, with different purposes and consequences. See Anthropic’s site-owner crawler documentation.

Google-Extended is another distinct case. Google describes it as a standalone product token, without a separate HTTP user agent, and says it does not affect inclusion or ranking in Google Search. See Google’s common crawlers documentation.

Perplexity documents PerplexityBot for search-related discovery and Perplexity-User for user actions. Its documentation also recommends combining user-agent and IP-range verification for WAF-level controls. See Perplexity’s crawler documentation.

5. llms.txt guidance

The checker tests whether the site publishes llms.txt. This file should be framed correctly. It is not a crawler-enforcement mechanism and it is not a ranking guarantee. It is a proposed Markdown guidance file intended to help language-model-powered tools find concise context and relevant links. The original proposal describes it as a way to provide information that an LLM may want to retrieve while assembling context. See the Answer.AI proposal and the llms.txt project site.

In the Better Robots.txt model, llms.txt is valuable because it improves discoverability of the right documentation, not because it magically forces citation.

6. AI governance files and policy pointers

The scan also checks for supporting governance surfaces:

  • ai-manifest.json;
  • .well-known/ai-governance.json;
  • .well-known/llm-policy.json;
  • .well-known/interpretation-policy.json;
  • an AI usage policy page;
  • a machine-readable policy pointer from robots.txt.

These files are not treated as universal enforcement standards. They are governance signals. They help document intent, reduce ambiguity, and create a clearer relationship between crawler rules, policy, and machine-readable guidance.

7. WordPress crawl hygiene

WordPress is the primary correction surface for Better Robots.txt. The checker looks for patterns that often create crawl waste or accidental ambiguity:

  • /wp-admin/ and admin-only routes;
  • internal search pages;
  • comment reply parameters;
  • feeds when they create low-value crawl loops;
  • WooCommerce cart, checkout, and account paths;
  • parameterized filters and faceted paths;
  • public media and asset access;
  • social preview crawlers and advertising files.

The objective is not to make WordPress invisible. The objective is to keep public content discoverable while reducing low-value crawler paths.

How to interpret your result

The score is not a legal opinion, not a guarantee of obedience, and not a ranking promise. It is a maturity indicator.

Audit stateInterpretationBest next action
Missing or unreachable robots.txtCrawlers cannot reliably read your baseline posture.Publish a clean robots.txt and include sitemap references.
Search crawler riskClassic search visibility may be harmed by overblocking.Separate search crawlers from training or AI-specific rules.
AI silentMajor AI crawler families are not explicitly addressed.Define whether your posture is open, training-restricted, or privacy-first.
Policy missingThe site has rules, but little explanation of intent.Publish an AI usage policy and link it from machine-readable surfaces.
WordPress traps presentThe site may waste crawl attention on low-value routes.Use Better Robots.txt presets, preview changes, then re-scan.
Strong governanceThe site is explicit and coherent.Monitor drift and re-audit after major site changes.

Why the plugin matters after the scan

The scan identifies the gap. WordPress users still need a safe way to fix it. Manual edits to robots.txt can be fragile when the file is generated dynamically by WordPress, an SEO plugin, a server rule, or a host-level configuration.

Better Robots.txt is the fix layer. It lets WordPress site owners configure crawler posture, AI crawler controls, sitemap signals, WordPress defaults, llms.txt, and previewable output from inside WordPress. That is the conversion path:

txt
scan → understand the issue → install Better Robots.txt → apply a safer preset → preview output → re-scan

The strongest value is not the scan alone. It is the closed loop between external audit and WordPress correction.

FAQ

Does robots.txt block AI training?

It can express crawl restrictions for documented crawlers that respect robots.txt. It is not a universal enforcement mechanism, and it is not a security system. For stronger enforcement, you need verified bot identity, server logs, WAF rules, contractual controls, or access restrictions.

Can I block AI training but keep AI search visibility?

Often, yes, if the vendor documents separate crawlers for training and search. For OpenAI, for example, GPTBot and OAI-SearchBot serve different roles. For Anthropic, ClaudeBot and Claude-SearchBot are also distinct. The audit helps you see whether that distinction appears in your file.

Does llms.txt improve AI rankings?

No guarantee should be made. llms.txt is a guidance layer. It can make the most useful content easier to discover and summarize, but it should not be sold as a ranking factor or citation guarantee.

Is the checker only for WordPress?

The scan can be used on any public domain. The fix layer is primarily WordPress because Better Robots.txt is a WordPress plugin. Non-WordPress sites can still use the audit as a manual implementation guide.