Skip to main contentSkip to content

AI crawler coverage check

AI crawler coverage measures whether a site is explicit about the automated systems that now read, retrieve, summarize, train on, or preview public web content.

A site can have a valid robots.txt file and still be strategically silent toward AI crawlers. That silence is the issue. It leaves important crawler families to inherit generic wildcard rules, even when the owner would prefer a more precise posture.

Coverage is not the same as blocking

The checker does not reward blind blocking. It rewards clarity.

A good policy might allow OAI-SearchBot while disallowing GPTBot. Another site might allow both. A private site might disallow both. The important point is that the file expresses a deliberate choice instead of leaving everything to User-agent: *.

Crawler families the audit looks for

FamilyExamplesTypical audit question
OpenAIGPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBotDoes the site distinguish training, search, user fetch, and ads-related access?
AnthropicClaudeBot, Claude-SearchBot, Claude-User, Claude-WebDoes the site separate search visibility from model-development crawling?
GoogleGooglebot, Google-Extended, GoogleOtherDoes the site avoid confusing Google Search with downstream AI usage controls?
PerplexityPerplexityBot, Perplexity-UserDoes the site distinguish search discovery from user-triggered actions?
AppleApplebot, Applebot-ExtendedDoes the site separate search-related Apple crawling from extended usage controls?
Other AI familiesMeta, ByteDance, Amazon, You.com, DuckAssist, Cohere, AI2, DiffbotDoes the site avoid being completely silent toward major AI-related access?

Why wildcard rules are weak

This file is technically valid:

txt
User-agent: *
Disallow:

But it says little about intent. It does not tell an operator, consultant, legal reviewer, or machine reader whether the site welcomes AI search, rejects training, accepts user-triggered retrieval, or simply never considered the question.

Coverage improves the file by turning silence into declared posture.

The three categories that matter most

These are the crawlers site owners usually worry about when they say they want to block AI training. Examples include GPTBot and ClaudeBot, depending on the vendor’s documented purpose.

AI search and answer crawlers

These crawlers help AI systems discover, summarize, cite, or link to public web content. Examples include OAI-SearchBot, Claude-SearchBot, and PerplexityBot. Blocking them may reduce AI search visibility.

User-triggered fetchers

These systems fetch content because a user explicitly requested something. Examples include ChatGPT-User, Claude-User, and Perplexity-User. Some vendors treat these differently from ordinary crawlers, so a robots.txt-only analysis should be careful not to overstate enforcement.

How the checker scores coverage

The scan looks for explicit references to major AI crawler families and then checks whether the resulting posture is coherent. A site that addresses only one vendor receives less confidence than a site that handles multiple major families. A site that blocks everything without preserving search discovery may receive warnings even if it is technically explicit.

What to fix after a weak score

  1. Decide the intended posture: AI-visible, training-restricted, privacy-first, or conservative publisher.
  2. Separate search crawlers from training crawlers where vendors document that distinction.
  3. Preserve Googlebot and Bingbot unless the site is intentionally private.
  4. Add llms.txt and policy context if the site wants better machine-readable guidance.
  5. Use Better Robots.txt if the site runs on WordPress, then re-scan.

Implementation checklist

Use the audit as an implementation sequence, not as a decorative score.

  1. Confirm the audited origin: protocol, host, and subdomain must match the site you actually want to govern.
  2. Preserve search access unless the site is intentionally private.
  3. Decide whether the goal is maximum AI visibility, training restriction, conservative publishing, or strict privacy.
  4. Configure crawler families by purpose rather than by emotion.
  5. Publish policy context only when it is coherent with the active rules.
  6. Re-scan after changes because a generated WordPress robots.txt file can be modified by plugins, cache, server rules, or edge middleware.

Manual spot check

A technical reviewer can validate the audit manually by requesting these URLs:

txt
/robots.txt
/llms.txt
/ai-manifest.json
/.well-known/ai-governance.json
/.well-known/llm-policy.json

Then compare the result with the public pages, sitemap, and WordPress configuration. The important question is not only whether each file exists. It is whether those files express the same intent. A robots.txt block, a permissive llms.txt, and a contradictory AI policy create a weak governance layer even if each file loads successfully.

Conversion path for WordPress

If the site is WordPress, the practical next step is not a spreadsheet of recommendations. It is a configuration pass inside Better Robots.txt: choose the closest preset, adjust crawler families, preview the output, publish, and re-run the external scan. That is what turns the audit from education into proof.