How to block AI training without blocking AI search

Many site owners say they want to “block AI”. What they often mean is more specific: they want to reduce or restrict training-related crawling while keeping search visibility, AI search discoverability, and user-requested retrieval intact.

That is a reasonable posture, but it requires precision.

Check whether your robots.txt makes this distinction

The dangerous shortcut

txt

User-agent: *
Disallow: /

This may block training crawlers that respect robots.txt, but it also blocks search crawlers, AI search crawlers, preview bots, and other useful agents. For a public site, that is usually too blunt.

The better model

A more precise policy separates crawler purposes:

txt

# Keep AI search visibility where desired
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Restrict documented training-related crawlers where desired
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: AI2Bot
Disallow: /

This is an example, not a universal prescription. The correct file depends on the site’s goals, jurisdiction, content model, and risk profile.

Each restricted user agent above is a documented training crawler in the training_crawlers_or_tokens category, while the allowed ones are search or answer crawlers. That mapping, with each provider's source, is the bot taxonomy and the machine registry /bot-registry.json. The same distinction appears in the plugin as the AI training, AI search and discovery, and user-triggered access controls.

Decision process

Decide whether the site wants AI search visibility.
Identify which vendors document separate training and search crawlers.
Preserve Googlebot and Bingbot unless the site is intentionally private.
Add explicit rules for training-related crawlers.
Add llms.txt and policy context so the posture is legible.
Re-scan after publishing.

When this posture makes sense

This is often useful for:

publishers that want citations but not broad training reuse;
brands that want ChatGPT, Claude, Perplexity, and Google ecosystems to discover canonical public pages;
agencies that need a defensible configuration for clients;
WordPress sites that want a practical preset instead of a hand-edited file.

When it is not enough

If the content is confidential, robots.txt is not enough. Use authentication, access control, or server-level restrictions. Robots.txt is for crawl instructions. It is not a private-content protection system.

WordPress implementation

Better Robots.txt can make this posture easier to manage by separating crawler families in the WordPress admin, previewing the generated output, and allowing re-audit after deployment.

Implementation checklist

Use the audit as an implementation sequence, not as a decorative score.

Confirm the audited origin: protocol, host, and subdomain must match the site you actually want to govern.
Preserve search access unless the site is intentionally private.
Decide whether the goal is maximum AI visibility, training restriction, conservative publishing, or strict privacy.
Configure crawler families by purpose rather than by emotion.
Publish policy context only when it is coherent with the active rules.
Re-scan after changes because a generated WordPress robots.txt file can be modified by plugins, cache, server rules, or edge middleware.

Manual spot check

A technical reviewer can validate the audit manually by requesting these URLs:

txt

/robots.txt
/llms.txt
/ai-manifest.json
/.well-known/ai-governance.json
/.well-known/llm-policy.json

Then compare the result with the public pages, sitemap, and WordPress configuration. The important question is not only whether each file exists. It is whether those files express the same intent. A robots.txt block, a permissive llms.txt, and a contradictory AI policy create a weak governance layer even if each file loads successfully.

Conversion path for WordPress

If the site is WordPress, the practical next step is not a spreadsheet of recommendations. It is a configuration pass inside Better Robots.txt: choose the closest preset, adjust crawler families, preview the output, publish, and re-run the external scan. That is what turns the audit from education into proof.

How to block AI training without blocking AI search ​

The dangerous shortcut ​

The better model ​

Decision process ​

When this posture makes sense ​

When it is not enough ​

WordPress implementation ​

Implementation checklist ​

Manual spot check ​

Conversion path for WordPress ​

Related audit pages ​