How to block AI training without blocking AI search
Many site owners say they want to “block AI”. What they often mean is more specific: they want to reduce or restrict training-related crawling while keeping search visibility, AI search discoverability, and user-requested retrieval intact.
That is a reasonable posture, but it requires precision.
The dangerous shortcut
User-agent: *
Disallow: /This may block training crawlers that respect robots.txt, but it also blocks search crawlers, AI search crawlers, preview bots, and other useful agents. For a public site, that is usually too blunt.
The better model
A more precise policy separates crawler purposes:
# Keep AI search visibility where desired
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Restrict documented training-related crawlers where desired
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /This is an example, not a universal prescription. The correct file depends on the site’s goals, jurisdiction, content model, and risk profile.
Decision process
- Decide whether the site wants AI search visibility.
- Identify which vendors document separate training and search crawlers.
- Preserve Googlebot and Bingbot unless the site is intentionally private.
- Add explicit rules for training-related crawlers.
- Add
llms.txtand policy context so the posture is legible. - Re-scan after publishing.
When this posture makes sense
This is often useful for:
- publishers that want citations but not broad training reuse;
- brands that want ChatGPT, Claude, Perplexity, and Google ecosystems to discover canonical public pages;
- agencies that need a defensible configuration for clients;
- WordPress sites that want a practical preset instead of a hand-edited file.
When it is not enough
If the content is confidential, robots.txt is not enough. Use authentication, access control, or server-level restrictions. Robots.txt is for crawl instructions. It is not a private-content protection system.
WordPress implementation
Better Robots.txt can make this posture easier to manage by separating crawler families in the WordPress admin, previewing the generated output, and allowing re-audit after deployment.
Implementation checklist
Use the audit as an implementation sequence, not as a decorative score.
- Confirm the audited origin: protocol, host, and subdomain must match the site you actually want to govern.
- Preserve search access unless the site is intentionally private.
- Decide whether the goal is maximum AI visibility, training restriction, conservative publishing, or strict privacy.
- Configure crawler families by purpose rather than by emotion.
- Publish policy context only when it is coherent with the active rules.
- Re-scan after changes because a generated WordPress robots.txt file can be modified by plugins, cache, server rules, or edge middleware.
Manual spot check
A technical reviewer can validate the audit manually by requesting these URLs:
/robots.txt
/llms.txt
/ai-manifest.json
/.well-known/ai-governance.json
/.well-known/llm-policy.jsonThen compare the result with the public pages, sitemap, and WordPress configuration. The important question is not only whether each file exists. It is whether those files express the same intent. A robots.txt block, a permissive llms.txt, and a contradictory AI policy create a weak governance layer even if each file loads successfully.
Conversion path for WordPress
If the site is WordPress, the practical next step is not a spreadsheet of recommendations. It is a configuration pass inside Better Robots.txt: choose the closest preset, adjust crawler families, preview the output, publish, and re-run the external scan. That is what turns the audit from education into proof.