Robots.txt presence and validity check
Every AI crawler governance audit starts with the same baseline: can machines actually read a usable robots.txt file for this origin?
The question sounds simple, but many sites fail it in subtle ways: wrong host, forced redirect, generated HTML shell, empty file, stale cache, syntax noise, missing sitemap, or a file that exists only on www but not on the root domain.
What the scan verifies
| Check | Meaning |
|---|---|
| HTTP availability | /robots.txt returns a usable response. |
| File-like content | The body looks like robots.txt, not a generic HTML page. |
| Basic directives | The file includes recognizable User-agent, Allow, Disallow, or Sitemap lines. |
| Sitemap reference | Crawlers can discover sitemap URLs directly from robots.txt. |
| Origin scope | The audited protocol and host are the ones the user intended. |
| Parseability | Rules are structured enough to be interpreted. |
Why origin scope matters
Robots.txt is scoped by scheme, host, and port. A policy on https://example.com/robots.txt does not automatically govern https://www.example.com/robots.txt or https://blog.example.com/robots.txt. Google’s robots documentation describes this origin-scoped behavior and the rule syntax used by compliant crawlers.
For multi-host businesses, this means the audit should be repeated on each important host: main site, blog, store, app, help center, and documentation subdomain.
Minimal healthy file
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xmlThis is not a complete AI governance policy. It is a baseline. It tells crawlers the file exists, the public site is mostly open, the WordPress admin is restricted, and the sitemap is discoverable.
Invalid does not always mean broken, but it means ambiguous
Robots.txt is historically permissive. Many crawlers ignore unsupported lines. That does not mean every line is useful. When the checker warns about validity or unknown directives, the point is to reduce ambiguity. A clean file is easier to audit, easier to explain, and easier to maintain.
Common problems
- The file redirects several times before loading.
- The file is blocked by a WAF or returns a challenge page.
- A single-page application serves its HTML shell as
/robots.txt. - The file exists on
wwwbut the audit was run on the apex domain. - Sitemap URLs are relative, outdated, or missing.
- Comments contain policy text, but no active directive or pointer exists.
WordPress correction path
For WordPress sites, Better Robots.txt can generate and preview a stable output from the admin. That is safer than editing a physical file that may be ignored, overwritten, or contradicted by another plugin.
Implementation checklist
Use the audit as an implementation sequence, not as a decorative score.
- Confirm the audited origin: protocol, host, and subdomain must match the site you actually want to govern.
- Preserve search access unless the site is intentionally private.
- Decide whether the goal is maximum AI visibility, training restriction, conservative publishing, or strict privacy.
- Configure crawler families by purpose rather than by emotion.
- Publish policy context only when it is coherent with the active rules.
- Re-scan after changes because a generated WordPress robots.txt file can be modified by plugins, cache, server rules, or edge middleware.
Manual spot check
A technical reviewer can validate the audit manually by requesting these URLs:
/robots.txt
/llms.txt
/ai-manifest.json
/.well-known/ai-governance.json
/.well-known/llm-policy.jsonThen compare the result with the public pages, sitemap, and WordPress configuration. The important question is not only whether each file exists. It is whether those files express the same intent. A robots.txt block, a permissive llms.txt, and a contradictory AI policy create a weak governance layer even if each file loads successfully.
Conversion path for WordPress
If the site is WordPress, the practical next step is not a spreadsheet of recommendations. It is a configuration pass inside Better Robots.txt: choose the closest preset, adjust crawler families, preview the output, publish, and re-run the external scan. That is what turns the audit from education into proof.