Robots.txt presence and validity check

Every AI crawler governance audit starts with the same baseline: can machines actually read a usable robots.txt file for this origin?

The question sounds simple, but many sites fail it in subtle ways: wrong host, forced redirect, generated HTML shell, empty file, stale cache, syntax noise, missing sitemap, or a file that exists only on www but not on the root domain.

Check robots.txt presence

What the scan verifies

Check	Meaning
HTTP availability	`/robots.txt` returns a usable response.
File-like content	The body looks like robots.txt, not a generic HTML page.
Basic directives	The file includes recognizable `User-agent`, `Allow`, `Disallow`, or `Sitemap` lines.
Sitemap reference	Crawlers can discover sitemap URLs directly from robots.txt.
Origin scope	The audited protocol and host are the ones the user intended.
Parseability	Rules are structured enough to be interpreted.

Why origin scope matters

Robots.txt is scoped by scheme, host, and port. A policy on https://example.com/robots.txt does not automatically govern https://www.example.com/robots.txt or https://blog.example.com/robots.txt. Google’s robots documentation describes this origin-scoped behavior and the rule syntax used by compliant crawlers.

For multi-host businesses, this means the audit should be repeated on each important host: main site, blog, store, app, help center, and documentation subdomain.

Minimal healthy file

txt

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

This is not a complete AI governance policy. It is a baseline. It tells crawlers the file exists, the public site is mostly open, the WordPress admin is restricted, and the sitemap is discoverable.

Invalid does not always mean broken, but it means ambiguous

Robots.txt is historically permissive. Many crawlers ignore unsupported lines. That does not mean every line is useful. When the checker warns about validity or unknown directives, the point is to reduce ambiguity. A clean file is easier to audit, easier to explain, and easier to maintain.

Common problems

The file redirects several times before loading.
The file is blocked by a WAF or returns a challenge page.
A single-page application serves its HTML shell as /robots.txt.
The file exists on www but the audit was run on the apex domain.
Sitemap URLs are relative, outdated, or missing.
Comments contain policy text, but no active directive or pointer exists.

WordPress correction path

For WordPress sites, Better Robots.txt can generate and preview a stable output from the admin. That is safer than editing a physical file that may be ignored, overwritten, or contradicted by another plugin.

Implementation checklist

Use the audit as an implementation sequence, not as a decorative score.

Confirm the audited origin: protocol, host, and subdomain must match the site you actually want to govern.
Preserve search access unless the site is intentionally private.
Decide whether the goal is maximum AI visibility, training restriction, conservative publishing, or strict privacy.
Configure crawler families by purpose rather than by emotion.
Publish policy context only when it is coherent with the active rules.
Re-scan after changes because a generated WordPress robots.txt file can be modified by plugins, cache, server rules, or edge middleware.

Manual spot check

A technical reviewer can validate the audit manually by requesting these URLs:

txt

/robots.txt
/llms.txt
/ai-manifest.json
/.well-known/ai-governance.json
/.well-known/llm-policy.json

Then compare the result with the public pages, sitemap, and WordPress configuration. The important question is not only whether each file exists. It is whether those files express the same intent. A robots.txt block, a permissive llms.txt, and a contradictory AI policy create a weak governance layer even if each file loads successfully.

Conversion path for WordPress

If the site is WordPress, the practical next step is not a spreadsheet of recommendations. It is a configuration pass inside Better Robots.txt: choose the closest preset, adjust crawler families, preview the output, publish, and re-run the external scan. That is what turns the audit from education into proof.

Robots.txt presence and validity check ​

What the scan verifies ​

Why origin scope matters ​

Minimal healthy file ​

Invalid does not always mean broken, but it means ambiguous ​

Common problems ​

WordPress correction path ​

Implementation checklist ​

Manual spot check ​

Conversion path for WordPress ​

Related audit pages ​