Skip to main contentSkip to content

WordPress robots.txt hygiene check

WordPress crawl hygiene is the part of robots.txt strategy that prevents machines from wasting attention on low-value paths while keeping public content visible.

A WordPress site can have a valid robots.txt file and still be noisy. Feeds, reply parameters, internal search URLs, WooCommerce carts, account pages, checkout paths, faceted filters, and generated archives can create crawl waste. At the same time, broad blocks can hide useful assets, images, or public pages.

What the checker looks for

WordPress surfaceTypical recommendation
/wp-admin/Restrict admin paths while allowing necessary public AJAX when relevant.
/wp-content/uploads/Usually keep public media crawlable.
Internal searchConsider blocking low-value search result URLs.
?replytocom=Reduce comment-reply crawl traps.
FeedsDecide based on publishing strategy; avoid uncontrolled feed loops.
WooCommerce cart and checkoutUsually block transactional utility paths from crawlers.
Account pagesUsually block private or utility account routes.
Facets and filtersAvoid crawl explosion on parameterized combinations.

Hygiene is not secrecy

Robots.txt should not be used to hide sensitive information. If a URL must remain private, use authentication, permissions, server controls, or removal. Robots.txt is a crawl instruction for compliant crawlers. It can reduce waste, but it is not security.

WooCommerce-specific concerns

WooCommerce introduces routes that rarely belong in search or AI answers:

txt
/cart/
/checkout/
/my-account/
?add-to-cart=
?orderby=
?filter_=

The audit looks for signs that the site has an e-commerce context and whether the robots.txt policy reduces obvious low-value crawler paths.

The public-resource balance

A strong WordPress policy should block admin and traps without blocking public rendering. Broadly disallowing /wp-content/ can hurt image discovery and page understanding. Broadly disallowing /wp-includes/ may also create problems if public JavaScript is needed for rendering.

Better Robots.txt implementation path

Better Robots.txt gives WordPress users a guided configuration layer. Instead of manually editing rules, the user can select a preset, adjust crawler families, review WordPress-specific options, preview the final output, and re-scan externally.

The best result is not the longest robots.txt file. It is the clearest file that expresses the site’s crawl intent without creating accidental invisibility.

Implementation checklist

Use the audit as an implementation sequence, not as a decorative score.

  1. Confirm the audited origin: protocol, host, and subdomain must match the site you actually want to govern.
  2. Preserve search access unless the site is intentionally private.
  3. Decide whether the goal is maximum AI visibility, training restriction, conservative publishing, or strict privacy.
  4. Configure crawler families by purpose rather than by emotion.
  5. Publish policy context only when it is coherent with the active rules.
  6. Re-scan after changes because a generated WordPress robots.txt file can be modified by plugins, cache, server rules, or edge middleware.

Manual spot check

A technical reviewer can validate the audit manually by requesting these URLs:

txt
/robots.txt
/llms.txt
/ai-manifest.json
/.well-known/ai-governance.json
/.well-known/llm-policy.json

Then compare the result with the public pages, sitemap, and WordPress configuration. The important question is not only whether each file exists. It is whether those files express the same intent. A robots.txt block, a permissive llms.txt, and a contradictory AI policy create a weak governance layer even if each file loads successfully.

Conversion path for WordPress

If the site is WordPress, the practical next step is not a spreadsheet of recommendations. It is a configuration pass inside Better Robots.txt: choose the closest preset, adjust crawler families, preview the output, publish, and re-run the external scan. That is what turns the audit from education into proof.