How to save the planet with your website: a defensible crawl strategy

This historical article was originally published by Gautier Dorval in January 2019. It was rewritten in July 2026 to preserve its useful premise while removing unsupported carbon estimates.

The original idea was simple: when crawlers repeatedly fetch useless URLs, the web performs work that provides no value. That premise still holds. The defensible conclusion, however, is narrower than claiming that every bot request represents a fixed quantity of carbon dioxide.

A request can involve server processing, storage access, network transfer, rendering, logging, and cache activity. Reducing avoidable requests can therefore reduce avoidable technical work. The exact energy or emissions change depends on the infrastructure, cache state, payload, location, electricity mix, and measurement boundary. A backlink report or server log alone cannot calculate it.

What a website owner can responsibly claim

The W3C Web Sustainability Guidelines treat data transfer, processing, caching, and infrastructure choices as connected parts of web sustainability. They also make an important point: sustainability is multidimensional, and its metrics and guidance evolve.

For crawl governance, this leads to a useful rule:

Reduce demonstrable waste first. Measure operational effects second. Make an environmental claim only when its scope and method are explicit.

This approach is less spectacular than attaching a universal carbon number to every request. It is also technically auditable.

Find crawl waste before blocking anything

Start with evidence from server logs and search-engine tools. Classify requests by path family, crawler, status code, response size, latency, and recurrence. Look for patterns such as:

faceted navigation and filter combinations that create near-infinite URL spaces;
internal search results and tracking parameters;
duplicate HTTP, HTTPS, www, non-www, slash, and non-slash routes;
obsolete XML sitemap entries;
redirect chains;
soft 404 pages that return 200;
persistent 5xx responses;
links that repeatedly send crawlers to removed URLs.

Google defines crawl budget as the combination of crawl capacity and crawl demand. Its current crawl budget guidance is mainly intended for large or rapidly changing sites. A small site whose important pages are crawled promptly does not need an elaborate crawl-budget project.

The environmental framing does not change that threshold. Do not manufacture complexity where the logs show no material problem.

Fix URL inventory before adding directives

The strongest correction usually happens in the routing and discovery architecture:

Link internally only to canonical, indexable 200 pages.
Keep XML sitemaps limited to canonical URLs that should be crawled.
Consolidate duplicates with a stable canonical URL and consistent links.
Remove crawl traps at their source when the application can stop generating them.
Normalize host, protocol, and trailing-slash variants in one hop.

Robots.txt can stop a compliant crawler from fetching a matching path, but it does not repair a broken URL model. If templates continue generating thousands of unnecessary URLs, the underlying defect remains.

Return an honest HTTP response

Historical URLs require a semantic decision, not a blanket redirect:

Restore the page with a 200 response when the content still serves a real intent.
Use a permanent 301 redirect when a clear, relevant successor exists.
Return 404 or 410 when the resource is gone and no equivalent exists.
Do not redirect unrelated legacy URLs to the homepage.

Google documents a permanent redirect as a strong signal that the destination should be processed, while a 4xx response says the old content does not exist. Its HTTP status guidance also warns against soft 404s and long redirect chains.

This is both an SEO and resource-efficiency rule. A direct, truthful response gives people and crawlers a clearer outcome with less repeated interpretation.

Use robots.txt for crawl control, not as a universal remedy

According to Google’s robots.txt guidance, robots.txt primarily manages crawler access and traffic. It is not a reliable mechanism for removing a web page from search results.

Use it conservatively:

block a path only when the crawler does not need to fetch it;
keep CSS, JavaScript, images, and other resources accessible when they are needed to understand the page;
do not block a URL that must be crawled to discover a noindex directive;
remember that crawler compliance is actor-specific and cannot be guaranteed by the file itself.

Better Robots.txt helps WordPress teams publish and review these crawl instructions. It is not a carbon calculator, an indexing guarantee, a security boundary, or proof that every crawler complied.

Make legitimate responses cheaper to serve

Not every crawler request is waste. Important pages still need to be discovered and refreshed. Improve the cost of those valid requests:

keep server response times stable;
use HTTP caching and 304 Not Modified where appropriate;
compress text assets and avoid unnecessarily large payloads;
remove unused scripts and third-party requests from public pages;
use a CDN and caching strategy that fit the site’s actual traffic and update model;
correct repeated 5xx and rate-limiting failures instead of hiding them.

These changes can reduce transfer and processing. They do not, by themselves, prove a specific emissions reduction.

Measure the result

Compare equivalent periods before and after the change. Account for publishing cadence, campaigns, migrations, and seasonality. Track:

crawler requests to the corrected path families;
2xx, 3xx, 4xx, and 5xx distributions;
response bytes and latency;
cache-hit behavior when available;
crawl statistics and indexing reports in Google Search Console;
discovery and recrawl of the canonical destination URLs.

If an environmental result will be published, define the system boundary, data source, estimation model, location assumptions, and uncertainty. Otherwise, report the operational result directly: fewer duplicate URLs, fewer unnecessary responses, shorter redirect chains, or lower transferred bytes.

Practical crawl-efficiency checklist

[ ] Crawl logs identify a real waste pattern.
[ ] Internal links and sitemaps use canonical 200 URLs.
[ ] Duplicate URL generation is corrected at the source.
[ ] Every legacy URL has a content-based decision: restore, redirect, or remove.
[ ] Permanent redirects point directly to a relevant successor.
[ ] Removed resources return a genuine 404 or 410.
[ ] Robots.txt rules match the intended crawler and purpose.
[ ] Important rendering resources remain crawlable.
[ ] Cache, payload, and response-time changes are measured.
[ ] Environmental language states its method and limitations.

Continue with crawl budget explained, the robots.txt guide, or run the free Better Robots.txt audit.

How to save the planet with your website: a defensible crawl strategy ​

What a website owner can responsibly claim ​

Find crawl waste before blocking anything ​

Fix URL inventory before adding directives ​

Return an honest HTTP response ​

Use robots.txt for crawl control, not as a universal remedy ​

Make legitimate responses cheaper to serve ​

Measure the result ​

Practical crawl-efficiency checklist ​