Content-shape signals vs AI citation readiness

AI citation work often starts with a practical observation: a page can be crawled and still never be quoted.

That observation is useful. It does not mean an external audit can predict whether a private AI system will retrieve, cite, rank or recommend a page.

The useful part of extractability

Extractability describes whether a page contains clean, bounded passages that can be isolated without forcing a system to reconstruct the whole context.

A content-shape observation can inspect visible HTML or Markdown signals such as:

an early direct answer;
headings that match the user question or intent;
short enough paragraphs;
stable entity naming;
explicit scope and non-guarantee boundaries;
proximity to a source, proof point or canonical reference.

These are observable page-shape signals. They can be fixed by the publisher.

The dangerous overclaim

A page can have excellent extractability and still not be cited.

Citation depends on layers Better Robots does not control or observe directly:

private retrieval indexes;
embeddings and rerankers;
model-specific source preferences;
freshness and authority signals;
query expansion and fan-out;
user context;
grounding rules;
competitor source availability.

For that reason, Better Robots should not call content-shape signals “AI citation probability”.

Where this fits in the Better Robots layer model

Layer	Question	Better Robots role
Crawler governance	Can the crawler access the content, and is the declared access posture coherent?	Core `/check` score
Post-crawl usage governance	What use is declared after access?	Policy and signal detection
Content-shape observations	Does the page expose bounded passages that look extractable?	Optional diagnostic, no score impact
Interpretive fidelity	Is the brand or claim reconstructed correctly by models?	Outside `/check`; InferensLab-style measurement
AI visibility measurement	Is the brand actually mentioned or cited?	Outside `/check`; downstream tracking

Product boundary for /check

If implemented, a content-shape module should remain separate from the main /check score.

It may say:

This page contains, or does not contain, structurally extractable passages.

It must not say:

This page is likely to be cited by ChatGPT, Claude, Gemini or Perplexity.

Recommended public language

Use cautious wording:

“Content-shape observations”;
“Extractable passage candidates”;
“Boundary statement missing”;
“No early direct answer candidate”;
“This does not predict retrieval, citation, ranking, recommendation or model adoption.”

Avoid overclaiming language:

“AI citation probability”;
“GEO score”;
“ranking readiness”;
“guaranteed AI visibility”;
“model compliance”.

Why this still matters

Content-shape work is useful because it gives publishers a fixable layer after crawl access and before downstream citation measurement.

It helps answer a narrower question:

If a system already reaches this page, does the page contain a clean passage that can be reused without losing its scope?

That is a legitimate diagnostic layer. It becomes misleading only when it is sold as a prediction of citation.

Content-shape signals vs AI citation readiness ​

The useful part of extractability ​

The dangerous overclaim ​

Where this fits in the Better Robots layer model ​

Product boundary for /check ​

Recommended public language ​

Why this still matters ​