The rubric
Every check we run, in plain English. Published openly so you can see exactly how a score is reached.
For the method behind these checks, see the full AI SEO guide.
A · AI crawler access (blocking)
- robots.txt fetched: present and parseable.
- 13 AI agents and opt-out tokens checked: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, CCBot, Amazonbot, Meta-ExternalAgent, plus the Google-Extended and Applebot-Extended robots.txt opt-out tokens (which control AI training, not crawling). For each agent and how to allow or block it, see the AI crawlers list.
- X-Robots-Tag header: checks for
noai/noimageai(these are unofficial AI opt-out signals with no W3C or IETF standard) andnoindex. - Meta robots: checks for
noindexon the home page.
B · Discoverability & structure
- sitemap.xml: reachable, either at the standard path or referenced in robots.txt.
- llms.txt: present at root and non-empty (emerging convention, not an adopted standard). See what llms.txt is and how to write one.
- Canonical tag: present.
- lang attribute: declared on the
<html>tag. - HTTPS + HSTS: site served over HTTPS with optional HSTS hardening.
- Redirect chain: flags if a URL needs more than one hop to resolve.
C · Structured data
- JSON-LD presence: at least one parseable block.
- Article / BlogPosting: requires author, datePublished, headline.
- Organization: requires name, url, logo, sameAs.
- Product: requires name, offers.
- FAQPage: at least 2 Question items.
- Microdata / RDFa: detected as fallback; we recommend migrating to JSON-LD.
- Open Graph + Twitter Card: checks og:title, og:description, og:image, og:type, twitter:card.
D · Content extractability
- Raw-HTML body depth: scored in bands on visible words in raw HTML (most AI crawlers do not run JS): under 100 is a blocking miss, under 300 a miss, 300 to 599 partial, 600 to 999 mostly credited, 1000+ full. Utility pages (contact, pricing, checkout, legal) are exempt.
- H1 presence + heading hierarchy: at least one h1; no skipped levels in h2/h3/h4.
- Semantic landmarks:
<article>or<main>present. - Image alt-text coverage: flags low coverage on pages with 5+ images.
- Text-to-code ratio: flags pages dominated by markup over visible text.
E · Citability signals
- Author byline: meta author, JSON-LD author, rel="author", or "By [Name]" text pattern.
- Publish / update date: JSON-LD datePublished/dateModified,
<time>with datetime, or article meta. - Outbound authoritative links: only on claim-making pages. A link to a recognised authority (gov/edu/Wikipedia/schema.org, major research firms, major business and tech press) clears it. Outbound links to unrecognised domains produce a non-scored note, not a penalty. Only a claim-heavy page with no outbound links at all gets a nice-to-have flag.
- Internal link density: flags very low link count on the home page.
F · Answer-shape
- Question-form headings: h2/h3 starting with "What/How/Why…" or ending in "?".
- Lists and tables: presence of structured content blocks.
- FAQ schema match: three or more headings that genuinely read as questions (end with a question mark, excluding calls-to-action) without FAQPage schema get a nice-to-have flag. Works for a real FAQ section anywhere, including the end of a services, product or article page. Google removed FAQ rich results for virtually all sites in 2023 (kept only for a few authoritative domains), so this is mainly for AI extraction now.
G · Classic SEO basics
- <title> length: 30-65 characters target.
- Meta description length: 120-160 characters target.
- Favicon + apple-touch-icon: present.
- Mobile viewport: declared.
- Core Web Vitals: optional. The main scan stays fast and skips this; you can run it on demand from the report. When run, it is fetched live from Google PageSpeed Insights (LCP, INP, CLS, performance score, mobile and desktop) and folded into the Classic SEO score.
H · Free extras
- Test-it-yourself deep links: prefilled queries to ChatGPT, Claude, Perplexity, Google AI Mode for actual citation evidence.
- Auto-generated llms.txt template: when missing.
- Auto-generated robots.txt patch: when AI bots are blocked.
- Auto-generated JSON-LD snippet: for the most-missing schema type.
- URL hygiene: clean, lowercase, no session params.
I · Content depth (the email-unlocked Content scan)
These checks run on the same pages as the main scan and answer a different question: not "do you have it" but "is it good enough to get cited". They are free; leave your email on the report page and an unlock link is sent to you. They feed the separate Content score, never the AI SEO or Classic SEO scores.
- Citable answer passages: under each question-shaped heading, is there a self-contained answer in the 115-180 word range AI assistants lift most readily (80-250 earns partial credit)?
- Entity statement up front: does the home page plainly say what the brand is and does within its opening 300 words of main content?
- Verifiable identity (sameAs): does the Organization schema link at least two recognised identity hosts (Wikipedia, Wikidata, LinkedIn, X, GitHub, Crunchbase and similar)?
- Complete article dates: do articles carry both datePublished and a valid dateModified, with a properly structured author?
- Social card depth: is og:image an absolute URL with width, height and alt text, paired with a valid Twitter card?
- Breadcrumb validation: where BreadcrumbList markup exists, are positions contiguous and every item named?
Scoring
The score is the share of points a site earns out of the points that actually apply to it. We do not start at 100 and subtract: we only grade a page on the signals that page type is expected to have.
- Applicable signals only: a signal that does not apply to a site is in neither the numerator nor the denominator. A site with no products is never graded on Product schema; a page with no Q&A content is never graded on FAQ schema; a contact or checkout page is never graded on article-length body text. These appear in a "Not applicable" list on the report so the denominator is transparent.
- Weighted by impact: each signal carries a weight reflecting how much it matters for being cited or ranked. Crawler access, HTTPS, structured-data presence and content depth are weighted far more heavily than a missing Twitter Card or favicon.
- Partial credit: a signal can be partly earned. Schema present but missing fields, 60% image-alt coverage, or a problem on 3 of 10 scanned pages each earn a proportional share of that signal's weight rather than all-or-nothing.
- Two independent scores: AI SEO and Classic SEO each have their own applicable signal set and are scored separately, 1 to 100.
- Critical gates: if AI crawlers are blocked in robots.txt, the page is set to noindex, or the site is not served over HTTPS, the score is capped low no matter how good everything else is. If assistants cannot read or are told not to index the site, a high score would be misleading.
Every signal we evaluated, passed or failed, is listed on the report (passes are grouped and collapsed) so the headline number always reconciles with the detail.