The AI Search Readiness Checklist: 15 Things to Fix
Getting cited by AI search isn't one big lever — it's a stack of small, fixable things. If an AI crawler can't reach your page, can't parse a clean answer out of it, or can't tell who's behind it, you don't get cited, no matter how good the content is. This is a concrete, numbered checklist across five areas — crawler access, content structure, extractability, authority signals, and technical setup. Work through each item, apply the fix, and check it off. Most of these take minutes, not weeks.
Crawler Access: Can AI Even Reach You? (Items 1–4)
Citations start with access. AI search engines use distinct crawlers, and many sites block them by accident — often through a security plugin, CDN bot rule, or an overzealous robots.txt. Before anything else, make sure the bots you want are allowed in.
Note that there are two categories of AI bot. Some fetch pages live to answer a query right now (OAI-SearchBot, PerplexityBot). Others crawl to build training corpora or indexes (GPTBot, ClaudeBot, Google-Extended). You can allow the search-facing ones while making your own call on the training ones — they're controlled by separate user-agent rules.
- 1. Allow the AI search crawlers in robots.txt. — Explicitly permit OAI-SearchBot (ChatGPT search), PerplexityBot, and Google-Extended (Gemini / AI Overviews context). A missing or restrictive robots.txt is the single most common reason a site is invisible to AI search.
- 2. Decide on the training/index crawlers deliberately. — GPTBot and ClaudeBot are separate user-agents. Blocking them is a legitimate choice, but do it on purpose — don't let a default deny rule silently lock out crawlers you'd actually want.
- 3. Check your CDN, WAF, and firewall, not just robots.txt. — Cloudflare, security plugins, and rate limiters often block unknown bots before robots.txt is ever read. A clean robots.txt means nothing if your edge returns a 403 to PerplexityBot.
- 4. Confirm key pages return HTTP 200 to a bot user-agent. — Soft 404s, login walls, and geo-redirects can serve humans fine but hand crawlers an error or a redirect loop. Test the actual pages you want cited.
Content Structure: Can It Find the Answer? (Items 5–8)
AI systems extract answers, not pages. They pull a clean, self-contained passage that responds to a query. If your answer is buried three scrolls down, hedged across five paragraphs, or trapped inside marketing copy, the model has nothing crisp to lift and attribute.
Structure your pages so the answer is obvious, near the top, and stands on its own without the surrounding context.
- 5. Put a direct answer in the first 1–2 sentences under each heading. — Lead with the conclusion, then explain. Models favor passages that answer the implied question immediately rather than building up to it.
- 6. Use descriptive, question-shaped headings. — H2s and H3s that mirror how people actually ask ("How much does X cost?", "What is Y?") map cleanly to queries and make passages easy to isolate.
- 7. Break content into scannable chunks. — Short paragraphs, bullet lists, comparison tables, and step-by-step formats give models discrete, quotable units. Walls of text are hard to extract a single citation from.
- 8. Make each section self-contained. — Avoid "as mentioned above" references. A passage that only makes sense with the rest of the page is hard to cite in isolation, which is exactly how AI answers use it.
Extractability & Technical Setup: Can It Parse You? (Items 9–12)
Even a well-structured answer is useless if the crawler can't read it. Many AI fetchers do not run JavaScript, or run it inconsistently. If your content only exists after client-side rendering, treat it as invisible until proven otherwise.
This is also where structured data and the emerging llms.txt convention come in — both give machines an explicit, unambiguous read on what your page is about.
- 9. Serve content in the initial HTML. — Server-side render or pre-render the body text. "View source" should show your actual content, not an empty div waiting for a framework to hydrate.
- 10. Add relevant schema markup. — Article, FAQPage, HowTo, Product, and Organization JSON-LD give machines explicit structure. It won't manufacture authority, but it removes ambiguity about what the page is.
- 11. Add an llms.txt file at your domain root. — A plain-Markdown map of your most important pages and what your site is for. Adoption is still early and uneven across platforms, so treat it as low-cost insurance, not a guaranteed ranking factor.
- 12. Keep pages fast and clean. — Reasonable load times, valid HTML, and a working sitemap.xml all help crawlers reach and parse more of your content. The SEO fundamentals still apply underneath GEO.
Authority & Citation Signals: Will It Trust You? (Items 13–15)
Access and structure get you eligible. Authority gets you chosen. When multiple sources could answer a query, AI systems lean toward content that looks credible, current, and corroborated elsewhere — and they often surface the source that's already widely referenced.
These signals are slower to build than a robots.txt fix, but they're what separates a citable page from a cited one. Be honest that the platforms are opaque here: nobody outside these companies has the exact ranking logic, so optimize for genuine credibility rather than tricks.
- 13. Show clear authorship and dates. — Named authors, an about page, visible publish and updated dates, and citations to primary sources are all signals of trustworthiness that models weigh.
- 14. Build off-page presence. — AI systems frequently cite sources that are mentioned, linked, and discussed across the web. Mentions on reputable sites, in communities, and in directories raise the odds you're the cited answer.
- 15. Keep content current. — Stale pages get passed over for time-sensitive queries. Update facts, refresh dates when you genuinely revise, and prune content that's gone wrong — accuracy is itself a citation signal.
Do It Now: The 15-Minute Pass
You don't have to do all 15 at once. Make a fast triage pass first — fix what's blocking you, then improve what's weak. Here's the order that catches the most damage fastest.
- Step 1 — Unblock (items 1–4). — Open robots.txt and your CDN bot rules. Confirm the search crawlers are allowed and your top pages return 200. This is the highest-leverage 5 minutes you'll spend.
- Step 2 — Check rendering (item 9). — View source on a key page. If the body text isn't there, that page is likely invisible to non-JS crawlers — fix that before anything cosmetic.
- Step 3 — Tighten one flagship page (items 5–8). — Pick your most important page, lead each section with a direct answer, and break up any walls of text. Use it as the template for the rest.
- Step 4 — Add the machine signals (items 10–11). — Drop in the right schema and publish an llms.txt at your root. Both are quick wins that cost almost nothing.
- Step 5 — Queue the slow work (items 13–15). — Authorship, freshness, and off-page presence are ongoing — put them on your roadmap rather than your to-do list for today.
# robots.txt — allow AI search crawlers
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
# Optional: training/index crawlers — allow or disallow on purpose
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
See your AI search readiness score
Working through 15 items by hand means a lot of manual checking — reading robots.txt, testing bot responses, viewing source on each page, hunting for missing schema. The free Am I Citable scanner does that pass for you: enter your URL and it checks AI-crawler access, content structure, schema, and whether you have an llms.txt, then returns a single 0–100 readiness score so you can see exactly which of these items you're failing. It also generates a ready-to-use llms.txt file you can drop at your domain root, knocking out item 11 instantly. Run it once to get your baseline, fix the flagged items, and re-scan to confirm the score moved.
Run the Free ScanFAQ
Check three places, because any one of them can block you silently: your robots.txt rules for OAI-SearchBot, PerplexityBot, and Google-Extended; your CDN or WAF bot-management settings (Cloudflare and security plugins often block unknown agents before robots.txt is read); and the actual HTTP response your key pages return to a bot user-agent. A scan that fetches your site the way an AI crawler would is the fastest way to catch all three at once.
Crawler access (items 1–4). Content quality, schema, and llms.txt are irrelevant if the crawler gets a 403 or a disallow rule before it ever reads your page. Confirm the AI search crawlers can reach your top pages and get a 200, then work down the list. Rendering (item 9) is a close second — content that only appears after JavaScript is effectively invisible to many AI fetchers.
No, and be skeptical of anyone who promises it. These platforms are opaque and their selection logic changes; the checklist makes you eligible and competitive, not guaranteed. What it does reliably is remove the mechanical reasons you'd never be cited — blocked crawlers, unparseable pages, missing answers — and strengthen the credibility signals that tip a decision your way. Citation also depends on query demand and how strong the competing sources are.