AI Search Optimisation

The 8 AI Crawlers You Should Be Allowing in 2026 (And How)

A definitive list of every AI crawler your website should explicitly welcome in 2026, who runs each one, and the exact robots.txt block to copy. Updated for the current AI search ecosystem.

Arclight Digital · · 4 min read

If you've already read our piece on why your website is invisible to ChatGPT, you know the fix is in robots.txt. But which crawlers actually matter, and which ones are noise? This is the current shortlist — every AI crawler worth explicitly allowing in 2026.

The 8 AI crawlers to know

CrawlerOwnerWhat it powersWhy it matters
GPTBot OpenAI ChatGPT training + indexing The biggest AI surface today. Single largest source of AI citations.
ChatGPT-User OpenAI Live web fetches when ChatGPT is asked to visit a page Different from GPTBot — used in real-time, not training
OAI-SearchBot OpenAI ChatGPT Search (the search-engine product) Powers the answer-engine experience
ClaudeBot Anthropic Claude indexing + research Growing fast in B2B and research-heavy verticals
PerplexityBot Perplexity Perplexity search + answer engine Highest citation visibility — Perplexity always shows sources
Google-Extended Google AI Overviews + Gemini training Separate from Googlebot — controls AI answers without affecting normal SEO
Applebot-Extended Apple Apple Intelligence + Siri ~30% of AU mobile traffic uses iOS — Apple AI is the default
Meta-ExternalAgent Meta Llama training + Meta AI Meta AI is integrated into Instagram, WhatsApp, Messenger search

The honourable mentions

Worth knowing but lower priority — they're either smaller in volume or operate slightly differently:

  • Amazonbot — Amazon's general crawler, increasingly tied to Alexa AI
  • CCBot — Common Crawl, the open dataset that trains many smaller LLMs
  • DuckAssistBot — DuckDuckGo's AI features
  • MistralAI-User — Mistral's crawler (popular in EU markets)
  • Bytespider — ByteDance / Doubao (TikTok parent), AI for the Chinese market
  • Diffbot — knowledge-graph data feeding many LLMs as a service
  • YouBot — You.com's AI search
  • cohere-ai — Cohere's enterprise LLM training

The exact robots.txt to copy

Drop this into your robots.txt file at yourdomain.com/robots.txt. It explicitly welcomes every major AI crawler and keeps traditional search (Google, Bing) accessible:

# Arclight Digital — robots.txt template (2026) # Welcomes traditional search + every major AI crawler User-agent: * Allow: / # ───── OpenAI ───── User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / # ───── Anthropic ───── User-agent: ClaudeBot Allow: / User-agent: Claude-Web Allow: / User-agent: anthropic-ai Allow: / # ───── Perplexity ───── User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / # ───── Google ───── User-agent: Google-Extended Allow: / User-agent: GoogleOther Allow: / # ───── Apple ───── User-agent: Applebot Allow: / User-agent: Applebot-Extended Allow: / # ───── Meta ───── User-agent: Meta-ExternalAgent Allow: / User-agent: FacebookBot Allow: / # ───── Honourable mentions ───── User-agent: Amazonbot Allow: / User-agent: CCBot Allow: / User-agent: DuckAssistBot Allow: / User-agent: MistralAI-User Allow: / User-agent: Bytespider Allow: / User-agent: Diffbot Allow: / User-agent: YouBot Allow: / # ───── Sitemap ───── Sitemap: https://yourdomain.com/sitemap.xml

Replace yourdomain.com on the last line with your actual domain. Save as plain text. Upload to the root of your site (same folder as your homepage).

Platform-specific instructions

  • Squarespace: can't edit robots.txt directly. Toggle every crawler ON under Settings → Crawlers & Spiders. Squarespace blocks the major AI bots by default — see our deeper guide on the Squarespace trap.
  • WordPress: use Rank Math or Yoast (both have a robots.txt editor in their SEO settings) — or edit directly via your hosting file manager.
  • Wix: Settings → SEO Tools → Robots.txt. Paste the block above.
  • Shopify: theme code → robots.txt.liquid. Newer plans allow direct edits; older plans require theme code customisation.
  • Custom builds (static HTML, Vercel, Netlify): drop a file named robots.txt into your public directory. Done.

How to verify it worked

Once your file is live:

  1. Visit yourdomain.com/robots.txt in an incognito window. Confirm the new content loads.
  2. Use Google Search Console's robots.txt tester to validate parsing.
  3. Wait 2-4 weeks, then ask ChatGPT: "Tell me about [your business name]." Accurate, cited answer? You're in.

Robots.txt is necessary — but not sufficient. Allowing crawlers gets them to your door. To get cited, the content behind the door also needs schema markup, FAQ structure, and clear topical authority. Read our AI Search Optimisation Brisbane page for the full picture.

What if I don't want some of these crawlers?

You can selectively block crawlers if you have a real reason to. Common cases:

  • Concerned about training data: block GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent. Note: this also blocks AI citation, not just training.
  • Heavy server load from a specific bot: block Bytespider or CCBot first — these are the most aggressive crawlers.
  • Paywall content: use Disallow: /premium/ patterns rather than blanket-blocking the bot.

For most small businesses in 2026, the answer is allow everything. The trade-off is short-term invisibility on AI surfaces vs. theoretical training-data concern that doesn't really apply to a 5-page service site.

Updated when?

This list reflects the AI crawler ecosystem as of . The list is shifting fast — new players appear quarterly. We update this page when meaningful changes happen. Bookmark it.

Get a free AI crawler audit

We'll check your robots.txt, schema, and AI citation status — and send back a one-page report with the highest-impact fixes.

Get a Free Audit