The 8 AI Crawlers You Should Be Allowing in 2026 (And How)

If you've already read our piece on why your website is invisible to ChatGPT, you know the fix is in robots.txt. But which crawlers actually matter, and which ones are noise? This is the current shortlist — every AI crawler worth explicitly allowing in 2026.

The 8 AI crawlers to know

Crawler	Owner	What it powers	Why it matters
`GPTBot`	OpenAI	ChatGPT training + indexing	The biggest AI surface today. Single largest source of AI citations.
`ChatGPT-User`	OpenAI	Live web fetches when ChatGPT is asked to visit a page	Different from GPTBot — used in real-time, not training
`OAI-SearchBot`	OpenAI	ChatGPT Search (the search-engine product)	Powers the answer-engine experience
`ClaudeBot`	Anthropic	Claude indexing + research	Growing fast in B2B and research-heavy verticals
`PerplexityBot`	Perplexity	Perplexity search + answer engine	Highest citation visibility — Perplexity always shows sources
`Google-Extended`	Google	AI Overviews + Gemini training	Separate from Googlebot — controls AI answers without affecting normal SEO
`Applebot-Extended`	Apple	Apple Intelligence + Siri	~30% of AU mobile traffic uses iOS — Apple AI is the default
`Meta-ExternalAgent`	Meta	Llama training + Meta AI	Meta AI is integrated into Instagram, WhatsApp, Messenger search

The honourable mentions

Worth knowing but lower priority — they're either smaller in volume or operate slightly differently:

Amazonbot — Amazon's general crawler, increasingly tied to Alexa AI
CCBot — Common Crawl, the open dataset that trains many smaller LLMs
DuckAssistBot — DuckDuckGo's AI features
MistralAI-User — Mistral's crawler (popular in EU markets)
Bytespider — ByteDance / Doubao (TikTok parent), AI for the Chinese market
Diffbot — knowledge-graph data feeding many LLMs as a service
YouBot — You.com's AI search
cohere-ai — Cohere's enterprise LLM training

The exact robots.txt to copy

Drop this into your robots.txt file at yourdomain.com/robots.txt. It explicitly welcomes every major AI crawler and keeps traditional search (Google, Bing) accessible:

# Arclight Digital — robots.txt template (2026)
# Welcomes traditional search + every major AI crawler

User-agent: *
Allow: /

# ───── OpenAI ─────
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# ───── Anthropic ─────
User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: anthropic-ai
Allow: /

# ───── Perplexity ─────
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# ───── Google ─────
User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

# ───── Apple ─────
User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Allow: /

# ───── Meta ─────
User-agent: Meta-ExternalAgent
Allow: /

User-agent: FacebookBot
Allow: /

# ───── Honourable mentions ─────
User-agent: Amazonbot
Allow: /

User-agent: CCBot
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Diffbot
Allow: /

User-agent: YouBot
Allow: /

# ───── Sitemap ─────
Sitemap: https://yourdomain.com/sitemap.xml

Replace yourdomain.com on the last line with your actual domain. Save as plain text. Upload to the root of your site (same folder as your homepage).

Platform-specific instructions

Squarespace: can't edit robots.txt directly. Toggle every crawler ON under Settings → Crawlers & Spiders. Squarespace blocks the major AI bots by default — see our deeper guide on the Squarespace trap.
WordPress: use Rank Math or Yoast (both have a robots.txt editor in their SEO settings) — or edit directly via your hosting file manager.
Wix: Settings → SEO Tools → Robots.txt. Paste the block above.
Shopify: theme code → robots.txt.liquid. Newer plans allow direct edits; older plans require theme code customisation.
Custom builds (static HTML, Vercel, Netlify): drop a file named robots.txt into your public directory. Done.

How to verify it worked

Once your file is live:

Visit yourdomain.com/robots.txt in an incognito window. Confirm the new content loads.
Use Google Search Console's robots.txt tester to validate parsing.
Wait 2-4 weeks, then ask ChatGPT: "Tell me about [your business name]." Accurate, cited answer? You're in.

Robots.txt is necessary — but not sufficient. Allowing crawlers gets them to your door. To get cited, the content behind the door also needs schema markup, FAQ structure, and clear topical authority. Read our AI Search Optimisation Brisbane page for the full picture.

What if I don't want some of these crawlers?

You can selectively block crawlers if you have a real reason to. Common cases:

Concerned about training data: block GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent. Note: this also blocks AI citation, not just training.
Heavy server load from a specific bot: block Bytespider or CCBot first — these are the most aggressive crawlers.
Paywall content: use Disallow: /premium/ patterns rather than blanket-blocking the bot.

For most small businesses in 2026, the answer is allow everything. The trade-off is short-term invisibility on AI surfaces vs. theoretical training-data concern that doesn't really apply to a 5-page service site.

Updated when?

This list reflects the AI crawler ecosystem as of April 2026. The list is shifting fast — new players appear quarterly. We update this page when meaningful changes happen. Bookmark it.