AI Crawler User Agent List: Complete Reference for 2026

AI Crawler User Agent List: Complete Reference for 2026: GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, CCBot, and more.

There are two categories: AI search/browse crawlers (index content for real-time AI answers) and AI training crawlers (collect data to train models).

Blocking search/browse crawlers reduces your AI citation visibility. Blocking training crawlers has minimal direct impact on current AI search.

This list is updated as new crawlers are identified. Last reviewed: February 2026.

How to Use This List

For robots.txt: Use the AI Crawler User Agent: value exactly as shown in the “Disallow target” column.
For server log analysis: Search for the string in the “User-agent string contains” column.
For WordPress bot detection (PulseRank): PulseRank uses this list automatically. You do not need to configure anything manually.

AI Crawler User Agent

These crawlers index content to power real-time AI assistant answers. If you block them, your content is unlikely to appear in AI-generated responses.

Bot name	Operator	User-agent string contains	Purpose
GPTBot	OpenAI	`GPTBot`	Web crawling for ChatGPT search and retrieval
OAI-SearchBot	OpenAI	`OAI-SearchBot`	SearchGPT (OpenAI’s search product)
PerplexityBot	Perplexity AI	`PerplexityBot`	Perplexity search and answer generation
Bingbot / Copilot	Microsoft	`Bingbot`	Bing search + Microsoft Copilot retrieval
Applebot	Apple	`Applebot`	Apple search + Siri knowledge
Applebot-Extended	Apple	`Applebot-Extended`	Apple AI feature indexing
YouBot	You.com	`YouBot`	You.com AI search
PhindBot	Phind	`PhindBot`	Phind AI search for developers
Googlebot	Google	`Googlebot`	Google Search + Gemini retrieval
Google-Extended	Google	`Google-Extended`	Google AI training and Gemini features
DuckDuckBot	DuckDuck Go	`DuckDuckBot`	DuckDuckGo search (used in some AI integrations)

AI Training Crawlers

These crawlers collect content to train or fine-tune AI models. Blocking them does not directly prevent current AI assistants from citing your content, but limits future training data use.

Bot name	Operator	User-agent string contains	Purpose
CCBot	Common Crawl	`CCBot`	General web crawl used in many AI training datasets
anthropic-ai	Anthropic	`anthropic-ai`	Anthropic training data collection
ClaudeBot	Anthropic	`ClaudeBot`	Claude AI crawling (training + retrieval)
Bytespider	ByteDance	`Bytespider`	TikTok parent company crawler
PetalBot	Huawei	`PetalBot`	Huawei search and AI data collection
Diffbot	Diffbot	`Diffbot`	AI-powered data extraction service
Omgilibot	Webz.io	`Omgilibot`	AI content aggregation
magpie-crawler	Magpie	`magpie-crawler`	AI data collection
Brightbot	BrightEdge	`Brightbot`	AI SEO and training data
DataForSeoBot	DataForSEO	`DataForSeoBot`	SEO data collection with AI applications

Unknown / Shadow AI Crawlers

These user-agents have been identified in server logs of WordPress sites but do not have publicly documented policies. Treat with caution.

User-agent string	Notes
`ChatGPT-User`	Used by ChatGPT’s browsing feature (not the training crawler)
`cohere-ai`	Cohere LLM training crawler
`AI2Bot`	Allen Institute for AI crawler
`img2dataset`	Image dataset collection bot
`Scrapy`	Python scraping framework; often used by AI data collectors

If you encounter unfamiliar user-agents in your logs, check for documentation on the crawling organization’s website before deciding to block.

Robots.txt Quick Reference

Allow all AI Search Crawler User Agent block training crawlers

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: PetalBot
Disallow: /

How to Identify AI Crawlers in WordPress Server Logs

If you have access to raw access logs, filter for known AI crawler strings:

Apache / Nginx (bash)

# Show all AI crawler requests with timestamp and URL
grep -E "GPTBot|OAI-SearchBot|PerplexityBot|ClaudeBot|anthropic-ai|Applebot|CCBot|Bytespider" /var/log/nginx/access.log | awk '{print $1, $7, $12}'

Count requests per bot (last 30 days)

grep -E "GPTBot|PerplexityBot|ClaudeBot|CCBot" /var/log/nginx/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn

AI Crawler User Agent In WordPress (via PulseRank)

PulseRank matches against this list automatically and displays crawler activity per bot, per page, per day — without requiring server log access.

AI Crawler User Agent Crawl Rates

As of early 2026, typical crawl behavior observed across WordPress sites:

GPTBot — typically crawls 3–8 pages per day on most small-to-medium sites. Re-crawls vary from weekly to monthly depending on content update frequency.
PerplexityBot — lower crawl volume than GPTBot; tends to focus on specific pages rather than full-site crawls.
OAI-SearchBot — higher crawl frequency than GPTBot; appears to prioritize recently updated pages.
CCBot — large batch crawls; may visit hundreds of pages in a single session.
ClaudeBot / anthropic-ai — relatively recent crawler; crawl volumes are growing.

These are observational estimates, not official figures. Use PulseRank or server logs to see the actual crawl rate for your specific site.

AI Crawler User Agent Compliance with robots.txt

Crawler	Honors robots.txt	Documentation
GPTBot	Yes (documented)	OpenAI docs
OAI-SearchBot	Yes (documented)	OpenAI docs
PerplexityBot	Yes (documented)	Perplexity AI docs
ClaudeBot / anthropic-ai	Yes (documented)	Anthropic docs
Google-Extended	Yes	Google developer docs
CCBot	Yes	Common Crawl docs
Bytespider	Claimed yes; inconsistent in practice	ByteDance docs
PetalBot	Yes	Huawei docs
Unknown scrapers	Often no	N/A

AI Crawler User Agent Changelog

Date	Change
February 2026	Added `ChatGPT-User` (browse feature), updated Applebot-Extended notes
January 2026	Added `cohere-ai`, `AI2Bot` to shadow crawlers section
November 2025	Added `OAI-SearchBot` (SearchGPT launch)
September 2025	Updated crawl rate estimates; added PhindBot

From Our Testing: Crawl Behavior Observed in the Wild

Based on server log analysis across hundreds of WordPress sites via PulseRank:

GPTBot is consistently the highest-volume AI Crawler User across all site niches, followed by PerplexityBot and OAI-SearchBot on pages covering AI, SaaS, and technical topics.
CCBot makes large batch crawls — we observed it requesting 200–400 pages per day on sites it visits infrequently, then going quiet for weeks.
ClaudeBot has approximately doubled its crawl rate between Q3 2025 and Q1 2026 on the sites we monitor, suggesting Anthropic is scaling its retrieval infrastructure significantly.
Bytespider claimed robots.txt compliance but we observed it accessing disallowed paths on 3 of 8 test sites within 30 days of applying a block. Treat it like a scraper until proven otherwise.
Unknown user-agents account for 5–12% of bot traffic across sites we monitor. Many appear to be AI data collection tools using generic or spoofed strings.

OpenAI’s official GPTBot documentation is at platform.openai.com/docs/gptbot. Common Crawl’s CCBot dataset is documented at commoncrawl.org and is acknowledged as one of the most widely used pre-training sources for large language models in published LLM research.

See which AI crawlers are already visiting your site and turn that data into visibility with PulseRank, your WordPress AI Analytics Plugin.

Frequently Asked Questions

We’ve anticipated your concerns and engineered solutions for each one.

GPTBot is OpenAI’s background crawler that indexes web content for future retrieval. ChatGPT-User is the user-agent used when a ChatGPT user actively triggers the browsing feature to look up a specific URL.

Both respect robots.txt.

Only if you want to prevent your content from being used in Google’s AI features (Gemini, AI Overviews).

This may reduce Gemini citations. Most sites should leave it allowed unless they have a specific policy reason to block it.

Crawlers are identified by monitoring server logs across many WordPress sites. If you find an undocumented user-agent, check the IP range’s WHOIS and any documentation the operator publishes.

Submit findings to discussions in the AI/SEO community.

Q: Is CCBot really used for AI training? Yes. Co

Yes. Your data stays on your site. PulseRank stores analytics locally in your WordPress database with no external dashboards and no data sent to third parties.

It’s GDPR-ready with IP hashing, export/delete tools, and configurable data retention controls.

Yes. Common Crawl’s dataset is one of the most widely used pre-training datasets for large language models.

Many major LLMs (GPT series, LLaMA, Mistral, etc.) were trained on Common Crawl data. Blocking CCBot won’t remove your content from existing trained models, but it limits future inclusion.

In addition to user-agent matching, PulseRank applies behavioral heuristics to identify bot-like sessions that do not match any known user-agent.

How to Use This List

AI Crawler User Agent

AI Training Crawlers

Unknown / Shadow AI Crawlers

Robots.txt Quick Reference

Allow all AI Search Crawler User Agent block training crawlers

How to Identify AI Crawlers in WordPress Server Logs

Apache / Nginx (bash)

Count requests per bot (last 30 days)

AI Crawler User Agent In WordPress (via PulseRank)

AI Crawler User Agent Crawl Rates

AI Crawler User Agent Compliance with robots.txt

AI Crawler User Agent Changelog

From Our Testing: Crawl Behavior Observed in the Wild

Frequently Asked Questions

What is the difference between GPTBot and ChatGPT-User? +

Should I block Google-Extended?+

How do I add new crawlers to this list?+

Is WordPress AI analytics data private with PulseRank?+

Is CCBot really used for AI training?+

What is PulseRank’s approach to detecting unknown crawlers?+

Stop Guessing. Start Measuring.