AI Crawler User Agent List: Complete Reference for 2026: GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, CCBot, and more.
There are two categories: AI search/browse crawlers (index content for real-time AI answers) and AI training crawlers (collect data to train models).
Blocking search/browse crawlers reduces your AI citation visibility. Blocking training crawlers has minimal direct impact on current AI search.
This list is updated as new crawlers are identified. Last reviewed: February 2026.
How to Use This List
- For robots.txt: Use the
AI Crawler User Agent:value exactly as shown in the “Disallow target” column. - For server log analysis: Search for the string in the “User-agent string contains” column.
- For WordPress bot detection (PulseRank): PulseRank uses this list automatically. You do not need to configure anything manually.
AI Crawler User Agent
These crawlers index content to power real-time AI assistant answers. If you block them, your content is unlikely to appear in AI-generated responses.
| Bot name | Operator | User-agent string contains | Purpose |
|---|---|---|---|
| GPTBot | OpenAI | GPTBot | Web crawling for ChatGPT search and retrieval |
| OAI-SearchBot | OpenAI | OAI-SearchBot | SearchGPT (OpenAI’s search product) |
| PerplexityBot | Perplexity AI | PerplexityBot | Perplexity search and answer generation |
| Bingbot / Copilot | Microsoft | Bingbot | Bing search + Microsoft Copilot retrieval |
| Applebot | Apple | Applebot | Apple search + Siri knowledge |
| Applebot-Extended | Apple | Applebot-Extended | Apple AI feature indexing |
| YouBot | You.com | YouBot | You.com AI search |
| PhindBot | Phind | PhindBot | Phind AI search for developers |
| Googlebot | Googlebot | Google Search + Gemini retrieval | |
| Google-Extended | Google-Extended | Google AI training and Gemini features | |
| DuckDuckBot | DuckDuck Go | DuckDuckBot | DuckDuckGo search (used in some AI integrations) |
AI Training Crawlers
These crawlers collect content to train or fine-tune AI models. Blocking them does not directly prevent current AI assistants from citing your content, but limits future training data use.
| Bot name | Operator | User-agent string contains | Purpose |
|---|---|---|---|
| CCBot | Common Crawl | CCBot | General web crawl used in many AI training datasets |
| anthropic-ai | Anthropic | anthropic-ai | Anthropic training data collection |
| ClaudeBot | Anthropic | ClaudeBot | Claude AI crawling (training + retrieval) |
| Bytespider | ByteDance | Bytespider | TikTok parent company crawler |
| PetalBot | Huawei | PetalBot | Huawei search and AI data collection |
| Diffbot | Diffbot | Diffbot | AI-powered data extraction service |
| Omgilibot | Webz.io | Omgilibot | AI content aggregation |
| magpie-crawler | Magpie | magpie-crawler | AI data collection |
| Brightbot | BrightEdge | Brightbot | AI SEO and training data |
| DataForSeoBot | DataForSEO | DataForSeoBot | SEO data collection with AI applications |
Unknown / Shadow AI Crawlers
These user-agents have been identified in server logs of WordPress sites but do not have publicly documented policies. Treat with caution.
| User-agent string | Notes |
|---|---|
ChatGPT-User | Used by ChatGPT’s browsing feature (not the training crawler) |
cohere-ai | Cohere LLM training crawler |
AI2Bot | Allen Institute for AI crawler |
img2dataset | Image dataset collection bot |
Scrapy | Python scraping framework; often used by AI data collectors |
If you encounter unfamiliar user-agents in your logs, check for documentation on the crawling organization’s website before deciding to block.
Robots.txt Quick Reference
Allow all AI Search Crawler User Agent block training crawlers
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: PetalBot
Disallow: /How to Identify AI Crawlers in WordPress Server Logs
If you have access to raw access logs, filter for known AI crawler strings:
Apache / Nginx (bash)
# Show all AI crawler requests with timestamp and URL
grep -E "GPTBot|OAI-SearchBot|PerplexityBot|ClaudeBot|anthropic-ai|Applebot|CCBot|Bytespider" /var/log/nginx/access.log | awk '{print $1, $7, $12}'Count requests per bot (last 30 days)
grep -E "GPTBot|PerplexityBot|ClaudeBot|CCBot" /var/log/nginx/access.log | \
awk '{print $12}' | sort | uniq -c | sort -rnAI Crawler User Agent In WordPress (via PulseRank)
PulseRank matches against this list automatically and displays crawler activity per bot, per page, per day — without requiring server log access.
AI Crawler User Agent Crawl Rates
As of early 2026, typical crawl behavior observed across WordPress sites:
- GPTBot — typically crawls 3–8 pages per day on most small-to-medium sites. Re-crawls vary from weekly to monthly depending on content update frequency.
- PerplexityBot — lower crawl volume than GPTBot; tends to focus on specific pages rather than full-site crawls.
- OAI-SearchBot — higher crawl frequency than GPTBot; appears to prioritize recently updated pages.
- CCBot — large batch crawls; may visit hundreds of pages in a single session.
- ClaudeBot / anthropic-ai — relatively recent crawler; crawl volumes are growing.
These are observational estimates, not official figures. Use PulseRank or server logs to see the actual crawl rate for your specific site.
AI Crawler User Agent Compliance with robots.txt
| Crawler | Honors robots.txt | Documentation |
|---|---|---|
| GPTBot | Yes (documented) | OpenAI docs |
| OAI-SearchBot | Yes (documented) | OpenAI docs |
| PerplexityBot | Yes (documented) | Perplexity AI docs |
| ClaudeBot / anthropic-ai | Yes (documented) | Anthropic docs |
| Google-Extended | Yes | Google developer docs |
| CCBot | Yes | Common Crawl docs |
| Bytespider | Claimed yes; inconsistent in practice | ByteDance docs |
| PetalBot | Yes | Huawei docs |
| Unknown scrapers | Often no | N/A |
AI Crawler User Agent Changelog
| Date | Change |
|---|---|
| February 2026 | Added ChatGPT-User (browse feature), updated Applebot-Extended notes |
| January 2026 | Added cohere-ai, AI2Bot to shadow crawlers section |
| November 2025 | Added OAI-SearchBot (SearchGPT launch) |
| September 2025 | Updated crawl rate estimates; added PhindBot |
From Our Testing: Crawl Behavior Observed in the Wild
Based on server log analysis across hundreds of WordPress sites via PulseRank:
- GPTBot is consistently the highest-volume AI Crawler User across all site niches, followed by PerplexityBot and OAI-SearchBot on pages covering AI, SaaS, and technical topics.
- CCBot makes large batch crawls — we observed it requesting 200–400 pages per day on sites it visits infrequently, then going quiet for weeks.
- ClaudeBot has approximately doubled its crawl rate between Q3 2025 and Q1 2026 on the sites we monitor, suggesting Anthropic is scaling its retrieval infrastructure significantly.
- Bytespider claimed robots.txt compliance but we observed it accessing disallowed paths on 3 of 8 test sites within 30 days of applying a block. Treat it like a scraper until proven otherwise.
- Unknown user-agents account for 5–12% of bot traffic across sites we monitor. Many appear to be AI data collection tools using generic or spoofed strings.
OpenAI’s official GPTBot documentation is at platform.openai.com/docs/gptbot. Common Crawl’s CCBot dataset is documented at commoncrawl.org and is acknowledged as one of the most widely used pre-training sources for large language models in published LLM research.
See which AI crawlers are already visiting your site and turn that data into visibility with PulseRank, your WordPress AI Analytics Plugin.
Frequently Asked Questions
We’ve anticipated your concerns and engineered solutions for each one.
GPTBot is OpenAI’s background crawler that indexes web content for future retrieval. ChatGPT-User is the user-agent used when a ChatGPT user actively triggers the browsing feature to look up a specific URL.
Both respect robots.txt.
Only if you want to prevent your content from being used in Google’s AI features (Gemini, AI Overviews).
This may reduce Gemini citations. Most sites should leave it allowed unless they have a specific policy reason to block it.
Crawlers are identified by monitoring server logs across many WordPress sites. If you find an undocumented user-agent, check the IP range’s WHOIS and any documentation the operator publishes.
Submit findings to discussions in the AI/SEO community.
Q: Is CCBot really used for AI training? Yes. Co
Yes. Your data stays on your site. PulseRank stores analytics locally in your WordPress database with no external dashboards and no data sent to third parties.
It’s GDPR-ready with IP hashing, export/delete tools, and configurable data retention controls.
Yes. Common Crawl’s dataset is one of the most widely used pre-training datasets for large language models.
Many major LLMs (GPT series, LLaMA, Mistral, etc.) were trained on Common Crawl data. Blocking CCBot won’t remove your content from existing trained models, but it limits future inclusion.
In addition to user-agent matching, PulseRank applies behavioral heuristics to identify bot-like sessions that do not match any known user-agent.
