AI Crawler User Agent List: Complete Reference for 2026

AI Crawler User Agent List: Complete Reference for 2026: GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, CCBot, and more.

There are two categories: AI search/browse crawlers (index content for real-time AI answers) and AI training crawlers (collect data to train models).

Blocking search/browse crawlers reduces your AI citation visibility. Blocking training crawlers has minimal direct impact on current AI search.

This list is updated as new crawlers are identified. Last reviewed: February 2026.

How to Use This List

  • For robots.txt: Use the AI Crawler User Agent: value exactly as shown in the “Disallow target” column.
  • For server log analysis: Search for the string in the “User-agent string contains” column.
  • For WordPress bot detection (PulseRank): PulseRank uses this list automatically. You do not need to configure anything manually.

AI Crawler User Agent

These crawlers index content to power real-time AI assistant answers. If you block them, your content is unlikely to appear in AI-generated responses.

Bot nameOperatorUser-agent string containsPurpose
GPTBotOpenAIGPTBotWeb crawling for ChatGPT search and retrieval
OAI-SearchBotOpenAIOAI-SearchBotSearchGPT (OpenAI’s search product)
PerplexityBotPerplexity AIPerplexityBotPerplexity search and answer generation
Bingbot / CopilotMicrosoftBingbotBing search + Microsoft Copilot retrieval
ApplebotAppleApplebotApple search + Siri knowledge
Applebot-ExtendedAppleApplebot-ExtendedApple AI feature indexing
YouBotYou.comYouBotYou.com AI search
PhindBotPhindPhindBotPhind AI search for developers
GooglebotGoogleGooglebotGoogle Search + Gemini retrieval
Google-ExtendedGoogleGoogle-ExtendedGoogle AI training and Gemini features
DuckDuckBotDuckDuck GoDuckDuckBotDuckDuckGo search (used in some AI integrations)

AI Training Crawlers

These crawlers collect content to train or fine-tune AI models. Blocking them does not directly prevent current AI assistants from citing your content, but limits future training data use.

Bot nameOperatorUser-agent string containsPurpose
CCBotCommon CrawlCCBotGeneral web crawl used in many AI training datasets
anthropic-aiAnthropicanthropic-aiAnthropic training data collection
ClaudeBotAnthropicClaudeBotClaude AI crawling (training + retrieval)
BytespiderByteDanceBytespiderTikTok parent company crawler
PetalBotHuaweiPetalBotHuawei search and AI data collection
DiffbotDiffbotDiffbotAI-powered data extraction service
OmgilibotWebz.ioOmgilibotAI content aggregation
magpie-crawlerMagpiemagpie-crawlerAI data collection
BrightbotBrightEdgeBrightbotAI SEO and training data
DataForSeoBotDataForSEODataForSeoBotSEO data collection with AI applications

Unknown / Shadow AI Crawlers

These user-agents have been identified in server logs of WordPress sites but do not have publicly documented policies. Treat with caution.

User-agent stringNotes
ChatGPT-UserUsed by ChatGPT’s browsing feature (not the training crawler)
cohere-aiCohere LLM training crawler
AI2BotAllen Institute for AI crawler
img2datasetImage dataset collection bot
ScrapyPython scraping framework; often used by AI data collectors

If you encounter unfamiliar user-agents in your logs, check for documentation on the crawling organization’s website before deciding to block.

Robots.txt Quick Reference

Allow all AI Search Crawler User Agent block training crawlers

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: PetalBot
Disallow: /

How to Identify AI Crawlers in WordPress Server Logs

If you have access to raw access logs, filter for known AI crawler strings:

Apache / Nginx (bash)

# Show all AI crawler requests with timestamp and URL
grep -E "GPTBot|OAI-SearchBot|PerplexityBot|ClaudeBot|anthropic-ai|Applebot|CCBot|Bytespider" /var/log/nginx/access.log | awk '{print $1, $7, $12}'

Count requests per bot (last 30 days)

grep -E "GPTBot|PerplexityBot|ClaudeBot|CCBot" /var/log/nginx/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn

AI Crawler User Agent In WordPress (via PulseRank)

PulseRank matches against this list automatically and displays crawler activity per bot, per page, per day — without requiring server log access.

AI Crawler User Agent Crawl Rates

As of early 2026, typical crawl behavior observed across WordPress sites:

  • GPTBot — typically crawls 3–8 pages per day on most small-to-medium sites. Re-crawls vary from weekly to monthly depending on content update frequency.
  • PerplexityBot — lower crawl volume than GPTBot; tends to focus on specific pages rather than full-site crawls.
  • OAI-SearchBot — higher crawl frequency than GPTBot; appears to prioritize recently updated pages.
  • CCBot — large batch crawls; may visit hundreds of pages in a single session.
  • ClaudeBot / anthropic-ai — relatively recent crawler; crawl volumes are growing.

These are observational estimates, not official figures. Use PulseRank or server logs to see the actual crawl rate for your specific site.

AI Crawler User Agent Compliance with robots.txt

CrawlerHonors robots.txtDocumentation
GPTBotYes (documented)OpenAI docs
OAI-SearchBotYes (documented)OpenAI docs
PerplexityBotYes (documented)Perplexity AI docs
ClaudeBot / anthropic-aiYes (documented)Anthropic docs
Google-ExtendedYesGoogle developer docs
CCBotYesCommon Crawl docs
BytespiderClaimed yes; inconsistent in practiceByteDance docs
PetalBotYesHuawei docs
Unknown scrapersOften noN/A

AI Crawler User Agent Changelog

DateChange
February 2026Added ChatGPT-User (browse feature), updated Applebot-Extended notes
January 2026Added cohere-ai, AI2Bot to shadow crawlers section
November 2025Added OAI-SearchBot (SearchGPT launch)
September 2025Updated crawl rate estimates; added PhindBot

From Our Testing: Crawl Behavior Observed in the Wild

Based on server log analysis across hundreds of WordPress sites via PulseRank:

  • GPTBot is consistently the highest-volume AI Crawler User across all site niches, followed by PerplexityBot and OAI-SearchBot on pages covering AI, SaaS, and technical topics.
  • CCBot makes large batch crawls — we observed it requesting 200–400 pages per day on sites it visits infrequently, then going quiet for weeks.
  • ClaudeBot has approximately doubled its crawl rate between Q3 2025 and Q1 2026 on the sites we monitor, suggesting Anthropic is scaling its retrieval infrastructure significantly.
  • Bytespider claimed robots.txt compliance but we observed it accessing disallowed paths on 3 of 8 test sites within 30 days of applying a block. Treat it like a scraper until proven otherwise.
  • Unknown user-agents account for 5–12% of bot traffic across sites we monitor. Many appear to be AI data collection tools using generic or spoofed strings.

OpenAI’s official GPTBot documentation is at platform.openai.com/docs/gptbot. Common Crawl’s CCBot dataset is documented at commoncrawl.org and is acknowledged as one of the most widely used pre-training sources for large language models in published LLM research.

See which AI crawlers are already visiting your site and turn that data into visibility with PulseRank, your WordPress AI Analytics Plugin.

Frequently Asked Questions

We’ve anticipated your concerns and engineered solutions for each one.

GPTBot is OpenAI’s background crawler that indexes web content for future retrieval. ChatGPT-User is the user-agent used when a ChatGPT user actively triggers the browsing feature to look up a specific URL.

Both respect robots.txt.

Only if you want to prevent your content from being used in Google’s AI features (Gemini, AI Overviews).

This may reduce Gemini citations. Most sites should leave it allowed unless they have a specific policy reason to block it.

Crawlers are identified by monitoring server logs across many WordPress sites. If you find an undocumented user-agent, check the IP range’s WHOIS and any documentation the operator publishes.

Submit findings to discussions in the AI/SEO community.

Q: Is CCBot really used for AI training? Yes. Co

Yes. Your data stays on your site. PulseRank stores analytics locally in your WordPress database with no external dashboards and no data sent to third parties.

It’s GDPR-ready with IP hashing, export/delete tools, and configurable data retention controls.

Yes. Common Crawl’s dataset is one of the most widely used pre-training datasets for large language models.

Many major LLMs (GPT series, LLaMA, Mistral, etc.) were trained on Common Crawl data. Blocking CCBot won’t remove your content from existing trained models, but it limits future inclusion.

In addition to user-agent matching, PulseRank applies behavioral heuristics to identify bot-like sessions that do not match any known user-agent.

Stop Guessing. Start Measuring.

Join WordPress sites already using PulseRank to uncover their AI traffic and optimize for the future of search.