CITEHUSTLE
Log in Get started

Common Crawl · Open dataset

CCBot

CCBot is the crawler operated by the non-profit Common Crawl, which publishes a free, openly available snapshot of the web. That dataset is one of the most widely used training corpora for large language models, so CCBot indirectly feeds many AI systems at once. CCBot respects robots.txt — blocking it keeps your content out of the Common Crawl corpus that many model builders draw from.

Last updated

User-agent token
CCBot
Operator
Common Crawl
Feeds
Common Crawl dataset (feeds many LLMs)
robots.txt
Unverified

How to control CCBot in robots.txt

Edit the robots.txt file at the root of your domain (for example https://example.com/robots.txt), add one of the groups below, then save and re-deploy. Remember: a named User-agent: CCBot group overrides your global User-agent: * rules, so repeat any private Disallow paths inside it.

Allow CCBot (recommended for AI visibility)

# Welcome CCBot, but keep private areas blocked.
# A named user-agent group overrides "User-agent: *", so repeat
# your own private Disallow rules inside this group.
User-agent: CCBot
Allow: /
Disallow: /admin/
Disallow: /account/
Disallow: /cart/
Disallow: /checkout/

Block CCBot

# Block CCBot from the entire site.
User-agent: CCBot
Disallow: /

FAQ

Why does blocking CCBot matter for AI?

Because the Common Crawl dataset CCBot builds is reused by many LLM training pipelines. Blocking CCBot removes your content from that shared corpus, which propagates to any model trained on it.

Does CCBot follow robots.txt?

Yes. Common Crawl documents that CCBot honors robots.txt directives.

Is your site visible to AI crawlers?

Run a free AI-visibility audit to see which AI crawlers can reach your content and how often you get cited by ChatGPT, Perplexity, Claude, and Google AI Overviews.

Run a free audit

Part of the Cite Hustle AI crawler directory. For the full framework on AI search visibility, read the GEO methodology.