robots.txt Guide: Control Which Crawlers Access Your Site (and Block AI Scrapers)

robots.txt is a plain text file at the root of your domain. It uses a simple protocol to tell web crawlers which pages they can and cannot access. It's not enforced technically — any crawler can ignore it — but Google, Bing, and most reputable bots respect it.

Getting robots.txt right is foundational technical SEO. Getting it wrong can accidentally de-index your entire site or leave paths exposed that you'd rather not have indexed.

The Basic Syntax

User-agent: *
Allow: /
Disallow: /admin

Sitemap: https://yoursite.com/sitemap.xml

User-agent — which crawler this rule applies to. * means all crawlers.
Allow — explicitly permit this path (useful to override a broader Disallow).
Disallow — block crawlers from this path.
Sitemap — tell crawlers where your sitemap is. Always include this.

Rules are read top to bottom. The most specific matching rule wins.

Path Matching

| Pattern | Matches | |---|---| | /admin | /admin, /admin/users, /admin/settings | | /admin/ | /admin/ and everything below | | /*.json | Any URL ending in .json | | /api/* | Everything under /api | | / | The entire site |

Wildcards: * matches any sequence of characters. $ anchors the end of a URL.

Blocking Specific Crawlers

Set different rules for different bots by using multiple User-agent blocks:

# Allow Google everywhere
User-agent: Googlebot
Allow: /

# Block Bing from the API
User-agent: Bingbot
Disallow: /api

# Block everything for a specific bot
User-agent: BadBot
Disallow: /

Blocking AI Training Crawlers

Since 2023, AI companies have deployed crawlers to collect training data. Many site owners want to block them. Here are the main ones and their User-agent strings:

| Bot | Company | User-agent | |---|---|---| | GPTBot | OpenAI | GPTBot | | ChatGPT-User | OpenAI | ChatGPT-User | | ClaudeBot | Anthropic | ClaudeBot | | CCBot | Common Crawl | CCBot | | Amazonbot | Amazon | Amazonbot | | Google-Extended | Google (Gemini) | Google-Extended | | Bytespider | ByteDance | Bytespider |

To block all AI training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Important: This prevents crawling for training data but does not retroactively remove your data from existing models. It also doesn't affect ChatGPT's browsing feature (which uses a different user agent and respects robots.txt separately).

Common Patterns

Block admin and private areas

User-agent: *
Allow: /
Disallow: /admin
Disallow: /private
Disallow: /dashboard
Disallow: /api

Block entire site (maintenance / staging)

User-agent: *
Disallow: /

Prevent indexing of query parameters

User-agent: *
Disallow: /*?

Allow only specific bots

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

What robots.txt Does NOT Do

It doesn't prevent pages from appearing in search results. If another site links to your blocked page, Google may still show the URL in results (without a description, since it can't crawl it). Use noindex meta tags to prevent indexing.

It doesn't protect sensitive data. robots.txt is public. Blocking /admin tells any human reading the file that /admin exists. For real security, use authentication.

It doesn't affect JavaScript-rendered content the same way. Googlebot can execute JavaScript, but it may not always. Critical content shouldn't rely solely on JS rendering for crawlability.

Deployment

Place robots.txt at exactly: https://yourdomain.com/robots.txt

It must be at the domain root — not in a subdirectory. A robots.txt at yoursite.com/blog/robots.txt only applies if your blog is hosted separately at blog.yoursite.com.

After deploying, verify with Google Search Console: Settings → robots.txt.

Generate Yours Now

Use our Robots.txt Generator to build your file visually — add per-bot rules, choose Allow/Disallow paths from common suggestions, block AI crawlers with one click, add your sitemap URL, and download the ready-to-deploy file.

Getting robots.txt right is foundational technical SEO. Getting it wrong can accidentally de-index your entire site or leave paths exposed that you'd rather not have indexed.

The Basic Syntax

User-agent: *
Allow: /
Disallow: /admin

Sitemap: https://yoursite.com/sitemap.xml

User-agent — which crawler this rule applies to. * means all crawlers.
Allow — explicitly permit this path (useful to override a broader Disallow).
Disallow — block crawlers from this path.
Sitemap — tell crawlers where your sitemap is. Always include this.

Rules are read top to bottom. The most specific matching rule wins.

Path Matching

Wildcards: * matches any sequence of characters. $ anchors the end of a URL.

Blocking Specific Crawlers

Set different rules for different bots by using multiple User-agent blocks:

# Allow Google everywhere
User-agent: Googlebot
Allow: /

# Block Bing from the API
User-agent: Bingbot
Disallow: /api

# Block everything for a specific bot
User-agent: BadBot
Disallow: /

Blocking AI Training Crawlers

Since 2023, AI companies have deployed crawlers to collect training data. Many site owners want to block them. Here are the main ones and their User-agent strings:

To block all AI training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Common Patterns

Block admin and private areas

User-agent: *
Allow: /
Disallow: /admin
Disallow: /private
Disallow: /dashboard
Disallow: /api

Block entire site (maintenance / staging)

User-agent: *
Disallow: /

Prevent indexing of query parameters

User-agent: *
Disallow: /*?

Allow only specific bots

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

What robots.txt Does NOT Do

It doesn't protect sensitive data. robots.txt is public. Blocking /admin tells any human reading the file that /admin exists. For real security, use authentication.

It doesn't affect JavaScript-rendered content the same way. Googlebot can execute JavaScript, but it may not always. Critical content shouldn't rely solely on JS rendering for crawlability.

Deployment

Place robots.txt at exactly: https://yourdomain.com/robots.txt

It must be at the domain root — not in a subdirectory. A robots.txt at yoursite.com/blog/robots.txt only applies if your blog is hosted separately at blog.yoursite.com.

After deploying, verify with Google Search Console: Settings → robots.txt.

robots.txt Guide: Control Which Crawlers Access Your Site (and Block AI Scrapers)

The Basic Syntax

Path Matching

Blocking Specific Crawlers

Blocking AI Training Crawlers

Common Patterns

Block admin and private areas

Block entire site (maintenance / staging)

Prevent indexing of query parameters

Allow only specific bots

What robots.txt Does NOT Do

Deployment

Generate Yours Now

— Tagged with

robots.txt Guide: Control Which Crawlers Access Your Site (and Block AI Scrapers)

The Basic Syntax

Path Matching

Blocking Specific Crawlers

Blocking AI Training Crawlers

Common Patterns

Block admin and private areas

Block entire site (maintenance / staging)

Prevent indexing of query parameters

Allow only specific bots

What robots.txt Does NOT Do

Deployment

Generate Yours Now

— Tagged with