devtoolslib
ToolsBlogsAbout
Get started
devtoolslib
ToolsBlogsAboutContactPrivacyTerms

© 2026 DevToolsLib.

Home / Blog / Post
✦SEOOctober 22, 2025

robots.txt Guide: Control Which Crawlers Access Your Site (and Block AI Scrapers)

robots.txt is the first file every crawler reads. Get it wrong and Google can't index your site. Get it right and you control exactly who accesses what — including AI training bots.

By DevToolsLib Team·5 min read

robots.txt is a plain text file at the root of your domain. It uses a simple protocol to tell web crawlers which pages they can and cannot access. It's not enforced technically — any crawler can ignore it — but Google, Bing, and most reputable bots respect it.

Getting robots.txt right is foundational technical SEO. Getting it wrong can accidentally de-index your entire site or leave paths exposed that you'd rather not have indexed.

The Basic Syntax

User-agent: *
Allow: /
Disallow: /admin

Sitemap: https://yoursite.com/sitemap.xml
  • User-agent — which crawler this rule applies to. * means all crawlers.
  • Allow — explicitly permit this path (useful to override a broader Disallow).
  • Disallow — block crawlers from this path.
  • Sitemap — tell crawlers where your sitemap is. Always include this.

Rules are read top to bottom. The most specific matching rule wins.

Path Matching

| Pattern | Matches | |---|---| | /admin | /admin, /admin/users, /admin/settings | | /admin/ | /admin/ and everything below | | /*.json | Any URL ending in .json | | /api/* | Everything under /api | | / | The entire site |

Wildcards: * matches any sequence of characters. $ anchors the end of a URL.

Blocking Specific Crawlers

Set different rules for different bots by using multiple User-agent blocks:

# Allow Google everywhere
User-agent: Googlebot
Allow: /

# Block Bing from the API
User-agent: Bingbot
Disallow: /api

# Block everything for a specific bot
User-agent: BadBot
Disallow: /

Blocking AI Training Crawlers

Since 2023, AI companies have deployed crawlers to collect training data. Many site owners want to block them. Here are the main ones and their User-agent strings:

| Bot | Company | User-agent | |---|---|---| | GPTBot | OpenAI | GPTBot | | ChatGPT-User | OpenAI | ChatGPT-User | | ClaudeBot | Anthropic | ClaudeBot | | CCBot | Common Crawl | CCBot | | Amazonbot | Amazon | Amazonbot | | Google-Extended | Google (Gemini) | Google-Extended | | Bytespider | ByteDance | Bytespider |

To block all AI training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Important: This prevents crawling for training data but does not retroactively remove your data from existing models. It also doesn't affect ChatGPT's browsing feature (which uses a different user agent and respects robots.txt separately).

Common Patterns

Block admin and private areas

User-agent: *
Allow: /
Disallow: /admin
Disallow: /private
Disallow: /dashboard
Disallow: /api

Block entire site (maintenance / staging)

User-agent: *
Disallow: /

Prevent indexing of query parameters

User-agent: *
Disallow: /*?

Allow only specific bots

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

What robots.txt Does NOT Do

It doesn't prevent pages from appearing in search results. If another site links to your blocked page, Google may still show the URL in results (without a description, since it can't crawl it). Use noindex meta tags to prevent indexing.

It doesn't protect sensitive data. robots.txt is public. Blocking /admin tells any human reading the file that /admin exists. For real security, use authentication.

It doesn't affect JavaScript-rendered content the same way. Googlebot can execute JavaScript, but it may not always. Critical content shouldn't rely solely on JS rendering for crawlability.

Deployment

Place robots.txt at exactly: https://yourdomain.com/robots.txt

It must be at the domain root — not in a subdirectory. A robots.txt at yoursite.com/blog/robots.txt only applies if your blog is hosted separately at blog.yoursite.com.

After deploying, verify with Google Search Console: Settings → robots.txt.

Generate Yours Now

Use our Robots.txt Generator to build your file visually — add per-bot rules, choose Allow/Disallow paths from common suggestions, block AI crawlers with one click, add your sitemap URL, and download the ready-to-deploy file.

— Tagged with

robots.txtRobots GeneratorSEOCrawl ControlBlock AI CrawlersGPTBotClaudeBotBlock GooglebotWeb CrawlersTechnical SEODeveloper ToolsDevToolsLib
— thanks for reading.← Back to the blog