robots.txt is a plain text file at the root of your domain. It uses a simple protocol to tell web crawlers which pages they can and cannot access. It's not enforced technically — any crawler can ignore it — but Google, Bing, and most reputable bots respect it.
Getting robots.txt right is foundational technical SEO. Getting it wrong can accidentally de-index your entire site or leave paths exposed that you'd rather not have indexed.
The Basic Syntax
User-agent: *
Allow: /
Disallow: /admin
Sitemap: https://yoursite.com/sitemap.xml
User-agent— which crawler this rule applies to.*means all crawlers.Allow— explicitly permit this path (useful to override a broader Disallow).Disallow— block crawlers from this path.Sitemap— tell crawlers where your sitemap is. Always include this.
Rules are read top to bottom. The most specific matching rule wins.
Path Matching
| Pattern | Matches |
|---|---|
| /admin | /admin, /admin/users, /admin/settings |
| /admin/ | /admin/ and everything below |
| /*.json | Any URL ending in .json |
| /api/* | Everything under /api |
| / | The entire site |
Wildcards: * matches any sequence of characters. $ anchors the end of a URL.
Blocking Specific Crawlers
Set different rules for different bots by using multiple User-agent blocks:
# Allow Google everywhere
User-agent: Googlebot
Allow: /
# Block Bing from the API
User-agent: Bingbot
Disallow: /api
# Block everything for a specific bot
User-agent: BadBot
Disallow: /
Blocking AI Training Crawlers
Since 2023, AI companies have deployed crawlers to collect training data. Many site owners want to block them. Here are the main ones and their User-agent strings:
| Bot | Company | User-agent | |---|---|---| | GPTBot | OpenAI | GPTBot | | ChatGPT-User | OpenAI | ChatGPT-User | | ClaudeBot | Anthropic | ClaudeBot | | CCBot | Common Crawl | CCBot | | Amazonbot | Amazon | Amazonbot | | Google-Extended | Google (Gemini) | Google-Extended | | Bytespider | ByteDance | Bytespider |
To block all AI training crawlers:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Important: This prevents crawling for training data but does not retroactively remove your data from existing models. It also doesn't affect ChatGPT's browsing feature (which uses a different user agent and respects robots.txt separately).
Common Patterns
Block admin and private areas
User-agent: *
Allow: /
Disallow: /admin
Disallow: /private
Disallow: /dashboard
Disallow: /api
Block entire site (maintenance / staging)
User-agent: *
Disallow: /
Prevent indexing of query parameters
User-agent: *
Disallow: /*?
Allow only specific bots
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
What robots.txt Does NOT Do
It doesn't prevent pages from appearing in search results. If another site links to your blocked page, Google may still show the URL in results (without a description, since it can't crawl it). Use noindex meta tags to prevent indexing.
It doesn't protect sensitive data. robots.txt is public. Blocking /admin tells any human reading the file that /admin exists. For real security, use authentication.
It doesn't affect JavaScript-rendered content the same way. Googlebot can execute JavaScript, but it may not always. Critical content shouldn't rely solely on JS rendering for crawlability.
Deployment
Place robots.txt at exactly: https://yourdomain.com/robots.txt
It must be at the domain root — not in a subdirectory. A robots.txt at yoursite.com/blog/robots.txt only applies if your blog is hosted separately at blog.yoursite.com.
After deploying, verify with Google Search Console: Settings → robots.txt.
Generate Yours Now
Use our Robots.txt Generator to build your file visually — add per-bot rules, choose Allow/Disallow paths from common suggestions, block AI crawlers with one click, add your sitemap URL, and download the ready-to-deploy file.