Guide

AI Crawlers and robots.txt: What to Allow, Block, and Monitor

Learn how robots.txt affects AI crawlers, why OAI-SearchBot matters for ChatGPT Search, and how publishers can choose a practical crawl policy.

Updated May 17, 2026

robots.txt still matters in the AI era because it is the first place many automated crawlers look for permission rules. If you want content discoverable in ChatGPT Search, OpenAI says OAI-SearchBot must be allowed to crawl the site. If you want to restrict access, robots.txt can express that choice, but it cannot solve attribution, payment, or content quality problems on its own.

The role of robots.txt#

Google describes robots.txt as the standard way to tell automated crawlers which parts of a site they may access. It is a crawl-control file, not a ranking file, not a rights-management system, and not a substitute for authentication.

Primary sources:

AI crawlers are not one thing#

Different bots serve different jobs. OpenAI says OAI-SearchBot is used to surface websites in ChatGPT Search and is not the crawler used to train foundation models. That distinction matters because site owners may want search visibility while applying different rules to other automated uses.

Bot category	Main purpose	Typical publisher question
Search crawler	Discover pages for answer or search experiences	Do I want to be found?
Training crawler	Collect content for model development	Do I allow reuse?
Commerce crawler	Read product data and availability	Do I want shopping visibility?
Security or monitoring bot	Verify uptime, abuse, or threats	Is this bot legitimate?

Treating all bots as identical usually produces bad policy. It can block useful discovery while failing to protect what actually matters.

A practical robots.txt policy#

Start from business intent:

Goal	Sensible direction
Maximize search visibility	Allow search crawlers on public pages
Protect member-only content	Use authentication, not just robots.txt
Exclude staging or faceted URLs	Block or canonicalize low-value paths
Support ecommerce discovery	Keep product URLs crawlable and current
Monetize access	Explore policy plus payment layers, not robots.txt alone

For a public ecommerce site that wants ChatGPT Search visibility, this may be relevant:

User-agent: OAI-SearchBot
Allow: /

Do not copy rules blindly. Validate what the crawler actually receives through your CDN, bot protection layer, and origin server.

Common mistakes#

Allowing the bot in robots.txt but blocking it at the CDN.
Blocking all AI user agents and then wondering why AI search referrals disappear.
Treating robots.txt as security for private content.
Forgetting that product, documentation, and support pages may need different policies.
Publishing contradictory rules across robots.txt, headers, and edge controls.

The Cloudflare AI Crawl Control guide covers the edge-policy side. The ChatGPT product recommendations checklist shows why crawler access matters for ecommerce discovery.

What to monitor#

Signal	Why it matters
Server logs by user agent	Shows whether bots actually fetch your pages
Search referrals	Shows whether discoverability creates traffic
Bot-blocking events	Reveals CDN or WAF conflicts
Product page freshness	Protects AI shopping quality
Crawl errors	Finds accidental 403, 429, or 5xx responses

For broader AEO measurement, connect crawl data with the AEO KPIs guide.

robots.txt is not an AEO strategy#

Good crawl policy is only the entry point. Agents still need:

clear page structure
accurate facts
current pricing or availability
internal links
machine-readable product or service data
safe action paths

That is why robots.txt belongs in the read layer, while task completion belongs in the execution layer.

FAQ#

Does allowing OAI-SearchBot guarantee ChatGPT visibility?#

No. OpenAI says allowing the crawler is important for inclusion, but it does not guarantee placement.

Can robots.txt protect private content?#

No. Use authentication and authorization for private content. robots.txt is a voluntary crawl directive.

Should I block every AI crawler?#

That is a business decision, not a default best practice. Separate search visibility goals from training and monetization policies.

How do I test my policy?#

Check the live robots.txt, inspect server logs, and verify that important URLs return the expected status to the user agents you want to support.

Bottom line#

AI crawler policy should be deliberate, not emotional. Decide which automated uses you want, express those rules clearly, and verify them at the edge and origin before assuming they work.