AI Crawlers and robots.txt: What to Allow, Block, and Monitor
Learn how robots.txt affects AI crawlers, why OAI-SearchBot matters for ChatGPT Search, and how publishers can choose a practical crawl policy.
Updated May 17, 2026
robots.txt still matters in the AI era because it is the first place many automated crawlers look for permission rules. If you want content discoverable in ChatGPT Search, OpenAI says OAI-SearchBot must be allowed to crawl the site. If you want to restrict access, robots.txt can express that choice, but it cannot solve attribution, payment, or content quality problems on its own.
The role of robots.txt#
Google describes robots.txt as the standard way to tell automated crawlers which parts of a site they may access. It is a crawl-control file, not a ranking file, not a rights-management system, and not a substitute for authentication.
Primary sources:
- Google robots.txt documentation
- OpenAI: Help ChatGPT discover your products
- OpenAI Help: ChatGPT Search
AI crawlers are not one thing#
Different bots serve different jobs. OpenAI says OAI-SearchBot is used to surface websites in ChatGPT Search and is not the crawler used to train foundation models. That distinction matters because site owners may want search visibility while applying different rules to other automated uses.
| Bot category | Main purpose | Typical publisher question |
|---|---|---|
| Search crawler | Discover pages for answer or search experiences | Do I want to be found? |
| Training crawler | Collect content for model development | Do I allow reuse? |
| Commerce crawler | Read product data and availability | Do I want shopping visibility? |
| Security or monitoring bot | Verify uptime, abuse, or threats | Is this bot legitimate? |
Treating all bots as identical usually produces bad policy. It can block useful discovery while failing to protect what actually matters.
A practical robots.txt policy#
Start from business intent:
| Goal | Sensible direction |
|---|---|
| Maximize search visibility | Allow search crawlers on public pages |
| Protect member-only content | Use authentication, not just robots.txt |
| Exclude staging or faceted URLs | Block or canonicalize low-value paths |
| Support ecommerce discovery | Keep product URLs crawlable and current |
| Monetize access | Explore policy plus payment layers, not robots.txt alone |
For a public ecommerce site that wants ChatGPT Search visibility, this may be relevant:
User-agent: OAI-SearchBot
Allow: /Do not copy rules blindly. Validate what the crawler actually receives through your CDN, bot protection layer, and origin server.
Common mistakes#
- Allowing the bot in
robots.txtbut blocking it at the CDN. - Blocking all AI user agents and then wondering why AI search referrals disappear.
- Treating
robots.txtas security for private content. - Forgetting that product, documentation, and support pages may need different policies.
- Publishing contradictory rules across
robots.txt, headers, and edge controls.
The Cloudflare AI Crawl Control guide covers the edge-policy side. The ChatGPT product recommendations checklist shows why crawler access matters for ecommerce discovery.
What to monitor#
| Signal | Why it matters |
|---|---|
| Server logs by user agent | Shows whether bots actually fetch your pages |
| Search referrals | Shows whether discoverability creates traffic |
| Bot-blocking events | Reveals CDN or WAF conflicts |
| Product page freshness | Protects AI shopping quality |
| Crawl errors | Finds accidental 403, 429, or 5xx responses |
For broader AEO measurement, connect crawl data with the AEO KPIs guide.
robots.txt is not an AEO strategy#
Good crawl policy is only the entry point. Agents still need:
- clear page structure
- accurate facts
- current pricing or availability
- internal links
- machine-readable product or service data
- safe action paths
That is why robots.txt belongs in the read layer, while task completion belongs in the execution layer.
FAQ#
Does allowing OAI-SearchBot guarantee ChatGPT visibility?#
No. OpenAI says allowing the crawler is important for inclusion, but it does not guarantee placement.
Can robots.txt protect private content?#
No. Use authentication and authorization for private content. robots.txt is a voluntary crawl directive.
Should I block every AI crawler?#
That is a business decision, not a default best practice. Separate search visibility goals from training and monetization policies.
How do I test my policy?#
Check the live robots.txt, inspect server logs, and verify that important URLs return the expected status to the user agents you want to support.
Bottom line#
AI crawler policy should be deliberate, not emotional. Decide which automated uses you want, express those rules clearly, and verify them at the edge and origin before assuming they work.