Guide

AI Crawlers and robots.txt: What to Allow, Block, and Monitor

Learn how robots.txt affects AI crawlers, why OAI-SearchBot matters for ChatGPT Search, and how publishers can choose a practical crawl policy.

Updated May 17, 2026

robots.txt still matters in the AI era because it is the first place many automated crawlers look for permission rules. If you want content discoverable in ChatGPT Search, OpenAI says OAI-SearchBot must be allowed to crawl the site. If you want to restrict access, robots.txt can express that choice, but it cannot solve attribution, payment, or content quality problems on its own.

The role of robots.txt#

Google describes robots.txt as the standard way to tell automated crawlers which parts of a site they may access. It is a crawl-control file, not a ranking file, not a rights-management system, and not a substitute for authentication.

Primary sources:

AI crawlers are not one thing#

Different bots serve different jobs. OpenAI says OAI-SearchBot is used to surface websites in ChatGPT Search and is not the crawler used to train foundation models. That distinction matters because site owners may want search visibility while applying different rules to other automated uses.

Bot categoryMain purposeTypical publisher question
Search crawlerDiscover pages for answer or search experiencesDo I want to be found?
Training crawlerCollect content for model developmentDo I allow reuse?
Commerce crawlerRead product data and availabilityDo I want shopping visibility?
Security or monitoring botVerify uptime, abuse, or threatsIs this bot legitimate?

Treating all bots as identical usually produces bad policy. It can block useful discovery while failing to protect what actually matters.

A practical robots.txt policy#

Start from business intent:

GoalSensible direction
Maximize search visibilityAllow search crawlers on public pages
Protect member-only contentUse authentication, not just robots.txt
Exclude staging or faceted URLsBlock or canonicalize low-value paths
Support ecommerce discoveryKeep product URLs crawlable and current
Monetize accessExplore policy plus payment layers, not robots.txt alone

For a public ecommerce site that wants ChatGPT Search visibility, this may be relevant:

User-agent: OAI-SearchBot
Allow: /

Do not copy rules blindly. Validate what the crawler actually receives through your CDN, bot protection layer, and origin server.

Common mistakes#

  1. Allowing the bot in robots.txt but blocking it at the CDN.
  2. Blocking all AI user agents and then wondering why AI search referrals disappear.
  3. Treating robots.txt as security for private content.
  4. Forgetting that product, documentation, and support pages may need different policies.
  5. Publishing contradictory rules across robots.txt, headers, and edge controls.

The Cloudflare AI Crawl Control guide covers the edge-policy side. The ChatGPT product recommendations checklist shows why crawler access matters for ecommerce discovery.

What to monitor#

SignalWhy it matters
Server logs by user agentShows whether bots actually fetch your pages
Search referralsShows whether discoverability creates traffic
Bot-blocking eventsReveals CDN or WAF conflicts
Product page freshnessProtects AI shopping quality
Crawl errorsFinds accidental 403, 429, or 5xx responses

For broader AEO measurement, connect crawl data with the AEO KPIs guide.

robots.txt is not an AEO strategy#

Good crawl policy is only the entry point. Agents still need:

  • clear page structure
  • accurate facts
  • current pricing or availability
  • internal links
  • machine-readable product or service data
  • safe action paths

That is why robots.txt belongs in the read layer, while task completion belongs in the execution layer.

FAQ#

Does allowing OAI-SearchBot guarantee ChatGPT visibility?#

No. OpenAI says allowing the crawler is important for inclusion, but it does not guarantee placement.

Can robots.txt protect private content?#

No. Use authentication and authorization for private content. robots.txt is a voluntary crawl directive.

Should I block every AI crawler?#

That is a business decision, not a default best practice. Separate search visibility goals from training and monetization policies.

How do I test my policy?#

Check the live robots.txt, inspect server logs, and verify that important URLs return the expected status to the user agents you want to support.

Bottom line#

AI crawler policy should be deliberate, not emotional. Decide which automated uses you want, express those rules clearly, and verify them at the edge and origin before assuming they work.