Lumos
Create account

AI Crawler Access Tester: The Same Test Lumos Runs on Every Page Weekly

Real-time bot access test โ€” the same one Lumos runs on your whole site weekly.

By Lumos Team ยท May 15, 2026

Why crawler access is the first GEO gate

Before AI engines can cite your page, they have to read it. ChatGPT, Claude, Gemini, and Perplexity all dispatch crawlers โ€” each with a distinct user-agent string and purpose โ€” and each one checks your robots.txt before fetching content. Block them there and nothing else matters: not your schema, not your llms.txt, not the quality of your prose. You're invisible.

In our 2026 audit of 1,000 mid-market sites, 41% were blocking at least one of the 13 major AI crawlers, almost always by accident. The usual culprits: a User-agent: * Disallow: / left over from a staging site, a Cloudflare bot-protection toggle enabled by default, or a CMS plugin that "secured" the site without anyone noticing AI bots were caught in the net.

The 13 AI crawlers this tool checks

Each bot has a specific role, and some sites should allow them all while others may legitimately block a subset:

  • GPTBot โ€” OpenAI's training crawler. Reads pages to train future GPT models. Blocking this opts you out of GPT training but does not affect ChatGPT citations.
  • OAI-SearchBot โ€” ChatGPT's real-time search crawler. This is the one that gets you cited in ChatGPT answers. Must be allowed for ChatGPT visibility.
  • ChatGPT-User โ€” fires when a ChatGPT user clicks a link inside an answer. Allowing it is essentially required.
  • ClaudeBot โ€” Anthropic's main crawler for Claude, including Claude.ai search-time answers.
  • anthropic-ai โ€” Anthropic's training crawler (separate from ClaudeBot).
  • Claude-Web โ€” Anthropic's web fetch user-agent when Claude visits a page on a user's behalf.
  • PerplexityBot โ€” Perplexity's main crawler. Must be allowed to appear in Perplexity answers.
  • Perplexity-User โ€” fires when a Perplexity user clicks a link.
  • Google-Extended โ€” Google's training opt-out flag for Gemini. Blocking this opts you out of Gemini training but does not block AI Overviews (those use Googlebot).
  • Googlebot โ€” classic Google + powers AI Overviews and Bard/Gemini search-time answers.
  • Applebot-Extended โ€” Apple Intelligence training opt-out.
  • Bingbot โ€” Bing + Copilot. Must be allowed for Copilot citations.
  • Bytespider โ€” TikTok / Doubao training crawler. Optional; many western brands choose to block it.

What good results look like

A site optimized for GEO will show all 13 bots as Allowed on the root path. Exceptions:

  • Some brands intentionally block GPTBot, anthropic-ai, Google-Extended, and Applebot-Extended (the training-only bots) while keeping search-time bots (OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Googlebot, Bingbot) allowed. This is the "opt-out of training, opt-in to citation" pattern.
  • A few brands block Bytespider to avoid TikTok training without affecting western AI engines.

If you see Allowed across the board, you've cleared the first gate. If you see Blocks, fix them โ€” the next sections show what to do.

Common mistakes

Blocking by accident via User-agent: * Disallow. A blanket disallow catches every bot, including AI crawlers. Add explicit Allow rules for the AI user-agents above.

Blocking GPTBot but not OAI-SearchBot. A common pattern, but ensure you mean it. If your goal is ChatGPT visibility, OAI-SearchBot is what matters โ€” GPTBot only affects training.

Cloudflare AI-bot toggle. Cloudflare's dashboard added an "Block AI bots" toggle in 2024 that defaults to ON for new sites. robots.txt can say Allow all you want โ€” Cloudflare will still 403 the request.

Serving robots.txt as HTML or behind auth. Both cause AI crawlers to give up. Force text/plain and public access.

Stale rules. A robots.txt from 2018 won't mention any of the AI user-agents that emerged in 2023-2024. Default-allow behavior usually saves you, but generic blocks can still catch them.

Trusting the tool result for one path. This tester checks your root (/). If you've blocked AI bots on /blog or /docs, you need to test those paths too โ€” those are where your highest-citability content usually lives.

After you test

  1. If all 13 show Allowed: you're cleared at the robots.txt layer for the root. Move on to schema, citability, and per-path checks.
  2. If any show Blocked: edit your robots.txt to add explicit Allow: rules for the blocked user-agents, then re-run this tester.
  3. If your CDN or WAF blocks despite robots.txt allowing: check Cloudflare โ†’ Security โ†’ Bots โ†’ "AI Scrapers and Crawlers"; in Akamai/Imperva, look for bot-management rules tagged "AI" or "scraper."
  4. Pair with Page Citability Checker โ€” access is necessary but not sufficient.

What Lumos does continuously (vs this one-shot test)

This page tests your root path on demand. Useful, but a single snapshot. The Lumos platform takes the same test and runs it as continuous infrastructure:

  • Every URL on your site, not just /. The tester here checks /. Lumos runs the 13-bot test against every URL it discovers โ€” blog posts, product pages, docs, anything in your sitemap. That's where the regressions hide: /blog blocked while / stays open.
  • Weekly, not when you remember. Configurations drift. A CMS plugin updates, a Cloudflare toggle flips on, a content team adds a robots rule. Lumos re-runs the full bot-access test every week so you catch the change in days, not quarters.
  • Day-of alerts. When any of the 13 bots flips from Allowed to Blocked, Lumos sends an alert with the diff โ€” which bot, which paths, what changed in robots.txt or the network response. No more "we noticed traffic dropped in ChatGPT three months ago."
  • robots.txt + network layer in one report. This standalone tool only inspects robots.txt. The Lumos platform also fetches each URL as each user-agent and reports the actual HTTP response โ€” so Cloudflare 403s and WAF blocks surface alongside robots.txt rules.
  • Tied to your citation and visibility data. When a bot becomes blocked, Lumos correlates the day with your visibility score in that engine โ€” so you see the business impact, not just the technical event.

This standalone tool is the same logic, run once at the root. If you want the continuous, every-URL, alerted version, the Lumos platform is built for it.

41%

of sites are blocking at least one AI crawler

Lumos research 2026

13

AI crawlers checked by this tool

Lumos

How it works

  1. 1

    Enter your domain

    Paste your full domain (e.g., yourbrand.com). No path required.

  2. 2

    Click Test

    We fetch /robots.txt and evaluate 13 AI user-agents against your root path.

  3. 3

    Review the results

    Each bot is reported as Allowed, Blocked, or Partial. Drill into rules per bot.

  4. 4

    Fix any blocks

    Use our robots.txt for AI Generator to produce a corrected file, then re-test.

FAQ

What does this tool check?

It fetches your robots.txt and tests 13 AI user-agents against it โ€” GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, anthropic-ai, Claude-Web, PerplexityBot, Perplexity-User, Google-Extended, Googlebot, Applebot-Extended, Bingbot, and Bytespider. Each is reported as Allowed or Blocked. The Lumos platform runs this same test, but continuously on every page of your site โ€” not just once at the root.

How does this compare to what Lumos does continuously?

This one-shot tester checks your root path on demand. The Lumos platform runs the same 13-bot test weekly on every URL on your site, watches for changes, and alerts you the day any bot flips to blocked โ€” including when a Cloudflare toggle or a CMS plugin silently changes the rules. Same test, full coverage, always on.

Is being unblocked enough to get cited?

No. Access is necessary but not sufficient. AI engines also need your content to be cite-worthy: clear answers, schema markup, fresh dates, and authoritative authorship. The Lumos platform pairs continuous bot monitoring with page-level citability scoring so you see the full picture.

What about Cloudflare or WAF blocking at the network layer?

robots.txt is one layer. Cloudflare's AI bot toggle, Akamai bot manager, or custom WAF rules can block AI crawlers even when robots.txt allows them. Lumos's paid platform checks both layers continuously โ€” robots.txt + actual HTTP response from each user-agent.

Why do some bots show 'partial'?

Some user-agents have nuanced rules โ€” e.g., a site allows GPTBot on / but blocks /api or /admin. We report Allow / Block / Partial so you know whether the gates are wide open or only the homepage. The Lumos platform extends this by checking every path on your site, not just the root.

Does Lumos monitor this continuously?

Yes โ€” the Lumos platform runs this check weekly on every URL and alerts you the day any bot becomes blocked. This standalone tool is the one-shot, root-only version. Connect your site to Lumos to get continuous coverage.

Related tools

48-Hour AI Visibility Audit Report

Full audit covering bot access, schema, page citability โ€” delivered in 48 hours.

Related reading

GEO: The SEO of the AI Era โ€” Monitor Your Brand in ChatGPT and Gemini

Generative Engine Optimization (GEO): learn how to monitor how ChatGPT and Gemini talk about your brand with metrics, criteria, and a 30-day pilot.

What Does ChatGPT Say About Your Business? How to Audit Your AI Visibility

Most businesses have no idea what ChatGPT, Gemini, or Perplexity say about them. Audit your AI visibility and catch problems before they cost you customers.

What is GEO? A Complete Guide to Generative Engine Optimization

Generative Engine Optimization (GEO) is the practice of optimizing your brand to appear in AI answers from ChatGPT, Gemini, and Perplexity. Full guide.

AI Crawler Access Tester: The Same Test Lumos Runs on Every Page Weekly