What Is Agent Analytics? Tracking AI Crawlers, Citations & Referrals (2026)

The fastest-growing visitor segment on your website is invisible to your analytics stack. GA4 cannot see an AI crawler. Neither can any tool that depends on a JavaScript tag. While marketing teams stare at dashboards built for human browsers, GPTBot, ClaudeBot, and PerplexityBot are reading their sites thousands of times a month, and deciding whether to cite them.

This guide defines the category, gives you a reference table for every major AI bot, shows how to collect the data on your own infrastructure, and ends with a diagnostic framework for turning bot logs into decisions.

What is agent analytics?

Agent analytics is bot-level web analytics for the AI era. Where traditional analytics answers “which humans visited my site and what did they do,” agent analytics answers three different questions:

Which AI systems are reading my site? ChatGPT, Claude, Perplexity, Gemini, Copilot, and Meta AI all operate distinct crawlers with distinct purposes.
What are they reading it for? A training crawl, an indexing crawl, and a live citation fetch are different events with different business value.
Did the reading turn into anything? Citations in AI answers and referral clicks back to your site are the outcomes that make the crawl data worth having.

The term comes from the product category Profound created with Agent Analytics, which reads AI bot traffic at the CDN and server-log level instead of the browser level. The category exists because the measurement gap is structural: the tools marketers already own physically cannot see this traffic.

Why GA4 misses AI crawlers entirely

GA4, and every analytics product built on the same model, counts a visit when a JavaScript snippet executes in a browser. That assumption held for twenty years of human web traffic. It fails completely for AI agents.

When GPTBot requests a page, it fetches the raw HTML server-to-server. No browser, no JavaScript execution, no tracking pixel, no event. The request is real, the content gets read, and your analytics records nothing. The same is true when ChatGPT fetches your pricing page mid-conversation to answer a buyer’s question. From GA4’s perspective, the most commercially interesting visit your site received that day never happened.

The only places this traffic shows up are your server logs and your CDN. That is why every serious agent analytics implementation works at the infrastructure level: edge workers, log streaming, or raw log ingestion. JavaScript is the wrong layer.

🔍 One partial exception: when a human clicks a link inside a ChatGPT answer, that click arrives as a normal browser session with a chatgpt.com referrer, and GA4 can see it. The crawls and live fetches that earned the citation in the first place stay invisible.

The three intents: citation, indexing, training

Not all bot visits are equal, and treating them as one number is the most common agent analytics mistake. Every AI bot visit falls into one of three intent classes:

Citation. A user-triggered, real-time fetch. ChatGPT-User hitting your page means an actual person asked a question and the model is reading your content right now to answer it. This is the highest-value visit on the web today.
Indexing. A search-index crawl, like OAI-SearchBot or PerplexityBot building the retrieval layer that future answers draw from. Indexing visits are the leading indicator of citation visits.
Training. A bulk crawl that feeds model training, like GPTBot or Meta-ExternalAgent. Training visits influence what future model versions know about you, but they produce no immediate citation or referral.

The ratio between these three classes tells you where you stand. Heavy training traffic with no citation traffic means models know you exist but answers never surface you. Heavy citation traffic means you are already part of live answers and should protect whatever is working.

AI bot user-agent reference table

These are the user-agent tokens that matter in 2026, what each one is for, and how to confirm a request is genuine rather than spoofed.

User-agent	Owner	Intent	How to verify
`GPTBot`	OpenAI	Training	UA match + OpenAI’s published IP ranges
`ChatGPT-User`	OpenAI	Citation (live fetch during a conversation)	UA match + OpenAI’s published IP ranges
`OAI-SearchBot`	OpenAI	Indexing (ChatGPT search index)	UA match + OpenAI’s published IP ranges
`ClaudeBot`	Anthropic	Training / indexing	UA match + IP range and ASN lookup
`Claude-User`	Anthropic	Citation (live fetch during a conversation)	UA match + IP range and ASN lookup
`PerplexityBot`	Perplexity	Indexing	UA match + Perplexity’s published IP ranges
`Google-Extended`	Google	Training control (robots.txt token, rides on Googlebot crawls)	Managed in robots.txt, no separate crawler to verify
`GoogleOther`	Google	Indexing / research crawls outside core Search	Reverse + forward DNS lookup
`Bingbot`	Microsoft	Indexing (also grounds Copilot answers)	Reverse + forward DNS lookup
`Meta-ExternalAgent`	Meta	Training	UA match + IP range and ASN lookup

Two details trip people up. First, Google-Extended is a robots.txt control token rather than a crawler you will see in logs; it governs whether content Googlebot already fetched can feed Gemini and other AI products. Second, OpenAI runs three separate bots on purpose, so blocking GPTBot in robots.txt to keep your content out of training data does nothing to stop ChatGPT-User from fetching your page for a live answer. Each bot needs its own robots.txt decision.

Verifying real bots vs spoofed ones

User-agent strings are plain text, and scrapers impersonate AI bots constantly to slip past rate limits. A verification pipeline runs four checks in order:

UA match against the known token list above.
IP range check against the ranges OpenAI and Perplexity publish for their crawlers.
Reverse DNS, then forward DNS to confirm the hostname resolves back to the same IP (the standard method for Googlebot and Bingbot).
ASN lookup to confirm the request originates from the operator’s network rather than a residential proxy.

A request that fails any step is spoofed traffic, and including it in your numbers inflates every downstream metric. Profound’s Agent Analytics runs this verification automatically and benchmarks the cleaned data against 100,000+ pages in the Profound Network, which is the kind of baseline you cannot build from one site’s logs.

How to collect the data: integration matrix

Since the data lives at the infrastructure layer, the integration path depends on your stack. These are the standard collection methods:

Platform	Method	Delivery
Cloudflare	Worker at the edge	Real time
Cloudflare	Logpush	Near real time (batched pushes)
Amazon CloudFront	Access log delivery	Batch
Vercel	Log drains	Streaming
Fastly	Real-time log streaming over HTTPS	Streaming
Netlify	Log drains	Streaming
Akamai	DataStream 2	Streaming
WordPress	Server-side plugin	Real time
Shopify	Platform integration	Managed
Any origin server	Raw access log upload	Batch

If you run on a major CDN, you can be collecting verified bot data within a day; the CDN already sees every request, and the integration just routes a filtered copy somewhere queryable. The hard part comes after collection: classifying intent, verifying identity, and joining bot visits to citation outcomes, which is the part purpose-built tools handle. Profound’s Pages view rolls citations, bot activity, and page health into a single per-URL command center, so you can move from raw log data to per-page diagnosis without stitching queries together.

💡 If you do nothing else this week: grep your raw access logs for GPTBot, ClaudeBot, and ChatGPT-User. Most teams who look for the first time find thousands of AI bot requests they had no idea existed.

The diagnostic framework: what your bot data means

Bot logs become useful when you read patterns instead of totals. This is the troubleshooting table I use:

What you see	What it means
High GPTBot visits, zero ChatGPT citations	Extractability problem. The model reads you but cannot lift an answer. Check for missing structured data, key claims buried below the fold, or answers that require three paragraphs of context.
403 or 404 spikes from ClaudeBot or GPTBot	Technical block. Your WAF, bot management rules, or robots.txt is turning crawlers away. The model cannot cite what it cannot fetch.
200 responses, healthy crawl volume, still no citations	Content problem rather than technical. The page works mechanically; the answer it contains loses to competitors’ pages.
Rising ChatGPT-User visits	You are appearing in live answers right now. Find which pages get the fetches and protect them in your next redesign.
Training-bot visits but no indexing-bot visits	Future model versions will know you, but current search-grounded answers will not surface you. Check whether your robots.txt blocks the search bots specifically.
Bot traffic concentrated on the homepage only	Discovery problem. Internal linking or sitemap issues are keeping crawlers from reaching your deep pages, which is where most citations come from.
New page uncited past 37 days	Past the 90th percentile of normal. Assume a technical block and audit robots.txt, WAF rules, and response codes before touching the content.

That last row comes from a measurable benchmark, which brings us to the KPI side of agent analytics.

The benchmark: how fast should citations arrive?

Agent analytics gives you a number worth managing against. Profound tracked roughly 900 newly published marketing pages over a 60-day window:

Median time from publish to first ChatGPT or Claude citation is 6.81 days. 75% of cited pages are cited within 18.68 days, and 90% within 37.1 days. Under a week to first citation puts a page ahead of the curve.

This turns agent analytics from passive logging into an operating metric. Publish, watch for the indexing crawl, watch for the first citation, and escalate to a technical audit if day 37 passes quietly. Without bot-level data you cannot run that play, because you cannot distinguish “the model chose not to cite us” from “the model never saw the page.”

One more thing measurement kills: best practices that do not survive contact with data. Serving LLMs Markdown instead of HTML is a popular recommendation, and Profound A/B tested it across 381 pages for three weeks:

Profound research

Serving Markdown to LLMs shows no statistically significant lift over HTML in A/B test

Profound A/B tested 381 pages across 3 weeks comparing Markdown vs HTML responses to LLM crawlers. Markdown produced ~16% higher mean bot visits but the result was not statistically significant; any lift concentrated in already high-traffic pages.

381 pages tracked for 3 weeks
Markdown saw ~16% higher mean bot visits, not statistically significant
Median bot visits — All LLM bots: HTML 6, Markdown 7; ChatGPT: HTML 4, Markdown 5
Average bot visits — All LLM bots: HTML 13.4, Markdown 15.7; ChatGPT: HTML 9.9, Markdown 11.7
Lift concentrated in pages at the 60th percentile and above; median page gained ~1 extra visit

Profound · Serving Markdown to LLMs has no statistically significant benefits · added 2026-05-14

The Markdown group showed about 16% higher mean bot visits, which did not reach statistical significance, and the apparent lift concentrated in pages that were already high-traffic. Measure before you re-platform.

Why agent analytics matters now: the May 7 shift

For most of the category’s life, the honest criticism was that AI visibility was hard to tie to traffic. On May 7, 2026, that changed. ChatGPT switched from citation chips to inline branded hyperlinks, routing brand mentions directly to brand websites. Profound measured the before and after across 8M+ referral visits:

Daily OpenAI referrals jumped from roughly 158K to 249K, about 1.6x, overnight, and held.
The share of answers containing a clickable brand URL went from 4 to 5% to 22%, a roughly 5x larger attribution surface.
Homepage share of OpenAI referrals went from 3.5% to 24.2%; about 1 in 4 clicks now lands on a homepage.
B2B SaaS referrals roughly tripled, while e-commerce stayed flat.

📊 Takeaway: before May 7, agent analytics measured a leading indicator. After May 7, the chain is complete and auditable: bot crawl → citation → clickable link → referral session → revenue. Citation share now converts to clicks at 4 to 5x the prior rate, which means the bot-level data that predicts citations finally connects to a number your CFO recognizes.

The homepage stat deserves its own action item. If a quarter of your AI referral traffic lands on /, your homepage is now an AI landing page, and most homepages are written for people who already know the brand. Agent analytics tells you which pages the bots read; referral data tells you where the humans arrive. Right now those two lists barely overlap for most sites.

Where this fits in an AEO program

Agent analytics is the measurement layer of answer engine optimization, the same way log-file analysis was the unglamorous backbone of technical SEO. The workflow it enables:

Instrument your CDN or server logs so every AI bot visit is captured and verified.
Classify visits by intent: citation, indexing, training.
Diagnose with the framework above: technical blocks, extractability gaps, discovery problems.
Benchmark new content against the 6.81-day median and the 37-day escalation threshold.
Attribute referral traffic from AI engines back to the pages and citations that earned it.

Citation share is the new market share, and agent analytics is how you audit it. The brands that win AI search over the next two years will be the ones that stopped guessing what the models see and started reading the logs.

Frequently asked questions

What is agent analytics?

Agent analytics is the practice of measuring AI crawler and assistant traffic to your website through server logs or CDN integrations. It identifies which AI bots visit your pages, classifies their intent (citation, indexing, or training), and connects bot activity to outcomes like AI citations and referral traffic.

How is agent analytics different from GA4?

GA4 measures human visitors by executing a JavaScript tag in the browser. AI crawlers fetch pages server-to-server and never run that JavaScript, so GA4 records nothing. Agent analytics reads server logs or CDN data instead, which captures every request regardless of whether JavaScript runs.

Why does GA4 miss AI crawlers?

Because AI crawlers do not execute JavaScript. GA4 only counts a visit when its tracking script fires in a browser. Bots like GPTBot and ClaudeBot request the raw HTML and leave, so the script never loads and the visit never registers.

How do I verify a real ChatGPT bot vs a spoofed one?

Check more than the user-agent string, which anyone can fake. Match the request IP against OpenAI’s published IP ranges, run reverse and forward DNS lookups, and check the ASN. A request claiming to be GPTBot from an IP outside OpenAI’s published ranges is spoofed.

How long until ChatGPT cites a new page?

Profound tracked roughly 900 newly published marketing pages over a 60-day window and found a median of 6.81 days from publish to first ChatGPT or Claude citation, with 90% of cited pages cited within 37 days. Past 37 days uncited, suspect a technical block rather than a content problem.

What is the difference between GPTBot, ChatGPT-User, and OAI-SearchBot?

GPTBot collects training data for OpenAI’s models. OAI-SearchBot builds the search index behind ChatGPT’s web search. ChatGPT-User fires in real time when ChatGPT fetches a page to answer a live question, making it the strongest citation signal of the three. Each respects its own robots.txt rule, so blocking one does nothing about the other two.

What is agent analytics?

Why GA4 misses AI crawlers entirely

The three intents: citation, indexing, training

AI bot user-agent reference table

Verifying real bots vs spoofed ones

How to collect the data: integration matrix

The diagnostic framework: what your bot data means

The benchmark: how fast should citations arrive?

Why agent analytics matters now: the May 7 shift

Where this fits in an AEO program

Frequently asked questions

What is agent analytics?

How is agent analytics different from GA4?

Why does GA4 miss AI crawlers?

How do I verify a real ChatGPT bot vs a spoofed one?

How long until ChatGPT cites a new page?

What is the difference between GPTBot, ChatGPT-User, and OAI-SearchBot?

See also