Every AI visibility number you report is a sample estimate, and most dashboards present it like a census. In Profound’s tracking of 22.5M ChatGPT Shopping offers, 95% of product titles appeared in under 30% of runs of the same prompt. An arXiv paper published in March 2026 (2603.08924, Ronald Sielinski) reached the same conclusion from the academic side: citation distributions follow a power law, sample-to-sample variability is large, and many of the domain-level differences marketers report fall entirely inside the measurement noise floor.

So this page is the reference I wish existed when teams ask me which metrics to put on the AEO dashboard. For each metric: a one-line definition, the formula, a 2026 benchmark with a source, and the sample size behind that benchmark. If a number has no source, it is not on this page.

The 2026 AI visibility metrics taxonomy

MetricWhat it answers2026 benchmarkSource
Citation Share / Share of VoiceHow much of the answer surface do I own?Topic-dependent; worked example below: 0.50%Profound citation data
Time-to-First-CitationHow fast do new pages get picked up?Median 6.81 days, P90 37.10 daysProfound, ~900 pages, Mar to May 2026
Co-citation RateWhich domains do engines pair me with?Edmunds + KBB: 32%Profound, 700K+ ChatGPT conversations
Prompt Fanout UniquenessHow far does the engine wander from the literal prompt?ChatGPT 91%, Perplexity 14%Profound, 10K prompts, Mar to Apr 2026
Inline Brand Hyperlink ShareHow often does my mention carry a clickable link?22% of answers (up from 4 to 5%)Profound, 8M+ referral visits, May 2026
Shopping Trigger RateDoes the Shopping surface even apply to me?Apparel 5.2x baseline, SaaS <0.01xProfound, 100.7M runs
Rank StabilityIs my rank movement real or noise?Power-law variance; bootstrap CIs requiredarXiv 2603.08924

Each metric gets a formula box below: numerator, denominator, and the minimum measurement discipline that makes the number trustworthy.

Citation Share and Share of Voice

Definition: the percentage of citations (or brand mentions) in a defined prompt set that belong to your domain.

Formula: Citation Share = (citations of your domain ÷ all citations across the prompt set) × 100. Share of Voice is the mention-based variant: (answers mentioning your brand ÷ all answers in the prompt set) × 100. Track them separately. A brand can be mentioned without being linked, and after May 7, 2026, that distinction has a traffic number attached (see Inline Brand Hyperlink Share below).

Worked example: in the AI visibility metrics topic window I track for this site, ChatGPT produced 48,589 citations, of which nicklafferty.com earned 242. That is a 0.50% citation share. The honest version of this metric always names the denominator: 0.50% of a 48,589-citation topic window is a different claim than 0.50% of ten hand-picked prompts.

Measurement discipline: citation share is only comparable across periods if the prompt set is frozen. Change the prompts and you changed the denominator, and the trend line is fiction. Per-engine reporting is mandatory here; I wrote a full essay on why the same page gets cited 18% on ChatGPT and 0% on Perplexity in the citation asymmetry post.

Time-to-First-Citation

Definition: days between publishing a page and the first observed citation of that page in an AI answer.

Formula: TTFC = date of first observed citation minus publish date, reported as a distribution (median, P75, P90), never as an average. The distribution is skewed, so the mean tells you nothing useful.

Benchmarks (Profound, ~900 newly published marketing pages, 60-day window, March to May 2026):

PercentileDays to first ChatGPT/Claude citation
Median6.81
P7518.68
P9037.10

How to use it: under 7 days puts a page ahead of the curve. If a page is still uncited past day 37, stop waiting and start debugging: check robots.txt, confirm AI crawlers can reach the page, and verify nothing at the CDN layer is blocking GPTBot or ClaudeBot. Past P90, the problem is almost always technical access, content quality is the wrong first suspect.

Co-citation Rate

Definition: the share of AI answers citing domain A that also cite domain B. ChatGPT validates answers through multiple sources, and the sources travel in vertical-specific pairs.

Formula: Co-citation Rate (A, B) = (answers citing both A and B ÷ answers citing A or B) × 100.

Benchmarks (Profound, 700,000+ U.S. English ChatGPT conversations):

VerticalPairCo-citation rate
Car researchedmunds.com + kbb.com32%
Careersglassdoor.com + indeed.com29%
Real estateredfin.com + zillow.com28%
Travelkayak.com + expedia.com21%
Newsapnews.com + reuters.com15%

How to find your category’s pair: pull the citations for your top 50 category prompts and count which two domains appear together most often. That pair is the validation set ChatGPT uses for your vertical. Your strategic position is one of three: you are in the pair, you are the third source engines reach for, or you are invisible. Two related stats sharpen the picture: roughly 18% of ChatGPT conversations trigger a web search at all, and Turn 1 is 2.5x more likely to cite than Turn 10, so co-citation battles are won in the first exchange.

Prompt Fanout Uniqueness

Definition: the share of retrieval queries an engine generates from a user prompt that are unique rather than restatements of the original wording. This is an engine-level property, and it determines how you should sample.

Formula: Fanout Uniqueness = (unique generated queries ÷ all generated queries) × 100, measured by running a fixed prompt set and logging the engine’s retrieval queries.

Benchmarks (Profound, 10,000 prompts over 2 weeks, March to April 2026):

EngineUnique queriesOverlap with original prompt
ChatGPT91%13%
Copilot47%50%
Perplexity14%88%

Why it belongs on a metrics page: fanout uniqueness sets the variance of every other metric you measure on that engine. ChatGPT behaves like a researcher, generating mostly novel queries per run, which means run-to-run citation results swing more. Perplexity runs nearly 1:1 search-style queries, so its results are more repeatable at lower sample counts. The engine with the highest visibility upside is also the one that demands the most runs before you trust a number.

Definition: the share of AI answers mentioning your brand that include a clickable inline link to your site. This metric barely existed before May 7, 2026, when ChatGPT shifted from citation chips to inline branded URLs.

Formula: Inline Hyperlink Share = (answers with a clickable brand URL ÷ all answers) × 100.

Benchmarks (Profound, 8M+ referral visits across thousands of sites):

  • Share of answers with a clickable brand URL: 4 to 5% before May 7, 22% after. Roughly a 5x jump.
  • OpenAI referral traffic: ~158K daily visits before, ~249K after, holding at about 1.6x.
  • Homepage share of OpenAI referrals: 3.5% to 24.2%. About 1 in 4 clicks now lands on a homepage, so homepages are suddenly an AEO surface again.
  • B2B SaaS referrals rose about 3x; ecommerce stayed flat.

How to use it: if you measured “mentions” before May 7 and “linked mentions” after, your trend line spans two different metrics. Reset the baseline at the change date and report the two eras separately.

Shopping Trigger Rate

Definition: the share of prompt runs in your category that surface the ChatGPT Shopping carousel. Measure this first, because for most B2B categories the answer is “never,” and every downstream Shopping metric is moot.

Formula: Shopping Trigger Rate = (runs surfacing the Shopping carousel ÷ total runs) × 100, with each prompt run 10+ times.

Benchmarks (Profound, 7,500 unique open-ended prompts across 100.7M runs):

CategoryTrigger rate vs baseline
Apparel / fashion5.2x
Physical products4.7x
Consumable grocery2.5x
Health / medical0.5x
Vehicles / equipment0.17x
Travel / hospitality0.05x
Professional services0.04x
Software / SaaS<0.01x
Financial products0.0x

Category alone reproduces trigger behavior with 95 to 97% accuracy. Across roughly 2M prompts run 10+ times each, 79% never triggered Shopping in any run, only about 6% triggered reliably, and a triggering prompt had an 83% chance of triggering again the next day. If you sell software or financial products, take Shopping Trigger Rate off the dashboard and spend the measurement budget on Citation Share.

Rank Stability: the metric that audits all the others

Definition: the consistency of your domain’s citation rank across repeated runs of an identical prompt set. This is the metric arXiv 2603.08924 was written about, and it is the one most dashboards skip.

Sielinski’s method: repeated sampling across Perplexity, SearchGPT, and Gemini in two regimes (daily collections over nine days, plus 10-minute-interval sampling), then bootstrap confidence intervals around each domain’s citation share. Three findings matter for practitioners:

  1. Citation distributions follow a power law. A few domains absorb most citations; the long tail is sparse, so tail estimates are inherently noisy.
  2. Many domain-level differences fall inside the noise floor. Two domains with “different” visibility scores often have overlapping confidence intervals, meaning the difference is not measurable at that sample size.
  3. Rank instability extends beyond the tail. Rankings wobble even among well-cited domains, so “we moved from #6 to #4 this week” needs an interval before it deserves a slide.

Profound’s larger datasets point the same direction. The 700K-conversation co-citation study and the 27M-prompt source-distribution analysis both show heavy concentration (Tier-1 publishers take just 2.6% of citations across 27M prompts, while Wikipedia appears in about 1 in 6 cited ChatGPT conversations), which is exactly the power-law shape that makes small samples lie. And in the 22.5M-offer Shopping dataset, 95% of product titles appeared in under 30% of runs of the same prompt. Run-to-run instability is the norm, even at the top of the distribution.

The practical formula: report every citation-share number as an interval. Bootstrap it: resample your runs with replacement, recompute the share, take the 2.5th and 97.5th percentiles. If this week’s interval overlaps last week’s, you do not have a trend yet. You have two samples from the same distribution.

How many runs do I need?

The honest answer is “more than your dashboard defaults to.” Since the underlying distributions are power-law, there is no single magic N, but the published studies give you calibration points for what serious measurement looks like:

MetricSample behind the published benchmarkWhat that implies for your program
Citation Share27M prompts (source distribution); 48,589 citations (one topic window)Freeze the prompt set; report per engine with intervals
Time-to-First-Citation~900 pages over 60 daysTrack cohorts of pages, not single launches
Co-citation Rate700K+ conversationsUse your top 50 category prompts as a floor for pair detection
Fanout Uniqueness10K prompts over 2 weeksRe-measure quarterly; engine behavior shifts with model updates
Inline Hyperlink Share8M+ referral visitsSplit baselines at the May 7, 2026 change date
Shopping Trigger Rate~2M prompts at 10+ runs each10 runs per prompt minimum before calling a trigger rate stable

Two rules fall out of this. First, repeated runs beat more prompts: a 50-prompt set run 10 times tells you more about stability than a 500-prompt set run once, because the variance lives at the run level. Second, refuse to promote a delta without significance. Profound’s own Markdown-vs-HTML A/B across 381 pages found a 16% mean lift in bot visits and reported it as not statistically significant. That is the standard. A 16% lift that does not clear the noise floor is a hypothesis, and most week-over-week “wins” on AEO dashboards are smaller than 16%.

💡 Takeaway: day-over-day visibility reporting on a small prompt set is the AEO equivalent of checking a poll of 40 people every morning. The number will move. The movement means nothing.

Which metrics belong on your dashboard

  • B2B SaaS: Citation Share (per engine), Inline Brand Hyperlink Share (your category gained about 3x referrals from the May 7 change), Time-to-First-Citation, Co-citation Rate. Skip Shopping entirely.
  • Ecommerce and retail: all of the above plus Shopping Trigger Rate and rank-1 buy-link share. In ChatGPT Shopping, Walmart holds 8.78% rank-1 buy-link share against Target’s 4.93%, while Target leads total offer presence at 7.16% vs 6.54%. Headline placement and overall presence are separate metrics; track both.
  • Publishers and marketplaces: Citation Share and Co-citation Rate first. The 27M-prompt finding that 97.4% of citations come from non-Tier-1 earned media is the whole opportunity: engines cite the long tail of credible sources far more than prestige mastheads.

For the tools that measure these metrics, I keep a current vendor rundown in the LLM tracking tools guide, and the evaluation criteria for choosing between them live in the AI visibility platform buyer’s guide.

The standard to hold every number to

Citation share is the new market share, but market share reported without a denominator, an engine breakdown, and a confidence interval is a vanity metric with extra steps. The arXiv paper gave the industry the statistical vocabulary; Profound’s 700K-conversation, 27M-prompt, and 100M-run datasets confirm the shape at production scale. Measurable, not aspirational, and measurable means sampled honestly.

Frequently Asked Questions

There is no universal benchmark, because the denominator changes with every prompt set. A 0.50% citation share of a 48,589-citation topic window (my own number for this topic) is a real measurement; “12% Share of Voice” with no denominator is marketing. The more useful target is positional: find your category’s co-citation pair (Edmunds + KBB run at 32% in cars), and work toward being the third domain engines reach for when validating answers in your vertical.

How often should I measure AI visibility metrics?

Measure continuously, report weekly or monthly, and never present a day-over-day delta from a small prompt set as a trend. Citation distributions follow a power law with heavy run-to-run variability: 95% of ChatGPT Shopping titles appeared in under 30% of runs of the same prompt. Profound’s shopping methodology used 10+ runs per prompt before treating rates as stable, and that is a reasonable floor for any per-prompt metric.

Which AI visibility metrics matter for B2B SaaS vs ecommerce?

B2B SaaS: Citation Share per engine, Inline Brand Hyperlink Share, Time-to-First-Citation, and Co-citation Rate. The May 7, 2026 ChatGPT change tripled B2B SaaS referrals, so the hyperlink metric now carries pipeline weight. Ecommerce: add Shopping Trigger Rate and rank-1 buy-link share, since the Shopping carousel is a separate surface with its own concentration dynamics. SaaS and financial brands should drop Shopping metrics: those categories trigger at under 0.01x and 0.0x baseline respectively.

Why do my AI visibility numbers change day to day?

Because every measurement is a sample, and the underlying distribution is power-law with substantial variance. arXiv 2603.08924 showed via bootstrap confidence intervals that many domain-level visibility differences sit inside the measurement noise floor, and that rank instability extends well beyond low-ranked domains. The fix is more runs of a frozen prompt set, intervals on every reported share, and the discipline to call an insignificant lift insignificant, the way Profound did with its own 16% Markdown finding.