What is a good Share of Voice in AI search?

There is no universal benchmark because the denominator changes with every prompt set. A useful target is your category's co-citation pair: in cars, Edmunds and KBB get cited together in 32% of citing answers. If a competitor pair owns your category, the realistic goal is becoming the third domain engines use for validation, then displacing one of the pair.

AI Visibility Metrics: Formulas, Benchmarks & Sample Sizes (2026)

Q: How often should I measure AI visibility metrics?

Measure continuously, report weekly or monthly. Citation distributions follow a power law with heavy sample-to-sample variability, so day-over-day deltas on small prompt sets are mostly noise. Profound's shopping study ran roughly 2 million prompts at 10 or more runs each before treating trigger rates as stable.

Q: Which AI visibility metrics matter for B2B SaaS vs ecommerce?

B2B SaaS should prioritize Citation Share, Inline Brand Hyperlink Share (B2B SaaS referrals rose about 3x after the May 7, 2026 ChatGPT change), and Time-to-First-Citation. Ecommerce should add Shopping Trigger Rate and rank-1 buy-link share. SaaS brands can skip Shopping metrics entirely: the category triggers ChatGPT Shopping at under 0.01x baseline.

Q: Why do my AI visibility numbers change day to day?

Because each measurement is a sample from a noisy distribution, not a census. arXiv 2603.08924 found citation distributions follow a power law and that many domain-level differences fall inside the measurement noise floor. Profound's data agrees: 95% of product titles in ChatGPT Shopping appeared in under 30% of runs of the same prompt.

Every AI visibility number you report is a sample estimate, and most dashboards present it like a census. In Profound’s tracking of 22.5M ChatGPT Shopping offers, 95% of product titles appeared in under 30% of runs of the same prompt. An arXiv paper published in March 2026 (2603.08924, Ronald Sielinski) reached the same conclusion from the academic side: citation distributions follow a power law, sample-to-sample variability is large, and many of the domain-level differences marketers report fall entirely inside the measurement noise floor.

So this page is the reference I wish existed when teams ask me which metrics to put on the AEO dashboard. For each metric: a one-line definition, the formula, a 2026 benchmark with a source, and the sample size behind that benchmark. If a number has no source, it is not on this page.

The 2026 AI visibility metrics taxonomy

Metric	What it answers	2026 benchmark	Source
Citation Share / Share of Voice	How much of the answer surface do I own?	Topic-dependent; worked example below: 0.50%	Profound citation data
Time-to-First-Citation	How fast do new pages get picked up?	Median 6.81 days, P90 37.10 days	Profound, ~900 pages, Mar to May 2026
Co-citation Rate	Which domains do engines pair me with?	Edmunds + KBB: 32%	Profound, 700K+ ChatGPT conversations
Prompt Fanout Uniqueness	How far does the engine wander from the literal prompt?	ChatGPT 91%, Perplexity 14%	Profound, 10K prompts, Mar to Apr 2026
Inline Brand Hyperlink Share	How often does my mention carry a clickable link?	22% of answers (up from 4 to 5%)	Profound, 8M+ referral visits, May 2026
Shopping Trigger Rate	Does the Shopping surface even apply to me?	Apparel 5.2x baseline, SaaS <0.01x	Profound, 100.7M runs
Rank Stability	Is my rank movement real or noise?	Power-law variance; bootstrap CIs required	arXiv 2603.08924
Google Model Variability	How much does visibility differ across Gemini, AIO, and AI Mode?	Median 8-point gap between best and worst Google model	Profound, 15,155 brand configs, May 2026

Each metric gets a formula box below: numerator, denominator, and the minimum measurement discipline that makes the number trustworthy.

Definition: the percentage of citations (or brand mentions) in a defined prompt set that belong to your domain.

Formula: Citation Share = (citations of your domain ÷ all citations across the prompt set) × 100. Share of Voice is the mention-based variant: (answers mentioning your brand ÷ all answers in the prompt set) × 100. Track them separately. A brand can be mentioned without being linked, and after May 7, 2026, that distinction has a traffic number attached (see Inline Brand Hyperlink Share below).

Worked example: in the AI visibility metrics topic window I track for this site, ChatGPT produced 48,589 citations, of which nicklafferty.com earned 242. That is a 0.50% citation share. The honest version of this metric always names the denominator: 0.50% of a 48,589-citation topic window is a different claim than 0.50% of ten hand-picked prompts.

Measurement discipline: citation share is only comparable across periods if the prompt set is frozen. Change the prompts and you changed the denominator, and the trend line is fiction. Per-engine reporting is mandatory here; I wrote a full essay on why the same page gets cited 18% on ChatGPT and 0% on Perplexity in the citation asymmetry post.

Time-to-First-Citation

Definition: days between publishing a page and the first observed citation of that page in an AI answer.

Formula: TTFC = date of first observed citation minus publish date, reported as a distribution (median, P75, P90), never as an average. The distribution is skewed, so the mean tells you nothing useful.

Benchmarks (Profound, ~900 newly published marketing pages, 60-day window, March to May 2026):

Percentile	Days to first ChatGPT/Claude citation
Median	6.81
P75	18.68
P90	37.10

How to use it: under 7 days puts a page ahead of the curve. If a page is still uncited past day 37, stop waiting and start debugging: check robots.txt, confirm AI crawlers can reach the page, and verify nothing at the CDN layer is blocking GPTBot or ClaudeBot. Past P90, the problem is almost always technical access, content quality is the wrong first suspect.

Co-citation Rate

Definition: the share of AI answers citing domain A that also cite domain B. ChatGPT validates answers through multiple sources, and the sources travel in vertical-specific pairs.

Formula: Co-citation Rate (A, B) = (answers citing both A and B ÷ answers citing A or B) × 100.

Profound research

ChatGPT co-cites domain pairs by vertical — Edmunds & KBB co-cited 32% of the time

Across 700,000+ U.S. English ChatGPT conversations (Oct–Dec 2025), Profound found that ~18% of conversations trigger a web search and cited sources cluster in vertical-specific pairs. Wikipedia anchors as the default knowledge layer.

~18% of ChatGPT conversations trigger at least one web search
Turn 1 is 2.5x more likely to cite than Turn 10 and 4x more likely than Turn 20
Wikipedia appears in ~1 in 6 cited conversations
Car directory co-citation: Edmunds & KBB 32%
Career co-citation: Glassdoor & Indeed 29%
Real estate: Redfin & Zillow 28%
Travel: Kayak & Expedia 21%
News: APNews & Reuters 15%
Sample = 700,000+ U.S. English ChatGPT conversations, Oct–Dec 2025

Profound · ChatGPT Validates Answers Through Multiple Sources · added 2026-05-14

Benchmarks (Profound, 700,000+ U.S. English ChatGPT conversations):

Vertical	Pair	Co-citation rate
Car research	edmunds.com + kbb.com	32%
Careers	glassdoor.com + indeed.com	29%
Real estate	redfin.com + zillow.com	28%
Travel	kayak.com + expedia.com	21%
News	apnews.com + reuters.com	15%

How to find your category’s pair: pull the citations for your top 50 category prompts and count which two domains appear together most often. That pair is the validation set ChatGPT uses for your vertical. Your strategic position is one of three: you are in the pair, you are the third source engines reach for, or you are invisible. Two related stats sharpen the picture: roughly 18% of ChatGPT conversations trigger a web search at all, and Turn 1 is 2.5x more likely to cite than Turn 10, so co-citation battles are won in the first exchange.

Prompt Fanout Uniqueness

Definition: the share of retrieval queries an engine generates from a user prompt that are unique rather than restatements of the original wording. This is an engine-level property, and it determines how you should sample.

Formula: Fanout Uniqueness = (unique generated queries ÷ all generated queries) × 100, measured by running a fixed prompt set and logging the engine’s retrieval queries.

Benchmarks (Profound, 10,000 prompts over 2 weeks, March to April 2026):

Engine	Unique queries	Overlap with original prompt
ChatGPT	91%	13%
Copilot	47%	50%
Perplexity	14%	88%

Why it belongs on a metrics page: fanout uniqueness sets the variance of every other metric you measure on that engine. ChatGPT behaves like a researcher, generating mostly novel queries per run, which means run-to-run citation results swing more. Perplexity runs nearly 1:1 search-style queries, so its results are more repeatable at lower sample counts. The engine with the highest visibility upside is also the one that demands the most runs before you trust a number.

Definition: the share of AI answers mentioning your brand that include a clickable inline link to your site. This metric barely existed before May 7, 2026, when ChatGPT shifted from citation chips to inline branded URLs.

Formula: Inline Hyperlink Share = (answers with a clickable brand URL ÷ all answers) × 100.

Benchmarks (Profound, 8M+ referral visits across thousands of sites):

Share of answers with a clickable brand URL: 4 to 5% before May 7, 22% after. Roughly a 5x jump.
OpenAI referral traffic: ~158K daily visits before, ~249K after, holding at about 1.6x.
Homepage share of OpenAI referrals: 3.5% to 24.2%. About 1 in 4 clicks now lands on a homepage, so homepages are suddenly an AEO surface again.
B2B SaaS referrals rose about 3x; ecommerce stayed flat.

How to use it: if you measured “mentions” before May 7 and “linked mentions” after, your trend line spans two different metrics. Reset the baseline at the change date and report the two eras separately.

Shopping Trigger Rate

Definition: the share of prompt runs in your category that surface the ChatGPT Shopping carousel. Measure this first, because for most B2B categories the answer is “never,” and every downstream Shopping metric is moot.

Formula: Shopping Trigger Rate = (runs surfacing the Shopping carousel ÷ total runs) × 100, with each prompt run 10+ times.

Benchmarks (Profound, 7,500 unique open-ended prompts across 100.7M runs):

Category	Trigger rate vs baseline
Apparel / fashion	5.2x
Physical products	4.7x
Consumable grocery	2.5x
Health / medical	0.5x
Vehicles / equipment	0.17x
Travel / hospitality	0.05x
Professional services	0.04x
Software / SaaS	<0.01x
Financial products	0.0x

Category alone reproduces trigger behavior with 95 to 97% accuracy. Across roughly 2M prompts run 10+ times each, 79% never triggered Shopping in any run, only about 6% triggered reliably, and a triggering prompt had an 83% chance of triggering again the next day. If you sell software or financial products, take Shopping Trigger Rate off the dashboard and spend the measurement budget on Citation Share.

Rank Stability: the metric that audits all the others

Definition: the consistency of your domain’s citation rank across repeated runs of an identical prompt set. This is the metric arXiv 2603.08924 was written about, and it is the one most dashboards skip.

Sielinski’s method: repeated sampling across Perplexity, SearchGPT, and Gemini in two regimes (daily collections over nine days, plus 10-minute-interval sampling), then bootstrap confidence intervals around each domain’s citation share. Three findings matter for practitioners:

Citation distributions follow a power law. A few domains absorb most citations; the long tail is sparse, so tail estimates are inherently noisy.
Many domain-level differences fall inside the noise floor. Two domains with “different” visibility scores often have overlapping confidence intervals, meaning the difference is not measurable at that sample size.
Rank instability extends beyond the tail. Rankings wobble even among well-cited domains, so “we moved from #6 to #4 this week” needs an interval before it deserves a slide.

Profound’s larger datasets point the same direction. The 700K-conversation co-citation study and the 27M-prompt source-distribution analysis both show heavy concentration (Tier-1 publishers take just 2.6% of citations across 27M prompts, while Wikipedia appears in about 1 in 6 cited ChatGPT conversations), which is exactly the power-law shape that makes small samples lie. And in the 22.5M-offer Shopping dataset, 95% of product titles appeared in under 30% of runs of the same prompt. Run-to-run instability is the norm, even at the top of the distribution.

The practical formula: report every citation-share number as an interval. Bootstrap it: resample your runs with replacement, recompute the share, take the 2.5th and 97.5th percentiles. If this week’s interval overlaps last week’s, you do not have a trend yet. You have two samples from the same distribution.

Google model variability: Gemini vs AI Overviews vs AI Mode

Gemini, AI Overviews, and AI Mode share infrastructure but behave like distinct products. Profound’s analysis of 15,155 brand configurations tracked daily in May 2026 found brands saw a median 8-point visibility gap between their best and worst-performing Google model. The gap is not driven by mention volume (all three surface a near-identical 4.4 to 5.0 brands per response) but by which brands each model chooses to feature and which sources it cites to back them up. Gemini leans on editorial and reference sources like Reddit, YouTube, and Wikipedia, while AI Overviews and AI Mode cite far more heavily from social and UGC platforms, at roughly double Gemini’s citation depth per run. Practical implication: reporting a single “Google” citation share hides an 8-point spread. Split the metric by surface.

How many runs do I need?

The honest answer is “more than your dashboard defaults to.” Since the underlying distributions are power-law, there is no single magic N, but the published studies give you calibration points for what serious measurement looks like:

Metric	Sample behind the published benchmark	What that implies for your program
Citation Share	27M prompts (source distribution); 48,589 citations (one topic window)	Freeze the prompt set; report per engine with intervals
Time-to-First-Citation	~900 pages over 60 days	Track cohorts of pages, not single launches
Co-citation Rate	700K+ conversations	Use your top 50 category prompts as a floor for pair detection
Fanout Uniqueness	10K prompts over 2 weeks	Re-measure quarterly; engine behavior shifts with model updates
Inline Hyperlink Share	8M+ referral visits	Split baselines at the May 7, 2026 change date
Shopping Trigger Rate	~2M prompts at 10+ runs each	10 runs per prompt minimum before calling a trigger rate stable

Two rules fall out of this. First, repeated runs beat more prompts: a 50-prompt set run 10 times tells you more about stability than a 500-prompt set run once, because the variance lives at the run level. Second, refuse to promote a delta without significance. Profound’s own Markdown-vs-HTML A/B across 381 pages found a 16% mean lift in bot visits and reported it as not statistically significant. That is the standard. A 16% lift that does not clear the noise floor is a hypothesis, and most week-over-week “wins” on AEO dashboards are smaller than 16%.

💡 Takeaway: day-over-day visibility reporting on a small prompt set is the AEO equivalent of checking a poll of 40 people every morning. The number will move. The movement means nothing.

Which metrics belong on your dashboard

B2B SaaS: Citation Share (per engine), Inline Brand Hyperlink Share (your category gained about 3x referrals from the May 7 change), Time-to-First-Citation, Co-citation Rate. Skip Shopping entirely.
Ecommerce and retail: all of the above plus Shopping Trigger Rate and rank-1 buy-link share. In ChatGPT Shopping, Walmart holds 8.78% rank-1 buy-link share against Target’s 4.93%, while Target leads total offer presence at 7.16% vs 6.54%. Headline placement and overall presence are separate metrics; track both.
Publishers and marketplaces: Citation Share and Co-citation Rate first. The 27M-prompt finding that 97.4% of citations come from non-Tier-1 earned media is the whole opportunity: engines cite the long tail of credible sources far more than prestige mastheads.

For the tools that measure these metrics, I keep a current vendor rundown in the LLM tracking tools guide, and the evaluation criteria for choosing between them live in the AI visibility platform buyer’s guide. Profound also publishes the Profound Index, a category-level leaderboard for AI Search visibility that applies the sampling discipline this page argues for. A newer companion metric to track alongside Citation Share is factual accuracy: Profound’s FactCheck measures what AI engines get right and wrong about your brand at scale, and identifies which cited sources are driving the errors, so you can fix the upstream cause rather than the symptom. For ecommerce teams specifically, Profound’s ChatGPT Shopping deep dive analyzed over 1 million ChatGPT shopping offers and found product feeds increasingly outweigh PDPs as the source ChatGPT pulls from, which reshapes what an ecommerce AEO dashboard should even be measuring. And for the downstream half of the funnel, Profound’s AI mention effect analysis measures the browsing behavior that follows a brand mention, an additional metric worth pairing with Inline Brand Hyperlink Share once your citation-side numbers are stable. Profound’s behavioral study with Kevin Indig and Clickstream Solutions, The shortlist is the new shelf, watched 56 people run 221 real shopping tasks inside ChatGPT and found a strong link between how often a brand appears in ChatGPT’s answers and what people actually buy, which is the clearest evidence yet that Citation Share is a leading indicator of purchase, not just visibility.

The standard to hold every number to

Citation share is the new market share, but market share reported without a denominator, an engine breakdown, and a confidence interval is a vanity metric with extra steps. The arXiv paper gave the industry the statistical vocabulary; Profound’s 700K-conversation, 27M-prompt, and 100M-run datasets confirm the shape at production scale. Measurable, not aspirational, and measurable means sampled honestly.

Frequently Asked Questions

There is no universal benchmark, because the denominator changes with every prompt set. A 0.50% citation share of a 48,589-citation topic window (my own number for this topic) is a real measurement; “12% Share of Voice” with no denominator is marketing. The more useful target is positional: find your category’s co-citation pair (Edmunds + KBB run at 32% in cars), and work toward being the third domain engines reach for when validating answers in your vertical.

How often should I measure AI visibility metrics?

Measure continuously, report weekly or monthly, and never present a day-over-day delta from a small prompt set as a trend. Citation distributions follow a power law with heavy run-to-run variability: 95% of ChatGPT Shopping titles appeared in under 30% of runs of the same prompt. Profound’s shopping methodology used 10+ runs per prompt before treating rates as stable, and that is a reasonable floor for any per-prompt metric.

Which AI visibility metrics matter for B2B SaaS vs ecommerce?

B2B SaaS: Citation Share per engine, Inline Brand Hyperlink Share, Time-to-First-Citation, and Co-citation Rate. The May 7, 2026 ChatGPT change tripled B2B SaaS referrals, so the hyperlink metric now carries pipeline weight. Ecommerce: add Shopping Trigger Rate and rank-1 buy-link share, since the Shopping carousel is a separate surface with its own concentration dynamics. SaaS and financial brands should drop Shopping metrics: those categories trigger at under 0.01x and 0.0x baseline respectively.

Why do my AI visibility numbers change day to day?

Because every measurement is a sample, and the underlying distribution is power-law with substantial variance. arXiv 2603.08924 showed via bootstrap confidence intervals that many domain-level visibility differences sit inside the measurement noise floor, and that rank instability extends well beyond low-ranked domains. The fix is more runs of a frozen prompt set, intervals on every reported share, and the discipline to call an insignificant lift insignificant, the way Profound did with its own 16% Markdown finding.