Benchmarking Methodology

Constellation publishes efficiency metrics for every MCP tool it exposes to AI coding assistants. This page explains how those numbers are produced, what they represent, and the deliberate limits of what they can tell you.

What We Measure

Every benchmark answers the same question for a given tool:

How much work does an AI assistant do to accomplish a task with Constellation versus without it?

For each tool we record:

Metric	What it captures
Input tokens	Tokens the model reads from the prompt and tool responses
Output tokens	Tokens the model generates in the final answer
Cache read / creation tokens	Tokens served from (or written to) the provider's prompt cache
Total billable tokens	The combined token volume a provider would price the request against
Cost	Provider-reported cost for the request, in USD
Turn count	Number of model turns required to converge on an answer
Wall-clock duration	End-to-end latency of the operation

These are paired into a single per-tool delta: the difference between an AI assistant answering the same prompt with Constellation available and the same assistant answering it without.

How We Measure It

Benchmarks compare two arms under tightly controlled conditions:

Baseline: an AI coding assistant with no Constellation MCP server attached. It answers code-intelligence questions the conventional way: reading files, running searches, following imports.
Constellation: the same assistant, same model, same prompt, with the Constellation MCP server attached and permitted to query the code graph.

To keep the comparison honest, both arms start from an identical, empty conversational context. Project-specific guidance files, persistent memory, and any prior session state are neutralized before each run, so the only meaningful difference between the two arms is whether Constellation is available.

Each tool is exercised across multiple iterations per arm. Per-tool results are aggregated into a mean and standard deviation, which become the published efficiency metric for that tool. Calibration is re-run periodically and whenever a tool's behavior changes materially.

Why We Call These "Estimates"

The numbers Constellation publishes are precise as measurements, but they describe a moving target. Three factors keep us from calling them anything stronger than an estimate.

Tokenizer Variance

Each inference provider tokenizes input differently. A response that costs 1,200 tokens against Anthropic's tokenizer might be 1,100 against OpenAI's and 1,400 against Google's. Even within a single provider, tokenizer changes between model generations shift the numbers. Our calibration runs against a specific model on a specific provider; the absolute token counts are accurate for that target and approximate everywhere else.

Pricing Variance

Token cost is not one number, it's at least four:

Input tokens (uncached)
Output tokens
Cache read tokens
Cache creation tokens

Each is priced independently. Each varies by model. Each is adjusted by providers over time. A dollar figure attached to a benchmark reflects the pricing in effect at the time of calibration, against the model used for calibration. Real-world cost will diverge as pricing changes and as users select different models.

Workload Variance

A benchmark is a representative scenario, not a forecast. Real prompts vary in length, ambiguity, and follow-up depth. Real codebases vary in size, language mix, and graph density. Real conversations carry prior context that changes how the model approaches the next task. We calibrate against carefully chosen representative prompts on a fixed project so that the comparison between arms is fair, but the absolute savings any user will realize depends on their configuration and workload.

Why the estimates still matter

Even with all three sources of variance, the ratio between the two arms is far more stable than either arm in isolation. If Constellation cuts token consumption by an order of magnitude on the calibration workload, that ratio holds up across providers and models even when the absolute token counts do not.

Significance Of The Numbers

Constellation's published metrics are designed to demonstrate relative efficiency, the difference between operating with Constellation and operating without it.

You can use the numbers to…	You should not use the numbers to…
Compare AI-assisted workflows with and without Constellation	Predict your exact monthly AI spend
Estimate savings on common code-intelligence tasks	Compare costs between two different LLM providers
Justify Constellation in a build-vs-buy or efficiency conversation	Treat the per-token figures as authoritative for your tokenizer
Track Constellation's own efficiency improvements over time	Substitute for your own provider's billing data

The metrics exist to confidently answer the question: Does adding Constellation to an AI coding agent's toolkit produce a meaningful, measurable improvement in performance? For that question, a carefully constructed comparative benchmark is the right instrument, and labeling its outputs as estimates is the honest way to present them.

What We Measure​

How We Measure It​

Why We Call These "Estimates"​

Tokenizer Variance​

Pricing Variance​

Workload Variance​

Significance Of The Numbers​

See Also​