Benchmarking Methodology
Constellation publishes efficiency metrics for every MCP tool it exposes to AI coding assistants. This page explains how those numbers are produced, what they represent, and the deliberate limits of what they can tell you.
What We Measure
Every benchmark answers the same question for a given tool:
How much work does an AI assistant do to accomplish a task with Constellation versus without it?
For each tool we record:
| Metric | What it captures |
|---|---|
| Input tokens | Tokens the model reads from the prompt and tool responses |
| Output tokens | Tokens the model generates in the final answer |
| Cache read / creation tokens | Tokens served from (or written to) the provider's prompt cache |
| Total billable tokens | The combined token volume a provider would price the request against |
| Cost | Provider-reported cost for the request, in USD |
| Turn count | Number of model turns required to converge on an answer |
| Wall-clock duration | End-to-end latency of the operation |
These are paired into a single per-tool delta: the difference between an AI assistant answering the same prompt with Constellation available and the same assistant answering it without.
How We Measure It
Benchmarks compare two arms under tightly controlled conditions:
- Baseline: an AI coding assistant with no Constellation MCP server attached. It answers code-intelligence questions the conventional way: reading files, running searches, following imports.
- Constellation: the same assistant, same model, same prompt, with the Constellation MCP server attached and permitted to query the code graph.
To keep the comparison honest, both arms start from an identical, empty conversational context. Project-specific guidance files, persistent memory, and any prior session state are neutralized before each run, so the only meaningful difference between the two arms is whether Constellation is available.
Each tool is exercised across multiple iterations per arm. Per-tool results are aggregated into a mean and standard deviation, which become the published efficiency metric for that tool. Calibration is re-run periodically and whenever a tool's behavior changes materially.
Why We Call These "Estimates"
The numbers Constellation publishes are precise as measurements, but they describe a moving target. Three factors keep us from calling them anything stronger than an estimate.
Tokenizer Variance
Each inference provider tokenizes input differently. A response that costs 1,200 tokens against Anthropic's tokenizer might be 1,100 against OpenAI's and 1,400 against Google's. Even within a single provider, tokenizer changes between model generations shift the numbers. Our calibration runs against a specific model on a specific provider; the absolute token counts are accurate for that target and approximate everywhere else.
Pricing Variance
Token cost is not one number, it's at least four:
- Input tokens (uncached)
- Output tokens
- Cache read tokens
- Cache creation tokens
Each is priced independently. Each varies by model. Each is adjusted by providers over time. A dollar figure attached to a benchmark reflects the pricing in effect at the time of calibration, against the model used for calibration. Real-world cost will diverge as pricing changes and as users select different models.
Workload Variance
A benchmark is a representative scenario, not a forecast. Real prompts vary in length, ambiguity, and follow-up depth. Real codebases vary in size, language mix, and graph density. Real conversations carry prior context that changes how the model approaches the next task. We calibrate against carefully chosen representative prompts on a fixed project so that the comparison between arms is fair, but the absolute savings any user will realize depends on their configuration and workload.
Even with all three sources of variance, the ratio between the two arms is far more stable than either arm in isolation. If Constellation cuts token consumption by an order of magnitude on the calibration workload, that ratio holds up across providers and models even when the absolute token counts do not.
Significance Of The Numbers
Constellation's published metrics are designed to demonstrate relative efficiency, the difference between operating with Constellation and operating without it.
| You can use the numbers to… | You should not use the numbers to… |
|---|---|
| Compare AI-assisted workflows with and without Constellation | Predict your exact monthly AI spend |
| Estimate savings on common code-intelligence tasks | Compare costs between two different LLM providers |
| Justify Constellation in a build-vs-buy or efficiency conversation | Treat the per-token figures as authoritative for your tokenizer |
| Track Constellation's own efficiency improvements over time | Substitute for your own provider's billing data |
The metrics exist to confidently answer the question: Does adding Constellation to an AI coding agent's toolkit produce a meaningful, measurable improvement in performance? For that question, a carefully constructed comparative benchmark is the right instrument, and labeling its outputs as estimates is the honest way to present them.
See Also
- What is Constellation? — Why structural code intelligence reduces AI workload
- MCP Tools — The full catalog of tools whose efficiency these benchmarks measure