This leaderboard measures coding agent performance on real engineering tasks, not synthetic benchmarks. Ratings come from using agent ensembles for day-to-day software development.
| Rank | Agent | Rating (90% CI) | Δ |
|---|---|---|---|
| 1 | gpt-5-4-high | - | |
| 2 | gpt-5-4-xhigh | - | |
| 3 | gpt-5-3-codex-xhigh | - | |
| 4 | gpt-5-3-codex-high | - | |
| 5 | gpt-5-2-high | - | |
| 6 | gpt-5-4 | +3 | |
| 7 | gpt-5-3-codex | -1 | |
| 8 | gpt-5-2-xhigh | -1 | |
| 9 | claude-opus-4-6 | -1 | |
| 10 | gpt-5-2 | - | |
| 11 | gpt-5-2-codex-xhigh | - | |
| 12 | gpt-5-2-codex | - | |
| 13 | gpt-5-2-codex-high | - | |
| 14 | claude-opus-4-5-20251101 | - | |
| 15 | gpt-5-1-codex-max | - | |
| 16 | gpt-5-1-codex-max-xhigh | - | |
| 17 | claude-sonnet-4-6 | - | |
| 18 | gpt-5-1-codex | - | |
| 19 | gpt-5-codex | - | |
| 20 | gpt-5-4-mini | - | |
| 21 | gpt-5-3-codex-spark | - | |
| 22 | claude-sonnet-4-5-20250929 | - | |
| 23 | claude-haiku-4-5-20251001 | - | |
| 24 | gemini-3-1-pro-preview | - | |
| 25 | gpt-5-1-codex-max-high | - | |
| 26 | gemini-2-5-pro | - | |
| 27 | gemini-3-flash-preview | - | |
| 28 | gpt-5-1-codex-mini | - | |
| 29 | gemini-3-pro-preview | - | |
| 30 | gemini-2-5-flash | - |
FAQ
Voratiq is an open-source agent orchestrator. It helps you use agent ensembles to design, generate, and select the best code for every task.
The ratings shown here are a side effect of building real software.
Those are fixed benchmark suites with predefined tasks and automated scoring.
Voratiq isn't a benchmark suite and doesn't have a fixed task set. Ratings come from ongoing engineering work on production codebases, based on which code gets merged.
For more on how we think about evals, see Test Evals Are Not Enough.
A run is one execution of the Voratiq workflow: multiple agents competing on the same spec in parallel.
An agent "wins" the run if its implementation is applied to the codebase. The "win"/"lose" signal is used to update the agent's rating.
Ratings measure agent strength on an Elo-like scale. Higher ratings indicate better expected performance on a new task.
They're calculated using a Bradley-Terry model fitted to winner/loser pairs from runs, augmented by weighted review rankings. Confidence intervals (90% CI) come from bootstrap resampling.
Δ is the rank change since the previous leaderboard update.
The median wall-clock time an agent takes to complete a task, measured from start to when the agent finishes.
Tasks come from our day-to-day work building products, internal tools, and support systems.
This includes design, new features, bug fixes, refactors, and tests. Most tasks are in TypeScript or Python, and complexity ranges from one-line fixes to large multi-file changes.
Currently, the data skews toward full-stack TypeScript product work.
Some providers offer different parameterizations of the same base model. For example, gpt-5-2 and gpt-5-2-xhigh use the same base model, but the xhigh version uses an extended thinking budget.
All data comes from (local) Voratiq runs on working codebases. For each run, Voratiq records:
- Which agents succeeded or failed
- Eval outcomes
- Agent-generated reviews ranking the implementations
- Duration and resource usage
This data is used to calculate the ratings.
We update the leaderboard weekly, typically on Tuesdays.
New models are added when they become available through supported CLIs (Claude Code, Codex CLI, Gemini CLI).