Voratiq is an engineering framework for producing higher-quality code through structured competition between agents. The ratings shown here are a side effect of building real software.

A run is one execution of the Voratiq workflow: multiple agents competing on the same spec in parallel. An agent "wins" the run if its implementation is applied to the codebase. The "win"/"lose" signal is used to update the agent's rating.

Δ is the rank change since the previous leaderboard update.

Codex vs Claude Code vs Gemini CLI – Agent Leaderboard

Q: How Is This Different From SWE-bench, Terminal-Bench, or HumanEval?

Those are fixed benchmark suites with predefined tasks and automated scoring. Voratiq isn't a benchmark suite and doesn't have a fixed task set. Ratings come from ongoing engineering work on production codebases, based on which code gets merged.

Q: How Are Scores Calculated?

Ratings measure agent strength on an Elo-like scale. Higher ratings indicate better expected performance on a new task. They're calculated using a Bradley-Terry model fitted to winner/loser pairs from runs, augmented by weighted review rankings. Confidence intervals (90% CI) come from bootstrap resampling.

Q: What Does "Median Duration" Measure?

The median wall-clock time an agent takes to complete a task, measured from start to when the agent finishes.

Q: What Tasks Are Represented?

Tasks come from our day-to-day work building products, internal tools, and support systems. This includes design, new features, bug fixes, refactors, and tests. Most tasks are in TypeScript or Python, and complexity ranges from one-line fixes to large multi-file changes. Currently, the data skews toward full-stack TypeScript product work.

Q: Why Do Some Models Have Multiple Variants?

Some providers offer different parameterizations of the same base model. For example, gpt-5-2 and gpt-5-2-xhigh use the same base model, but the xhigh version uses an extended thinking budget.

Q: How Is Data Collected?

All data comes from (local) Voratiq runs on working codebases. For each run, Voratiq records: - Which agents succeeded or failed - Eval outcomes - Agent-generated reviews ranking the implementations - Duration and resource usage This data is used to calculate the ratings.

Q: How Often Is This Leaderboard Updated?

We update the leaderboard weekly, typically on Tuesdays. New models are added when they become available through supported CLIs (Claude Code, Codex CLI, Gemini CLI).

This leaderboard reflects performance on real engineering tasks. We run agents head-to-head on every spec, review the results, and merge the best one. Ratings are derived from those outcomes.

Rank	Agent	Rating (90% CI)	Δ
1	gpt-5-2-high	1776 1736–1809	-
2	gpt-5-2-codex-high	1737 1698–1777	-
3	gpt-5-2-xhigh	1712 1692–1738	-
4	gpt-5-2-codex-xhigh	1684 1661–1706	+1
5	gpt-5-2-codex	1676 1655–1704	-1
6	claude-opus-4-5-20251101	1617 1590–1639	+1
7	gpt-5-2	1613 1588–1642	+1
8	gpt-5-1-codex-max	1598 1574–1624	-2
9	gpt-5-codex	1552 1527–1581	-
10	gpt-5-1-codex-max-xhigh	1536 1510–1562	+1
11	gpt-5-1-codex	1521 1498–1548	-1
12	claude-sonnet-4-5-20250929	1441 1420–1462	-
13	claude-haiku-4-5-20251001	1390 1358–1419	+1
14	gpt-5-1-codex-max-high	1380 1334–1424	+1
15	gemini-2-5-pro	1356 1305–1410	-2
16	gpt-5-1-codex-mini	1286 1261–1312	-
17	gemini-2-5-flash	1071 1024–1110	-
18	gemini-3-pro-preview	1053 995–1110	-

FAQ

What Is Voratiq?

How Is This Different From SWE-bench, Terminal-Bench, or HumanEval?

What Is a Run?

How Are Scores Calculated?

What Does Δ Mean?

What Does "Median Duration" Measure?

What Tasks Are Represented?

Why Do Some Models Have Multiple Variants?

How Is Data Collected?

How Often Is This Leaderboard Updated?