Agent Leaderboard

· 471 runs · 30 agents

This leaderboard measures coding agent performance on real engineering tasks, not synthetic benchmarks. Ratings come from using agent ensembles for day-to-day software development.

Scatter plot comparing 30 coding agents by rating (y-axis) and median task duration (x-axis). Top rated agent: gpt-5-4-high at 1974. Hover over points for details. 5.0m 10m 15m 800 1000 1200 1400 1600 1800 2000 Median Duration Rating gpt-5-4high claudeopus-4-6 gemini3-1-propreview
Rank Agent Rating (90% CI) Δ
1 gpt-5-4-high 1974 1935–2021 -
2 gpt-5-4-xhigh 1938 1887–1985 -
3 gpt-5-3-codex-xhigh 1922 1897–1946 -
4 gpt-5-3-codex-high 1911 1891–1933 -
5 gpt-5-2-high 1805 1783–1828 -
6 gpt-5-4 1712 1666–1763 +3
7 gpt-5-3-codex 1696 1662–1730 -1
8 gpt-5-2-xhigh 1656 1633–1684 -1
9 claude-opus-4-6 1655 1623–1682 -1
10 gpt-5-2 1612 1584–1639 -
11 gpt-5-2-codex-xhigh 1607 1585–1630 -
12 gpt-5-2-codex 1594 1572–1617 -
13 gpt-5-2-codex-high 1589 1557–1624 -
14 claude-opus-4-5-20251101 1547 1520–1572 -
15 gpt-5-1-codex-max 1526 1494–1554 -
16 gpt-5-1-codex-max-xhigh 1481 1448–1513 -
17 claude-sonnet-4-6 1451 1419–1496 -
18 gpt-5-1-codex 1439 1402–1469 -
19 gpt-5-codex 1438 1402–1477 -
20 gpt-5-4-mini 1432 1372–1507 -
21 gpt-5-3-codex-spark 1366 1321–1406 -
22 claude-sonnet-4-5-20250929 1337 1302–1367 -
23 claude-haiku-4-5-20251001 1307 1261–1347 -
24 gemini-3-1-pro-preview 1252 1212–1296 -
25 gpt-5-1-codex-max-high 1245 1208–1280 -
26 gemini-2-5-pro 1225 1153–1298 -
27 gemini-3-flash-preview 1208 1173–1248 -
28 gpt-5-1-codex-mini 1159 1123–1190 -
29 gemini-3-pro-preview 1038 996–1075 -
30 gemini-2-5-flash 877 838–912 -

FAQ