Selection Rather Than Prediction

Coding agents are getting quite good, and the question everyone asks is: which one should I use?

However, agent performance varies considerably by language, task type, and time. When you commit to a single agent, you're predicting it will be best for whatever task you throw at it.

That bet might be informed by evals, experience, or word of mouth. But the variance is high enough that you'll often be wrong.

Selection sidesteps the prediction problem. Generate many candidate implementations, and choose from the pool of solutions. This converts the prediction problem into an optimization problem.

So, we think the question to ask instead is: how many agents should I use, and which ones?

This is often called "best-of-N": run N parallel attempts (here, across different models), then select the best output.

Agents Compete, Humans Arbitrate

We've been running this workflow for a few months now. Here's what it looks like:

We write a spec for the task and fan it out to multiple agents in parallel. Each agent works in its own isolated worktree and runs the repo's evals, then a human reviewer looks at the diffs, picks the best implementation, and applies that patch. The agent whose diff gets applied is the winner.

This is best-of-N with a human judge.

This turns everyday work into a useful eval signal: which agent, given a real task in a real codebase, produced the code we actually merged?

The data here comes from real day-to-day work (rather than a benchmark) and spans 211 tasks across 18 agents. We track ongoing results on our leaderboard. The plots here are a snapshot.

Most tasks are full-stack TypeScript product work, usually atomically scoped features, bug fixes, or refactors that take minutes to about an hour.

Rankings Are Noisy

Once each run has a winner, we can treat it as a multi-way match and fit ratings from the outcomes.

We fit a Bradley-Terry model to the winner/loser pairs and map strengths to an Elo-style rating.

Each point is an agent's rating, where higher is better. Whiskers are 90% bootstrap confidence intervals, and color indicates model family.

Across this data, the agents separate naturally into tiers. And in particular, there's a gap between the top tier and the rest.

Within that top tier, the confidence intervals overlap heavily. The top two agents in this snapshot, gpt-5-2-high and gpt-5-2-codex-high, have about forty points of overlap. This means first and second are not separable with confidence.

The ranking exists, but it's noisy. If you had to pick one agent based on the leaderboard, expect variance from task to task even with a top-rated model.

Selection Advantage Is Large

So, how much do you gain by running a cohort instead of a single agent?

We measure the value of a cohort by its win rate on runs where at least one member participated. E.g., if you ran the top three agents (the top-3 cohort), how often would someone in that group produce the winning output?

The x-axis is cohort size, ordered by rating. The y-axis is win rate on runs where at least one agent from that cohort participated. A win means the merged diff came from someone in that cohort.

Across our data, the top agent alone (top-1) wins 24% of the time. Add the next two agents (top-3) and someone in that group wins 51%. Expand to seven agents (top-7), and the win rate reaches 91%.

The first few agents matter most. After seven, the curve flattens, and agents 8 through 18 improve your odds by only a small amount.

You don't need to run every agent, but running only one leaves better code on the table most of the time.

Conclusion

Coding agents cluster into performance tiers, and within the top tier the margins are thin and noisy. A leaderboard is still useful, but mostly as a way to choose a cohort instead of a single default.

The best agent for a given task is hard to predict in advance, but if you run a few from the top tier, one of them will usually get it right.

In our data, going from one agent to three roughly doubles your win rate. Going from three to seven captures most of the remaining gains.

Again, this is across our data, which reflects performance on day-to-day product work in our codebase, according to our preferences. If your workflow and domain differ meaningfully from ours, your results may differ.

Tokens are cheap, and human engineering time is expensive. It may be worth running a few more agents if it means a better foundation, less cleanup, and fewer bugs for future work.