
AI coding tools don’t just speed up development—they inflate throughput: more PRs, more diffs, more experiments, more parallel threads. Many orgs then discover the same pattern: total engineering work doesn’t go down. It moves into review capacity, coordination overhead, and cognitive load.
For Heads of Engineering, the goal isn’t “more code.” It’s faster, safer delivery with sustainable attention. That requires managing new bottlenecks and measuring them well—where DevEx surveys become a critical instrument.
How many hours of clean, self-contained software work can an AI agent finish end-to-end? That’s the most useful way to read METR’s results.
METR benchmarks agents on well-specified software tasks, estimates how long a skilled human would take, and reports the task length where the agent succeeds about 50% of the time (and also at 80%). METR tested this on frontier models in their evaluation suite with methodology described here.
METR’s testing suggests that Claude Opus 4.6 can complete software tasks that would take a human expert about 14.5 hours with 50% success (as summarized here). In practical terms: the model can sometimes stay on track for what’s essentially two full workdays of human effort—but it’s still a coin flip.
If you raise the bar and ask for ~80% reliability, the same summary notes the effective task size drops sharply (from ~14.5 hours down to ~1 hour). That gap is exactly what engineering leaders feel in practice: AI can handle longer arcs sometimes, but reliability still forces you to design work as shorter, verifiable chunks.
METR reports that this “how many hours can it finish” curve has historically doubled about every seven months (paper), and some commentary suggests it may be accelerating. If that trajectory holds, the ceiling on what you can delegate safely will keep rising—fast.
These results are for clean, self-contained coding tasks. Real engineering work includes ambiguous requirements, stakeholder negotiation, “why did we do it this way?”, and hidden constraints—areas where the bottlenecks remain very human.
The key leadership takeaway isn’t “AI can do 14.5 hours of work.” It’s: AI can sometimes do multi-hour work, but you can’t plan your delivery system around a coin flip.
In practice, that means the best ROI comes from:
If PR volume doubles, reviewers become the limiting resource. Symptoms show up quickly:
AI outputs are often plausible; correctness requires expertise. A recurring pattern is that adoption increases the need for strong reviewers and verification discipline. Teams win when they invest in:
Multiple AI threads, more proposals, more partial solutions—teams can drift into “fast activity” without real outcome progress.
When code generation gets cheaper, teams generate more code. That’s often productive—but it also exposes constraints that used to be hidden:
This is why some teams feel busier after adopting AI—even if they’re shipping more (and why “output” metrics alone can be misleading).
Engineering metrics tell you what happened (cycle time, PR throughput, incidents). They rarely tell you why it happened—or what friction is building until it’s already costly.
DevEx surveys fill that gap by capturing signals that don’t show up reliably in systems data:
Use these five questions from the standard DevEx question bank to quantify the exact issues this article discusses: