When AI writes most of the code, engineering moves up the stack

When AI writes most of the code, engineering moves up the stack

From Uber’s agent stack to OpenAI’s small teams: the new operating model for engineering

AI didn’t just make coding faster.

It made everything around coding—problem framing, system design, review, and operational safety—the new constraint.

Across the industry, we’re seeing two things happen at once:

This article synthesizes recent perspectives on what’s breaking, what becomes uniquely human, and how top teams are actually working with AI agents today.

The key challenges when AI produces code massively

The illusion of speed (progress that turns into rework)

When execution is cheap, it’s easy to ship a lot of code in the wrong direction. The trap is mistaking “generated output” for “validated progress”

In practice, this shows up as:

  • features that compile but don’t solve the real user problem
  • quick wins that create long-term operational drag
  • requirements ambiguity “baked into” thousands of lines of plausible code

The cost of being wrong goes up because the mistake propagates faster.

One‑nine software flooding systems that require five‑nines thinking

A recurring warning from ecosystem maintainers, for example from Google: a “90% solution” is also a “one‑nine” solution Agentic tools can mass-produce partial implementations and low-quality changes faster than organizations can:

  • review them
  • test them
  • integrate them safely
  • maintain them over time

This is especially painful in mature, long-lived codebases where the real work is evolution and reliability.

Hallucinations and unverified claims are structural, not a vendor bug

Organizations hoping vendors will “fix hallucinations” are waiting for something that likely won’t arrive in the way they imagine. Current models optimize for plausible output, not verified truth. So enterprise workflows must assume:

  • the model will occasionally be confidently wrong
  • the model will omit critical constraints unless prompted
  • the model will create subtle inconsistencies across modules

Review and integration become the bottleneck

As AI increases code throughput, teams quickly hit second-order effects:

  • more PRs and more code to route to the right reviewers
  • more noise in reviews (“looks fine” comments) and more risk of missing the one critical issue
  • more testing demand (and cost) to maintain confidence

At scale, you don’t just need better prompts—you need better workflow design.

Cost becomes real (tokens, compute, and platform load)

The agent era pushes compute and platform constraints into the foreground. When engineers run multiple agents in parallel, the limiting factor can shift from “developer time” to:

  • token spend
  • CI capacity
  • code review bandwidth
  • platform/infra investment needed to make agents safe and useful

The human role—and why deep work and deep thinking matter more

As code generation becomes abundant, judgment becomes scarce. The highest-leverage human work moves “above the code”.

  • What do we build?
  • Why, and for whom?
  • How does it fit into the system?

Deep work is upstream risk management

In an AI-saturated environment, deep work isn’t a luxury. It’s the cheapest form of risk reduction.

The “slow phases” are where teams decide:

  • what success looks like (and what’s explicitly out of scope)
  • constraints (security, performance, compliance, reliability)
  • the shape of the system (interfaces, invariants, operability)
  • what evidence is required before shipping

If those decisions are fuzzy, AI will happily fill the vacuum with plausible implementation details.

One practical implication: “protecting deep work” can’t just be a calendar aspiration—it needs operational support. Work Smart is our approach to safeguarding focus time using actions and system log data observability (e.g., collaboration signals and interruption patterns), so teams can spot where deep work is being eroded and put guardrails in place. Uber, for example, also uses focus time as a key engineering productivity metric.

Humans remain the final checkpoint (taste, accountability, verification)

A practical operating assumption: AI can draft; humans must sign off.

Senior engineers’ “taste” shows up as:

  • spotting subtle inconsistencies
  • recognizing when a change violates system intent
  • knowing which risks are existential vs acceptable
  • demanding evidence (tests, monitoring signals, rollback plan)

Communication becomes part of the technical system

If the new constraints are decisions and coordination, then communication quality directly affects throughput.

This is also where measurement helps: Developer Experience Surveys can quantify friction (interruptions, review latency, context switching, clarity of goals, perceived autonomy) and help leaders validate whether AI is improving the experience—or just shifting pain to different parts of the system. High-performing teams use lightweight patterns to avoid drowning in context.

How top teams work with AI agents today

Small, empowered teams with high autonomy

Teams that move fastest tend to be small (often 3–4 people), with high ownership and minimal ceremony. One example: OpenAI shipped Sora in ~28 days with a 4‑person team and also runs a very small team for the Codex app.

The key is not “adding AI to the process,” but continuously removing bottlenecks across the lifecycle:

  • planning
  • prioritization
  • generation
  • review
  • testing

Leaders must avoid becoming the bottleneck; local decision-making is required when tooling and models change weekly. One comparison used for the Codex org model is a modern Bell Labs

Parallelizing work with multiple agents

A common workflow shift: Uber moved from single-threaded IDE work to orchestrating multiple parallel agents, using tools like Claude Code, Cursor, and Codex.

  • pre‑AI: most time in IDE authoring code
  • early agents: one agent at a time
  • current: several parallel agents, each with a defined task

Engineers orchestrate, review, and integrate rather than type every line.

Building an internal “AI stack,” not just buying a tool

Uber shows what it takes to scale agentic development in a large organization.

  • a platform layer (model gateway)
  • secure access to internal context (code, docs, tickets)
  • support for best-of-breed external tools
  • specialized agents (background tasks, test generation, code review)
  • enablement: measurement, education, and cost control

They also invested in infrastructure to make this manageable (again, see the Uber deep dive):

  • an MCP (Model Context Protocol) gateway to connect agents to internal/external tools with centralized auth + telemetry
  • an internal agent builder/registry to reuse workflows
  • a standard CLI entrypoint to provision/update tools and run background agents

Guardrails: speed requires stronger quality gates

As throughput increases, teams need explicit guardrails to keep quality and operability intact.

A practical checklist:

  1. Definition of done includes evidence (tests, performance checks, security checks).
  2. Risk-tier reviews (low-risk changes can be delegated more; high-risk requires senior sign-off).
  3. Repo-aware agents with constrained scopes (clear boundaries, clear interfaces).
  4. CI and observability as first-class (canarying, monitoring, rollback).
  5. Cost controls (token budgets, concurrency limits, caching, “run agents where it matters”).

What this means for engineering leaders

If you want AI speed without quality collapse, the playbook is not “mandate AI.” It’s:

  • invest in deep thinking upstream (problem framing, constraints, intent) — and protect the capacity for deep work with tooling like Work Smart
  • redesign workflows so review/testing/operations can keep up
  • build the platform connectors and guardrails that make agents safe
  • measure outcomes, not output — combining delivery signals with the “human system” signals you get from Developer Experience Surveys

Network Perspective lens: as coordination becomes the constraint, leaders need visibility into how work flows through the organization—handoffs, review bottlenecks, decision latency, and where context breaks. The teams that win will treat collaboration and decision-making as measurable, improvable systems, and then systematically reduce the friction that prevents deep work.

Suggested next steps (starting this week)

  • Pick one workflow (e.g., migrations, test creation, or bugfixes) and pilot an agent-first approach with clear guardrails.
  • Introduce a “thinking first” template: problem, success criteria, constraints, risks, evidence required.
  • Run a fast baseline Developer Experience Survey to identify where AI is increasing friction (context switching, review latency, unclear requirements), then re-run after 2–4 weeks to verify improvement: Developer Experience Surveys
  • Use Work Smart to protect focus time during the pilot (e.g., detect interruption-heavy periods, identify meeting/notification hotspots, and reinforce deep-work blocks): Work Smart
  • Split changes into risk tiers and adjust review rigor accordingly.
  • Track the new bottlenecks end-to-end: PR routing time, review latency, CI time, incident rate, token spend — and correlate them with focus/interrupt signals.
March 31, 2026

Want to explore more?

See our tools in action

Developer Experience Surveys

Explore Freemium →

WorkSmart AI

Schedule a demo →
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.