When AI Builds AI: The Feedback Loop That Changes Everything
When AI Builds AI
Inspired by MC Escher
Here are two numbers that, taken together, should give anyone in technology pause.
The first: according to METR’s time horizon benchmark, Claude Opus 4.6 can now successfully complete software tasks that take skilled human engineers over 14 hours—roughly a 3x jump from Claude Opus 4.5’s roughly 5-hour mark just a few months earlier.
The second: Mike Krieger, Anthropic’s Chief Product Officer and co-founder of Instagram, recently stated at the Cisco AI Summit that approximately 95% of Claude Code is now written by Claude itself. Boris Cherny, who leads Claude Code development, told Fortune he hasn’t personally written a line of code in over two months—while shipping over 250 pull requests in a single month, all written by Claude.
Read those two facts together and the implication is striking: AI is building AI. And the AI it’s building is getting dramatically better at building AI.
This is the feedback loop that people have been theorizing about for years. The question is whether the data supports the idea that it’s actually happening.
The Graph Everyone Is Talking About
Source: METR Time Horizons — reproduced with attribution
If you follow AI developments, you’ve probably seen METR’s time horizon graph. On a logarithmic scale, pure exponential growth should appear as a straight line—steady, predictable, orderly. And for years, that’s roughly what the data showed: a consistent doubling rate, models climbing the curve at a regular pace.
METR (Model Evaluation & Threat Research) measures what they call “time horizons”—the task duration, measured by how long a skilled human would take, at which an AI agent succeeds with 50% reliability. Their latest data shows this metric has been doubling approximately every 128 days—about 4.3 months—accelerating from a historical average of roughly 6 months.
Here’s where the frontier models currently stand:
| Model | 50% Time Horizon | Date |
|---|---|---|
| Claude 3.7 Sonnet | ~1 hour | Feb 2025 |
| o3 | ~2 hours | Apr 2025 |
| GPT-5 | ~3.5 hours | Aug 2025 |
| Claude Opus 4.5 | ~5 hours 20 min | Nov 2025 |
| GPT-5.2 | ~6 hours 34 min | Dec 2025 |
| Claude Opus 4.6 | ~14.5 hours | Feb 2026 |
That last row is the one that caught my eye—and here’s what’s significant about it. Claude Opus 4.6 isn’t just continuing the exponential trend. It’s an outlier even on the log scale. When a data point breaks above the trend line on a logarithmic chart, it means the growth rate itself is accelerating. We’re not just seeing exponential progress; we may be seeing the exponent increase.
Now look at the same data on a linear scale:
The same data, different story. On a linear scale, the recent jump from ~5 hours to ~14.5 hours is impossible to miss.
On a linear scale, the implication hits harder. The jump from Opus 4.5 to 4.6 is larger than all previous jumps combined. The log scale makes exponential growth look like steady progress. The linear scale shows you what exponential growth actually feels like when you’re living through it. And when the log scale itself shows acceleration, the linear scale becomes a hockey stick.
But before we get carried away, some important caveats.
Reading the Graph Correctly
MIT Technology Review recently called this “the most misunderstood graph in AI,” and they’re right. There are several things people commonly get wrong.
First, the y-axis does not represent how long the AI can work independently. It represents how long a human would take to complete the tasks the AI can do. Just because Claude Code can iterate for 12 hours without user input doesn’t mean it has a 12-hour time horizon.
Second, the confidence intervals are enormous. METR reports Claude Opus 4.6’s 50% time horizon at roughly 14.5 hours, but the 95% confidence interval stretches from about 6 hours to 98 hours. That’s a massive range. The benchmark is running into saturation problems—the tasks aren’t long enough to precisely measure where the frontier models actually top out. METR has been expanding their task suite (up 34% in the latest version, with double the number of 8+ hour tasks), but they’re essentially building the road as they drive on it.
Third, there’s a reliability gap that matters enormously in practice. GPT-5.2’s 50% time horizon is 6.5 hours, but its 80% time horizon—where it succeeds four times out of five—is only about 55 minutes. That’s a 7:1 ratio, and it’s consistent across models. Being able to do something half the time is very different from being able to do it reliably.
These caveats matter. But they don’t erase the trend. Even the conservative end of the confidence intervals shows remarkable progress.
Claude Is Writing Claude
Now here’s where it gets interesting.
In March 2025, Anthropic CEO Dario Amodei predicted that AI would write 90% of code within 3-6 months. At the time, many people—myself included—thought this was aggressive. By late 2025, he refined the claim: 70-80-90% of code at Anthropic was being written by Claude.
By February 2026, the numbers appear to bear him out—at least internally. Krieger’s claim that 95% of Claude Code is written by Claude itself means the tool is substantially authoring its own next version. An Anthropic spokesperson clarified the company-wide figure is between 70% and 90%, which is still remarkable.
They’re not alone. Cognition reports that Devin now produces 25% of its own company’s internal pull requests, with a target of 50% by year-end. Microsoft’s Satya Nadella said AI was generating about 30% of code at Microsoft as of mid-2025.
But Anthropic’s situation is qualitatively different. When Claude writes 95% of Claude Code, and Claude Code is the tool people use to direct Claude, you have something that looks like the early stages of a recursive improvement loop: AI builds tools → tools make AI more capable → more capable AI builds better tools.
Is This the Self-Improvement Boom?
Let’s be careful here. The concept of recursive AI self-improvement has been discussed for decades, often in the context of an “intelligence explosion”—a hypothetical scenario where AI rapidly bootstraps itself to superhuman capability. That scenario remains speculative. What we’re seeing is something more mundane but still significant.
The current loop looks like this: human engineers at Anthropic set direction, define architecture, and make judgment calls. Claude generates the vast majority of the implementation code. That code makes Claude better. The next version of Claude generates even more of the code. Humans remain in the loop—but the loop is tightening.
This is not fully autonomous self-improvement. It’s more like a power tool that keeps getting better at manufacturing the next version of itself, while a human operator still guides the process. But the acceleration is real. Consider the pace: Claude Opus 4.5 shipped in November 2025 with a 5-hour time horizon. Just months later, Opus 4.6 landed at 14.5 hours. Previous generations took longer to produce smaller gains.
An ICLR 2026 workshop on recursive self-improvement notes that while AI systems now routinely rewrite their own codebases, they still require human guidance—which places an upper bound on how fast improvement can come. But that upper bound appears to be rising.
What This Means Practically
For those of us watching AI’s impact on software development—a topic I’ve been writing about in this series—the METR trend has concrete implications.
If the doubling rate holds, we’re looking at agents that can handle full-day tasks by mid-2026 and multi-day tasks by early 2027. Even if the rate slows—and it might, as we hit harder problems—the trajectory means that AI coding agents are moving from “useful assistant” to “junior developer” to something approaching “senior developer” capability in the span of a year or two.
This doesn’t mean human developers become irrelevant—I’ve explored that question separately. The reliability gap means humans remain essential for catching errors, making judgment calls, and directing work. But the nature of the work is shifting fast. Boris Cherny’s claim that he ships 20+ PRs a day without writing code is a preview of what development looks like when the human’s job is direction, not implementation.
For the Garage Renaissance—the trend of solo founders building with AI agents—the acceleration is fuel on the fire. Every doubling of the time horizon expands the complexity of what a single person can build.
The Questions We Should Be Asking
The recursive improvement loop raises questions that go beyond developer productivity.
If AI is substantially writing itself, how do we audit and trust the output? Anthropic says they have “all the right scaffolds” around the process, but as the percentage climbs toward 100%, the scaffolding becomes the whole structure. What does code review look like when the reviewer is also AI?
There’s also the concentration question. If the companies building AI are the ones most accelerated by AI, the gap between leaders and everyone else could widen fast. Anthropic, OpenAI, and Google are all using their own models to accelerate development. Companies without that advantage fall further behind with each cycle.
And there’s the question I keep coming back to: rather than asking “are we almost at AGI?"—a question that may not have a meaningful answer—we should be asking what tasks AI can now perform, and what that implies for the systems we depend on. The METR graph is one answer to that question: the range of tasks is expanding exponentially. Whether or not that constitutes “intelligence” is less important than what it means for how work gets done.
Questions Worth Asking
- If AI agents can handle 14-hour tasks today, what does your job look like when they can handle 100-hour tasks in the near future?
- How should we think about code that is substantially written by AI, reviewed by AI, and used to build more AI? (corellary, how do we think about code written and reviewed by humans now?) What will a world filled with AI generated code look like?
- Does the recursive improvement loop have natural speed limits, or are we at the beginning of a sustained acceleration? How can we keep up?
- What happens to the competitive landscape when the leading AI companies are also the ones most accelerated by AI? Who is setting the direction of these models?
What do you think? I’d love to hear your thoughts.
Comments