What is a "truly new task"? Does there exist such a thing? What's an example of one?
Everything we do builds on top of what's already been done. When I write a new program, I'm composing a bunch of heuristics and tricks I've learned from previous programs. When a mathematician approaches an open problem, they use the tactics they've developed from their experience. When Newton derived the laws of physics, he stood on the shoulders of giants. Sure, some approaches are more or less novel, but it's a difference in degree, not kind. There's no magical firebreak to separate what AI is doing or will do, and the things the most talented humans do.
That highlighted phrase "everything is a remix" was for a good reason, there's a documentary of that same name, and I can certainly recommend it.
At the same time, there are things that are truly novel, even if the idea is based on combining two common approaches, the implementation might need to be truly novel, with new formulas and new questions that arise from those. AI can't belp there, speaking from experience.
I don't understand why their "Instant Grep + roundtrip to us-east-1" is so slow. First of all, the round-trip latency should not be nearly so bad to us-east-1. But second, and much more importantly, the LLM runs in the cloud. Shouldn't you just situate the LLM, agent runtime, and regex index in the same region? Wouldn't that be faster than round-tripping to the user's local machine?
Yeah, assuming there's no active monitoring during the training runs, you can trivially give the agent an abstraction which turns "1 GPU" into "16 GPUs" that just so happens to take 16x the wall-clock time to run.
In fact, looking at the blog post, the agent orchestrating 16 GPUs is half as efficient as the agent using 1 GPU in GPU-time. Since it uses 16 GPUs to reach the same result as 1 GPU in 1/8 of the time.
Do you have a sense of whether these validation loss improvements are leading to generalized performance uplifts? From afar I can't tell whether these are broadly useful new ideas or just industrialized overfitting on a particular (model, dataset, hardware) tuple.
Super interesting study. One curious thing I've noticed is that coding agents tend to increase the code complexity of a project, but simultaneously massively reduce the cost of that code complexity.
If a module becomes unsustainably complex, I can ask Claude questions about it, have it write tests and scripts that empirically demonstrate the code's behavior, and worse comes to worst, rip out that code entirely and replace it with something better in a fraction of the time it used to take.
That's not to say complexity isn't bad anymore—the paper's findings on diminishing returns on velocity seem well-grounded and plausible. But while the newest (post-Nov. 2025) models often make inadvisable design decisions, they rarely do things that are outright wrong or hallucinated anymore. That makes them much more useful for cleaning up old messes.
Bad code has real world consequences. Its not limited to having to rewrite it. The cost might also include sanctions, lost users, attrition, and other negative consequences you don’t just measure in dev hours
Right, but that cost is also incurred by human-written code that happens to have bugs.
In theory experienced humans introduce less bugs. That sounds reasonable and believable, but anyone who's ever been paid to write software knows that finding reliable humans is not an easy task unless you're at a large established company.
Well, if you keep in mind that "professionals" means "people paid to write code" then LLMs have been generating code at the same quality OR BETTER for about a year now. Most code sucks.
If you compare it to beautiful code written by true experts, then obviously not, but that kind of code isn't what makes the world go 'round.
We should qualify that kind of statement, as it’s valuable to define just what percentile of “professional developers” the quality falls into. It will likely never replace p90 developers for example, but it’s better than somewhere between there and p10. Arbitrary numbers for examples.
Can you quantify the quality of a p90 or p10 developer?
I would frame it differently. There are developers successfully shipping product X. Those developer are, on average, as skilled as necessary to work on project X. else they would have moved on or the project would have failed.
Can LLMs produce the same level of quality as project X developers? The only projects I know of where this is true are toy and hobby projects.
> Can you quantify the quality of a p90 or p10 developer?
Of course not, you have switched “quality” in this statement to modify the developer instead of their work. Regarding the work, each project, as you agree with me on from your reply, has an average quality for its code. Some developers bring that down on the whole, others bring it up. An LLM would have a place somewhere on that spectrum.
In a one-shot scenario, I agree. But LLMs make iteration much faster. So the comparison is not really between an AI and an experienced dev coding by hand, it's between the dev iterating with an LLM and the dev iterating by hand. And the former can produce high-quality code much faster than the latter.
The question is, what happens when you have a middling dev iterating with an LLM? And in that case, the drop in quality is probably non-linear---it can get pretty bad, pretty fast.
There was a recent study posted here that showed AI introduces regressions at an alarming rate, all but one above 50%, which indicates they spend a lot of time fixing their own mistakes. You've probably seen them doing this kind of thing, making one change that breaks another, going and adjusting that thing, not realizing that's making things worse.
The study is likely "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration". Regression rate plot is figure 6.
Read the study to understand what it is measuring and how it was measured. As I understand parent's summary is fine, but you want to understand it first before repeating it to others.
Bentley Software is proof that you can ship products with massive, embarrassing defects and never lose a customer. I can’t explain enterprise software procurement, but I can guarantee you product quality is not part of that equation.
This only helps if you notice the code is bad. Especially in overlay complex code, you have to really be paying attention to notice when a subtle invariant is broken, edge case missed, etc.
Its the same reason a junior + senior engineer is about as fast as a senior + 100 junior engineers. The senior's review time becomes the bottleneck and does not scale.
And even with the latest models and tooling, the quality of the code is below what I expect from a junior. But you sure can get it fast.
This is the most important point in the thread. The study measures code complexity but the REAL bottleneck is cognitive load (and drain) on the reviewer.
I've been doing 10-12 hour days paired with Claude for months. The velocity gains are absolutely real, I am shipping things I would have never attempted solo before AI and shipping them faster then ever. BUT the cognitive cost of reviewing AI output is significantly higher than reviewing human code. It's verbose, plausible-looking, and wrong in ways that require sustained deep attention to catch.
The study found "transient velocity increase" followed by "persistent complexity increase." That matches exactly. The speed feels incredible at first, then the review burden compounds and you're spending more time verifying than you saved generating.
The fix isn't "apply traditional methods" — it's recognizing that AI shifts the bottleneck from production to verification, and that verification under sustained cognitive load degrades in ways nobody's measuring yet. I think I've found some fixes to help me personally with this and for me velocity is still high, but only time will tell if this remains true for long.
The part that gets me is when it passes lint, passes tests, and the logic is technically correct, but it quietly changed how something gets called. Rename a parameter. Wrap a return value in a Promise that wasn't there before. Add some intermediate type nobody asked for. None of that shows up as a failure anywhere. You only notice three days later when some other piece of code that depended on the old shape breaks in a way that has nothing to do with the original change.
> The study found "transient velocity increase" followed by "persistent complexity increase."
Companies facing this reality are of course typically going to use AI to help manage the increased complexity. But that leads quickly to AI becoming a crutch, without which even basic maintenance could pose an insurmountable challenge.
I would argue they are. Those traditional methods aim at keeping complexity low so that reading code is easier and requires less effort, which accelerates code review.
This matches what I've seen. The bottleneck moved from writing to reviewing, but we didn't update the process to reflect that. What helped our team was shifting to smaller, more frequent commits with tight scope descriptions — reviewing five 30-line diffs is dramatically less taxing than one 150-line diff, even though the total volume is the same. The cognitive load is nonlinear.
I’ve also seen Opus 4.5 and 4.6 churn out tons of essentially meaningless tests, including ones where it sets a field on a structure and then tests that the field was set.
You have to actually care about quality with these power saws or you end up with poorly-fitting cabinets and might even lose a thumb in the process.
I find LLMs get much more prone to making mistakes or missing references when the size or complexity of the code increases. I have a “vibe coded” application that is just for personal use, and I’ll usually create a fresh prompt after a large refactor and ask “were all references to the previous approach removed, and has the application been fully migrated to using the new approach?”
It finds spots it missed during the refactor basically every time.
So I partially agree with you, but I think it takes multiple passes and at least enough understanding to challenge the LLM and ask pointed questions.
The open source models are pretty good too now. They are a few months behind, but not more than that. Sure, you still have to host them in the cloud to get enough VRAM to run them - but looking 10+ years into the future the end game here will probably be having a local LLM that runs on your own computer and is more than capable enough to do coding for you.
I've been asking myself the same question. Realistically I think it depends a lot on how many providers are available in the future. If you lose access to one you can move to another, its not a single point of failure per say. I think this question gets a lot more relevant if the providers get more monopolized instead of gaining wider spread, so far we've only seen more providers appear.
> Super interesting study. One curious thing I've noticed is that coding agents tend to increase the code complexity of a project, but simultaneously massively reduce the cost of that code complexity.
This is the same pattern I observed with IDEs. Autocomplete and being able to jump to a definition means spaghetti code can be successfully navigated so there's no "natural" barrier to writing spaghetti code.
I think thats a fallacy. As of right now there is a point of no return where the complexity cant be broken by the agent itself without breaking more on other things. I’ve seen it before. Agents cheat on tests, break lint and type rules.
I was hoping for it to work, but It didn’t for me.
> but simultaneously massively reduce the cost of that code complexity.
Citation needed. Until proven otherwise complexity is still public enemy #1. Particularly given that system complexity almost always starts causing most of its problems once a project is further along I don’t think we will know anything meaningful about that statement for at least a year.
I don’t necessarily endorse the author’s broad conclusions about “AI”, but I will say that the Spotify DJ specifically is an enragingly bad product. Nothing close to the utility of Claude Code.
OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.
We validate with MMLU and Hellaswag presently, and are getting this independently verified by a 3rd party.
We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.
Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.
I've no problem with the intuition. But I would hope for a lot more focus in the marketing materials on proving the (statistical) correctness of the implementation. 15% better inference speed is not worth it to use a completely unknown inference engine not tested across a wide range of generation scenarios.
This is a fair critique! We plan to use our system to generate many more inference libraries of this nature, and I'll make it a point to release better, broader correctness measures when we do so.
More likely: this is a transitional phase where our previously hard problems become easy, and we will soon set our sights on new and much harder problems. The pinnacle of creative achievement in the universe is probably not 2010s B2B SaaS.
It is entirely possible, however, that human beings will not be the primary drivers of progress on those problems.
Finally, a perspective that looks beyond the buggy whips! As for your last comment, it depends on what you mean by the primary drivers. Figurative crank turners, maybe not. Creativity and insight, don’t count us out just yet.
I did, and yet I also felt more relaxed reading it than I am reading most blog entries posted on here. I didn't feel like I had to guard against my time being wasted by vacuous LLM fiction.
And, sadly, that is indistinguishable (to me, at least) from a human genuinely availing themself of LLM assistance to rough out a draft, then making an honest effort to personalize the text with their own effort and insight.
Everything we do builds on top of what's already been done. When I write a new program, I'm composing a bunch of heuristics and tricks I've learned from previous programs. When a mathematician approaches an open problem, they use the tactics they've developed from their experience. When Newton derived the laws of physics, he stood on the shoulders of giants. Sure, some approaches are more or less novel, but it's a difference in degree, not kind. There's no magical firebreak to separate what AI is doing or will do, and the things the most talented humans do.
reply