Hey! Thanks for your comment - I'm the one who wrote this article. I wasn't trying to say that the paper authors talked about "unexpected edge cases" or "thinking outside the box." I edited the post to be more clear that some of these takeaways are my own opinions.
This article is less of a summary of a paper and rather commentary on what the results of the paper entails. After all, Hacker News is meant for discussion :)
I will say though that I do believe that I still stand by the "exponentially more valuable" portion. I think the fact that LLMs can fluke their way into "hitting a jackpot" in terms of test coverage is exactly why they're so valuable. When you have something constantly trying out different combinations, if it hits even one jackpot, like in the paper, it's extremely valuable to the team. It's a case that could have been either non-obvious or simply too tedious to write a test for manually. I think there's tremendous value in that, especially speaking as someone who has spend way too much time simply figuring out how to test something within a Big Tech codebase (F/G) when I already knew what to test.
Pedantic warning here. In fast and loose day-to-day common English language "exponentially more" means "fast growth" or "a whole lot". But that usage is meaningless! Why?, technically, you can't have exponential growth without a dependent variable. You can have exponential growth as a function of time, height, spend, distance, any freaking metric or variable. But it has to be as a function of a value.
You CAN'T have exponential growth that is not a function of some value or variable or input.
I suppose in this case you could argue you have exponential growth as a function of the discrete using-an-LLM or not-using-an-LLM, but I've never heard of exponential growth as a function of a discrete.
Often people using the term "exponential growth" in common English don't understand what it means. Sorry.
Good point! I used exponential to emphasize the nature of the value of a certain test compared to the rest of the generated tests, but you're right it's not the right word. I updated the article to remove the usage. :)
FWIW, exponential growth as a function of a discrete variable is very common (e.g. all of algorithmic complexity), but it has to be (at least modeled as) an unbounded numeric variable.
You can't have exponential growth as a function of a binary variable.
Seconding digdugdirk's comment :) Thanks for the thoughtful response and I apologize if I came across as mean.
My problem is we have no clue what those lines actually were. If it was effectively dead code, then it's not surprising that it was untested, and the LLM-generated test wouldn't be valuable to the team. We have no clue what the value of the test actually was, and using a single stat like "lines of code covered" doesn't actually tell us anything. Saying the test was "exponentially more valuable" is pure speculation, and IMO not an especially well-founded one. (Sort of like saying people who write more lines of code are more productive.)
This speculation seems downright irresponsible when the paper specifically emphasizes that this result was a fluke. When the authors said "hit the jackpot" they did not mean "hit the jackpot with a valuable test", they meant "hit the jackpot with an outlier that somewhat artificially juked the stats." I truly believe if the LLM managed to write a unusually valuable test with such broad coverage they would have mentioned it in the qualitative discussion. Instead they went out of their way to dismiss the importance of the 1,326 figure.
You're right. I've edited my wording to be more realistic about the value of the test. I believe you're right that the test is not an outlier in terms of value provided.
Some of my comments within the article are more aspirational than realistic in this case, and I've made edits to reflect that.
I want to clarify that I view this LLM as a junior dev that submits PRs that pass presubmits and other verifiable, programmatic checks. A human dev then reviews the PR manually. In this case, the LLM + its processing is used to make sure that no BS is sent out of review - only potential improvements.
In no scenario should it's auto-generated code be auto-submitted into the codebase. That becomes a nightmare really fast.
Thanks for engaging with the above constructive criticism, it's a refreshing change from what is sadly the norm.
One additional question - do you forsee any issues with this application where LLMs enter a non-value add "doom loop"? I can imagine a scenario where a test generation LLM gets hooked on the lower value simplistic tests, and yet management sees such a huge increase on the test metric ("100x increase in unit tests in an afternoon? Let's do it again!") that they continue to bloat the test suite to near-infinity. Now we're in a situation where all future training data is now training on complete cesspool of meaningless tests that technically add coverage, but mostly just to cover an edge case that only an LLM would create.
Not sure if that makes sense, but tl;dr - having LLMs in the loop for both code creation and code testing seems like it's a feedback loop waiting to happen, with what seems like solely negative repercussions for future LLM training data.
I could see that, and I wouldn't want LLMs just generating tests willy-nilly with no human oversight. I don't have any trust in the reasoning ability of LLMs at all.
Rather, I prefer to view LLMs as a junior dev that submits PRs that pass presubmits and other verifiable, programmatic checks. A human dev then reviews the PR manually. In this case, the LLM + its processing is used to make sure that no BS is sent out of review - only potential improvements.
Perhaps there should be domains of focus for the test LLMs - even if they are clones, but assigned to only a particular domain, then their results have to be PR's etc...
Why not treat every LLM as a dev contributing to git such that Humans, or other LLms need to gatekeep in case something like that happens? (start by treating them as Interns, rather than Professors with office hours)
This article is less of a summary of a paper and rather commentary on what the results of the paper entails. After all, Hacker News is meant for discussion :)
I will say though that I do believe that I still stand by the "exponentially more valuable" portion. I think the fact that LLMs can fluke their way into "hitting a jackpot" in terms of test coverage is exactly why they're so valuable. When you have something constantly trying out different combinations, if it hits even one jackpot, like in the paper, it's extremely valuable to the team. It's a case that could have been either non-obvious or simply too tedious to write a test for manually. I think there's tremendous value in that, especially speaking as someone who has spend way too much time simply figuring out how to test something within a Big Tech codebase (F/G) when I already knew what to test.