Hey! Thanks for your comment - I'm the one who wrote this article. I wasn't tryi...

camkego · on Feb 24, 2024

Pedantic warning here. In fast and loose day-to-day common English language "exponentially more" means "fast growth" or "a whole lot". But that usage is meaningless! Why?, technically, you can't have exponential growth without a dependent variable. You can have exponential growth as a function of time, height, spend, distance, any freaking metric or variable. But it has to be as a function of a value.

You CAN'T have exponential growth that is not a function of some value or variable or input.

I suppose in this case you could argue you have exponential growth as a function of the discrete using-an-LLM or not-using-an-LLM, but I've never heard of exponential growth as a function of a discrete.

Often people using the term "exponential growth" in common English don't understand what it means. Sorry.

engineercodex · on Feb 27, 2024

Good point! I used exponential to emphasize the nature of the value of a certain test compared to the rest of the generated tests, but you're right it's not the right word. I updated the article to remove the usage. :)

atq2119 · on Feb 24, 2024

Spot on.

FWIW, exponential growth as a function of a discrete variable is very common (e.g. all of algorithmic complexity), but it has to be (at least modeled as) an unbounded numeric variable.

You can't have exponential growth as a function of a binary variable.

nicklecompte · on Feb 24, 2024

Seconding digdugdirk's comment :) Thanks for the thoughtful response and I apologize if I came across as mean.

My problem is we have no clue what those lines actually were. If it was effectively dead code, then it's not surprising that it was untested, and the LLM-generated test wouldn't be valuable to the team. We have no clue what the value of the test actually was, and using a single stat like "lines of code covered" doesn't actually tell us anything. Saying the test was "exponentially more valuable" is pure speculation, and IMO not an especially well-founded one. (Sort of like saying people who write more lines of code are more productive.)

This speculation seems downright irresponsible when the paper specifically emphasizes that this result was a fluke. When the authors said "hit the jackpot" they did not mean "hit the jackpot with a valuable test", they meant "hit the jackpot with an outlier that somewhat artificially juked the stats." I truly believe if the LLM managed to write a unusually valuable test with such broad coverage they would have mentioned it in the qualitative discussion. Instead they went out of their way to dismiss the importance of the 1,326 figure.

engineercodex · on Feb 27, 2024

You're right. I've edited my wording to be more realistic about the value of the test. I believe you're right that the test is not an outlier in terms of value provided.

Some of my comments within the article are more aspirational than realistic in this case, and I've made edits to reflect that.

I want to clarify that I view this LLM as a junior dev that submits PRs that pass presubmits and other verifiable, programmatic checks. A human dev then reviews the PR manually. In this case, the LLM + its processing is used to make sure that no BS is sent out of review - only potential improvements.

In no scenario should it's auto-generated code be auto-submitted into the codebase. That becomes a nightmare really fast.

digdugdirk · on Feb 24, 2024

Thanks for engaging with the above constructive criticism, it's a refreshing change from what is sadly the norm.

One additional question - do you forsee any issues with this application where LLMs enter a non-value add "doom loop"? I can imagine a scenario where a test generation LLM gets hooked on the lower value simplistic tests, and yet management sees such a huge increase on the test metric ("100x increase in unit tests in an afternoon? Let's do it again!") that they continue to bloat the test suite to near-infinity. Now we're in a situation where all future training data is now training on complete cesspool of meaningless tests that technically add coverage, but mostly just to cover an edge case that only an LLM would create.

Not sure if that makes sense, but tl;dr - having LLMs in the loop for both code creation and code testing seems like it's a feedback loop waiting to happen, with what seems like solely negative repercussions for future LLM training data.

engineercodex · on Feb 27, 2024

I could see that, and I wouldn't want LLMs just generating tests willy-nilly with no human oversight. I don't have any trust in the reasoning ability of LLMs at all.

Rather, I prefer to view LLMs as a junior dev that submits PRs that pass presubmits and other verifiable, programmatic checks. A human dev then reviews the PR manually. In this case, the LLM + its processing is used to make sure that no BS is sent out of review - only potential improvements.

samstave · on Feb 24, 2024

Perhaps there should be domains of focus for the test LLMs - even if they are clones, but assigned to only a particular domain, then their results have to be PR's etc...

Why not treat every LLM as a dev contributing to git such that Humans, or other LLms need to gatekeep in case something like that happens? (start by treating them as Interns, rather than Professors with office hours)