In every single system I have worked on, tests were not just tests - they were t...

jeswin · on Oct 27, 2024

> Using an autocomplete to "bang it out" seems foolish.

Based on my own experience, I find the widespread scepticism on HN about AI-assisted coding misplaced. There will be corner cases, there will be errors, and there will be bugs. There will also be apps for which AI is not helpful at all. But that's fine - nobody is saying otherwise. The question is only about whether it is a _significant_ nett saving on the time spent across various project types. The answer to that is a resounding Yes.

The entire set of tests for a web framework I wrote recently were generated with Claude and GPT. You can see them here: https://github.com/webjsx/webjsx/tree/main/src/test

On an average, these tests are better than tests I would have written myself. The project was written mostly by AI as well, like most other stuff I've written since GPT4 came out.

"Using an autocomplete to bang it out" is exactly what one should do - in most cases.

beepbooptheory · on Oct 27, 2024

Ok but looking at those tests for just a second (for createElement), you might want to go through it again, or ask the computer or whatever. For example, edgeCases.test.ts is totally redundant, you are running the same exact tests in children.test.ts.

Edit: such a LLM repo... why did it feel the need to recreate these DOM types? Is your AI just trying to maximize LoC? It just seems like such a pain and potential source of real trouble when these are already available. https://github.com/webjsx/webjsx/blob/main/src%2FjsxTypes.ts

jeswin · on Oct 28, 2024

Actually, the file you identified is the (only) one that's mostly human written. It came from a previous project, I may be able to get rid of it.

But generally, the tests are very useful. My point is that there will be redundancies, and maybe even bugs - and that's fine, because the time needed to fix these would be much less than what it would have taken to write them from scratch.

thanksgiving · on Oct 27, 2024

I want to bring my own experience from a code base I briefly worked on, I worked on a module of code where basically all the unit tests assertions were commented out. This was about ten years ago. The meta is there should still be someone responsible for the code an LLM generated and there should still be at least one more person who does a decent code review at some point. Otherwise, the unit tests being there is useless just like the example I gave on top with the assertions removed.

swatcoder · on Oct 26, 2024

Fully agreed.

It's bad enough when human team members are submitting useless, brittle tests with their PR's just to satisfy some org pressure to write them. The lazy ones provide a false sense of security even though they neglect critical scenarios, the unstable ones undermine trust in the test output because they intermittently raise false negatives that nobody has time to debug, and the pointless ones do nothing but reify architecture so it becomes too laborious to refactor anything.

As contextually aware generators, there are doubtless good uses for LLM's in test developement, but (as with many other domains) they threaten to amplify an already troubling problem with low-quality, high-volume content spam.

BeetleB · on Oct 27, 2024

Mostly agree.

My first thought when I read this post was: Is his goal to test the code, or validate the features?

The first problem is he's providing the code, and asking for tests. If his code has a bug, the tests will enshrine those bugs. It's like me writing some code, and then giving it to a junior colleague, not providing any context, and saying "Hey, write some tests for this."

This is backwards. I'm not a TDD guy, but you should think of your test cases independent of your code.

_puk · on Oct 27, 2024

But in a system that exists without tests (this is the real world after all), the current functionality is already enshrined in the app.

Adding tests that capture the current state of things, so that when that bug is uncovered tests can easily be updated to the correct functionality to prove the bug prior to fixing it is a much better place to be than the status quo.

The horse may have bolted from the barn, but we can at least close the farm gate in the hopes of recapturing it eventually.

renegade-otter · on Oct 27, 2024

Right! AI is going to help you write passing tests - not BREAK your code, which is the whole point of writing tests.

GuB-42 · on Oct 27, 2024

Tests are not just for breaking your code. Writing passing tests is great for regression testing, which I think is the most important kind of unit testing.

If your goal is to break your code, try fuzzing. For some reason, it seems that the only people who do it are in the field of cybersecurity. Fuzzing can do more than find vulnerabilities.

sumedh · on Oct 27, 2024

> not providing any context

You can provide the context to an AI model though, you can share the source with it.

danmaz74 · on Oct 27, 2024

I subscribe to the concept of the "pyramid of tests" - lots of simpler unit tests, fewer integration tests, and very few end-to-end tests. I find that using LLMs to write unit tests is very useful. If I just wrote code which has good naming both for the classes, methods and variables, useful comments where necessary and if I already have other tests which the LLMs can use as examples for how I test things, I usually just need to read the created tests and sometimes add some test cases, just writing the "it should 'this and that'" part for cases which weren't covered.

An added bonus is that if the tests aren't what you expect, often it helps you understand that the code isn't as clear as it should be.

holbrad · on Oct 27, 2024

I also subscribe to a testing pyramid but I think it's commonly upside down IMO.

You should have a few very granular unit tests for where they make the most sense (Known dangerous areas, or where they are very easy to write eg. analysis)

More library/service tests. I read in an old config file and it has the values I expect.

Integration/system tests should be the most common, I spin up the entire app in a container and use the public API to test the application as a whole.

Then most importantly automated UI tests, I do the standard normal customer workflows and either it works or it doesn't.

The nice thing is that when you strongly rely on UI and public API tests you can have very strong confidence that your core features actually work. And when there are bugs they are far more niche. And this doesn't require many tests at all.

(We've all been in the situation where the 50,000 unit tests pass and the application is critically broken)

renegade-otter · on Oct 28, 2024

This is exactly my experience.

viraptor · on Oct 26, 2024

Pretty much this and I prefer the opposite. "Here's the new test case from me, make the code pass it" is a decent workflow with Aider.

I get that occasionally there are some really trivial but important tests that take time and would be nice to automate. But that's a minority in my experience.

skissane · on Oct 26, 2024

> "More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code.

Are there ways we can measure this?

One idea that I’ve had, is collect code coverage separately for each test. If a test isn’t covering any unique code or branches, maybe it is superfluous - although not necessarily, it can make sense to separately test all the boundary conditions of a function, even if doing so doesn’t hit any unique branches.

Maybe prefer a smaller test which covers the same code to a bigger one. However, sometimes if a test is very DRY, it can be more brittle, since it can be non-obvious how to update it to handle a code change. A repetitive test, updating it can be laborious, but at least reasonably obvious how to do so.

Could an LLM evaluate test quality, if you give it a prompt containing some expert advice on good and bad testing practices?

fijiaarone · on Oct 27, 2024

Sometimes you actually have to think, or hire someone who can. Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.

skissane · on Oct 27, 2024

> Sometimes you actually have to think, or hire someone who can.

I'm perfectly capable of thinking. Thinking about "how can I create a system which reduces some of my cognitive load on testing so I can spend more of my cognitive resources on other things" is a particularly valuable form of thinking.

> Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.

That problem is when managers take a metric and turn it into a KPI. That doesn't happen to all metrics. I can think of many metrics I've personally collected that no manager ever once gazed upon.

The real measure of a metric's value, is how meaningful a domain expert finds it to be. And if the answer to that is "not very" – is that an inherent property of metrics, or a sign that the metric needs to be refined?

jaredsohn · on Oct 27, 2024

Good tests reduce your cognitive load; you can have more confidence that code will work and spend less time worrying that someone will break it.

BTW, I think above are the best metrics to use for tests. Actually measuring it can be hard, but I think keeping track of when functionality doesn't work and people break your code is a good start.

And I think all of this should be measured in terms of doing the right thing business logic-wise and weighing importance of what needs testing based on the business value of when things don't work.

nrnrjrjrj · on Oct 27, 2024

There is an art to writing tests especially getting absraction levels right. For example do you integration test hitting the password field with 1000 cases or do that as a unit test, and does doing it as a unit test sufficiently cover this.

AI could do all this thinking in the future but not yet I believe!

Let alone the codebase is likely a mess of bad practice already (never seen one that isn't! That is life) so often part of the job is leaving the campground a bit better than how you found it.

LLMs can help now on last mile stuff. Fill in this one test. Generate data for 100 test cases. Etc.

dngit · on Oct 27, 2024

Great point on focusing on high-impact tests. I agree that LLMs risk giving a false sense of coverage. Maybe a smart strategy is generating boilerplate tests while we focus on custom edge cases.

idoco · on Oct 27, 2024

Absolutely with you on the need for high-impact tests. I find that humans are still way better at coming up with the tests that actually matter, while AI can handle the implementation faster—especially when there’s a human guiding it.

Keeping a human in the loop is essential, in my experience. The AI does the heavy lifting, but we make sure the tests are genuinely useful. That balance helps avoid the trap of churning out “dumb” tests that might look impressive but don’t add real value.

bryanrasmussen · on Oct 27, 2024

>Sometimes I spend more time on the test code than the actual code (probably normal).

This seems like the kind of thing that should be highly dependent on the kind of project one is doing, if you have an MVP and your test code is taking longer than the actual code then it is clear the test code is antagonistic to the whole concept of an MVP.

aoeusnth1 · on Oct 27, 2024

Detecting regressions is the goal. If LLMs can do that for free to cheap, that’s good. It doesn’t have to be complicated.

idoco · on Oct 27, 2024

Totally agree, especially about the need for well-architected, high-impact tests that go beyond just coverage. At Loadmill, we found out pretty early that building AI to generate tests was just the starting point. The real challenge came with making the system flexible enough to handle complex customer architectures. Think of multiple test environments, unique authentication setups, and dynamic data preparation.

There’s a huge difference between using an LLM to crank out test code and having a product that can actually support complex, evolving setups long-term. A lot of tools work great in demos but don’t hold up for these real-world testing needs.

And yeah, this is even trickier for higher-level tests. Without careful design, it’s way too easy to end up with “dumb” tests that add little real value.