Posted by johnjwang 2 days ago
"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).
Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.
Based on my own experience, I find the widespread scepticism on HN about AI-assisted coding misplaced. There will be corner cases, there will be errors, and there will be bugs. There will also be apps for which AI is not helpful at all. But that's fine - nobody is saying otherwise. The question is only about whether it is a _significant_ nett saving on the time spent across various project types. The answer to that is a resounding Yes.
The entire set of tests for a web framework I wrote recently were generated with Claude and GPT. You can see them here: https://github.com/webjsx/webjsx/tree/main/src/test
On an average, these tests are better than tests I would have written myself. The project was written mostly by AI as well, like most other stuff I've written since GPT4 came out.
"Using an autocomplete to bang it out" is exactly what one should do - in most cases.
It's bad enough when human team members are submitting useless, brittle tests with their PR's just to satisfy some org pressure to write them. The lazy ones provide a false sense of security even though they neglect critical scenarios, the unstable ones undermine trust in the test output because they intermittently raise false negatives that nobody has time to debug, and the pointless ones do nothing but reify architecture so it becomes too laborious to refactor anything.
As contextually aware generators, there are doubtless good uses for LLM's in test developement, but (as with many other domains) they threaten to amplify an already troubling problem with low-quality, high-volume content spam.
My first thought when I read this post was: Is his goal to test the code, or validate the features?
The first problem is he's providing the code, and asking for tests. If his code has a bug, the tests will enshrine those bugs. It's like me writing some code, and then giving it to a junior colleague, not providing any context, and saying "Hey, write some tests for this."
This is backwards. I'm not a TDD guy, but you should think of your test cases independent of your code.
Adding tests that capture the current state of things, so that when that bug is uncovered tests can easily be updated to the correct functionality to prove the bug prior to fixing it is a much better place to be than the status quo.
The horse may have bolted from the barn, but we can at least close the farm gate in the hopes of recapturing it eventually.
You can provide the context to an AI model though, you can share the source with it.
An added bonus is that if the tests aren't what you expect, often it helps you understand that the code isn't as clear as it should be.
You should have a few very granular unit tests for where they make the most sense (Known dangerous areas, or where they are very easy to write eg. analysis)
More library/service tests. I read in an old config file and it has the values I expect.
Integration/system tests should be the most common, I spin up the entire app in a container and use the public API to test the application as a whole.
Then most importantly automated UI tests, I do the standard normal customer workflows and either it works or it doesn't.
The nice thing is that when you strongly rely on UI and public API tests you can have very strong confidence that your core features actually work. And when there are bugs they are far more niche. And this doesn't require many tests at all.
(We've all been in the situation where the 50,000 unit tests pass and the application is critically broken)
I get that occasionally there are some really trivial but important tests that take time and would be nice to automate. But that's a minority in my experience.
Are there ways we can measure this?
One idea that I’ve had, is collect code coverage separately for each test. If a test isn’t covering any unique code or branches, maybe it is superfluous - although not necessarily, it can make sense to separately test all the boundary conditions of a function, even if doing so doesn’t hit any unique branches.
Maybe prefer a smaller test which covers the same code to a bigger one. However, sometimes if a test is very DRY, it can be more brittle, since it can be non-obvious how to update it to handle a code change. A repetitive test, updating it can be laborious, but at least reasonably obvious how to do so.
Could an LLM evaluate test quality, if you give it a prompt containing some expert advice on good and bad testing practices?
I'm perfectly capable of thinking. Thinking about "how can I create a system which reduces some of my cognitive load on testing so I can spend more of my cognitive resources on other things" is a particularly valuable form of thinking.
> Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.
That problem is when managers take a metric and turn it into a KPI. That doesn't happen to all metrics. I can think of many metrics I've personally collected that no manager ever once gazed upon.
The real measure of a metric's value, is how meaningful a domain expert finds it to be. And if the answer to that is "not very" – is that an inherent property of metrics, or a sign that the metric needs to be refined?
BTW, I think above are the best metrics to use for tests. Actually measuring it can be hard, but I think keeping track of when functionality doesn't work and people break your code is a good start.
And I think all of this should be measured in terms of doing the right thing business logic-wise and weighing importance of what needs testing based on the business value of when things don't work.
This seems like the kind of thing that should be highly dependent on the kind of project one is doing, if you have an MVP and your test code is taking longer than the actual code then it is clear the test code is antagonistic to the whole concept of an MVP.
AI could do all this thinking in the future but not yet I believe!
Let alone the codebase is likely a mess of bad practice already (never seen one that isn't! That is life) so often part of the job is leaving the campground a bit better than how you found it.
LLMs can help now on last mile stuff. Fill in this one test. Generate data for 100 test cases. Etc.
I was, however, extremely impressed with Claude this time around. Not only did it do a great job off the bat, but it taught me some techniques and tricks available in the language/framework (Ruby, Rspec) which I wasn't familiar with.
I'm certain that it helped having a decent prompt, asking it to consider all the potential user paths and edge cases, and also having a very good understanding of the code myself. Still, this was the first time for me I could honestly say that an LLM actually saved me time as a developer.
We are really already past the point of being able to discuss these matters though in large groups.
The herd speaks as if all LLMs on all programming languages are basically the same.
It is an absurdity. Talking to the herd is mostly for entertainment at this point. If I actually want to learn something, I will ask Sonnet.
In the past I've been involved in several projects deeply using MDA (Model Driven Architecture) techniques which used various code generation methods to develop software. One of the main obstacles was the problem of maintaining the generated code.
IOW: how should we treat generated code?
If we treat it in the same way as code produced by humans (ie. we maintain it) then the maintenance cost grows (super-linearly) with the amount of code we generate. To make matters worse for LLM: since the code it generates is buggy it means we have more buggy code to maintain. Code review is not the answer because code review power in finding bugs is very weak.
This is unlike compilers (that also generate code) because we don't maintain code generated by compilers - we regenerate it anytime we need.
The fundamental issue is: for a given set of requirements the goal is to produce less code, not more. _Any_ code generation (however smart it might be) goes against this goal.
EDIT: typos
Refactoring is harder, especially if it's not clear why a test is in place. I've seen many developers disable tests simply because they could not understand how, or why, to fix them.
I'm hopeful that LLMs can provide guidance in removing useless tests or simplifying things. In an ideal future they may even help in formulating requirements or design documentation.
I am very sceptical here as well. The biggest problem with formulating requirements or design documentation is translation from informal to formal language. In other words... writing programs.
LLMs are good at generating content that doesn't provide useful information (ie. have low information content). Their usefulness right now is caused by the fact that people are used to reading lot of text and distill information from it (ie. all the useless e-mails formulated in corporate language, all multi-page requirement documents formulated in human readable form). The job of a software engineer is to extract information from low information content text and write it down in a formal language.
In this context:
What I expect in the long run is that people will start to value high information content and concise text. And obviously - it cannot be generated by any LLM, because LLM cannot provide any information by itself. There is really no point in: provide short high information content text (ie. prompt) to LLM -> receive long low information content text from LLM -> extract information from long text.
If you need to change behaviour of generated code you need to change your generator to provide the right hooks.
Obviously none of this applies to "AI" generated code because the "AI" generator is not deterministic and will hallucinate different bugs from run to run. You must treat "AI" generated code as if it was written by the dumbest person you've ever worked with.
If you have injected services in your current service, the LLM doesn't know anything about those so it makes poor guesses. You have to bring those in context, so they can be mocked properly.
You end up spending a lot of time guiding the LLM, so it's not measurably faster than writing test by hand.
I want my prompt to be: "write unit tests for XYZ method" without having to accurately describe it the prompt what the method does, how it does it and why it does it. Writing too many details in the prompt takes the same time as writing the code myself.
Github Copilot should be better since it's supposed to have access to you entire code base. But somehow it doesn't look at dependencies and it just uses the knowledge of the codebase for stylistic purposes.
It's probably my fault, there are for sure better ways to use LLMs for code, but I am probably not the only one who struggles.
Happy to open source if anyone is interested.
The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.
LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.
Using LLMs allows us to have much higher coverage than if we didn't use it. To me and our engineering team, this is a pretty good thing because in the time prioritization matrix, if I can get a higher quality code base with higher test coverage with minimal extra work, I will definitely take it (and in fact it's something I encourage our engineering teams to do).
Most of the base tests that we use were created originally by some of our best engineers. The patterns they developed are used throughout our code base and LLMs can take these and make our code very consistent, which I also view as a plus.
re: Complacency: We actually haven't found this to be the case. In fact, we've seen more tests being written with this method. Just think about how much easier it is to review a PR and make edits vs write a PR. You can actually spend your time enforcing higher quality tests because you don't have to do most of the boilerplate for writing a test.
> which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.
I’ve been working with AI, too. I see what I’m guessing is the same unreliability that you admit in the last part of your article. For some reason, you are sanguine about it, whereas I see it as a serious problem.
You say you aren’t complacent, but your words don’t seem to address the complacency issue. “More tests” does not mean better testing, or even good enough testing.
Google “automation bias” and tell me what policies and procedures or training is in place to avoid it.
Having good tests allows me to be more liberal with LLMs on implementation. I still only use LLMs to bootstrap the implementation, and I finish it myself. LLMs, being generative, are really good for ideating different implementations (it proposes implementations that I would never have thought of), but I never take any implementation as-is -- I always try to step through it and finish it off manually.
Some might argue that it'd be faster if I wrote the entire thing myself, but it depends on the problem domain. So much of what I do is involve implementing code for unsolved problems (I'm not writing CRUD apps for instance) that I really do get a speed-up from LLMs.
I imagine folks writing conventional code might spend more time fixing LLM mistakes and thus think that LLMs slow them down. But this is not true for my problem domain.
If you don't understand how the code works, don't approve it.
Sure, complacent developers will get burned. They'll find plenty of other non-AI ways to burn themselves too.
I do think that LLMs will increase the volume of bad code though. I use Cursor a lot, and occasionally it will produce perfect code, but often I need to direct and refine, and sometimes throw away. But I'm sure many devs will get lazy and just push once they've got the thing working...
I think the issue is that we are currently being sold that it is. I'm blown away by how useful AI is, and how stupid it can be at the same time. Take a look at the following example:
https://app.gitsense.com/?doc=f7419bfb27c896&highlight=&othe...
If you click on the sentence, you can see how dumb Sonnet-3.5 and GPT-4 can be. Each model was asked to spell-check and grammar-check the sentence 5 times each, and you can see that GPT-4o-mini was the only one that got this right all 5 times. The other models mostly got it comically wrong.
I believe LLM is going to change things for the better for developers, but we need to properly set expectations. I suspect this will be difficult, since a lot of VC money is being pumped into AI.
I also think a lot of mistakes can be prevented if you include in your prompt, how and why it did what it did. For example, the prompt that was used in the blog post should include "After writing the test, summarize how each rule was applied."
The message that these systems are flawed appears to be pretty universal to me:
ChatGPT footer: "ChatGPT can make mistakes. Check important info."
Claude footer: "Claude can make mistakes. Please double-check responses."
https://www.meta.ai/ "Messages are generated by AI and may be inaccurate or inappropriate."
etc etc etc.
I still think the problem here is science fiction. We have decades of sci-fi telling us that AI systems never make mistakes, but instead will cause harm by following their rules too closely (paperclip factories, 2001: A Space Odyssey etc).
Turns out the actual AI systems we have make mistakes all the time.
I'd say parent is absolutely correct - we ARE being sold (quite literally, through promotional material, i.e. ads) that these models are way more capable than they actually are.
I do see your science fiction angle, but I think the bigger issue is the media, VCs, etc. are not clearly spelling out that we are nowhere near science fiction AI.
What absolute nonsense. What an absurd false equivalence. It's not that we expect perfection or even human level performance from "AI". It's that the crap that comes out of LLMs is not even at the level of a first year student. I've never in my entire life reviewed the code of a junior engineer and seen them invent third party APIs from whole cloth. I've never had a junior send me code that generates a payload that doesn't validate at the first layer of the operation with zero manual testing to check it. No junior has ever asked me to review a pull request containing references to an open source framework that doesn't exist anywhere in my application. Yet these scenarios are commonplace in "AI" generated code.
If an LLM hallucinates a method that doesn't exist I find out the moment I try and run the code.
If I'm using ChatGPT Code Interpreter (for Python) or Claude analysis mode (for JavaScript) I don't even have to intervene: the LLM can run in a loop, generating code, testing that it executes without errors and correcting any mistakes it makes.
I still need to carefully review the code, but the mistakes which cause it not to run at all are by far the least amount of work to identify.
I think the source code for tools like this one is genuinely good code: https://github.com/simonw/tools/blob/main/extract-urls.html
What do you see that's wrong with that?
I can knock out small but useful applications in genuinely less time than it would take me to Google for an existing solution to the same problem.
You can call them dreck if you like. I call (most of) them useful solutions.
... ... ... |_____| <- it's good code, but toy problem
I guess we all see where the goalposts will be tomorrow. Good code, good problem, I don't like the language. Or something :)