LLMs work best when the user defines their acceptance criteria first

Posted by dnw 8 hours ago

LLMs work best when the user defines their acceptance criteria first(blog.katanaquant.com)

210 points | 172 comments

alexhans 40 seconds ago|

> The vibes are not enough. Define what correct means. Then measure.

Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.

The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.

The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.

- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )

pornel 7 hours ago||

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

unlikelytomato 6 hours ago||

This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

danparsonson 3 hours ago|||

Yeah that describes most legacy codebases I've worked on XD

Foobar8568 3 hours ago||||

LLM code is higher quality than any codes I have seen in my 20 years in F500. So yeah you need to "guide" it, and ensure that it will not bypass all the security guidance for ex...But at least you are in control, although the cognitive load is much higher as well than just "blind trust of what is delivered".

But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.

thesz 2 hours ago|||

  > LLM code is higher quality than any codes I have seen in my 20 years in F500.

"Any codes"?

Foobar8568 2 hours ago|||

At least my comment hasn't been reviewed or written by a LLM.

And in my French brain, code or codebase is countable and not uncountable.

sebastiennight 1 hour ago|||

As far as I've ever heard, "le code" used in a codebase is uncountable, like "le café" you'd put in a cup, so we would still say "meilleur que tout le code que j'ai vu en 20 ans" and not "meilleur que tous les codes que j'ai vus en 20 ans".

There is a countable "code" (just like "un café" is either a place, or a cup of coffee, or a type of coffee), and "un code" would be the one used as a password or secret, as in "j'ai utilisé tous les codes de récupération et perdu mon accès Gmail" (I used all the recovery codes and lost Gmail access).

Foobar8568 1 hour ago||

You are correct, we generally say le code. To be exact at that time, I was more thinking toutes les lignes de code.

thesz 2 hours ago||||

I guess you can guide it to write in any style.

But what set me off is an universal qualifier: there was no code seen by you that is of equal quality or better that what LLMs generate.

mejutoco 1 hour ago||

cows are brown, from one side.

https://www.neatorama.com/2007/01/22/a-mathematical-cow-joke...

Implicated 2 hours ago|||

I got curious and had to fire up the ol LLM to find out what the story is about the words that aren't pluralized - TIL about countable and uncountable nouns. I wonder if the guy giving you trouble about your English speaks French.

thesz 2 hours ago|||

I speak Russian and some English, but the question was about universal quantification: author declares that LLMs generate code of better quality than "any codes" he seen in his career.

iLoveOncall 1 hour ago|||

I'm native French and nobody would consider code countable. "codes" makes no sense. We'd talk about "lines of code" as a countable in French just like in English.

Implicated 2 hours ago|||

You'll find, at times, that those communicating in a language that's not their primary language will tend to deviate from what one whose it was their primary language might expect.

If that's obvious to you than you're just being rude. If it's not obvious to you, then you'll also find this is a common deviance (plural 'code') from those who come from a particular primary language's region.

Edit; This got me thinking - what is the grammar/rule around what gets pluralized and what doesn't? How does one know that "code" can refer to a single line of code, a whole file of code, a project, or even the entirety of all code your eyes have ever seen without having to have an s tacked on to the end of it?

tsimionescu 2 hours ago|||

"Codes" as a way to refer to programs/libraries is actually common usage in academia and scientific programming, even by native English speakers. I believe, but am not sure, that it may just be relatively old jargon, before the use of "programs" became more common in the industry.

As for the grammar rule, it's the question of whether a word is countable or uncountable. In common industry usage, "code" is an uncountable noun, just like "flour" in cooking (you say 2 lines of code, 1 pound of flour).

It's actually pretty common for the same word to have both countable and uncountable versions, with different, though related, meanings. Typically the uncountable version is used with a measure of quantity, while the countable version denotes different kinds (flours - different types of flour; peoples - different groups of people).

Implicated 2 hours ago||

> Typically the uncountable version is used with a measure of quantity, while the countable version denotes different kinds (flours - different types of flour; peoples - different groups of people).

This was very helpful, thank you! (I had just gotten off the phone with Claude learning about countable and uncountable nouns but those additional details you provided should prove quite valuable)

thesz 2 hours ago||||

The question was about universal quantification, not grammar error.

As if author of the comment had not seen any code that is better or of equal quality of code generated by LLMs.

Implicated 2 hours ago||

Well now I look like an idiot. But I did learn some things! :D My apologies.

thaumasiotes 1 hour ago|||

> what is the grammar/rule around what gets pluralized and what doesn't? How does one know that "code" can refer to a single line of code, a whole file of code, a project, or even the entirety of all code your eyes have ever seen without having to have an s tacked on to the end of it?

Well, the grammar is that English has two different classes of noun, and any given noun belongs to one class or the other. Standard terminology calls them "mass nouns" and "count nouns".

The distinction is so deeply embedded in the language that it requires agreement from surrounding words; you might compare many [which can only apply to count nouns] vs much [only to mass nouns], or observe that there are separate generic nouns for each class [thing is the generic count noun; stuff is the generic mass noun].

For "how does one know", the general concept is that count nouns refer to things that occur discretely, and mass nouns refer to things that are indivisible or continuous, most prototypically materials like water, mud, paper, or steel.

Where the class of a noun is not fixed by common use (for example, if you're making it up, or if it's very rare), a speaker will assign it to one class or the other based on how they internally conceive of whatever they're referring to.

mettamage 1 hour ago|||

Giving it prompts of the Shannon project helps for security

lwansbrough 4 hours ago||||

For me, I'll do the engineering work of designing a system, then give it the specific designs and constraints. I'll let it plan out the implementation, then I give it notes if it varies in ways I didn't expect. Once we agree on a solution, that's when I set it free. The frontier models usually do a pretty good job with this work flow at this point.

YesBox 4 hours ago||||

Heh, people like to have someone else to blame.

iLoveOncall 1 hour ago|||

Really? Because this perfectly explains why it will never replace them: it needs an exact language listing everything required to function as you expect it.

You need code to get it to generate proper code.

abm53 13 minutes ago||

I think GP was a joke about the ability of a typical programmer.

I certainly read it as one and found it funny.

stingraycharles 7 hours ago|||

> If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.

Mavvie 4 hours ago||

Sounds like my coworkers.

lelanthran 50 minutes ago|||

Maybe, but I'd bet a large sum of money that each of your coworkers aren't turning out this drivel at a rate of 3kLoC per hour.

Can you imagine working with someone who produces 100k lines of unmaintainable code in a single sprint?

This is your future.

Foobar8568 3 hours ago|||

That's the reality nobody really wants to say.

Jweb_Guru 2 hours ago||

It's not reality. I'm really not a fan of the way that people excuse the really terrible code LLMs write by claiming that people write code just as bad. Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.

lukan 14 minutes ago|||

"Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later."

I admire your experience with people.

darkwater 53 minutes ago||||

> it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.

I had a coworker that more or less exactly did that. You left a comment in a ticket about something extra to be done, he answered "yes sure" and after a few days proceeded to close the ticket without doing the thing you asked. Depending on the quantity of work you had at the moment, you might not notice that until after a few months, when the missing thing would bite you back in bitter revenge.

imiric 2 hours ago||||

It's an easy copout.

Tool works as expected? It's superintelligence. Programming is dead.

Tool makes dumb mistake? So do humans.

brabel 1 hour ago||

Yes and both are right. It’s a matter of which is working as expected and making fewer mistakes more often. And as someone using Claude Code heavily now, I would say we’re already at a point where AI wins.

ttoinou 1 hour ago||||

No but they will despise you for bringing the problem up

evolve-maz 1 hour ago|||

[dead]

marginalia_nu 6 hours ago|||

My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

ehnto 5 hours ago|||

Well when you write it manually you are doing the review and sanity checking in real time. For some tasks, not all but definitely difficult tasks, the sanity checking is actually the whole task. The code was never the hard part, so I am much more interested in the evolving of AIs real world problem solving skills over code problems.

I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.

LPisGood 5 hours ago|||

And it’s slower to review because you didn’t do the hard part of understanding the code as it was being written.

Implicated 5 hours ago||

You're holding it wrong.

Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.

ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.

To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).

Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.

slopinthebag 4 hours ago||

They aren't holding it wrong, it's a fundamental limitation of not writing the code yourself. You can make it easier to understand later when you review it, but you still need to put in that effort.

joquarky 2 hours ago|||

Don't let it deteriorate so far that it can't recover in one session.

Perform regular sessions dedicated to cleaning up tech debt (including docs).

vannevar 6 hours ago|||

I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.

coolius 1 hour ago||

+1 if you are vibe coding projects from scratch. if the architecture you specify doesn't make sense, the llm will start struggling, the only way out of their misery is mocking tests. the good thing is that a complete rewrite with proper architecture and lessons learned is now totally affordable.

disgruntledphd2 47 minutes ago||

I think the best thing about LLMs is how incredibly easy they make it to build one to throw away.

I've definitely built the same thing a few times, getting incrementally better designs each time.

Implicated 5 hours ago|||

Not trying to be snarky, with all due respect... this is a skill issue.

It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.

> If you tell them the code is slow

Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.

Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).

> you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.

And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.

Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.

The more you stick to convention the better off you'll be. And use planning mode.

riffraff 2 hours ago|||

> Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results?

Yes? Why don't you?

They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).

Implicated 2 hours ago|||

So, you observed some telemetry - which would have been some sort of specific metric, right? Wouldn't you communicate that to them as well, not just "it's slow"?

"Hey, I saw that metric A was reporting 40% slower, are you aware already or have any ideas as to what might be causing that?"

Those two approaches are going to produce rather distinctly different results whether you're speaking to a human or typing to a GPU.

bryanrasmussen 1 hour ago||||

Yeah if my co-worker can't start figuring out why the code is slow, with a reasonable reference to what the code in question is, that is a knock against their skills. I would actually expect some ideas as to what the problem is just off the top of their heads, but that the coding agent can't do that isn't a hit against it specifically, this is now a good part of what needs to be done differently.

The suggestion to tell the agent to do performance analysis of the part of the code you think is problematic, and offer suggestions for improvements seems like the proper way to talk to a machine, whereas "hey your code is slow" feels like the proper way to talk to a human.

brabel 46 minutes ago||

As someone who leads a team of engineers, telling someone their code is slow is not nice, helpful or something a good team member should do. It’s like telling them there’s a bug and not explaining what the bug is. Code can be slow for infinite reasons, maybe the input you gave is never expected and it’s plenty fast otherwise. Or the other dev is not senior enough to know where problems may be. It can be you when I tell you your OOP code is super slow, but you only ever done OOP and have no idea how to put data in a memory layouts that avoids cpu cache misses or whatever. So no that’s not the proper way to talk to humans. And AI is only as good as the quality of what you’re asking. It’s a bit like a genie, it will give you what you asked , not what you actually wanted. Are you prepared for the ai to rewrite your Python code in C to speed it up? Can it just add fast libraries to replace the slow ones you had selected? Can it write advanced optimization techniques it learned about from phd thesis you would never even understand?

zabzonk 1 hour ago|||

Well, I would say something like "We seem to be having some performance issues the business has noticed in the XYZ stuff. Shall we sit down together and see if we can work out if we can improve things?"

girvo 1 hour ago||||

I absolutely tell a coworker their code is slow and expect them to fix it…

brabel 59 minutes ago||||

Great answer, and the reason some people have bad experiences is actually patently clear: they don’t work with the AI as a partner, but as a slave. But even for them, AI is getting better at automatically entering planning mode, asking for clarification (what exactly is slow, can you elaborate?), saying some idea is actually bad (I got that a few times), and so on… essentially, the AI is starting to force people to work as a partner and give it proper information, not just tell them “it’s broken, fix it” like they used to do on StackOverflow.

otabdeveloper4 3 hours ago||||

It is not a tool. It is an oracle.

It can be a tool, for specific niche problems: summarization, extraction, source-to-source translation -- if post-trained properly.

But that isn't what y'all are doing, you're engaging in "replace all the meatsacks AGI ftw" nonsense.

Implicated 2 hours ago||

If I was on the "replace all the meatsacks AGI ftw" team then I would have referred to it as an oracle, by your own logic, wouldn't I have?

It's a tool. It's good for some things, not for others. Use the right tool for the job and know the job well enough to know which tools apply to which tasks.

More than anything it's a learning tool. It's also wildly effective at writing code, too. But, man... the things that it makes available to the curious mind are rather unreal.

I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.

I didn't know anything about all this stuff before I started. I didn't AGI myself here. I used a learning tool.

But keep up with your schtick if that's what you want to do.

leptons 1 hour ago||

>I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.

So what? That's honestly amateur hour. And the LLM derived all of it from things that have been done and posted about a thousand times before.

You could have achieved the same thing with a few google searches 15 years ago (obviously not with ESP32, but other microcontrollers).

5o1ecist 1 hour ago|||

[dead]

codebolt 4 hours ago|||

I use the restore checkpoint/fork conversation feature in GitHub Copilot heavily because of this. Most of the time it's better to just rewind than to salvage something that's gone off track.

disgruntledphd2 43 minutes ago||

Yeah I'm a big fan of branching for basically every change, as it provides a known good checkpoint.

leke 2 hours ago|||

i wonder if the solution is to just ask it to refactor its code once it's working.

MadnessASAP 1 hour ago||

You can, and it might make things a bit better. The only real way I've found so far is to start going through file by file, picking it apart.

I wouldn't be surprised if over half my prompts start with "Why ...?", usually followed by "Nope, ... instead”

Maybe the occasional "Fuck that you idiot, throw the whole thing out"

MattGaiser 5 hours ago|||

> If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?

bryanrasmussen 6 hours ago|||

maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

ashdksnndck 30 minutes ago|||

I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.

krackers 5 hours ago|||

I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.

movedx01 29 minutes ago||

Perhaps the problem is that you RL on one patch a time, failing to capture the overarching long term theme, an architecture change being introduced gradually over many months, that exists in the maintainer’s mental model but not really explicitly in diffs.

esafak 6 hours ago||

I have run into this too. Some of it is because models lack the big picture; so called agentic search (aka grep) is myopic.

grey-area 1 hour ago||

This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow). Doesn't use is_ipk to identify primary keys, uses fsync on every statement. The problem with larger projects like this even if you are competent is that there are just too many lines of code to read it properly and understand it all. Bravo to the author for taking the time to read this project, most people never will (clearly including the author of it).

I find LLMs at present work best as autocomplete -

The chunks of code are small and can be carefully reviewed at the point of writing

Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete

That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.

Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.

D-Machine 5 hours ago||

This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.

They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.

kubb 1 hour ago||

IIRC, the most code in its training data is Python. Closely followed by Web technologies (HTML, JS/TS, CSS). This corresponds to the most abundant developers. Many of them dedicated their entire careers to one technology.

We stubbornly use the same language to refer to all software development, regardless of the task being solved. This lets us all be a part of the same community, but is also a source of misunderstanding.

Some of us are prone to not thinking about things in terms of what they are, and taking the shortcut of looking at industry leaders to tell us what we should think.

These guys consistently, in lockstep, talk about intelligent agents solving development tasks. Predominately using the same abstract language that gives us an illusion of unity. This is bound to make those of us solving the common problems believe that the industry is done.

ozozozd 5 hours ago||

Exactly. It’s also easy to find yourself in the out-of-distribution territory. Just ask for some tree-sitter queries and watch Gemini 3, Opus 4.5 and GLM 5 hallucinate new directives.

ehnto 3 hours ago|||

I think this could be the key difference in how people are experiencing the tools. Using Claude in industries full of proprietary code is a totally different experience to writing some React components, or framework code in C#, PHP or Java. It's shockingly good at the later, but as you get into proprietary frameworks or newer problem domains it feels like AI in 2023 again, even with the benefit of the agentic harnesses and context augments like memory etc.

simianwords 1 hour ago|||

Any example of how I can get it to hallucinate?

consumer451 56 minutes ago||

Nitpick/question: the "LLM" is what you get via raw API call, correct?

If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.

I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?

dragonwriter 46 minutes ago||

> Nit pick/question: The LLM is what you get via raw API call, correct?

You always need a harness of some kind to interact with an LLM. Normal web APIs (especially for hosted commercial systems) wrapped around LLMs are non-minimal harnesses, that have built in tools, interpretation of tool calls, application of what is exposed in local toolchains as “prompt templates” to transform the context structure in the API call into a prompt (in some cases even supporting managing some of the conversation state that is used to construct the prompt on the backend.)

> If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

You are essentially always using something more than an LLM (unless “you” are the person writing the whole software stack, and the only thing you are consuming is the model weights, or arguably a truly minimal harness that just takes setting and a prompt that is not transformed in any way before tokenization, and returns the result after no transformations or filtering other than mapping back from tokens to text.)

But, yes, if you are using an elaborate frontend of the type you enumerate (whether web or CLI or something else), you are probably using substantially more stuff on top of the LLM than if you are using the providers web API.

consumer451 43 minutes ago||

In meetings, I try to explain the roles of system prompts, agentic loops, tool calls, etc in the products I create, to the stakeholders.

However, they just look at the whole thing as "the LLM," which carries specific baggage. If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent many very smart folks who are not practitioners from saying dumb-sounding stuff.

staplers 26 minutes ago||

  If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent very smart folks from outside the field from saying dumb-sounding stuff.

This is an example of why LLMs won't displace engineers as severely as many think. There are very old solved processes and hyper-efficient ways of building things in the real world that still require a level of understanding many simply don't care or want to achieve.

xlth 54 minutes ago||

You're not off-base at all. The way I think about it:

- LLM = the model itself (stateless, no tools, just text in/text out) - LLM + system prompt + conversation history = chatbot (what most people interact with via ChatGPT, Claude, etc.) - LLM + tools + memory + orchestration = agent (can take actions, persist state, use APIs)

When someone says "LLMs have no memory" they're correct about the raw model, but Claude Code or Cursor are agents - they have context, tool access, and can maintain state across interactions.

The industry seems to be settling on "agentic system" or just "agent" for that last category, and "chatbot" or "assistant" for the middle one. The confusion comes from product names (ChatGPT, Claude) blurring these boundaries - people say "LLM" when they mean the whole stack.

flerchin 7 hours ago||

Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.

g947o 7 hours ago|

Attributing these to "hidden requirements" is a slippery slope.

My own experience using Claude Code and similar tools tells me that "hidden requirements" could include:

* Make sure DESIGN.md is up to date

* Write/update tests after changing source, and make sure they pass

* Add integration test, not only unit tests that mock everything

* Don't refactor code that is unrelated to the current task

...

These are not even project/language specific instructions. They are usually considered common sense/good practice in software engineering, yet I sometimes had to almost beg coding agents to follow them. (You want to know how many times I have to emphasize don't use "any" in a TypeScript codebase?)

People should just admit it's a limitation of these coding tools, and we can still have a meaningful discussion.

flerchin 7 hours ago||

Yeah I agree generally that the most banal things must be specified, but I do think that a single sentence in the prompt "Performance should be equivalent" would likely have yielded better results.

jqpabc123 6 hours ago||

LLMs have no idea what "correct" means.

Anything they happen to get "correct" is the result of probability applied to their large training database.

Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.

Relying on an LLM for anything "serious" is a liability issue waiting to happen.

A1kmm 48 minutes ago||

Yes Transformer models are non-deterministic, but it is absolutely not true that they can't generalise (the equivalent of interpolation and extrapolation in linear regression, just with a lot more parameters and training).

For example, let's try a simple experiment. I'll generate a random UUID:

> uuidgen 44cac250-2a76-41d2-bbed-f0513f2cbece

Now it is extremely unlikely that such a UUID is in the training set.

Now I'll use OpenCode with "Qwen3 Coder 480B A35B Instruct" with this prompt: "Generate a single Python file that prints out the following UUID: "44cac250-2a76-41d2-bbed-f0513f2cbece". Just generate one file."

It generates a Python file containing 'print("44cac250-2a76-41d2-bbed-f0513f2cbece")'. Now this is a very simple task (with a 480B model), but it solves a problem that is not in the training data, because it is a generalisation over similar but different problems in the training data.

Almost every programming task is, at some level of abstraction, and with different levels of complexity, an instance of solving a more general type of problem, where there will be multiple examples of different solutions to that same general type of problem in the training set. So you can get a very long way with Transformer model generalisations.

tonypapousek 6 hours ago|||

It’s a shame of bulk of that training data is likely 2010s blogspam that was poor quality to begin with.

2god3 6 hours ago||

But isn't that a reflection of reality?

If you've made a significant investment in human capital, you're even more likely to protect it now and prevent posting valuable stuff on the web.

simianwords 1 hour ago|||

This is easily proven incorrect. Just go to ChatGPT and say something incorrect and ask it to verify. Why do people still believe this type of thing?

girvo 1 hour ago||

And yet models get things wrong all the time, too.

simianwords 58 minutes ago||

That’s what I would expect even if it can have the concept of truth. Like humans.

2god3 6 hours ago|||

Aye. I wish more conversations would be more of this nature - in that we should start with basic propositions - e.g. the thing does not 'know' or 'understand' what correct is.

LarsDu88 6 hours ago||

This is about to change very soon. Unlike many other domains (such as greenfield scientific discovery), most coding problems for which we can write tests and benchmarks are "verifiable domains".

This means an LLM can autogenerated millions of code problem prompts, attempt millions of solutions (both working and non-working), and from the working solutions, penalize answers that have poor performance. The resulting synthetic dataset can then be used as a finetuning dataset.

There are now reinforcement finetuning techniques that have not been incorporated into the existing slate of LLMs that will enable finetuning them for both plausibility AND performance with a lot of gray area (like readability, conciseness, etc) in between.

What we are observing now is just the tip of a very large iceberg.

2god3 6 hours ago||

Lets suppose whatever you say is true.

If Im the govt, Id be foaming at the mouth - those projects that used to require enormous funding now will supposedly require much less.

Hmmm, what to do? Oh I know. Lets invest in Digital ID-like projects. Fun.

LarsDu88 3 hours ago||

It is true. Here is the publication going over how to generate this type of dataset and finetune: https://arxiv.org/pdf/2506.14245

I don't think you grasp my statement. LLMs will exceed humans greatly for any domain that is easy to computationally verify such as math and code. For areas not amenable to deterministic computations such as human biology, or experimental particle physics, progress will be slower

comex 7 hours ago||

Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):

https://news.ycombinator.com/item?id=47176209

seanmcdirmid 4 hours ago||

I'm using an LLM to write queries ATM. I have it write lots of tests, do some differential testing to get the code and the tests correct, and then have it optimize the query so that it can run on our backend (and optimization isn't really optional since we are processing a lot of rows in big tables). Without the tests this wouldn't work at all, and not just tests, we need pretty good coverage since if some edge case isn't covered, it likely will wash out during optimization (if the code is ever correct about it in the first place). I've had to add edge cases manually in the past, although my workflow has gotten better about this over time.

I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.

88j88 4 hours ago|

100% I found that you think you are smarter than the LLM and knowing what you want, but this is not the case. Give the LLM some leeway to come up with solution based on what you are looking to achieve- give requirements, but don't ask it to produce the solution that you would have because then the response is forced and it is lower quality.

More comments...