LLMs work best when the user defines their acceptance criteria first

Posted by dnw 9 hours ago

LLMs work best when the user defines their acceptance criteria first(blog.katanaquant.com)

210 points | 172 commentspage 2

gormen 4 hours ago|

Excellent article. But to be fair, many of these effects disappear when the model is given strict invariants, constraints, and built-in checks that are applied not only at the beginning but at every stage of generation.

lukeify 8 hours ago||

Most humans also write plausible code.

tartoran 8 hours ago||

LLMs piggyback on human knowledge encoded in all the texts they were trained on without understanding what they're doing.

Humans would execute that code and validate it. From plausible it'd becomes hey, it does this and this is what I want. LLMs skip that part, they really have no understanding other than the statistical patterns they infer from their training and they really don't need any for what they are.

red75prime 4 hours ago|||

Could we stop using vague terms like “understanding” when talking about LLMs and machine learning? You don't know what understanding is. You only know how it feels to understand something.

It's better to describe what you can do that LLMs currently can't.

stevenhuang 3 hours ago||

At least it's an easy way for those who don't know that they're talking about to out themselves.

If they'd bother to see how modern neuroscience tries to explain human cognition they'd see it explained in terms that parallel modern ML. https://en.wikipedia.org/wiki/Predictive_coding

We only have theories for what intelligence even means, I wouldn't be surprised there are more similarities than differences between human minds and LLMs, fundamentally (prediction and error minimization)

owlninja 8 hours ago||||

They probably at least look at the docs?

stevenhuang 7 hours ago|||

LLMs can execute code and validate it too so the assertions you've made in your argument are incorrect.

What a shame your human reasoning and "true understanding" led you astray here.

gitaarik 3 hours ago|||

All code is plausible by design

einrealist 3 hours ago||

> SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.

Someone (with deep pockets to bear the token costs) should let Claude run for 26 months to have it optimize its Rust code base iteratively towards equal benchmarks. Would be an interesting experiment.

The article points out the general issue when discussing LLMs: audience and subject matter. We mostly discuss anecdotally about interactions and results. We really need much more data, more projects to succeed with LLMs or to fail with them - or to linger in a state of ignorance, sunk-cost fallacy and supressed resignation. I expect the latter will remain the standard case that we do not hear about - the part of the iceberg that is underwater, mostly existing within the corporate world or in private GitHubs, a case that is true with LLMs and without them.

In my experience, 'Senior Software Engineer' has NO general meaning. It's a title to be awarded for each participation in a project/product over and over again. The same goes for the claim: "Me, Senior SWE treat LLMs as Junior SWE, and I am 10x more productive." Imagine me facepalming every time.

helsinki 5 hours ago||

That's why I added an invariant tool to my Go agent framework, fugue-labs/gollem:

https://github.com/fugue-labs/gollem/blob/main/ext/codetool/...

FrankWilhoit 8 hours ago||

Enterprise customers don't buy correct code, they buy plausible code.

kibwen 8 hours ago||

Enterprise customers don't buy plausible code, they buy the promise of plausible code as sold by the hucksters in the sales department.

2god3 8 hours ago|||

They're not buying code.

They are buying a service. As long as the service 'works' they do not care about the other stuff. But they will hold you liable when things go wrong.

The only caveat is highly regulated stuff, where they actually care very much.

marginalia_nu 8 hours ago||

I think SolarWinds would have preferred correct code back in 2020.

qup 8 hours ago||

Okay, but what did they buy?

marginalia_nu 8 hours ago||

Code, from their employees.

spullara 2 hours ago||

human developers work best when the user defines their acceptance criteria first.

sim04ful 4 hours ago||

I've noticed a key quality signal with LLM coding is an LOC growth rate that tapers off or even turns negative.

raw_anon_1111 7 hours ago||

The difference for me recently

Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.

Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.

I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.

I treat coding agents like junior developers.

datagobes 37 minutes ago||

Same pattern in data engineering generally. LLMs default to the obvious row-by-row or download-then-insert approach and you have to steer them toward the efficient path (COPY, bulk loaders, server-side imports). Once you name the right primitive, they execute it correctly, permissions and all, as you found.

The deeper issue is that "efficient ingest" depends heavily on context that's implicit in your setup: file sizes, partitioning, schema evolution expectations, downstream consumers. A Lambda doing direct S3-to-Postgres import is fine for small/occasional files, but if you're dealing with high-volume event-driven ingestion you'll hit connection pool pressure fast on RDS. At that point the conversation shifts to something like a queue buffer or moving toward a proper staging layer (S3 → Redshift/Snowflake/Databricks with native COPY or autoloader). The LLM won't surface that tradeoff unless you explicitly bring it up. It optimizes for the stated task, not for the unstated architectural constraints.

svpyk 7 hours ago|||

Unlike junior developers, llms can take detailed instructions and produce outstanding results at first shot a good number of times.

conception 7 hours ago||

Did you ask it to research best practices for this method, have an adversarial performance based agent review their approach or search for performant examples of the task first? Relying on training data only will always get your subpar results. Using “What is the most performant way to load a CSV from S3 into PostgreSQL on RDS? Compare all viable and research approaches before recommending one.” gave me the extension as the top option.

raw_anon_1111 6 hours ago||

I knew the best way. I was just surprised that Claude got it wrong. As soon as I told it to use the s3 extension, it knew to add the appropriate permissions, to update my sql unit script to enable the extension and how to write the code

marginalia_nu 9 hours ago||

I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.

No exaggeration it floundered for an hour before it started to look right.

It's really not good at tasks it has not seen before.

hrmtst93837 29 minutes ago||

The model stumbles when asked to invent procedural geometry it has rarely tokenized because LLMs predict tokens, not precise coordinate math. For reliable output define acceptance criteria up front and require a strict format such as an SVG path with absolute coordinates and explicit cubic Bezier control points, plus a tiny rendering test that checks a couple of landmark pixels.

Break the job into microtasks, ask for one petal as a pair of cubic Beziers with explicit numeric control points, render that snippet locally with a simple rasterizer, then iterate on the numbers. If determinism matters accept the tradeoff of writing a small generator using a geometry library like Cairo or a bezier solver so you get reproducible coordinates instead of watching the model flounder for an hour.

ehnto 8 hours ago|||

Even with well understood languages, if there isn't much in the public domain for the framework you're using it's not really that helpful. You know you're at the edges of its knowledge when you can see the exact forum posts you are looking at showing up verbatim in it's responses.

I think some industries with mostly proprietary code will be a bit disappointing to use AI within.

jshmrsn 8 hours ago|||

Considering that a fleur-de-lis involves somewhat intricate curves, I think I'd be pretty happy with myself if I could get that task done in an hour.

Given a harness that allows the model to validate the result of its program visually, and given the models are capable of using this harness to self correct (which isn't yet consistently true), then you're in a situation where in that hour you are free to do some other work.

A dishwasher might take 3 hours to do for what a human could do in 30 minutes, but they're still very useful because the machine's labor is cheaper than human labor.

marginalia_nu 8 hours ago||

I didn't provide any constraints on how to draw it.

TBH I would have just rendered a font glyph, or failing that, grabbed an image.

Drawing it with vector graphics programmatically is very hard, but a decent programmer would and should push back on that.

zeroxfe 8 hours ago||

> TBH I would have just rendered a font glyph, or failing that, grabbed an image.

If an LLM did that, people would be all up in arms about it cheating. :-)

For all its flaws, we seem to hold LLMs up to an unreasonably high bar.

marginalia_nu 8 hours ago||

That's the job description for a good programmer though. Question assumptions and requirements, and then find the simplest solution that does the job.

Just about anyone can eventually come up with a hideously convoluted HeraldicImageryEngineImplFactory<FleurDeLis>.

comex 8 hours ago|||

LLMs are really bad at anything visual, as demonstrated by pelicans riding bicycles, or Claude Plays Pokémon.

Opus would probably do better though.

tartoran 8 hours ago||

How could they be any good at visuals? They are trained on text after all.

comex 8 hours ago|||

Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.

Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:

https://simonwillison.net/tags/pelican-riding-a-bicycle/

But they're still not very good.

tartoran 8 hours ago||

I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.

boxedemp 7 hours ago||

Maybe we should drop one of the L's

astrange 8 hours ago||||

Claude is multimodal and can see images, though it's not good at thinking in them.

msephton 8 hours ago||||

Shapes can be described as text or mathematical formulas.

tempest_ 8 hours ago|||

An SVG is just text.

internet2000 8 hours ago|||

I got Opus 4.6 to one shot it, took 5-ish mins. "Write me a python program that outputs an svg of a fleur-de-lis. Use freely available images to double check your work."

It basically just re-created the wikipedia article fleur-de-lis, which I'm not sure proves anything beyond "you have to know how to use LLMs"

64738 6 hours ago|||

Just for reference, Codex using GPT-5.4 and that exact prompt was a 4-shot that took ten minutes. The first result was a horrific caricature. After a slight rebuke ("That looks terrible. Read https://en.wikipedia.org/wiki/Fleur-de-lis for a better understanding of what it should look like."), it produced a very good result but it then took two more prompts about the right side of the image being clipped off before it got it right.

robertcope 7 hours ago|||

Same, I used Sonnet 4.6 with the prompt, "Write a simple program that displays a fleur-de-lis. Python is a good language for this." Took five or six minutes, but it wrong a nice Python TK app that did exactly what it was supposed to.

scuff3d 6 hours ago|||

I tried to use Codex to write a simple TCP to QUIC proxy. I intentionally kept the request fairly simple, take one TCP connection and map it to a QUIC connection. Gave a detailed spec, went through plan mode, clarified all the misunderstandings, let it write it in Python, had it research the API, had it write a detailed step by step roadmap... The result was a fucking mess.

Beyond the fact that it was "correct" in the same way the author of the article talked about, there was absolutely bizarre shit in there. As an example, multiple times it tried to import modules that didn't exist. It noticed this when tests failed, and instead of figuring out the import problem it add a fucking try/except around the import and did some goofy Python shenanigans to make it "work".

tartoran 9 hours ago||

Have you tried describing to Claude what it is? The more the detail the better the result. At some point it does become easier to just do it yourself.

parvardegr 3 hours ago|||

agreed with part that at some point it's better to just do it yourself but for sure they will get better and better

marginalia_nu 8 hours ago||||

It knows what it is, it's a very well known symbol. But translating that knowledge to code is something else.

Interesting shortcoming, really shows how weak the reasoning is.

cat_plus_plus 8 hours ago||

Try writing code from description without looking at the picture or generated graphics. Visual LLM with a suggestion to find coordinates of different features and use lines/curves to match them might do better.

vdfs 9 hours ago|||

Most people just forget to tell it "make it quick" and "make no mistake"

mekael 8 hours ago|||

I’m unable to determine if you’re missing /s or not.

tartoran 8 hours ago|||

That's kind of foolish IMO. How can an open ended generic and terse request satisfy something users have in mind?

More comments...