Posted by aray07 19 hours ago
Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.
the part that doesn't get talked about enough: most people are hitting a single provider API and treating it as fixed cost. but inference pricing varies a lot across providers for the same model. we've seen 3-5x spreads for equivalent quality on commodity models.
so half the cost problem is architectural (don't let agents spin unboundedly) and the other half is just... shopping around. not glamorous but real.
Not a rhetoric question. Trillion token burners and such.
Since you have to test that manually anyway, you can have AI write the code first; you test it; if it's the right result, you tell AI this is correct, so write test cases for this result.
This resonates with my experience, and it is also a refreshing honest take: pushing back on heavy upfront process isn't laziness, it's just the natural engineers drive to build things and feel productive.
> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.
> I care about this. I don't want to push slop
They clearly didn't care about that. They only cared about non stop lines of code generation and shipping anything fast. Otherwise they wouldn't need weeks to realise that they weren't reading or testing this code - it's obvious from the outset.
Maybe their approach to this changed and that's fine, but at the beginning they very much did not care and I feel people only keep saying that do because otherwise they'd need to be the one to admit the emperor isn't wearing clothes.
But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.
The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.
maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct
LLMs don't actually have a reward system like some other ML models.