Agents that run while I sleep

Posted by aray07 19 hours ago

Agents that run while I sleep(www.claudecodecamp.com)

360 points | 401 commentspage 3

Havoc 18 hours ago|

They're definitely inferior to proper tests, but even weak CC tests on top of CC code is an improvement over no tests. If CC does make a change that shifts something dramatically even a weak test may flag enough to get CC to investigate.

Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.

aray07 18 hours ago|

yeah i have been hearing a lot more about this concept of “digital twins” - where you have high fidelity versions of external services to run tests against. You can ask the API docs of these external services and give it to Claude. Wonder if that is where we will be going more towards.

didgeoridoo 18 hours ago||

Isn’t this just an API sandbox? Many services have a test/sandbox mode. I do wish they were more common outside of fintech.

jeff_antseed 3 hours ago||

the overnight cost thing is real. "$200 in 3 days" is actually pretty tame compared to what happens when you have agents spawning sub-tasks without a budget cap.

the part that doesn't get talked about enough: most people are hitting a single provider API and treating it as fixed cost. but inference pricing varies a lot across providers for the same model. we've seen 3-5x spreads for equivalent quality on commodity models.

so half the cost problem is architectural (don't let agents spin unboundedly) and the other half is just... shopping around. not glamorous but real.

wg0 12 hours ago||

All these macho men - I wonder what exactly are they shipping at that pace?

Not a rhetoric question. Trillion token burners and such.

gitaarik 6 hours ago||

In the end you'll always have to manually validate the output, to ensure that what the test case tests is correct. When you write a test case, that's always what you need to do, to ensure that the test case passes in the right conditions, and you have to test that manually.

Since you have to test that manually anyway, you can have AI write the code first; you test it; if it's the right result, you tell AI this is correct, so write test cases for this result.

ziofill 10 hours ago||

> Writing acceptance criteria is harder than writing a prompt, because it forces you to think through edge cases before you've seen them. Engineers resist it for the same reason they resisted TDD, because it feels slower at the start.

This resonates with my experience, and it is also a refreshing honest take: pushing back on heavy upfront process isn't laziness, it's just the natural engineers drive to build things and feel productive.

dwedge 6 hours ago||

This is a really good article but I do kind take issue with the intro, because it's the same assertion I see all over the place :

> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.

> I care about this. I don't want to push slop

They clearly didn't care about that. They only cared about non stop lines of code generation and shipping anything fast. Otherwise they wouldn't need weeks to realise that they weren't reading or testing this code - it's obvious from the outset.

Maybe their approach to this changed and that's fine, but at the beginning they very much did not care and I feel people only keep saying that do because otherwise they'd need to be the one to admit the emperor isn't wearing clothes.

olalonde 8 hours ago||

Somewhat unrelated but are there good boilerplate/starter repos that are optimized for agent based development? Setting up the skills/MCPs/AGENTS.md files seems like a lot of work.

vidimitrov 17 hours ago||

He admits the real hole himself: "this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass."

But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.

The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.

dwaltrip 16 hours ago|

Did AI write this?

vidimitrov 16 hours ago||

Nope - though I’ll take it as a compliment either way. It’s a problem I’ve been sitting with for a while, so the answer came out more formed than I expected. You disagree?

rrvsh 15 hours ago|||

Its actually a pretty good idea/framework for writing commit descriptions, especially for smaller changes that don't have any nuances to note in the commit

svstoyanovv 14 hours ago||

Why only small changes tho? I think it can also work with larger changes if you commit more regularly. And with agentic coding or even with autonomous agentic coding, you need to do it regularly and create these contextual checkpoints, no?

dwaltrip 12 hours ago|||

It has that punchy, breathless cadence... shrugs

storus 18 hours ago||

Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

8note 13 hours ago||

even in one agent, a different starting prompt will have you tracing a very different path through the model.

maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct

xandrius 15 hours ago|||

I think people are misunderstanding reward functions and LLMs.

LLMs don't actually have a reward system like some other ML models.

storus 14 hours ago||

They are trained with one, and when you look at DPO you can say they contain an implicit one as well.

throwatdem12311 13 hours ago||

It’s superstition that using a different slop generator to “review” the slop from a different brand of slop generator somehow makes things better. It’s slop all the way down.

storus 12 hours ago||

https://github.com/karpathy/llm-council

https://ui.adsabs.harvard.edu/abs/2025arXiv250214815C/abstra...

https://www.arxiv.org/abs/2509.23537

https://www.aristeidispanos.com/publication/panos2025multiag...

https://arxiv.org/abs/2305.14325

https://arxiv.org/abs/2306.05685

https://arxiv.org/abs/2310.19740v1

hermit_dev 12 hours ago|

It's an interesting problem that even though it's represented by just you as a single person, I think this is shared across the board with larger corporations at scale. I know for example they were seeing this with game devs in regards to the Godot engine. So many people were uploading work done by AI that has been unverified that people just can't keep up with it. And maybe some of it's good, but how do you vet all the crap out? No one knows what's being written anymore (and non-devs can code now too, which is amazing, but part of the problem that we introduced). I think in the future of being a developer will be more about verifying code integrity and working with AI to ensure it is meeting said standards. Rather than actually being in the driver's seat. Not sexy, but we're handing the keys over willingly, yet, AI is only interpreting the intent. It's going to get things wrong no matter what we do.

More comments...