Agents that run while I sleep

Posted by aray07 20 hours ago

Agents that run while I sleep(www.claudecodecamp.com)

375 points | 422 commentspage 4

vidimitrov 18 hours ago|

He admits the real hole himself: "this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass."

But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.

The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.

dwaltrip 18 hours ago|

Did AI write this?

vidimitrov 17 hours ago||

Nope - though I’ll take it as a compliment either way. It’s a problem I’ve been sitting with for a while, so the answer came out more formed than I expected. You disagree?

rrvsh 16 hours ago|||

Its actually a pretty good idea/framework for writing commit descriptions, especially for smaller changes that don't have any nuances to note in the commit

svstoyanovv 15 hours ago||

Why only small changes tho? I think it can also work with larger changes if you commit more regularly. And with agentic coding or even with autonomous agentic coding, you need to do it regularly and create these contextual checkpoints, no?

dwaltrip 13 hours ago|||

It has that punchy, breathless cadence... shrugs

storus 19 hours ago||

Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

8note 14 hours ago||

even in one agent, a different starting prompt will have you tracing a very different path through the model.

maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct

xandrius 17 hours ago|||

I think people are misunderstanding reward functions and LLMs.

LLMs don't actually have a reward system like some other ML models.

storus 15 hours ago||

They are trained with one, and when you look at DPO you can say they contain an implicit one as well.

throwatdem12311 14 hours ago||

It’s superstition that using a different slop generator to “review” the slop from a different brand of slop generator somehow makes things better. It’s slop all the way down.

storus 14 hours ago||

https://github.com/karpathy/llm-council

https://ui.adsabs.harvard.edu/abs/2025arXiv250214815C/abstra...

https://www.arxiv.org/abs/2509.23537

https://www.aristeidispanos.com/publication/panos2025multiag...

https://arxiv.org/abs/2305.14325

https://arxiv.org/abs/2306.05685

https://arxiv.org/abs/2310.19740v1

olalonde 9 hours ago||

Somewhat unrelated but are there good boilerplate/starter repos that are optimized for agent based development? Setting up the skills/MCPs/AGENTS.md files seems like a lot of work.

hermit_dev 14 hours ago||

It's an interesting problem that even though it's represented by just you as a single person, I think this is shared across the board with larger corporations at scale. I know for example they were seeing this with game devs in regards to the Godot engine. So many people were uploading work done by AI that has been unverified that people just can't keep up with it. And maybe some of it's good, but how do you vet all the crap out? No one knows what's being written anymore (and non-devs can code now too, which is amazing, but part of the problem that we introduced). I think in the future of being a developer will be more about verifying code integrity and working with AI to ensure it is meeting said standards. Rather than actually being in the driver's seat. Not sexy, but we're handing the keys over willingly, yet, AI is only interpreting the intent. It's going to get things wrong no matter what we do.

rurban 10 hours ago||

This is TDD? Tests first, then code? I do first the docs, then the tests, then the code. For years.

What he describes is like that. Just that the plan step is suggesting docs, not writing actual docs.

godelski 10 hours ago|

TDD has always been flawed. Tests can't give you complete coverage, they are always incomplete. Though every time I say this people think I'm against tests. I'm just saying tests can't prove correctness. You'd have to be a lunatic to think they are proofs. Even crazier is having the LLMs write their own tests and think that that's proof. I'm sure it improves things, but proofs are a different beast all together.

Seems things still haven't changed in half a century

https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD288...

UK-Al05 5 hours ago|||

It's not meant to give you complete coverage. It's meant to guide to meeting the acceptance criteria.

rurban 9 hours ago|||

Of course tests are not proofs. For proofs I do 'make verify' :)

Tests just catch the most simple mistakes, edge cases and some regressions.

OsrsNeedsf2P 19 hours ago||

Our app is a desktop integration and last year we added a local API that could be hit to read and interact with the UI. This unlocked the same thing the author is talking about - the LLM can do real QA - but it's an example of how it can be done even in non-web environments.

Edit: I even have a skill called release-test that does manual QA for every bug we've ever had reported. It takes about 10 hours to run but I execute it inside a VM overnight so I don't care.

8note 14 hours ago|

i got me a windows mcp setup running in a sandbox, so it can look at screenshots, see the UIA, and click things either by coordinate or by UIA.

i let it run overnight against a windows app i was working on, and that got it from mostly not working to mostly working.

the loop was

1. look at the code and specs to come up with tests 2. predict the result 3. try it 4. compare the prediction against rhe result 5. file bug report, or call it a success

and then switch to bug fixing, and go back around again. Worked really well in geminicli with the giant context window

jc-myths 14 hours ago||

Solo founder here, shipping a real product built mostly with AI. The code review thing is real but my actual daily pain is different. AI lies about being done. It'll say "implemented" and what it actually did is add a placeholder with a TODO comment. Or it silently adds a fallback path that returns hardcoded data when the real API fails, and now your app "works" but nothing is real.

I've also given it explicit rules like "never use placeholder images, always generate real assets" — and it just... ignores them sometimes. Not always. Sometimes. Which is worse, because you can't trust it but you also can't not use it.

The 80% it writes is fine. The problem is you still have to verify 100% of it.

cube00 13 hours ago|

Have you tried using an additional agent to verify the outputs? It seems that can help if the supervising agent has a small context demand on it. (ie. run this command, make sure it returns 0, invoke main coding agent with error message if it doesn't)

jc-myths 9 hours ago||

Yeah I've experimented with that pattern. The meta-agent approach works for catching obvious stuff, like "did the build pass" or "does this file actually exist." But the harder bugs are semantic. The agent writes a function that returns the right shape of data but with wrong values, or adds a fallback that masks the real failure. A supervising agent reading the same code often has the same blind spots.

What's worked better for me is building verification into the workflow itself, like explicit test assertions the agent has to pass before it can claim "done," plus a rule that any API call must show a real response, not a mock. Basically treating the AI like a junior dev who needs guard rails, not a senior who just needs a code review.

overfeed 17 hours ago||

> At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.

To everyone who plan on automating themselves out of a job by taking the human element out- this is the endgame that management wants: replacing your (expensive and non-tax-optimized) labor with scalable Opex.

hinkley 15 hours ago|

It's also delusional.

throwaway7783 14 hours ago||

Regarding the self-congratulation machine - I simply use a different claude code session to do the reviews. There is no self-congratulation, but overly critical at times. Works well.

Honestly, sometimes the harnesses, specs, some predefined structure for skills etc all feel over-engineering. 99% of the time a bloody prompt will do. Claude Code is capable of planning, spawning sub-agents, writing tests and so on.

Claude.md file with general guidelines about our repo has worked extraordinarily good, without any external wrappers, harnesses or special prompts. Even the MD file has no specific structure, just instructions or notes in English.

lateforwork 20 hours ago|

> When Claude writes tests for code Claude just wrote, it's checking its own work.

You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.

xandrius 17 hours ago||

I don't think that's necessary, just make sure the context is not shared. A pretty good model can handle both sides well enough.

aray07 20 hours ago||

yeah i have started using codex to do my code reviews and it helps to have “a different llm” - i think one of my challenges has been that unit tests are good but not always comprehensive. you still need functional tests to verify the spec itself.

More comments...