Agents that run while I sleep

Posted by aray07 17 hours ago

Agents that run while I sleep(www.claudecodecamp.com)

344 points | 391 commentspage 2

Aachen 3 hours ago|

And here I am turning my computer off at night for energy consumption, while others run a few extra ones for... for what, anyway? If you're working on problems real people are having (diseases, climate change, poverty, etc.) then sure, but exacerbating the energy transition for a blog post and your personal brand as OP seems to do? How's that not criminal

ionwake 3 hours ago||

I found your post interesting, Im just trying to understand your POV.

If you are on a sinking ship would you not do your best to position yourself?

Or do you see your actions morally equivalent to others regardless of scale?

Aachen 2 hours ago||

What sinking ship?

macgyverismo 3 hours ago|||

It's not criminal as the power usage is (assumed to be) paid for. It is criminal or at least problematic that the cost of power does not include (negative) externalities, we should strive to change that.

lazystar 3 hours ago||

> How's that not criminal

Well, a) it's a hobby, and b) this is still a free country/free society.

wartywhoa23 2 hours ago|||

If I were tasked with stripping this country/world of all remaining freedom, I'd surely let bullshit like this proliferate in ordo ab chao mode, where the exact line between ordo and chao is only known to me and my henchmen, and just wait till defeated enjoyers of miserable remnants of said freedom crouch begging me to rob them of that chaos-inducing freedom.

Aachen 2 hours ago|||

I could see the comparison to hobbies which pollute the environment, but in general people do tend to vote for reducing freedom where it harms others

simonpure 10 hours ago||

I've been impressed by Google Jules since the Gemini 3.1 Pro update. Sometimes it's been working on a task for 4h. I've now put it in a ralph loop using a Github Action to call itself and auto merge PRs after the linter, formatter and tests pass. It does still occasionally want my approval, but most of the time I just say Sounds great!

It's currently burning through the TESTING.md backlog: https://github.com/alpeware/datachannel-clj

afro88 17 hours ago||

I guess to reach this point you have already decided you don't care what the code looks like.

Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

tdeck 9 hours ago||

If you haven't reviewed the code yet, how can you say it did 4 weeks of work in 2 days? You haven't verified the correctness, and besides reviewing the code is part of the work.

afro88 2 hours ago||

That's what I was getting at. With the review and potential rework time, we could be looking at over the original 4 week estimate. So then what's the point in using long running unsupervised agents if it ends up being longer than doing it in small chunks.

eikenberry 13 hours ago|||

The proper solution is to treat the agent generated code like assembly... IE. don't review it. Agents are the compiler for your inputs (prompts, context, etc). If you care about code quality you should have people writing it with AI help, not the other way around.

lbreakjai 13 hours ago|||

> Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Same as before. Small PRs, accept that you won't ship a month of code in two days. Pair program with someone else so the review is just a formality.

The value of the review is _also_ for someone else to check if you have built the right thing, not just a thing the right way, which is exponentially harder as you add code.

dumpsterdiver 11 hours ago|||

You’re not alone. I went from being a mediocre security engineer to a full time reviewer of LLM code reviews last week. I just read reports and report on incomplete code all day. Sometimes things get humorously worse from review to review. I take breaks by typing out the PoCs the LLMs spell out for me…

krater23 8 hours ago||

I'm security engineer too and when it really will come so far that I only review LLM code I refuse to do it for fewer than my doubled hourly rate.

akshaysg 16 hours ago|||

I've been thinking a lot about this!

Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).

IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.

kg 16 hours ago|||

It sounds like you know this but what happened is that you didn't do 4 weeks of work over 2 days, you got started on 4 weeks of work over 2 days, and now you have to finish all 4 weeks worth of work and that might take an indeterminate amount of time.

If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.

You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.

Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?

Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.

kwanbix 16 hours ago|||

So you have become a reviewer instead of a programmer? Is that so? hones question. And if so, what is the advantage of looking a code for 12 hours instead of coding for 12.

woah 12 hours ago||

Build features faster. Granted, this exposes the difference between people who like to finish projects and people who like to get paid a lot of money for typing on a keyboard.

krater23 8 hours ago||

Bullshit! You project isn't finished as long as there are obvious major bugs that you can't fix because you don't unterstand the code.

logicchains 16 hours ago|||

>Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.

aray07 16 hours ago|||

yeah honestly thats what i am struggling with too and I dont have a a good solution. However, I do think we are going to see more of this - so it will be interesting to see how we are going to handle this.

i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)

afro88 15 hours ago||

It's a nice idea, but how do you know the agent is aligned with what it thinks the intent is?

8note 12 hours ago||

or moreso, what happens at compact boundaries where the agent completely forgets the intent

zer00eyz 16 hours ago||

> how do you review all the code?

Code review is a skill, as is reading code. You're going to quickly learn to master it.

> It's like 20k of line changes over 30-40 commits.

You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.

> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.

But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.

afro88 15 hours ago||

Oh I didn't mean literally how do I review code. I meant, if an agent can write a lot of code to achieve a large task that seemingly works (from manual testing), what's the point if we haven't really solved code review? There's still that bottleneck no matter how fast you can get working code down.

daxfohl 15 hours ago||

Sounds like we've just gotten into lazy mode where we believe that whatever it spits out is good enough. Or rather, we want to believe it, and convince ourselves that some simple guardrail we put up will make it true, because God forbid we have to use our own brain again.

What if instead, the goal of using agents was to increase quality while retaining velocity, rather than the current goal of increasing velocity while (trying to) retain quality? How can we make that world come to be? Because TBH that's the only agentic-oriented future that seems unlikely to end in disaster.

rglover 14 hours ago|

You can't. To retain and improve quality requires care. Very few if any of the people setting stuff like this up truly care about delivering a quality result (any result is the real goal). Unless there's some incentive to care, quality will be found among the exceedingly rare people/businesses.

cadamsdotcom 4 hours ago||

Code and Claude Code hooks can conditionally tell the model anything:

#!python

print(“fix needed: method ABC needs a return type annotation on line 45”

import os

os.exit(2)

Claude Code will show that output to the model. This lets you enforce anything from TDD to a ban on window.alert() in code - deterministically.

This can be the basis for much more predictable enforcement of rules and standards in your codebase.

Once you get used to code based guardrails, you’ll see how silly the current state of the art is: why do we pack the context full of instructions, distract the model from its task, then act all surprised when it doesn’t follow them perfectly!

tdeck 9 hours ago||

> A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.

I can't understand the mindset that would lead someone not to have realized this from the beginning.

TonyAlicea10 15 hours ago||

You can find approaches that improve things, but there's always going to be a chance that your code is terrible if you let an LLM generate it and don't review it with human eyes.

But review fatigue and resulting apathy is real. Devs should instead be informed if incorrect code for whatever feature or process they are working on would be high-risk to the business. Lower-risk processes can be LLM-reviewed and merged. Higher risk must be human-reviewed.

If the business you're supporting can't tolerate much incorrectness (at least until discovered), than guess what - you aren't going to get much speed increase from LLMs. I've written about and given conference talks on this over the past year. Teams can improve this problem at the requirements level: https://tonyalicea.dev/blog/entropy-tolerance-ai/

jeff_antseed 2 hours ago||

the overnight cost thing is real. "$200 in 3 days" is actually pretty tame compared to what happens when you have agents spawning sub-tasks without a budget cap.

the part that doesn't get talked about enough: most people are hitting a single provider API and treating it as fixed cost. but inference pricing varies a lot across providers for the same model. we've seen 3-5x spreads for equivalent quality on commodity models.

so half the cost problem is architectural (don't let agents spin unboundedly) and the other half is just... shopping around. not glamorous but real.

jdlshore 17 hours ago||

Pet peeve: this post misunderstands “TDD.” What it really describes is acceptance tests.

TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.

TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”

Thank you for coming to my TED^H^H^H TDD talk.

wnevets 17 hours ago||

> TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice.

I would like to emphasize that feedback includes being alerted to breaking something you previously had working in a seemly unrelated/impossible way.

hinkley 11 hours ago||

Accidentally mutating an input is always a 'fun' way to trigger spooky action at a distance.

hinkley 11 hours ago||

suggestion: TeDD talk.

Lasang 11 hours ago|

The concept of long-running background agents sounds appealing, but the real challenge tends to be reliability and task definition rather than raw model capability.

If an agent runs unattended for hours, small errors compound quickly. Even simple misunderstandings about file structure or instructions can derail the whole process.

More comments...