Posted by aray07 17 hours ago
If you are on a sinking ship would you not do your best to position yourself?
Or do you see your actions morally equivalent to others regardless of scale?
Well, a) it's a hobby, and b) this is still a free country/free society.
It's currently burning through the TESTING.md backlog: https://github.com/alpeware/datachannel-clj
Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?
Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.
One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
Same as before. Small PRs, accept that you won't ship a month of code in two days. Pair program with someone else so the review is just a formality.
The value of the review is _also_ for someone else to check if you have built the right thing, not just a thing the right way, which is exponentially harder as you add code.
Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).
IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.
If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.
You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.
Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?
Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.
Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.
i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)
Code review is a skill, as is reading code. You're going to quickly learn to master it.
> It's like 20k of line changes over 30-40 commits.
You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.
> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.
But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.
What if instead, the goal of using agents was to increase quality while retaining velocity, rather than the current goal of increasing velocity while (trying to) retain quality? How can we make that world come to be? Because TBH that's the only agentic-oriented future that seems unlikely to end in disaster.
#!python
print(“fix needed: method ABC needs a return type annotation on line 45”
import os
os.exit(2)
Claude Code will show that output to the model. This lets you enforce anything from TDD to a ban on window.alert() in code - deterministically.
This can be the basis for much more predictable enforcement of rules and standards in your codebase.
Once you get used to code based guardrails, you’ll see how silly the current state of the art is: why do we pack the context full of instructions, distract the model from its task, then act all surprised when it doesn’t follow them perfectly!
I can't understand the mindset that would lead someone not to have realized this from the beginning.
But review fatigue and resulting apathy is real. Devs should instead be informed if incorrect code for whatever feature or process they are working on would be high-risk to the business. Lower-risk processes can be LLM-reviewed and merged. Higher risk must be human-reviewed.
If the business you're supporting can't tolerate much incorrectness (at least until discovered), than guess what - you aren't going to get much speed increase from LLMs. I've written about and given conference talks on this over the past year. Teams can improve this problem at the requirements level: https://tonyalicea.dev/blog/entropy-tolerance-ai/
the part that doesn't get talked about enough: most people are hitting a single provider API and treating it as fixed cost. but inference pricing varies a lot across providers for the same model. we've seen 3-5x spreads for equivalent quality on commodity models.
so half the cost problem is architectural (don't let agents spin unboundedly) and the other half is just... shopping around. not glamorous but real.
TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.
TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”
Thank you for coming to my TED^H^H^H TDD talk.
I would like to emphasize that feedback includes being alerted to breaking something you previously had working in a seemly unrelated/impossible way.
If an agent runs unattended for hours, small errors compound quickly. Even simple misunderstandings about file structure or instructions can derail the whole process.