It's a shame that AI coding tools have become such a polarizing issue among developers. I understand the reasons, but I wish there had been a smoother path to this future. The early LLMs like GPT-3 could sort of code enough for it to look like there was a lot of potential, and so there was a lot of hype to drum up investment and a lot of promises made that weren't really viable with the tech as it was then. This created a large number of AI skeptics (of whom I was one, for a while) and a whole bunch of cynicism and suspicion and resistance amongst a large swathe of developers. But could it have been different? It seems a lot of transformative new tech is fated to evolve this way. Early aircraft were extremely unreliable and dangerous and not yet worthy of the promises being made about them, but eventually with enough evolution and lessons learned we got the Douglas DC-3, and then in the end the 747.
If you're a developer who still doesn't believe that AI tools are useful, I would recommend you go read Mitchell's post, and give Claude Code a trial run like he did. Try and forget about the annoying hype and the vibe-coding influencers and the noise and just treat it like any new tool you might put through its paces. There are many important conversations about AI to be had, it has plenty of downsides, but a proper discussion begins with close engagement with the tools.
Our tooling just had a refresh in less than 3 years and it leaves heads spinning. People are confused, fighting for or against it. Torn even between 2025 to 2026. I know I was.
People need a way to describe it from 'agentic coding' to 'vibe coding' to 'modern AI assisted stack'.
We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!
We don't call builders 'vibe builders' for using earth-moving machines instead of a shovel...
When was the last time you reviewed the machine code produced by a compiler? ...
The real issue this industry is facing, is the phenomenal speed of change. But what are we really doing? That's right, programming.
- Pull requests
- Merge requests
- Code review
I feel like I’m taking crazy pills. Are SWE supposed to move away from code review, one of the core activities for the profession? Code review is as fundamental for SWE as double entry is for accounting.Yes, we know that functional code can get generated at incredible speeds. Yes, we know that apps and what not can be bootstrapped from nothing by “agentic coding”.
We need to read this code, right? How can I deliver code to my company without security and reliability guarantees that, at their core, come from me knowing what I’m delivering line-by-line?
The workflow automation and better (and model-directed) context management are all obvious in retrospect but a lot of people (like myself) were instead focused on IDE integration and such vs `grep` and the like. Maybe multi-agent with task boards is the next thing, but it feels like that might also start to outrun the ability to sensibly design and test new features for non-greenfield/non-port projects. Who knows yet.
I think it's still very valuable for someone to dig in to the underlying models periodically (insomuch as the APIs even expose the same level of raw stuff anymore) to get a feeling for what's reliable to one-shot vs what's easily correctable by a "ran the tests, saw it was wrong, fixed it" loop. If you don't have a good sense of that, it's easy to get overambitious and end up with something you don't like if you're the sort of person who cares at all about what the code looks like.
But that speed makes a pretty significant difference in experience.
If you wait a couple minutes and then give the model a bunch of feedback about what you want done differently, and then have to wait again, it gets annoying fast.
If the feedback loop is much tighter things feel much more engaging. Cursor is also good at this (investigate and plan using slower/pricier models, implement using fast+cheap ones).
This is the key one I think. At one extreme you can tell an agent "write a for loop that iterates over the variable `numbers` and computes the sum" and they'll do this successfully, but the scope is so small there's not much point in using an LLM. On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad.
A lot of successful LLM adoption for code is finding this sweet spot. Overly specific instructions don't make you feel productive, and overly broad instructions you end up redoing too much of the work.
It cognitively feels very similar to other classic programming activities, like modularization at any level from architecture to code units/functions, thoughtfully choosing how to lay out and chunk things. It's always been one of the things that make programming pleasurable for me, and some of that feeling returns when slicing up tasks for agents.
I'm starting to think of projects now as a tree structure where the overall architecture of the system is the main trunk and from there you have the sub-modules, and eventually you get to implementations of functions and classes. The goal of the human in working with the coding agent is to have full editorial control of the main trunk and main sub-modules and delegate as much of the smaller branches as possible.
Sometimes you're still working out the higher-level architecture, too, and you can use the agent to prototype the smaller bits and pieces which will inform the decisions you make about how the higher-level stuff should operate.
I agree. This is how I see it too. It's more like a shortcut to an end result that's very similar (or much better) than I would've reached through typing it myself.
The other day I did realise that I'm using my experience to steer it away from bad decisions a lot more than I noticed. It feels like it does all the real work, but I have to remember it's my/our (decades of) experience writing code playing a part also.
I'm genuinely confused when people come in at this point and say that it's impossible to do this and produce good output and end results.
Or maybe something quite different but where these early era agentic tooling strategies still become either unneeded or even actively detrimental.
I think anyone who has worked on a serious software project would say, this means it would be polling you constantly.
Even if we posit that an LLM is equivalent to a human, humans constantly clarify requirements/architecture. IMO on both of those fronts the correct path often reveals itself over time, rather than being knowable from the start.
So in this scenario it seems like you'd be dealing with constant pings and need to really make sure you're understanding of the project is growing with the LLM's development efforts as well.
To me this seems like the best-case of the current technology, the models have been getting better and better at doing what you tell it in small chunks but you still need to be deciding what it should be doing. These chunks don't feel as though they're getting bigger unless you're willing to accept slop.
What this misses, of course, is that you can just have the agent do this too. Agent's are great at making project plans, especially if you give them a template to follow.
Amusingly, this was my experience in giving Lovable a shot. The onboarding process was literally just setting me up for failure by asking me to describe the detailed app I was attempting to build.
Taking it piece by piece in Claude Code has been significantly more successful.
Maybe there’s something about not having to context switch between natural language and code just makes it _feel_ easier sometimes
The more detailed I am in breaking down chunks, the easier it is for me to verify and the more likely I am going to get output that isn't 30% wrong.
But not so good at making (robust) new features out of the blue
The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).
What ended up working for me was treating chat as where I shape the plan (tradeoffs, invariants, failure modes) and treating the agent as something that does narrow, reviewable diffs against that plan. The human job stays very boring: run it, verify it, and decide what’s actually acceptable. That separation is what made it click for me.
Once I got that loop stable, it stopped being a toy and started being a lever. I’ve shipped real features this way across a few projects (a git like tool for heavy media projects, a ticketing/payment flow with real users, a local-first genealogy tool, and a small CMS/publishing pipeline). The common thread is the same: small diffs, fast verification, and continuously tightening the harness so the agent can’t drift unnoticed.
Yeah I would get patterns where, initial prototypes were promising, then we developed something that was 90% close to design goals, and then as we try to push in the last 10%, drift would start breaking down, or even just forgetting, the 90%.
So I would start getting to 90% and basically starting a new project with that as the baseline to add to.
These patterns seem to be picking up speed in the general population; makes the human race seem quite easily hackable.
If the human race were not hackable then society would not exist, we'd be the unchanging crocodiles of the last few hundred million years.
Have you ever found yourself speaking a meme? Had a catchy toon repeating in your head? Started spouting nation state level propaganda? Found yourself in crowd trying to burn a witch at the stake?
Hacking the flow of human thought isn't that hard, especially across populations. Hacking any one particular humans thoughts is harder unless you have a lot of information on them.
With where the models are right now you still need a human in the loop to make sure you end up with code you (and your organisation) actually understands. The bottle neck has gone from writing code to reading code.
This has always been the bottleneck. Reviewing code is much harder and gets worse results than writing it, which is why reviewing AI code is not very efficient. The time required to understand code far outstrips the time to type it.
Most devs don’t do thorough reviews. Check the variable names seem ok, make sure there’s no obvious typos, ask for a comment and call it good. For a trusted teammate this is actually ok and why they’re so valuable! For an AI, it’s a slot machine and trusting it is equivalent to letting your coworkers/users do your job so you can personally move faster.
"Please let us sit down and have a reasonable conversation! I was a skeptic, too, but if all skeptics did what I did, they would come to Jesus as well! Oh, and pay the monthly Anthropic tithe!"
I do run multiple models at once now. On different parts of the code base.
I focus solely on the less boring tasks for myself and outsource all of the slam dunk and then review. Often use another model to validate the previous models work while doing so myself.
I do git reset still quite often but I find more ways to not get to that point by knowing the tools better and better.
Autocompleting our brains! What a crazy time.