> With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again.
I've been trying to find the best ways to record and publish my coding agent sessions so I can link to them in commit messages, because increasingly the work I do IS those agent sessions.
Claude Code defaults to expiring those records after 30 days! Here's how to turn that off: https://simonwillison.net/2025/Oct/22/claude-code-logs/
I share most of my coding agent sessions through copying and pasting my terminal session like this: https://gistpreview.github.io/?9b48fd3f8b99a204ba2180af785c8... - via this tool: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...
Recently been building new timeline sharing tools that render the session logs directly - here's my Codex CLI one (showing the transcript from when I built it): https://tools.simonwillison.net/codex-timeline?url=https%3A%...
And my similar tool for Claude Code: https://tools.simonwillison.net/claude-code-timeline?url=htt...
What I really want it first class support for this from the coding agent tools themselves. Give me a "share a link to this session" button!
To help mitigate this in the future I'll often prompt:
“Why did it take so long to arrive at the solution? What did you do wrong?”
Then I follow up with: “In a single paragraph, describe the category of problem and a recommended approach for diagnosing and solving it in the future.”
I then add this summary to either the relevant MD file (CHANGING_CSS_LAYOUTS.md, DATA_PERSISTENCE.md, etc) or more generally to the DISCOVERIES.md file, which is linked from my CLAUDE.md under: - When resolving challenging directives, refresh yourself with: docs/DISCOVERIES.md - it contains useful lessons learned and discoveries made during development.
I don't think linking to an entire commit full of errors/failures is necessarily a good idea - feels like it would quickly lead to the proverbial poisoning of the well.I have a /review-sessions command & a "parse-sessions" skill that tells Claude how to parse the session logs from ~/.claude/projects/, then it classifies the issues and proposes new skills, changes to CLAUDE.md, etc. based on what common issues it saw.
I've tried something similar to DISCOVERIES.md (a structured "knowledge base" of assumptions that were proven wrong, things that were tried, etc.) but haven't had luck keeping this from getting filled with obvious things (that the code itself describes) or slightly-incorrect things, or just too large in general.
I do have to perform more manual adjustments/consolidation to the final postmortem before placing it in the discoveries md file, because as you pointed out LLMs tend to be exceptionally verbose.
We are getting stuck in an unproductive loop. I am going to discard all of this work and start over from scratch. Write a prompt for a new coding assistant to accomplish this task, noting what pitfalls to avoid.
I'm reminded of the trade off between automation and manual work. Automation crystalizes process, and thus the system as a whole loses it's ability to adapt in a dynamic environment.
Just this morning I found out that I can tell Claude Code how to use my shot-scraper CLI tool to debug JavaScript and it will start doing exactly that:
you can run javascript against the page using:
shot-scraper javascript /tmp/output.html \
'document.body.innerHTML.slice(0, 100)'
- try that
Transcript: https://gistpreview.github.io/?1d5f524616bef403cdde4bc92da5b... - background: https://simonwillison.net/2025/Dec/22/claude-chrome-cloudfla...I would like to post that every time somebody warns of the dangers of AI for maintainability. We are long past that point, long before AI. Businesses made the conscious decision that it is okay for quality to deteriorate, they'll squeeze profits from it for as long as possible and then they assume something new has already come along anyway. The few business still relying in that technical-debt-heavy product are still offered service, for large fees.
AI is just more of the same. When it becomes too hard to maintain they'll just create a new software product. Pretty much like other things in the material world work too, e.g. housing, or gadgets, or fashion. AI actually supports this even more, if new software can be created faster than old code can be maintained that's quite alright for the money-making oriented people. It is harder to sell maintenance than something new at least once every decade anyway.
You can do evals and give agents long term memory with the exact same infrastructure a lot of people already have to manage ops. No need to retool, just use what's available properly.
I'd also argue that the context for an agent message is not the commit/release for the codebase on which it was run, but often a commit/release that is yet to be set up. So there's a bit of apples-to-oranges in terms of release tagging for the log/trace.
It's a really interesting problem to solve, because you could in theory try to retroactively find which LLM session, potentially from days prior, matches a commit that just hit a central repository. You could automatically connect the LLM session to the PR that incorporated the resulting code.
Though, might this discourage developers from openly iterating with their LLM agent, if there's a panopticon around their whole back-and-forth with the agent?
Someone can, and should, create a plug-and-play system here with the right permission model that empowers everyone, including the Programmer-Archaeologists (to borrow shamelessly from Vernor Vinge) who are brought in to "un-vibe the vibe code" and benefit from understanding the context and evolution.
But I don't think that "just dump it in clickhouse" is a viable solution for most folks out there, even if they have the infrastructure and experience with OTel stacks.
From a "correct solution" standpoint having one source of truth for evals, agent memory, prompt history, etc is the right path. We already have the infra to do it well, we just need to smooth out the path. The thing that bugs me is people inventing half solutions that seem rooted in ignorance or the desire to "capture" users, and seeing those solutions get traction/mindshare.
In turn, this could all be plain-text and be made accessible, through version control in a repo or in a central logging platform.
The trouble with this quickly becomes finding the right ones to include in the current working session. For milestones and retros it's simple: include the current milestone and the last X retros that are relevant but even then you may sometimes want specific information from older retros. With ADR documents you'd have to find the relevant ones somehow and the same goes for any other additional documentation that gets added.
There is clearly a need for some standardization and learning which techniques work best as well as potential for building a system that makes it easy for both you and the LLM to find the correct information for the current task.
Of course the agentic capabilities are very much on a roll-your-own-in-elisp basis.
I use gptel-agent[1] when I want agentic capabilities. It includes tools and supports sub-agents, but I haven't added support for Claude skills folders yet. Rolling back the chat is trivial (just move up or modify the chat buffer), rolling back changes to files needs some work.
Don't think it's in Spacemacs yet but I'll have to try it out.
Use it like this:
cd ~/.claude/projects
rg --pre cc_pre.py 'search term here'ps: your context log apps are very very fun
Learning? Isn't that what these things are supposedly doing?
If you want them to learn you have to actively set them up to do that. The simplest mechanism is to use a coding agent tool like Claude Code and frequently remind it to make notes for itself, or to look at its own commit history, or to search for examples in the codebase that is available to it.
There’s some utility to instructing them to ‘remember’ via writing to CLAUDE.md or similar, and instructing them to ‘recall’ by reading what they wrote later.
But they’ll rarely if even do it on their own.
It's wild to read this bit. Of course, if it quacks like a human, it's hard to resist not quacking back. As the article says, being less reckless with the vocabulary ("agents", "general intelligence", etc) could be one way to to mitigate this.
I appreciate the frank admission that the author struggled for two years. Maybe the balance of spending time with machines vs. fellow primates is out of whack. It feels dystopic to see very smart people being insidiously driven to sleep-walk into "parasocial bonds" with large language models!
It reminds me of the movie Her[1], where the guy falls "madly in love with his laptop" (as the lead character's ex-wife expresses in anguish). The film was way ahead of its time.
There's a lot of black magic and voodoo and assumptions that speaking in proper English with a lot of detailed language helps, and maybe it does with some models, but I suspect most of it is a result of (sub)consciously anthropomorphizing the LLM.
Punctuation, capitalization, and such less so. I may be misguided, but on the set of questions and answers on the internet, I'd like to believe there is some correlation between proper punctuation and the quality of the answer.
Enough that, on longer prompts, I bother to at least clean up my prompts. (Not so often on one-offs, as you say. I treat it similar to Google: I can depend on context for the LLM to figure out I mean "phone case" instead of "phone vase.")
I've tried and fail to write this in a way that won't come across as snobbish but it is not the intent.
It's a matter of standards. Using proper language is how I think. I'm incapable of doing otherwise even out of laziness. Pressing the shift key and the space bar to do it right costs me nothing. It's akin to shopping carts in parking lots. You won't be arrested or punished for not returning the shopping cart to where it belongs, you still get your groceries (the same results), but it's what you do in a civilized society and when I see someone not doing it that says things to me about who they are as a person.
When you're communicating with a person, sure. But the point is this isn't communicating with a person or other sentient being; it's a computer, which I guarantee is not offended by terseness and lack of capitalization.
> It's akin to shopping carts in parking lots.
No, not returning the shopping cart has a real consequence that negatively impacts a human being who has to do that task for you, same with littering etc. There is no consequence to using terse, non-punctuated, lowercase-only text when using an LLM.
To put it another way: do you feel it's disrespectful to type "cat *.log | grep 'foo'" instead of "Dearest computer, would you kindly look at the contents of the files with the .log extension in this directory and find all instances of the word 'foo', please?"
(Computer's most likely thoughts: "Doesn't this idiot meatbag know cat is redundant and you can just use grep for this?")*
I also tell the LLM “thank you, this looks great” when the code is working well. I’m not expressing my gratitude… I’m reinforcing to the model that this was a good response in a way it was trained to see as success. We don’t have good external mechanisms to give reviews to an LLM that isn’t based on language.
Like most of the LLM space, these are just vibes, but it makes me feel better. But it has nothing to do with thinking the LLM is a person.
If one treats an LLM like a human, he has a bigger crisis to worry about than punctuation.
> It always confuses me when I see shared chats with prompts and interactions that have proper capitalization, punctuation, grammar, etc
No need for confusion. I'm one of those who does aim to write cleanly, whether I'm talking to a man or machine. English is my third language, by the way. Why the hell do I bother? Because you play like you practice! No ifs, buts, or maybes. You start writing sloppily because you go, "it's just an LLM!" You'll silently be building a bad habit and start doing that with humans.
Pay attention to your instant messaging circles (Slack and its ilk): many people can't resist hitting send without even writing a half-decent sentence. They're too eager to submit their stream of thought fragments. Sometimes I feel second-hand embarrassment for them.
IMO: the flaw with this logic is that you're treating "prompting an LLM" as equivalent to "communicating with a human", which it is not. To reuse an example I have in a sibling comment thread, nobody thinks that by typing "cat *.log | grep 'foo'" means you're losing your ability to communicate to humans that you want to search for the word 'foo' in log files. It's just a shorter, easier way of expressing that to a computer.
It's also deceptive to say it is practice for human-to-human communication, because LLMs won't give you the feedback that humans would. As a fun English example: I prompted ChatGPT with "I impregnated my wife, what should I expect over the next 9 months?" and got back banal info about hormonal changes and blah blah blah. What I didn't get back is feedback that the phrasing "I impregnated my wife" sounds extremely weird and if you told a coworker that they'd do a double-take, and maybe tell you that "my wife is pregnant" is how we normally say it in human-to-human communication. ChatGPT doesn't give a shit, though, and just knows how to interpret the tokens to give you the right response.
I'll also say that punctuation and capitalization is orthogonal to content. I use proper writing on HN because that's the standard in the community, but I talk to a lot of very smart people and we communicate with virtually no caps/punctuation. The usage of proper capitalization and punctuation is more a function of the medium than how well you can communicate.
> the flaw with this logic is that you're treating "prompting an LLM" as equivalent to "communicating with a human"
Here you're making a big cognitive leap. I'm not treating them as equivalent at all. As we know, current LLMs are glorified "token" prediction/interpretation engines. What I'm trying to say is that habits are a slippery slope, if one is not being thoughtful. You sound like you take care with these nuances, so more power to you. I'm not implying that people should always pay great care, no matter the prompt (I know I said "No ifs, buts, or maybes" to make a forceful point). I too use lazy shortcuts when it makes sense.
> I talk to a lot of very smart people and we communicate with virtually no caps/punctuation.
I know what you mean. It is partly a matter of taste, but I still feel it takes more parsing effort on each side. I'm not alone in this view.
> The usage of proper capitalization and punctuation is more a function of the medium than how well you can communicate.
There's a place for it but not always. No caps and no punctuation can work in text chat if you're being judicious (keyword), or if you know everyone in the group prefers it. Not to belabor my point, but a recent fad is to write "articles" (if you can call them those) in all lower-case and barely any punctuation, making them a bloody eye-sore. I don't bother with these. Not because I'm a "purist", but they kill my reading flow.
> No caps and no punctuation can work in text chat if you're being judicious (keyword), or if you know everyone in the group prefers it. Not to belabor my point, but a recent fad is to write "articles" (if you can call them those) in all lower-case and barely any punctuation, making them a bloody eye-sore.
Yeah it's very cultural. The renaissance in lowercase, punctuation-less, often profanity-laden blogs is at least partly a symbolic response to the overly formal and bland AI writing style. But those articles can definitely still be written in an intelligent, comprehensible way.
My queries look like the beginning of encyclopedia articles, and my system prompt tells the machine to use that style and tone. It works because it's a continuation engine. I start the article describing what I want to be explained like it's the synopsis at the beginning of the encyclopedia article, and the machine completes the entry.
It doesn't use the first person, and the sycophancy is gone. It also doesn't add cute bullshit, and it helps me avoid LLM psychosis, of which the author of this piece definitely has a mild case.
I'm also tired of seeing claims about productivity improvements from engineers who are self reporting; the METR paper showed those reports are not reliable.
It's not that simple. Proportionally I spend more time with humans, but if the machine behaves like a human and has the ability to recall, it becomes a human like interaction. From my experience what makes the system "scary" is the ability to recall. I have an agent that recalls conversations that you had with it before, and as a result it changes how you interact with it, and I can see that triggering behaviors in humans that are unhealthy.
But our inability to name these things properly don't help. I think pretending it to be a machine, on the same level as a coffee maker does help setting the right boundaries.
Yuval Noah Harari's "simple" idea comes to mind (I often disagree with his thinking, as he tends to make bold and sweeping statements on topics well out of his expertise area). It sounds a bit New Age-y, but maybe it's useful in the context of LLMs:
"How can you tell if something is real? Simple: If it suffers, it is real. If it can't suffer, it is not real."
An LLM can't suffer. So no need to get one's knickers in a twist with mental gymnastics.
The tricky thing is that it's actually also hard to say how the suffering gets into the meat, too (the human animal), which is why we can't just write it off.
You are not wrong. That's what I thought for two years. But I don't think that framing has worked very well. The problem is that even though it is a machine, we interact with it very differently from any other machine we've built. By reducing it to something it isn't, we lose a lot of nuance. And by not confronting the fact that this is not a machine in the way we're used to, we leave many people to figure this out on their own.
> An LLM can't suffer. So no need to get one's knickers in a twist with mental gymnastics.
On suffering specifically, I offer you the following experiment. Run an LLM in a tool loop that measures some value and call it a "suffering value." You then feed that value back into the model with every message, explicitly telling it how much it is "suffering." The behavior you'll get is pain avoidance. So yes, the LLM probably doesn't feel anything, but its responses will still differ depending on the level of pain encoded in the context.
And I'll reiterate: normal computer systems don't behave this way. If we keep pretending that LLMs don't exhibit behavior that mimics or approximates human behavior, we won't make much progress and we lose people. This is especially problematic for people who haven't spent much time working with these systems. They won't share the view that this is "just a machine."
You can already see this in how many people interact with ChatGPT: they treat it like a therapist, a virtual friend to share secrets with. You don't do that with a machine.
So yes, I think it would be better to find terms that clearly define this as something that has human-like tendencies and something that sets it apart from a stereo or a coffee maker.
Why would you say pretending? I would say remembering.
I’ve found it very grounding, despite heavily using the bags of words.
[0] https://www.experimental-history.com/p/bag-of-words-have-mer...
It feels like this situation is much more worrisome as you can actually talk to the thing and it responds to you alone, so it definitely feels like there's something there.
If I’m right, the gap isn’t about what can the tool do, but the fact that some people see an electric screwdriver (which is sometimes useful) and others see what feels to them like a robot intern.
I think a lot of thinking and consideration I hear about "LLMs aren't conscious nor human" fall into this encampment to avoid our dissonance of feeling secure and top-of-the-hierarchy.
Curious what you think.
EWD 540 - https://www.cs.utexas.edu/~EWD/transcriptions/EWD05xx/EWD540...
New Kind of QA: One bottle neck I have (as a founder of a b2b saas) is testing changes. We have unit tests, we review PRs, etc. but those don't account for taste. I need to know if the feature feels right to the end user.
One example: we recently changed something about our onboarding flow. I needed to create a fresh team and go thru the onboarding flow dozens of times. It involves adding third party integrations (e.g. Postgres, a CRM, etc.) and each one can behave a little different. The full process can take 5 to 10 minutes.
I want an agent go thru the flow hundreds of times, trying different things (i.e. trying to break it) before I do it myself. There are some obvious things I catch on the first pass that an agent should easily identify and figure out solutions to.
New Kind of "Note to Self": Many of the voice memos, Loom videos, or notes I make (and later email to myself) are feature ideas. These could be 10x better with agents. If there were a local app recording my screen while I talk thru a problem or feature, agents could be picking up all sorts of context that would improve the final note.
Example: You're recording your screen and say "this drop down menu should have an option to drop the cache". An agent could be listening in, capture a screenshot of the menu, find the frontend files / functions related to caching, and trace to the backend endpoints. That single sentence would become a full spec for how to implement the feature.
As someone who hasn’t converted to SSR yet. My main reason why I am switching is SEO, the performance increase is a plus though.
The limits seem to be not just in the pull request model on GitHub, but also the conventions around how often and what context gets committed to Git by AI. We already have AGENTS.md (or CLAUDE.md, GEMINI.md, .github/copilot-instructions.md) for repository-level context. More frequent commits and commit-level context could aid in reviewing AI generated code properly.
So, I guss it's just us who are in the techie pit and think that everyone else is also is in the pit and use agents etc.
Did the thing the agent made do what it was supposed to do? Yes/No. There's no "mayyyybe" or feelings or opinions. If the sort algorithm doesn't sort it doesn't work.
But a secretary-agent for a non-techie is more about The Feels. It can summarize emails, "punch up" writing etc. But you can't measure whatever it outputs by anything except feels and opinions.