Posted by vismit2000 17 hours ago
> "We collect self-reported familiarity with AI coding tools, but we do not actually measure differences in prompting techniques."
Many people drive cars without being able to explain how cars work. Or use devices like that. Or interact with people who's thinking they can't explain. Society works like that, it is functional, does not work by full understanding. We need to develop the functional part not the full understanding part. We can write C without knowing the machine code.
You can often recognize a wrong note without being able to play the piece, spot a logical fallacy without being able to construct the valid argument yourself, catch a translation error with much less fluency than producing the translation would require. We need discriminative competence, not generative.
For years I maintained a library for formatting dates and numbers (prices, ints, ids, phones), it was a pile of regex but I maintained hundreds of test cases for each type of parsing. And as new edge cases appeared, I added them to my tests, and iterated to keep the score high. I don't fully understand my own library, it emerged by scar accumulation. I mean, yes I can explain any line, but why these regexes in this order is a data dependent explanation I don't have anymore, all my edits run in loop with tests and my PRs are sent only when the score is good.
Correctness was never grounded in understanding the implementation. Correctness was grounded in the test suite.
But the fundamentals all cars behave the same way all the time. Imagine running a courier company where sometimes the vehicles take a random left turn.
> Or interact with people who's thinking they can't explain
Sure but they trust those service providers because they are reliable . And the reason that they are reliable is that the service providers can explain their own thinking to themselves. Otherwise their business would be chaos and nobody would trust them.
How you approached your library was practical given the use case. But can you imagine writing a compiler like this? Or writing an industrial automation system? Not only would it be unreliable but it would be extremely slow. It's much faster to deal with something that has a consistent model that attempts to distill the essence of the problem, rather than patching on hack by hack in response to failed test after failed test.
I think being a programmer is closer to being an aircraft pilot than a car driver.
But isn't the corrections of those errors that are valuable to society and get us a job?
People can tell they found a bug or give a description about what they want from a software, yet it requires skills to fix the bugs and to build software. Though LLMs can speedup the process, expert human judgment is still required.
If you know that you need O(n) "contains" checks and O(1) retrieval for items, for a given order of magnitude, it feels like you've all the pieces of the puzzle needed to make sure you keep the LLM on the straight and narrow, even if you didn't know off the top of your head that you should choose ArrayList.
Or if you know that string manipulation might be memory intensive so you write automated tests around it for your order of magnitude, it probably doesn't really matter if you didn't know to choose StringBuilder.
That feels different to e.g. not knowing the difference between an array list and linked list (or the concept of time/space complexity) in the first place.
When it comes to fundamentals, I think it's still worth the investment.
To paraphrase, "months of prompting can save weeks of learning".
Tests only cover cases you already know to look for. In my experience, many important edge cases are discovered by reading the implementation and noticing hidden assumptions or unintended interactions.
When something goes wrong, understanding why almost always requires looking at the code, and that understanding is what informs better tests.
Instead, just learning concepts with AI and then using HI (Human Intelligence) & AI to solve the problem at hand—by going through code line by line and writing tests - is a better approach productivity-, correctness-, efficiency-, and skill-wise.
I can only think of LLMs as fast typists with some domain knowledge.
Like typists of government/legal documents who know how to format documents but cannot practice law. Likewise, LLMs are code typists who can write good/decent/bad code but cannot practice software engineering - we need, and will need, a human for that.
> AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear. Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation -- particularly in safety-critical domains.
I assistance produces significant productivity gains across professional domains, particularly for novice workers.
We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average.
Are the two sentences talking about non-overlapping domains? Is there an important distinction between productivity and efficiency gains? Does one focus on novice users and one on experienced ones? Admittedly did not read the paper yet, might be clearer than the abstract.
The research question is: "Although the use of AI tools may improve productivity for these engineers, would they also inhibit skill formation? More specifically, does an AI-assisted task completion workflow prevent engineers from gaining in-depth knowledge about the tools used to complete these tasks?" This hopefully makes the distinction more clear.
So you can say "this product helps novice workers complete tasks more efficiently, regardless of domain" while also saying "unfortunately, they remain stupid." The introductiory lit review/context setting cites prior studies to establish "ok coders complete tasks efficiently with this product." But then they say, "our study finds that they can't answer questions." They have to say "earlier studies find that there were productivity gains" in order to say "do these gains extend to other skills? Maybe not!"
I learned a lot more in a short amount of time than I would've stumbling around on my own.
Afaik its been known for a long time that the most effective way of learning a new skill, is to get private tutoring from an expert.
But that's what "impairs learning" means.
Personally, I’ve never been learning software development concepts faster—but that’s because I’ve been offloading actual development to other people for years.
I don't necessarily think that writing more code means you get better coder. I automate nearly all my tests with AI and large chunk of bugfixing as well. I will regularly ask AI to propose an architecture or introduce a new pattern if I don't have a goal in my mind. But in these last 2 examples, I will always redesign the entire approach to be what I consider a better, cleaner interface. I don't recall AI ever getting that right, but must admit I asked AI in the first place cos I didn't know where to start.
If I had to summarize, I would say to let AI implement coding, but not API design/architecture. But at the same time, you can only get good at those by knowing what doesn't work and trying to find a better solution.
How exactly? Do you tell the agent "please write a test for this" or do you also feed it some form of spec to describe what the tested thing is expected to do? And do these tests ever fail?
Asking because the first option essentially just sets the bugs in stone.
Wouldn't it make sense to do it the other way around? You write the test, let the AI generate the code? The test essentially represents the spec and if the AI produces sth which passes all your tests but is still not what you want, then you have a test hole.
I care more about the code than the tests. Tests are verification of my work. And yes, there is a risk of AI "navigating around" bugs, but I found that a lot of the time AI will actually spot a bug and suggest a fix. I also review each line to look for improvements.
Edit: to answer your question, I will typically ask it to test a specific test case or few test cases. Very rarely will I ask it to "add tests everywhere". Yes, these tests frequently fail and the agent will fix on 2nd+ iteration after it runs the tests.
One more thing to add is that a lot of the time agent will add a "dummy" test. I don't really accept those for coverage's sake.
A follow-up:
> I care more about the code than the tests.
Why is that? Your (product) code has tests. Your test (code) doesn't. So I often find that I need to pay at least as much attention to my tests to ensure quality.
I find tests easier to write. Your function(s) may be hundred lines long, but the test is usually setup, run, assert.
I don't have much experience beyond writing unit/integration tests, but individual test cases seem to be simpler than the code they test (linear, no branches).
I can iterate on entire approaches in the same amount of time it would have taken to explore a single concept before.
But AI is an amplifier of human intent- I want a code base that’s maintainable, scalable, etc., and that’s a different than YOLO vibe coding. Vibe engineering, maybe.
Then also switching arch approaches quickly when i find some code strategies that are not correctly ergonomic. Splitting of behaviors and other refactors are much lower cost now.
Sometimes I wonder if people who make statements like this have ever actually casually browsed Twitter or reddit or even attempted a "large" application themselves with SOTA models.
An example: I vibecoded myself a Toggl Track clone yesterday - it works amazingly but if I had to rewrite e.g. the PDF generation code by myself I wouldn't have a clue!
Previous title: "Anthropic: AI Coding shows no productivity gains; impairs skill development"
The previous title oversimplified the claim to "all" developers. I found the previous title meaningful while submitting this post because most of the false AI claims of "software engineer is finished" has mostly affected junior `inexperienced` engineers. But I think `junior inexperienced` was implicit which many people didn't pick.
The paper makes a more nuanced claim that AI Coding speeds up work for inexperienced developers, leading to some productivity gains at the cost of actual skill development.
> Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average.
The library in question was Python trio and the model they used was GPT-4o.
I found that Claude wasn't too great at first at it and returned a lot of hallucinated methods or methods that existed in Pandas but not Polars. I chalk this up to context blurring and that there's probably a lot less Polars code in the training corpus.
I found it most useful for quickly pointing me to the right documentation, where I'd learn the right implementation and then use it. It was terrible for the code, but helpful as a glorified doc search.
Like the architecture work and making good quality specs, working on code has a guiding effect on the coding agents. So in a way, it also benefits to clarify items that may be more ambiguous in the spec. If I write some of the code myself, it will make fewer assumptions about my intent when it touches it (especially when I didn't specify them in the architecture or if they are difficult to articulate in natural language).
In small iterations, the agent checks back for each task. Because I spend a lot of time on architecture, I already have a model in my mind of how small code snippets and feature will connect.
Maybe my comfort with reviewing AI code comes form spending a large chunk of my life reverse engineering human code, to understand it to the extent that complex bugs and vulnerabilities emerge. I've spent a lot of time with different styles of code writing from awful to "this programmer must have a permanent line to god to do this so elegantly". The models is train on that, so I have a little cluster of neurons in my head that's shaped closely enough to follow the model's shape.
1. AI help produced a solution only 2m faster, and
2. AI help reduced retention of skill by 17%
(thankfully market dynamics and OSS alternatives will probably stop this but it's not a guarantee, you need like at least six viable firms before you usually see competitive behavior)
I don't think that's true? From what I understand most labs are making money from subscription users (maybe not if you include training costs, but still, they're not selling at a loss).
>(thankfully market dynamics and OSS alternatives will probably stop this but it's not a guarantee, you need like at least six viable firms before you usually see competitive behavior)
OpenAI is very aggressive with the volume of usage you can get from Codex, Google/DeepMind with Gemini. Anthropic reduced the token price with the latest Opus release (4.5).