Posted by danielfalbo 3 days ago
But one thing that has scared me the most, is the trust of LLMs output to the general society. I believe that for software engineers it's really easy to see if it's being useful or not -- We can just run the code and see if the output is what we expected, if not, iterate it, and continue. There's still a professional looking to what it produces.
On the contrary, for more day-to-day usage of the general pubic, is getting really scary. I've had multiple members of my family using AI to ask for medical advice, life advice, and stuff were I still see hallucinations daily, but at the same time they're so convincing that it's hard for them not to trust them.
I still have seen fake quotes, fake investigations, fake news being spreaded by LLMs that have affected decisions (maybe, not as crucials yet but time will tell) and that's a danger that most software engineers just gross over.
Accountability is a big asterisk that everyone seems to ignore
That is not the reality we're living in. Doctors barely give you 5 minutes even if you get an appointment days or weeks in advance. There is just nobody to ask. The alternatives today are
1) Don't ask, rely on yourself, definitely worse than asking a doctor
2) Ask an LLM, which gets you 80-90% of the way there.
3) Google it and spend hours sifting through sponsored posts and scams, often worse than relying on yourself.
The hallucinations that happen are massively outweighed by the benefits people get by asking them. Perfect is the enemy of good enough, and LLMs are good enough.
Much more important also is that LLMs don't try to scam you, don't try to fool you, don't look out for their own interests. Their mistakes are not intentional. They're fiduciaries in the best sense, just like doctors are, probably even more so.
1. People around us
2. TV and newspapers
3. Random people on the internet and their SEO-optimized web pages
Books and experts have been less popular. LLMs are an improvement.
Unless somebody is using them to generate authoritative-sounding human-sounding text full of factoids and half-truths in support of a particular view.
Then it becomes about who can afford more LLMs and more IPs to look like individual users.
When the appreciable-fraction-of-GDP money tap turns off, there going to be enormous pressure to start putting a finger on the scale here.
And AI spew is theoretically a fantastic place to insert almost subliminal contextual adverts on a way that traditional advertising can only dream about.
Imagine if it could start gently shilling a particular brand of antidepressant if you started talking to it about how you're feeling lonely and down. I'm not saying you should do that, but people definitely do.
And then multiply by every question you doing ask. Ask about do you need tyres. "Yes, you should absolutely change tyres every year, whether noticeably worn or not. KwikFit are generally considered the best place to have this done. Of course I know you have a Kia Picanto - you should consider that actually a Mercedes C class is up to 200% lighter on tyre wear. I have searched and found an exclusive 10% offer at Honest Jim's Merc Mansion, valid until 10pm. Shall I place an order?"
Except it'll be buried in a lot more text and set up with more subtlety.
Yeah, back in the day before monetization Internet pages were informative, reliable and ad-free too.
Even most hobby AI projects mostly seem to have an eye on being a side hustle or CV buffing.
Perhaps it's because even in the 90s you could serve a website for basically free (once you had the server). AI today has a noticeable per-user cost.
This is untrue. There's a huge landscape of locally-hosted AI stuff, and they're actually doing real interesting research. The problem is that 99% of it is pornography-focused, so understandably it's very underground.
I despise all of this. For the moment though, before all this is implemented, it's perhaps a brief golden age of LLMs usefulness. (And I'm sure LLMs will remain useful for many things, but there will be entire categories where they're ruined by pay to play the same as happened with Google search.)
Doctors already shill for big pharma. There are trust issues all the way down.
Nonetheless, we must somehow build trust in others and denounce the undeserving. Some humans deserve trust. Will these AI models?
This is not the norm worldwide.
Big pharma corps are multinational powerhouses, who behave like all other big corps, doing whatever they can to increase profits. It may not be direct product placement, kickbacks, or bribery on the surface, but how about an expense-paid trip to a sponsored conference or a small research grant? Soft money gets their foot in the door.
The Internet was 80%-90% accurate to begin with.
Then the Internet became worth money. And suddenly that accuracy dropped like a stone.
There is no reason to believe that ML/AI isn't going to speedrun that process.
There is zero chance LLMs will just stroll into this space with "Kinda sorta mostly right" answers, even with external verification.
Doctors will absolutely resist this, because it means the impending end of their careers. Insurers don't care about cost savings because insurers and care providers are often the same company.
Of course true AGI will eventually - probably quite soon - become better at doctoring than many doctors are.
But that doesn't mean the tech will be rolled out to the public without a lot of drama, friction, mistakes, deaths, and traumatic change.
The western world is already solving this, but not through letting LLMs prescribe (because that's a non-starter for liability reasons).
Instead, nurses and allied health professionals are getting prescribing rights in their fields (under doctors, but still it scales much better).
This is so naive, especially since both google and openai openly confess to manipulate the data for their own agenda (ads but not only)
AI is a skilled liar
You can always pride yourself and playing with fire, but the more humble attitude would be to avoid it at all cost;
LLMs don't try to scam/fool you, LLM providers do.
Remember how Grok bragged that Musk had the “potential to drink piss better than any human in history” and was the “ultimate throat goat,” whose “blowjob prowess edges out” Donald Trump’s. Grok also posited that Musk was more physically fit than LeBron James, and that he would have been a better recipient of the 2016 porn industry award than porn star Riley Reid.
I had a chuckle reading all of these.
They follow their corporations instead. Just look at the status-quoism of the free "Google AI" and the constant changes in Grok, where xAI is increasingly locking down Grok, perhaps to stay in line with EU regulations. But Grok is also increasingly pro-billionaire.
Copilot was completely locked down on anything political before the 2024 election.
They all scam you according to their training and system prompts. Have you seen the minute change in the system prompt that led to MechaHitler?
Hallucinations and sycophancy are still an issue, 80-90% is being generous I think.
I know this is not issues of the LLM itself, but rather the implementation & companies behind them (since there are open models as well), but, what limits to LLMs to be enshittified by corp needs?
I've seen this very recently with Grok, people were asking trolley-like problems comparing Elon Musk to anything, and Grok very frequently chose Elon Musk most of the time because it is probably embedded in the system prompt or training [1].
[1] https://www.theguardian.com/technology/2025/nov/21/elon-musk...
> where every person can ask a doctor their questions 10 times a day and instantly get an accurate response.
Why in god's name would you need to ask a doctor 10 questions every day? How is this in any way germane to this issue?
In any first-world country you can get a GP appointment free of charge either on the day or with a few days' wait, depending on the urgency. Not to mention emergency care / 112 any time day or night if you really need it. This exists and has existed for decades in most vaguely social-democratic countries in the world (but not only those). So you can get professional help from someone, there's no (absurd) false choice between either "asking the stochastic platitude generator" and "going without healthcare".
But I know right, a functioning health system with the right funding, management, and incentives! So boring! Yawn yawn, not exciting. GP practices don't get trillions of dollars in VC money.
> Ask an LLM, which gets you 80-90% of the way there.
This is such a ridiculous misrepresentation of the current state of LLMs that I don't even know how to continue a conversation from here.
Are you really under the assumption that this is a first-world perk?
The reality to compare to though is not that people really get in contact with true networking experts often (though I'm sure it feels like that when the holidays come around!) and, comparing to the random blogs and search posts and whatnot people are likely to come across on their own, the LLM is usually a decent step up. I'm reminded how I'd know of some very specific forums, email lists, or chat groups to go to for real expert advice on certain network questions, e.g. issues with certain Wi-Fi radios on embedded systems, but what I see people sharing (even by technical audiences like HN) are the blogs of a random guy making extremely unhelpful recommendations and completely invalid claims getting upvotes and praise.
With things like asking AI for medical advice... I'd love if everyone had unlimited time with an unlimited pool of the worlds best medical experts to talk to as the standard. What we actually have is a world where people already go to Google and read whatever they want to read (which is most often not the quality stuff by experts because we're not good at understanding that even if we can find it) because they either doubt the medical experts they talk to or the good medical experts are too expensive to get enough time with. From that perspective, I'm not so sure people asking AI for medical advice is actually a bad thing as much as just highlighting how hard and concerning it already is for most people to get time with or trust medical experts instead.
To take it to an extreme, it's basically saying "people already get little or bad advice, we might as well give them some more bad advice."
I simply don't buy it.
That said, it definitely feels as though keeping a coherent picture of what is actually happening is getting harder, which is scary.
The concern, I think, is that for many that “discard function” is not, “Is this information useful?”. Instead: “Does this information reinforce my existing world view?”
That feedback loop and where it leads is potentially catastrophic at societal scale.
Are LLMs "democratized" yet, though? If not, then it's just-as-likely that LLMs will be steered by their owners to reinforce an echo-chamber of their own.
For example, what if RFK Jr launched an "HHS LLM" - what then?
As much as this is true, and i.e. doctors for sure can profit (here in my country they don't get any type of sponsor money AFAIK, other than having very high rates), there is still accountability.
We have built a society based on rules and laws, if someone does something that can harm you, you can follow the path to at least hold someone accountable (or, try).
The same cannot be said about LLMs.
I mean there is some if they go wildly off the rails, but in general if the doctor gives a prognosis based on a tiny amount of the total corpus of evidence they are covered. Works well if you have the common issue, but can quickly go wrong if you have the uncommon one.
This is not the same scale of problem.
Elina listened in on the speech and got surprised :)...
https://www.aftonbladet.se/nyheter/a/gw8Oj9/ebba-busch-anvan...
Ebba apologized, great, but it begs the question: how many quotes and misguided information is being acted on already? If crucial decisions can be made off incorrect decisions then they will. Murphys law!
There is a vast gap between the output happening to be what you expect and code being actually correct.
That is, in a way, also the fundamental issue with LLMs: They are designed to produce “expected” output, not correct output.
The output is correct but only for one input.
The output is correct for all inputs but only with the mocked dependency.
The output looks correct but the downstream processors expected something else.
The output is correct for all inputs with real world dependencies and is in the correct structure for downstream processors, but it's not being registered with the schema filtered and it all gets deleted in prod.
While implementing the correct function you fail to notice that the correct in every way output doesn't conform to that thing that Tom said because you didn't code it yourself but instead let the LLM do it. The system works flawlessly with itself but the final output fails regulatory compliance.
I didn't mean they do it on the first time, or that it is correct, I mean that you can 'run' and 'test it' to see if it does what you want in the way you want.
The same cannot be said to any other topics like medical advice, life advice, etc.
The point is, how verifiable is the output the LLM gives and so how useful it is.
They slow down software delivery on aggregate, so no. They have a therapeutic effect on developer burnout though. Not sure it's worth it, personally. Get a corporate ping-ping table or something like that instead.
Humans have a long history of being prone to believe and parrot anything they hear or read, from other humans, who may also just be doing the same, or from snake-oil salesmen preying on the weak, or woo-woo believers who aren't grounded in facts or reality. Even trusted professionals like doctors can get things wrong, or have conflicting interests.
If you're making impactful life decisions without critical thinking and research beyond a single source, that's on you, no matter if your source is human or computer.
Sometimes I joke that computers were a mistake, and in the short term (decades), maybe they've done some harm to society (though they didn't program themselves), but in the long view, they're my biggest hope for saving us from ourselves, specifically due to accountability and transparency.
So the number of anti-vaxxers is going to plummet drastically in the following decade, I guess.
> Depends if they use lobotomized bots like Grok...
What are you on about?
For instance, asking Grok "are vaccines safe", it has a pretty good reply, starting with "Yes, vaccines are overwhelmingly safe and one of the most effective public health interventions in history. Extensive scientific evidence from decades of research, including rigorous clinical trials, post-licensure monitoring, and systematic reviews by organizations like the WHO, CDC, NIH, and independent bodies, shows that the benefits of vaccination far outweigh the risks for individuals and populations." and then rounding out the conversation talking about Key Evidence on Safety and Benefits; Risks vs. Benefits; Addressing Concerns.
https://grok.com/share/c2hhcmQtNA_69e20553-2558-46be-9f21-6a...
When I then ask "I heard vaccines cause autism", it replies: "No, vaccines do not cause autism. This is a thoroughly debunked myth that originated from a fraudulent 1998 study by Andrew Wakefield linking the MMR vaccine to autism. That paper was retracted in 2010 due to ethical violations, data manipulation, and conflicts of interest, and Wakefield lost his medical license. Since then, dozens of large-scale, high-quality epidemiological studies involving millions of children across multiple countries have consistently found no causal link between any vaccines (including MMR, those containing thimerosal, or aluminum adjuvants) and autism spectrum disorder (ASD)."
Seems pretty good to me.
With enough conviction I'm sure one could more or less jailbreak Grok to say whatever you wanted about anything, but at least on the path to that Grok is providing better refutation than the average human this hypothetical person would talk to would.
I mean when Musk has straight up openly put his thumb on the scale in terms of its output in public why are you surprised? Trust is easily lost and hard to gain back.
I'm not a fan of this phrasing. Use of the terms "resistance" and "skeptics" implies they were wrong. It's important we don't engage in revisionist history that allows people in the future to say "Look at the irrational fear programmers had of AI, which turned out to be wrong!" The change occurred because LLMs are useful for programming in 2025 and the earliest versions weren't for most programmers. It was the technology that changed.
It's easy to declare "victory" when you're only talking about the maximalist position on one side ("LLMs are totally useless!") vs the minimalist position on the other side ("LLMs can generate useful code"). The AI maximalist position of "AI is going to become superintelligent and make all human work and intelligence obsolete" has certainly not been proven.
The LLM skeptics claim LLM usefulness is an illusion. That the LLMs are a fad, and they produced more problems than they solve. They cite cherry picked announcements showing that LLM usage makes development slower or worse. They opened ChatGPT a couple times a few months ago, asked some questions, and then went “Aha! I knew it was bad!” when they encountered their first bad output instead of trying to work with the LLM to iterate like everyone who gets value out of them.
The skeptics are the people in every AI thread claiming LLMs are a fad that will go away when the VC money runs out, that the only reason anyone uses LLMs is because their boss forces them to, or who blame every bug or security announcement on vibecoding.
I also believe the current generations of LLMs (transformers) are technical dead ends on the path to real AGI, and the more time we spend hyping them, the less research/money gets spent on discovering new/better paths beyond transformers.
I wish we could go back to complaining about Kubernetes, focusing on scaling distributed systems, and solving more interesting problems that comparing winnings on a stochastic slot machine. I wish our industry was held to higher standards than jockeying bug-ridden MVP code as quickly as possible.
Almost no human could port 3000 lines of Python to JavaScript and test it in their spare time while watching TV and decorating a Christmas tree. Almost no human you can employ would do a good job of it for $6/hour and have it done 5 hours. How is that "ignorance or a sense of desparation" and "not actually useful"?
But this is cherry-picking.
In the grand scheme of the work we all collectively do, very few programming projects entail something even vaguely like generating an Nth HTML parser in a language that already has several wildly popular HTML parsers--or porting that parser into another language that has several wildly popular HTML parsers.
Even fewer tasks come with a library of 9k+ tests to sharpen our solutions against. (Which itself wouldn't exist without experts trodding this ground thoroughly enough to accrue them.)
The experiments are incredibly interesting and illuminating, but I feel like it's verging on gaslighting to frame them as proof of how useful the technology is when it's hard to imagine a more favorable situation.
Granted, but this reads a bit like a headline from The Onion: "'Hard to imagine a more favourable situation than pressing nails into wood' said local man unimpressed with neighbour's new hammer".
I think it's a strong enough example to disprove "they're an interesting phenomenon that people have convinced themselves MUST BE USEFUL ... either through ignorance or a sense of desperation". Not enough to claim they are always useful in all situations or to all people, but I wasn't trying for that. You (or the person I was replying to) basically have to make the case that Simon Willison is ignorant about LLMs and programming, is desperate about something, or is deluding himself that the port worked when it actually didn't, to keep the original claim. And I don't think you can. He isn't hyping an AI startup, he has no profit motive to delude him. He isn't a non-technical business leader who can't code being baffled by buzzwords. He isn't new to LLMs and wowed by the first thing. He gave a conference talk showing that LLMs cannot draw pelicans on bicycles so he is able to admit their flaws and limitations.
> "But this is cherry-picking."
Is it? I can't use an example where they weren't useful or failed. It makes no sense to try and argue how many successes vs. failures, even if I had any way to know that; any number of people failing at plumbing a bathroom sink don't prove that plumbing is impossible or not useful. One success at plumbing a bathroom sink is enough to demonstrate that it is possible and useful - it doesn't need dozens of examples - even if the task is narrowly scoped and well-trodden. If a Tesla humanoid robot could plumb in a bathroom sink, it might not be good value for money, but it would be a useful task. If it could do it for $30 it might be good value for money as well even if it couldn't do any other tasks at all, right?
Chuffed you picked this example to ~sneer about.
There's a near-infinite list of problems one can solve with a hammer, but there are vanishingly few things one can build with just a hammer.
> You (or the person I was replying to) basically have to make the case that Simon Willison is ignorant about LLMs and programming, is desperate about something, or is deluding himself that the port worked when it actually didn't, to keep the original claim.
I don't have to do any such thing.
I said the experiments were both interesting and illuminating and I meant it. But that doesn't mean they will generalize to less-favorable problems. (Simon's doing great work to help stake out what does and doesn't work for him. I have seen every single one of the posts you're alluding to as they were posted, and I hesitated to reply here because I was leery someone would try to frame it as an attack on him or his work.)
> Is it? I can't use an example where they weren't useful or failed.
https://en.wiktionary.org/wiki/cherry-pick
(idiomatic) To pick out the best or most desirable items
from a list or group, especially to obtain some advantage
or to present something in the best possible light.
(rhetoric, logic, by extension) To select only evidence which supports an argument,
and reject or ignore contradictory evidence.
> any number of people failing at plumbing a bathroom sink don't prove that plumbing is impossible or not useful. One success at plumbing a bathroom sink is enough to demonstrate that it is possible and useful - it doesn't need dozens of examples - even if the task is narrowly scoped and well-trodden.This smells like sleight of hand.
I'm happy to grant this (with a caveat^) if your point is that this success proves LLMs can build an HTML parser in a language with several popular source-available examples and thousands of tests (and probably many near-identical copies of the underlying HTML specs as they evolve) with months of human guidance^ and (with much less guidance) rapidly translate that parser into another language with many popular source-available answers and the same test suite. Yes--sure--one example of each is proof they can do both tasks.
But I take your GP to be suggesting something more like: this success at plumbing a sink inside the framework an existing house with plumbing provides is proof that these things can (or will) build average fully-plumbed houses.
^Simon, who you noted is not ignorant about LLMs and programming, was clear that the initial task of getting an LLM to write the first codebase that passed this test suite took Emil months of work.
> If a Tesla humanoid robot could plumb in a bathroom sink, it might not be good value for money, but it would be a useful task. If it could do it for $30 it might be good value for money as well even if it couldn't do any other tasks at all, right?
The only part of this that appears to have been done for about $30 was the translation of the existing codebase. I wouldn't argue that accomplishing this task for $30 isn't impressive.
But, again, this smells like sleight of hand.
We have probably plumbed billions of sinks (and hopefully have billions or even trillions more to go), so any automation that can do one for $30 has clear value.
A world with a billion well-tested HTML parsers in need of translation is likely one kind of hell or another. Proof an LLM-based workflow can translate a well-tested HTML parser for $30 is interesting and illuminating (I'm particularly interested in whether it'll upend how hard some of us have to fight to justify the time and effort that goes into high-quality test suites), but translating them obviously isn't going to pay the bills by itself.
(If the success doesn't generalize to less favorable situations that do pay the bills, this clearly valuable capability may be repriced to better reflect how much labor and risk it saves relative to a human rewrite.)
Therefore LLMs are useful. Q.E.D. The claim "people who say LLMs are useful are deluded" is refuted. Readers can stop here, there is no disagreement to argue about.
> "But I take your GP to be suggesting something more like: this success at plumbing a sink inside the framework an existing house with plumbing provides is proof that these things can (or will) build average fully-plumbed houses."
Not exactly; it's common to see people dismiss internet claims of LLMs being useful. Here[1] is a specific dismissal that I am thinking of where various people are claiming that LLMs are useful and the HN commenter investigated and says the LLMs are useless, the people are incompetent, and others are hand-writing a lot of the code. No data is provided for use the readers to make any judgement one way or the other. Emil taking months to create the Python version could be dismissed this way as well, assuming a lot of hand-writing of code in that time. Small scripts can be dismissed with "I could have written that quickly" or "it's basically regurgitating from StackOverflow".
Simon Willison's experiment is a more concrete example. The task is clearly specified, not vague architecture design. The task has a clear success condition (the tests). It's clear how big the task is and it's not a tiny trivial toy. It's clear how long the whole project took and how long GPT ran for, there isn't a lot of human work hiding in it. It ran for multiple hours generating a non-trivial amount of work/code which is not likely to be a literal example regurgitated from its training data. The author is known (Django, Datasette) to be a competent programmer. The LLM code can be clearly separated from any human involvement.
Where my GP was going is that the experiment is not just another vague anecdote, it's specific enough that there's no room left for dismissing it how the commenter in [1] does. It's untenable to hold the view that "LLMs are useless" in light of this example.
> (repeat) "But I take your GP to be suggesting something more like: this success at plumbing a sink inside the framework an existing house with plumbing provides is proof that these things can (or will) build average fully-plumbed houses."
The example is not proof that these things can do anything else, but why would you assume they can't do tasks of similar complexity? Through time we've gone from "LLMs don't exist" to "LLMs exist as novelties and toys (GPT-1 2018)" to "LLMs might be useful but might not be". If things keep progressing we will get to "LLMs are useful". I am taking the position that we are past that point, and I am arguing that position. We are definitely into the time "they are useful". Other people have believed that for a long time. Not just useful for that task, but for tasks of that kind of complexity.
Sometime between GPT-1 babbling (2018) and today (Q4 2025) the GPTs and the tooling improved from not being able to do this task to yes being able to do this task. Some refinement, some chain of thought, some enlarged context, some API features, some CLI tools.
Since one can't argue that LLMs are useless by giving a single example of a failure, to hold the view that LLMs are useless, one would need to broadly dismiss whole classes of examples by the techniques in [1]. This specific example can't be dismissed in those ways.
> "If the success doesn't generalize to less favorable situations that do pay the bills"
Most bill-paying code in the world is CRUD, web front end, business logic, not intricate parsing and computer science fundamentals. I'm expecting that "AI slop" is going to be good enough for managers no matter how objectionable programmers find it. If I order something online and it arrives, I don't care if the order form was Ruby on Rails emailing someone who copied the order docs into a Google Spreadsheet using an AI generated If This Then That workflow. and as long as the error rate and credit card chargeback rate are low enough, nor will the company owners. Even though there are tons of examples of companies having very poor systems and still being in business, I don't have any specific examples so I wouldn't argue this vehemently - but the world isn't waiting for LLMs to be as 'useful' as HN commenters are waiting for, before throwing spaghetti at the wall and letting 'Darwinian Natural Selection' find the maximum level of slop the markets will tolerate.
----
On that note, a pedantic bit about cherry-picking: there's a difference between cherry-picking as a thing, and cherry-picking as a logical fallacy / bad-faith argument. e.g. if someone claims "Plants are inedible" and I point to cabbage and say it proves the claim is false, you say I'm cherry-picking cabbage and ignoring poisonous foxgloves. However, foxgloves existing - and a thousand other inedible plants existing - does not make edible cabbage stop existing. Seeing the ignored examples does not change the conclusion "plants are inedible" is false, so ignoring those things was not bad. Similarly "I asked GPT5 to port the Linux kernel to Rust and it failed" does not invalidate the html5 parser port.
Definition 2 is bad form; e.g. saying "smoking is good for you, here is a study which proves it" is a cherry-picking fallacy because if the ignored-studies were seen, they would counter the claim "smoking is good for you". Hiding them is part of the argument, deceptively.
"LLMs are useless and only a deluded person would say otherwise" is an example of the former; it's countered by a single example of a non-deluded person showing an LLM doing something useful. It isn't a cherry-picking fallacy to pick one example because no amount of "I asked ChatGPT to port Linux to Rust and it failed" makes the HTML parser stop existing and doesn't change the conclusion.
Even for prototyping, using a wireframe software would be faster.
a) why maintain instead of making it all disposable? This could be like a dishwasher asking who is going to wash all the mass-manufactured paper cups. Use future-LLM to write something new which does the new thing.
> They're undeniably useful in software development
> I've fixed countless bugs in a tiny fraction of the time
> I get the most reliable results
> This works extremely well and reliably in producing high quality results.
If there's one common thing in comments that seems to be astroturfing for LLM usage, it's that they use lots of superlative adjectives in just one paragraphs.
To be honest, it makes no difference in my life if you believe or not what I'm saying. And from my perspective, it's just a bit astounding to read people's takes that are authoritatively claiming that LLMs are not useful for software development. It's like telling me over the phone that restaurant X doesn't have a pasta dish, while I'm sitting at restaurant X eating a pasta dish. It's just weird, but I understand that maybe you haven't gone to the resto in a while, or didn't see the menu item, or maybe you just have something against this restaurant for some weird reason.
Go to docs, fast page load. Than blank, wait a full second, page loads again. This does not feel like high quality. You think it does because LLM go brrrrrrrr, never complains, says your smart. The resulting product is frustrating.
Reading these comments during this period of history is interesting because a lot of us actually have found ways to make them useful, acknowledging that they’re not perfect.
It’s surreal to read claims from people who insist we’re just deluding ourselves, despite seeing the results
Yeah they’re not perfect and they’re not AGI writing the code for us. In my opinion they’re most useful in the hands of experienced developers, not juniors or PMs vibecoding. But claiming we’re all just delusional about their utility is strange to see.
This is why it's so important to have data. So far I have not seen any evidence of a 'Cambrian explosion' or 'industrial revolution' in software.
The claim was that they’re useful at all, not that it’s a Cambrian explosion.
"In God we trust, all others must bring data."
just imagine how the skeptics feel :p
For what it's worth:
* I agree with you that LLMs probably aren't a path to AGI.
* I would add that I think we're in a big investment bubble that is going to pop, which will create a huge mess and perhaps a recession.
* I am very concerned about the effects of LLMs in wider society.
* I'm sad about the reduced prospects for talented new CS grads and other entry-level engineers in this world, although sometimes AI is just used as an excuse to paper over macroeconomic reasons for not hiring, like the end of ZIRP.
* I even agree with you that LLMs will lead to some maintenance nightmares in the industry. They amplify engineers' ability to produce code, and there a lot of bad engineers out there, as we all know: plenty of cowboys/cowgirls who will ship as much slop as they can get away with. They shipped unmaintainable mess before, they will ship three times as much now. I think we need to be very careful.
But, if you are an experienced engineer who is willing to be disciplined and careful with your AI tools, they can absolutely be a benefit to your workflow. It's not easy: you have to move up and down a ladder of how much you rely on the tool, from true vide coding for throwaway use-once helper scripts for some dev or admin task with a verifiable answer, all the way up to hand-crafting critical business logic and only using the agent to review it and to try and break your implementation.
You may still be right that they will create a lot of problems for the industry. I think the ideal situation for using AI coding agents is at a small startup where all the devs are top-notch, have many years of experience, care about their craft, and hold each other to a high standard. Very very few workplaces are that. But some are, and they will reap big benefits. Other places may indeed drown in slop, if they have a critical mass of bad engineers hammering on the AI button and no guard-rails to stop them.
This topic arouses strong reactions: in another thread, someone accused me of "magical thinking" and "AI-induced psychosis" for claiming precisely what TFA says in the first paragraph: that LLMs in 2025 aren't the stochastic parrots of 2023. And I thought I held a pretty middle of the road position on all this: I detest AI hype and I try to acknowledge the downsides as well as the benefits. I think we all need to move past the hype and the dug-in AI hate and take these tools seriously, so we can identify the serious questions amidst the noise.
I think that’s where they’re most useful, for multiple reasons:
- programming is very formal. Either the thing compiles, or it doesn’t. It’s straightforward to provide some “reinforcement” learning based on that.
- there’s a shit load of readily available training data
- there’s a big economic incentive; software developers are expensive
When the ROI in training the next model is realised to be zero or even negative, then yes the money will run out. Existing models will soldier on for a while as (bankrupt) operators attempt to squeeze out the last few cents/pennies, but they will become more and more out of date, and so the 'age of LLMs' will draw to a close.
I confess my skeptic-addled brain initially (in hope?) misread the title of the post as 'Reflections on the end of LLMs in 2025'. Maybe we'll get that for 2026!
"Ah-hah you stopped when this tool blew your whole leg off. If you'd stuck with it like the rest of us you could learn to only take off a few toes every now and again, but I'm confident that in time it will hardly ever do that."
Yes, because everyone who uses LLMs simply writes a prompt and then lets it write all of the code for them without thinkng! Vibecoding amirite!?
That's good to hear, but I have been called an AI skeptic a lot on hn, so not everyone agrees with you!
I agree though, there's a certain class of "AI denialism" which pretends that LLMs don't do anything useful, which in almost-2026 is pretty hard to argue.
It has been entertaining to see how Yudkowsky and the rationalist community spent over a decade building around these AI doom arguments, then they squandered their moment in the spotlight by making crazy demands about halting all AI development and bombing data centers.
To say that any prediction about the future shape of a technology is 'untenable' is pretty silly. Unless you've popped back in a time machine to post this.
The context was the article quoted, not HN comments.
I’ve been called all sorts of things on HN and been accused of everything from being a bot to a corporate shill here. You can find people applying labels and throwing around accusations in every thread here. It doesn’t mean much after a while.
There's value there, but there's also a lot of hype that will pass, just like the AGI nonsense that companies were promising their current/next model will reach.
First you find them useful but not intelligent. That is a bit of a contradiction. Basically anyone who has used AI, seriously knows that while it can be used to regurgitate generic filler and bootstrap code it can also be used to solve complex domain specific problems that is not at all part of its training data. This by definition makes it intelligent and it makes it so we know the LLM understands the problem it was given. it would be This by definition makes it intelligent, and it makes it so we know the LLM understands the problem it was given. It would be disingenuous for me not to mention how wrong and how much an LLM hallucinates, so obviously the thing has flaws and is not super intelligence. But you have to judge the entire spectrum of what it does. It gets things right and it gets things wrong and getting something complex right makes it intelligent while getting something wrong does not predude it from intelligence.
Second most non skeptics aren't saying all human work is going to be obsolete. no one can predict the future. But you've got to be blind if you don't see the trendline of progress. Literally look at the progress of AI for the past 15 years. You have to be next level delusional if you can't project another 15 years and see that obviously a super intelligence or at least an intelligence comparable to humans is not a reasonable prediction. Most skeptics like you ignore the trendline and cling to what Yann lecunn said about llms being stochastic parrots. It is very likely something with human intelligence exists in the future and in our lifetimes, whether or not its an LLM remains to be seen but we can't ignore where the trendlines are pointing.
That's a very easy way to get everyone to pinky promise that they absolutely love AI to the ends of the earth
But the skeptics and anti-AI commenters are almost as active as ever, even as we enter 2026.
The debate about the usefulness of LLMs has grown into almost another culture war topic. I still see a constant stream of anti-AI comments on HN and every other social platform from people who believe the tools are useless, the output is always unusable, people who mock any idea that operator skill has an impact on LLM output, or even claims that LLMs are a fad that will go away.
I’m a light LLM user ($20/month plan type of usage) but even when I try to share comments about how I use LLMs or tips I’ve discovered, I get responses full of vitriol and accusations of being a shill.
I butted heads with many earlier on, and they did nothing to challenge that frame meaningfully. What did change is my perception of the set of tasks that don't require "intelligence". And the intuition pump for that is pretty easy to start — I didn't suppose that Deep Blue heralded a dawn of true "AI", either, but chess (and now Go) programs have only gotten even more embarrassingly stronger. Even if researchers and puzzle enthusiasts might still find positions that are easier for a human to grok than a computer.
You are attacking a strawman. Almost nobody claims that LLMs are useless or you can never use their output.
It’s not a strawman. It’s everywhere on HN.
> LLMs have certainly become extremely useful for Software Engineers
> LLMs are useful for programming in 2025
> Do LLMs make bad code: yes all the time (at the moment zero clue about good architecture). Are they still useful: yes, extremely so.
If your comment is not a strawman, show me where people actually claim what you say they do.
Lots of things are "useful for programming". Switching to a comfier chair is more useful for programming than any LLM.
We were sold vibe coding, and that's what managers want.
People will go from skeptic to dread/anxiety, to either acceptance or despair. We are witnessing the disruption of a profession in real time and it will create a number of negative effects.
I don't think it's fair to say that at all. How are LLMs not statistical models that predict tokens? It's a big oversimplification but it doesn't seem wrong, the same way that "computers are electricity running through circuits" isn't a wrong statement. And in both cases, those statements are orthogonal to how useful they are.
No one says "computers are JUST electricity running through circuits" because no one tries to argue the computer itself is "thinking" or has some kind of being. No one tries to argue that when you put the computer to sleep it is actually doing a form of "sleeping".
The mighty token though produces all kinds of confused nonsense.
there's LLMs as in "the blob of coefficients and graph operations that runs on a gpu whenever there's an inference" which is absolutely "a statistical model that predict tokens" and LLMs as in "the online apps that iterates and have access to an entire automated linux environment that can run $LANGUAGE scripts and do web queries when an intermediary statistical output contains too much maybes and use the result to drive further inference.".
Modern LLMs are trained via reinforcement learning where the training objective is no longer maximum next token probability.
They still produce tokens sequentially (ignoring diffusion models for now) but since the objective is so different thinking of them as next token predictors is more wrong than right.
Instead one has to think of them as trying to fit their entire output to the model learnt in the reinforcement phase. That's how reasoning in LLMs works so well.
You’re not willing to have good faith discussion. You took the worst possible interpretation of my statement and crafted a terse response to shut me down. I only did two things. First I explained myself… then I called you out for what you did while remaining civil. I don’t skirt around HN rules as a means to an end, which is what I believe you’re doing? I’m ok with what you’re doing… but I will call it out.
I’m not a victim of anything. But you are definitely a perpetrator and instigator.
So technical means something like this: in a technical sense you are a stochastic parrot. You are also technically an object. But in everyday language we don't call people stochastic parrots or objects because language is nuanced and the technical meaning is rarely used at face value and other meanings are used in place of the technical one.
So when people use a term in conversation and go by the technical meaning it's usually either very strange or done deliberately to deceive. Sort of like how you claim you don't know what "technically" means and sort of how you deliberately misinterpreted my words as "inflammatory" when I did nothing of the sort.
I hope you learned something basic about the English today! Good day to you sir!
I am not. I'm sorry you feel this way about yourself. you are more than a next token predictor
Humans ARE next token predictors technically and we are also more than that. That is why calling someone a next token predictor is a mischaracterization. I think we are in agreement you just didn’t fully understand my point.
But the claim for LLMs are next token predictors is the SAME mischaracterization. LLMs are clearly more than next token predictors. Don’t get me wrong LLMs aren’t human… but they are clearly more than just a next token predictor.
The whole point of my post is to point out how the term stochastic parrot is weaponized to dismiss LLMs and mischaracterize and hide the current abilities of AI. The parent OP was using the technical definition as an excuse to use the word as a means to achieve his own ends namely be “against” AI. It’s a pathetic excuse I think it’s clear the LLM has moved beyond a stochastic parrot and there’s just a few stragglers left who can’t see that AI is more than that.
You can be “against” AI, that’s fine but don’t mischaracterize it… argue and make your points honestly and in good faith. Using the term stochastic parrot and even what the other poster did in attempt to accuse me of inflammatory behavior is just tactics and manipulation.
It was possible to create things in gpt-3.5. The difference now is it aligns with the -taste- of discerning programmers, which has a little, but not everything, to do with technological capability.
This... doesn't match the field reports I've seen here, nor what I've seen from poking around the repos for AI-powered Show HN submissions.
What changed was the use of RLVR training for programming, resulting in "reasoning" models that are now attempting to optimize for a long-horizon goal (i.e. bias generation towards "reasoning steps" that during training let to a verified reward), as opposed to earlier LLMs where RL was limited to RLHF.
So, yeah, the programmers who characterized early pre-RLVR coding models as of limited use were correct. Now the models are trained differently and developers find them much more useful.
The difficulty of using RL more generally to promote reasoning is that in the general case it's hard to define correctness and therefore quantify a reward for the RL training to use.
I think this gets to the crux of the issue with LLMs for coding (and indeed 'test orientated development'). For anything beyond a most basic level of complexity (i.e. anything actually useful), code cannot be verified by compiling and running it. It can only be verified - to a point - by skilled human inspection/comprehension. That is the essence of code really, a definition of action, given by humans, to a machine for running with /a prior/ unenumerated inputs. Otherwise it is just a fancy lookup table. By definition then not all inputs and expected outputs can be tabulated, tested for, or rewarded for.
As far as using the trained model to generate code, then of course it's up to the developer to do code reviews, testing, etc as normal, although of course an LLM can be used to assist writing test cases etc as well.
Ah, hence the "HF" angle.
The way RLHF works is that a smallish amount of feedback data of A/B preferences from actual humans is used to train a preference model, and this preference model is then used to generate RL rewards for the actual RLHF training.
RLHF has been around for a while and is what tamed base models like GPT 3 into GPT 3.5 that was used for the initial ChatGPT, making it behave in more of an acceptable way!
RLVR is much more recent, the basis of the models that do great at math and programming. If you talk about reasoning models being RL trained then it's normally going to imply RLVR, but it seems there's a recent trend of people calling it RLVR to be more explcit.
> The fundamental challenge in AI for the next 20 years is avoiding extinction.
Seem to be almost absurd without further, concrete justification.
LLMs are still quite useful, I'm glad they exist and honestly am still surprised more people don't use them in software. Last year I was very optimistic that LLMs would entirely change how we write software by making use of them as a fundamental part of our programming tool kit (in a similar way that ML fundamentally changed the options available to programmers for solving problems). Instead we've just come up with more expensive ways to extend the chat metaphor (the current generation of "agents" is disappointingly far from the original intent of agents in AI/CS).
The thing I am increasingly confused about is why so many people continue to need LLMs to be more than they obviously are. I get why crypto boosters exist, if I have 100 BTC, I have a very clear interest getting others to believe that they are valuable. But with "AI", I don't quite get, for the non-VS/founder, why it matters that people start foaming out the mouth over AI rather than just using it for the things it's good at.
Though I have some growing sense that this need is related to another trend I've personally started with witness: AI psychosis is very real. I personally know an increasing number of people who are spiraling into an LLM induced hallucinated world. The most shocking was someone talking about how losing human relationships is inevitable because most people can't keep up with those enhanced by AI acceleration. On the softer end I know more and more people who quietly confess how much they let AI work as a perpetual therapist, guiding their every decision (which is more than most people would let a human therapist guide there directions).
This is a ridiculous statement. A simple example of the huge difference is context size.
GPT-4 was, what, 8K? Now we’re in the millions with good retention. And this is just context size, let alone reasoning, multimodality, etc.
I have quizzed it with three books (total more than 1500 pages) and it gave great answers.
Initially yes when they released 2 million context with Gemini 1.5 it wasn’t effective.
Try it with Gemini 3 pro/flash now.
Why can't an AGI be inherently classless, unconcerned with profit or scarcity, and inherently "arc-ing toward justice"?
Because that isn't good news for nerds who think they rightly sit at the top of a meritocracy. An evil AGI is one that confirms tech is the ultimate unconquerable power that only the tech elite can even hope to master.
Well, lets see how all the economics will play out. LLMs might be really useful, but as far as I can see all the AI companies are not making money on inference alone. We might be hitting plateau in capabilities with money being raised on vision of being this godlike tech that will change the world completely. Sooner or later the costs will have to meet the reality.
The numbers aren’t public, but from what companies have indicated it seems inference itself would be profitable if you could exclude all of the R&D and training costs.
But this debate about startups losing money happens endlessly with every new startup cycle. Everyone forgets that losing money is an expected operating mode for a high growth startup. The models and hardware continue to improve. There is so much investment money accelerating this process that we have plenty of runway to continue improving before companies have to switch to full profit focus mode.
But even if we ignore that fact and assume they had to switch to profit mode tomorrow, LLM plans are currently so cheap that even a doubling or tripling isn’t going to be a problem. So what if the monthly plans start at $40 instead of $20 and the high usage plans go from $200 to $400 or even $600? The people using these for their jobs paying $10K or more per month can absorb that.
That’s not going to happen, though. If all model progress stopped right now the companies would still be capturing cheaper compute as data center buildouts were completed and next generation compute hardware was released.
I see these predictions as the current equivalent of all of the predictions that Uber was going to collapse when the VC money ran out. Instead, Uber quietly settled into steady operation, prices went up a little bit, and people still use Uber a lot. Uber did this without the constant hardware and model improvements that LLM companies benefit from.
LLMs have a short shelf-life. They don't know anything past the day they're trained. It's possible to feed or fine-tune them a bit of updated data but its world knowledge and views are firmly stuck in the past. It's not just news - they'll also trip up on new syntax introduced in the latest version of a programming language.
They could save on R&D but I expect training costs will be recurring regardless of advancements in capability.
I'm not gonna dig out the math again, but if AI usage follows the popularity path of cell phone usage (which seems to be the case), then trillions invested has a ROI of 5-7 years. Not bad at all.
Now you have a world of people who have become accustomed to using AI for tons of different things, and the enshittification starts ramping up, and you find out how much people are willing to pay for their ChatGPT therapist.
They don’t have to spend all their cash at once on the 30GW of data centers commitments.
Why go on the internet and tell stupid lies?
Having good quality dev tools is non negotiable, and I have a feeling that a lot of people are going to find out the hard way that reliability and it not being owned by profit seeking company is the #1 thing you want in your environment
This was the missed point on why GPT5 was such an important launch (quality of models and vibes aside). It brought the model sizes (and hence inference cost) to more sustainable numbers. Compared to previous SotA (GPT4 at launch, or o1/3 series), GPT5 is 8x-12x cheaper! I feel that a lot of people never re-calibrated their views on inference.
And there's also another place where you can verify your take on inference - the 3rd party providers that offer "open" models. They have 0 incentive to subsidise prices, because people that use them often don't even know who serves them, so there's 0 brand recognition (say when using models via openrouter).
These 3rd party providers have all converged towards a price-point per billion param models. And you can check those prices, and have an idea on what would be proffitable and at what sizes. Models like dsv3.2 are really really cheap to serve, for what they provide (at least gpt5-mini equivalent I'd say).
So yes, labs could totally become profitable with inference alone. But they don't want that, because there's an argument to be made that the best will "keep it all". I hope, for our sake as consumers that it isn't the case. And so far this year it seems that it's not the case. We've had all 4 big labs one-up eachother several times, and they're keeping eachother honest. And that's good for us. We get frontier level offerings at 10-25$/MTok (Opus, gpt5.2, gemini3pro, grok4), and we get highly capable yet extremely cheap models at 1.5-3$/MTok (gemini3-flash, gpt-minis, grok-fast, etc)
Whenever I ask a SOTA model about architecture recommendations, and frame the problem correctly, I get top notch answers every single time.
LLMs are terrific software architects. And that’s not surprising, there has to be tons of great advice on how to correctly build software in the training corpus.
They simply aren’t great software architects by default.
I spend a couple of hours per week teaching software architecture to a junior in my team, because he has not the experience to not only ask correctly but also assess the quality of the answer from the LLM.
And, as much as what I’ve just said is hyperbolically pessimistic, there is some truth to it.
In the UK a bunch of factors have coincided to put the brakes on hiring, especially smaller and mid-size businesses. AI is the obvious one that gets all the press (although how much it’s really to blame is open to question in my view), but the recent rise in employer AI contribution, and now (anecdotally) the employee rights bill have come together to make companies quite gunshy when it comes to hiring.
Programming is more like math than creative writing. It's largely verifiable, which is where RL is repeatedly proven to eventually achieve significantly better than human intelligence.
Our saving grace, for now, is that it's not entirely verifiable because things like architectural taste are hard to put into a test. But I would not bet against it.
This is true for everything, any tool you might use. Competent users of tools understand how they work and thus their limitations and how they're best put to work.
Incompetents just fumble around and sometimes get things working.
For me LLMs are a waste of time.
I know a lot of people who does it.
It’s basically the same idea but faster.
This makes me think: I wonder if Goodhart's law[1] may apply here. I wonder if, for instance, optimizing for speed may produce code that is faster but harder to understand and extend. Should we care or would it be ok for AI to produce code that passes all tests and is faster? Would the AI become good at creating explanations for humans as a side effect?
And if Goodhard's law doesn't apply, why is it? Is it because we're only doing RLVR fine-tuning on the last layers of the network so all the generality of the pre-training is not lost? And if this is the case, could this be a limitation in not being able to be creative enough to come up with move 37?
This is generally true for code optimised by humans, at least for the sort of mechanical low level optimisations that LLMs are likely to be good at, as opposed to more conceptual optimisations like using better algorithms. So I suspect the same will be true for LLM-optimised code too.
Superoptimizers have been around since 1987: https://en.wikipedia.org/wiki/Superoptimization
They generate fast code that is not meant to be understood or extended.
When people use LLMs to improve their code, they commit their output to Git to be used as source code.
Until ~2022 there was a clear line between human-generated code and computer-generated code. The former was generally optimized for readability and the latter was optimized for speed at all cost.
Now we have computer-generated code in the human layer and it's not obvious what it should be optimized for.
It should be optimized for readability by AI. If a human wants to know what a given bit of code does, they can just ask.
I'm not super up-to-date on all that's happening in AI-land, but in this quote I can find something that most techno-enthusiast seem to have decided to ignore: no, code is not free. There are immense resources (energy, water, materials) that go into these data centers in order to produce this "free" code. And the material consequences are terribly damaging to thousands of people. With the further construction of data centers to feed this free video coding style, we're further destroying parts of the world. Well done, AGI loverboys.
Not really. Most corn grown in the US isn’t even fit for consumption. It is primarily used for fermenting bioethanol.
- drive to the store or to work
- take a shower
- eat meat
- fly on vacation
And so on... thanks!
If you don't do that, and are a homesteader, then yes. You are a very small minority outlier. (Assuming you aren't ordering supplies delivered instead of driving to the store.
> Eat meat.
Yes, not eating meat is in the minority.
> Fly on vacation.
So, don't vacation, walk to vacation, or drive to vacation? 1/3 are also consumptive.
It seems you are either a very significant outlier, or you're being daft. I'm curious which. Would you mind clarifying?
For holidays, we did a cycling holiday with our children. They loved it!
I don’t at all feel like an outlier, many friends do similar things.
We have a backward orange fool running things for gems like this: https://news.ycombinator.com/item?id=46357881
But it's just as much local political issues as national around here.
It's not the case that every form of writing has to be an academic research paper. Sometimes people just think things, and say them – and they may be wrong, or they may be right. And they sometime have some ideas that might change how you think about an issue as a result.
Accomplishment in one field does not make one an expert, nor even particularly worth listening to, in any other. Certainly it doesn't remove the burden of proof or necessity to make an actual argument based on more then simply insisting something is true.
The creator of Redis.
> woah buddy this persons opinion isn’t worth anything more than a random homeless person off the street. they’re not an expert in this field
Is there a term for this kind of pedantry? Obviously we can put more weight behind the words a person says if they’ve proven themselves trustworthy in prior areas - and we should! We want all people to speak and let the best idea win. If we fallback to only expert opinions are allowed that’s asking to get exploited. And it’s also important to know if antirez feels comfortable spouting nonsense.
This is like a basic cornerstone of a functioning society. Though, I realize this “no man is innately better than another, evaluate on merit” is mostly a western concept which might be some of my confusion.
no, you shouldn't
this is how you end up with crap like vaccine denialism going mainstream
"but he's a doctor!"
We've got Avi Loeb on mainstream podcasts and TV spouting baseless alien nonsense. He's a preeminent in his field, after all.
Focus on what you understand. If you don't understand, learn more.
His entirely unsupported statements about AGI are pretty useless, for instance.
So many people assume AGI is possible, yet no one has a concrete path to it or even a concrete definition of what it or what form it might take.
This one is bizarre, if true (I'm not convinced it is).
The entire purpose of the attention mechanism in the transformer architecture is to build this representation, in many layers (conceptually: in many layers of abstraction).
> 2. NOT have any representation about what they were going to say.
The only place for this to go is in the model weights. More parameters means "more places to remember things", so clearly that's at least a representation.
Again: who was pushing this belief? Presumably not researchers, these are fundamental properties of the transformer architecture. To the best of my knowledge, they are not disputed.
> I believe [...] it is not impossible they get us to AGI even without fundamentally new paradigms appearing.
Same, at least for the OpenAI AGI definition: "An AI system that is at least as intelligent as a normal human, and is able to do any economically valuable work."
> The entire purpose of the attention mechanism in the transformer architecture is to build this representation, in many layers (conceptually: in many layers of abstraction).
I think this is really about a hidden (i.e. not readily communicated) difference in what the word "meaning" means to different people.
I think that's the usual understanding of how transformer architectures work, at the level of math.