Posted by simonw 7 days ago
HN is a bit weird because it's got 99 articles about how evil LLMs are and one article that's like "oh hey I asked an LLM questions and got some answers" and people are like "wow amazing".
Not that I mind. I assume Simon just wanted to share some cool nerdy stuff and there's nothing wrong with the blog post. It's just surprising that it's posted not once but twice on HN and is on the front page when there's so much anti-AI sentiment otherwise.
Often the results were bad, so the answer was bad.
GPT-5 Thinking (and o3 before it, but very few people tried o3) does a whole lot better then that. It runs multiple searches, then evaluates the results and runs follow-up searches to try to get to a credible result.
This is new and worth writing about. LLM search doesn't suck any more.
FWIW Gemini at least has been pretty good at this since late 2024 IMO.
As for where things are now, I just ran a comparison with ChatGPT 5 in thinking mode against Google search's AI mode across a few questions. They performed the same on the searches I tried and returned substantially the same answer except for some minor variation here or there. Google search is maybe an order of magnitude faster. Google obviously has an advantage here which is that it has full access to their search and ranking index.
And of course the ability to make multiple searches and reason about them for been available for months, maybe almost a year, as deep research mode. I guess the novelty now is you can wait a smaller time and get research that's less deep.
The results look reasonable? It’s a good start, given how long it takes to hear back from our doctor on questions like this.
There is one big failure mode though - ChatGPT hallucinates middle of simple textual OCR tasks!
I will feed ChatGPT a simple computer hardware invoice with 10 items - out comes perfect first few items, then likely but fake middle items (like MSI 4060 16GB instead of Asus 5060 Ti 16GB) and last few items are again correct.
If you start prompting with hints, the model will keep making up other models and manufacturers, it will apologize and come up with incorrect Gigabyte 5070.
I can forgive mistaking 5060 for 5080 - see https://www.theguardian.com/books/booksblog/2014/may/01/scan... . However how can the model completely misread the manufacturers??
This would be trivially fixed by reverting to Tesseract based models like ChatGPT used to do.
PS Just tried it again and 3rd item instead of correct GSKILL it gave Kingston as manufacturer for RAM.
Basically ChatGPT sort of OCRs like a human would, by scanning first then sort of confabulating middle and then getting the footer correct.
...
I used to play games on my computer a lot. Not so much anymore, don't really want to lock myself in a room alone and play games. I have kids and a wife, and it feels isolative.
But those days I would, and often the hardware I had was underpowered to be able to experience the game in its full glory. I would often spend hours and hours just honing settings and config and environment to get the game to run at peak capability on my machine.
At some point, I would reach a zenith. Some perfect arrangement of settings and environment that gave me a game running at top quality on my machine (or as close to top as I could get). The experience for me is joyous. So enjoyable that I often didn't even play the game except maybe to test the boundaries of its performance at that level.
Reading this article made me sad for people who don't put in work for some sort of accomplishment that amounts to nothing. And it made me think of my own experience with it. Accomplishment for its own sake is still accomplishment. And it's still self realization, which is important to existing.
The context: I was rushing for a train, I ran into Starbucks at the station for a coffee, I noticed they didn't have cake pops and the staff member didn't appear to know what they were.
I see three choices here:
1. Since I'm mildly curious about Starbucks and cake pop availability in the UK, I get on the train, open up my laptop and dedicate realistically a solid half hour or more to figuring out what's going on.
2. I fire off a research question at GPT-5 Thinking on my mobile phone.
3. I don't do any research at all and leave my mild curiosity unsaturated.
Realistically, I think the choices are between 2 and 3. I was never going to perform a full research project on this myself.
See also: AI-enhanced development makes me more ambitious with my projects, which I wrote in March 2023 and has aged extremely well. https://simonwillison.net/2023/Mar/27/ai-enhanced-developmen...
I do plenty of deep dive research projects myself into topics both useful and pointless - my blog is full of them!
Now I can take on even more.
Alternatively, you could have spent that half hour on the train exercising your own creativity to try and satisfy your curiosity. Whether you're right or wrong doesn't really matter, because as you acknowledge it's not really important enough to you to matter. Picking (2) eliminates all the possible avenues that might have lead you down.
I'm not saying one is better than the other, just that you're approaching the criticism on the basis of axioms that represent a narrow viewpoint: That of someone who has to be "right" about the things they are curious about, no matter how trivial.
I spent my half hour on the train satiating all sorts of other things instead (like the identity of that curious looking building in Reading).
> Picking (2) eliminates all the possible avenues that might have lead you down.
I don't think that's the case. Using GPT-5 for the Cake Pop question lead me down a bunch of avenues I may never have encountered otherwise - the structure of Starbucks in the UK, the history of their Cake Pops rollout, the fact that checking nutritional and allergy details on their website is a great way to get an "official" list of their products independent of what's on sale in individual stores, and it sparked me to run a separate search for their Cookies and Cream cake pop and find out had been discontinued in the US.
Not bad for typing a couple of prompts on my phone and then spending a few extra minutes with the results after the research task had completed.
Now multiply that by a dozen plus moments of curiosity per day and my intellectual life feels genuinely elevated - I'm being exposed to so many more interesting and varied avenues than if I was manually doing all of the work on a smaller number of curiosities myself.
I don't disagree: I just posited that there are other ways to satisfy it, and that there is an opportunity cost to the path you've chosen to satisfy it that you don't seem very aware of, because your curiosity and desire to be correct are tightly coupled - but that doesn't actually have to be the case. It has its pros and cons.
Now I'm more of an "it's the journey not the destination" guy, so accelerating the journey doesn't appeal to me as much as it used to, because for me its where I get the most value. That change in my perspective is what motivated me to comment.
But anyway, you clearly enjoy it and do great work, so all the best with it!
Example query: a keyboard stand with music (notes) stand.
-- Disclaimer--
It might be connected to the web enshittification process which has been undergoing for quite some time already.