Posted by simonw 6 days ago
I’ve found ChatGPT and other LLMS can struggle to evaluate evidence - to understand the biases behind sources - ie taking data from a sketchy think tank as gospel. I also have found in my work the more reasoning, the more hallucination. Especially when gathering many statistics.
That plus the usual sycophancy can cause the model to really want to find evidence to support your position. Even if you don’t think you’re asking a leading question, it can really want to answer your question in the affirmative.
I always ask ChatGPT do directly cite and evaluate sources. And try to get it in the mindset of comparing and contrasting arguments for and against. And I find I must argue against its points to see how it reacts.
More here https://softwaredoug.com/blog/2025/08/19/researching-with-ag...
I'd love if it had a confidence rating based on the sources it found or something, but I imagine that would be really difficult to get right.
AI really needs better source-validation. Not just to combat the hallucination of sources (which gemini seems to do 80% of the time), but also to combat low quality sources that happen to correlate well to the question in the prompt.
It's similar to Google having to fight SEO spam blogs, they now need to do the same in the output of their models.
No API access though so you're stuck talking with it through the webapp.
Kagi also tells you the percentages “used” for each source and cites them in line.
It’s not perfect, but it’s a lot better to narrow down what you want to get out of your prompt.
When LLMs really started to show themselves, there was a big debate about what is truth, with even HN joining in on heated debates on the number of sexes or genders a dog may have and if it was okay or not for ChatGPT to respond with a binary answer.
On one hand, I did found those discussions insufferable, but the deeper question - what is truth and how do we automated the extraction of truth from corpora - is super important and somehow completely disappeared from the LLM discourse.
If you ask for information which is e.g. academic or technical it would cite information and compare different results, etc, without any extra prompt or reminder.
Grok 4 (at the initial release) was just reporting information in the articles it found without any analysis.
Claude Opus 4 also seems bad: I asked it to give a list of JS libraries of a certain kind in deep research mode, and it returned a document focused on market share and usage statistics. Looks like it stumbled upon some articles of that kind and got carried away by it. Quite bizarre.
So GPT-5 is really good in comparison. Maybe not perfect in all situations, but perhaps better than an average human
Alas, the average human is pretty bad at these things.
This is what I keep finding, it mostly repeats surface level "common knowledge." It usually take a few back and forths to get to whether or not something is actually true - asking for the numbers, asking for the sources, asking for the excerpt from the sources where they actually provide that information, verifying to make sure it's not hallucinating, etc. A lot of the time, it turns out its initial response was completely wrong.
I imagine most people just take the initial (often wrong) response at face value, though, especially since it tends to repeat what most people already believe.
This cuts both ways. I have yet to find an opinion or fact I could not make chatgpt agree with as if objectivly true. Knowing how to trigger (im)partial thought is a skill in and of itself and something we need to be teaching in school asap. (Which some already are in 1 way or another)
There was a recent discussion about Wikipedia here recently where a lot of people who are active on the site argued against people taking the claims there with a grain of salt and verifying the accuracy for themselves.
We can teach these things until the cows come home, but it's not going to make a difference if people say it's a good idea and then immediately do the opposite.
If you mean whether Wikipedia is unreliable? That's a different story, everything is unreliable. Wikipedia just happens to be potentially less unreliable than many (typically) (if used correctly) (#include caveats.h) .
Sources are like power tools. Use them with respect and caution.
You are very optimistic.
Look at all other skills we are trying to teach in school. 'Critical thinking' has been at the top of nearly every curriculum you can point a finger at for quite a while now. To minimal effect.
Or just look at how much math we are trying to teach the kids, and what they actually retain.
Critical thinking is a much more general skill which is applicable anywhere, thus quicker to be 'buried' under other learned behavior.
This skill has an obvious trigger; you're using AI, which means you should be aware of this.
I have a liberal arts background. So I use the term research to mean gathering evidence, evaluating its trustworthiness and biases, and avoiding related thinking errors related to evaluating evidence (https://thedecisionlab.com/biases).
LLMs can fall prey to these problems as well. Usually it’s not just “reasoning” that gives you trouble. It’s the reasoning about evidence. I see this with Claude Code a lot. It can sometimes create some weird code, hallucinating functionality that doesn’t exist, all because it found a random forum post.
I realize though that the term is pretty overloaded :)
Same here. But it often produces broken or bogus links.
I'll take one of your examples: Britannica to seed Wikipedia. I searched for "wikipedia encyclopedia brtannica". In less than 1 second, I got search results back.
I spend maybe 30 seconds scanning the page; past the Wikipedia article on Encyclopedia Britannica, past the Encyclopedia article about Wikipedia, past a Reddit thread comparing them, past the Simple English Wikipedia article on Britannica, and past the Britannica article on Wiki. OK, there it is, the link to "Wikipedia:WikiProject Encyclopaedia Britannica", that answers your question.
Then to answer your follow up, I spend a couple more seconds to search Wikipedia for Wikipedia, and find in the first paragraph that it was founded in 2001.
So, let's say a grand total of 60 seconds of me searching, skimming, and reading the results. The actual searching was maybe 2 or 3 seconds of time total, once on Google, and once on Wikipedia.
Compared to nearly 3 minutes for ChatGPT to grind through all of that, plus the time for you to read it, and hopefully verify by checking its references because it can still hallucinate.
And what did you pay for the privilege of doing that? How much extra energy did you burn for this less efficient response? I wish that when linking to chat transcripts like you do, ChatGPT would show you the token cost of that particular chat
So yeah, it's possible to do search with ChatGPT. But it seems like it's slower and less efficient than searching and skimming yourself, at least for this query.
That's generally been my impression of LLMs; it's impressive that they can do X. But when you add up all the overhead of asking them to do X, having them reason about it, checking their results, following up, and dealing with the consequences of any mistakes, the alternative of just relying on plain old search and your own skimming seems much more efficient.
* "Rubber bouncy at Heathrow removal" on Google had 3 links, including the one about SFO from which chatGPT took a tangent. While ChatGPT provided evidence for the latest removal date being of 2024, none was provided for the lower bound. I saw no date online either. Was this a hallucination?
* A reverse image lookup of the building gave me the blog entry, but also an Alamy picture of the Blade (admittedly this result can have been biased by the fact the author already identified the building as the blade)
* The starbucks pop Google search led me to https://starbuckmenu.uk/starbucks-cake-pop-prices/. I will add that the author bitching to ChatGPT about ChatGPT hidden prompts in the transcript is hilarious.
I get why people prefer ChatGPT. It will do all the boring work of curating the internet for you, to privde you with a single answer. It will also hallucinate every now and then but that seems to be a price people are willing to pay and ignore, just like the added cost compared to a single Google search. Now I am not sure how this will evolve.
Back in the days, people would tell you to be weary of the Internet and that Wikipedia thing, and that you could get all the info you need from a much more reliable source at the library anyways, for a fraction of the cost. I guess that if LLMs continue to evolve, we will face the same paradigm shift.
Firstly, if we don't remove the Google AI summary then as you rightly say, it makes the experience 10x worse. They try to still give an answer quickly, but the AI takes up a ton of space and is mostly terrible.
Googling for a Github repository just now, Google linked me to 3 resources except the actual page. One clone that was named the same, another garbage link but luckily the 3rd was a reddit post by the same person which linked to the correct page.
GPT does take a lot longer, but the main advantage for me comes in depending on the scope of what you're looking for. In the above example I didn't mind Google, because the 3 links opened fast and I could scan and click through to find what I was looking for, ie. I wanted the information right now.
But then let's say I'm interested in something a bit deeper, for example how did they do the unit movement in StarCraft 2? This is a well known question, so the links/info you get from either Google or GPT are all great. If I was searching this topic via Google I'd then have to copy or bookmark the main topics to continue my research on them. Doing it via GPT it returns the same main items, but I can very easily tell it to explain all those topics in turn, have it take the notes, find source code, etc.
Of course as in your example, if you're a Doctor and you're googling symptoms or perhaps real world location of ABC then the hallucination specter is a dangerous thing which you want to avoid at all costs. But for myself I find that I can as easily filter LLM mistakes as I can noise/errors from manual searches.
My future Internet guess is going to be that in N years there will be no such thing as manually searching for anything, everything will be assistant driven via LLM.
* Bouncy people mover. Some Google searching turns up the SFO article that you liked. Trying to pin down the exact dates is harder. ChatGPT maybe did narrow down the time frame quicker than I could through a series of Google searches,
* The picture of the building. Go to Google lens, paste in the image, less than a second later I get results. Of course, the exact picture in this article comes up on top, but among the other results I get a mix of two different buildings, one of which is identified as the Blade, one Independence Temple. So a few seconds here between searching and doing my own quick visual scan of the results.
* Starbucks UK Cake Pops: This one is harder to find the full details with a quick Google search. I am able to find that the were fairly recently introduced in the UK after my second search. It looks like ChatGPT gave you a bunch of extra response, some of which you didn't like, because you then spent a while trying to reverse engineer its system prompt rather than any actual follow up on the question itself.
* Official name of the University of Cambrdige: search gave me Wikipedia, Wikipedia contains the official name and a link to a reference on the University's page. Pretty quick to solve with Google Search/Wikipedia.
* Exeter quay. I searched for "waterfront exeter cliff building" and found this result towards the top of the results: https://www.exeterquay.org/milestones/ which explains "Warehouses were added in 1834 [Cornish's] and 1835 [Hooper's], with provision for storing tobacco and wine and cellars for cider and silk were cut into the cliffs downstream." You seemed to be a lot more entertained by ChatGPT's persistence in finding more info, but for satisfying curiosity about the basic question, I got an answer pretty quickly via Google.
* Aldi vs Lidl: this is a much more subjective question, so whether the results you get via a quick Google search meet your needs, vs. whether the summary of subjective results you get via ChatGPT, is more of a question you can answer. I do find some Reddit threads and similar with a quick Google search.
* Book scanning. You asked specifically about destructive book scanning. You can do a quick search of each of the labs and "book scanning" and find the same lack of results that ChatGPT gives you. Maybe takes a similar amount of time to how long it spent thinking. You pretty much only find references to Anthropic doing destructive book scanning, and Google doing mostly non-destructive scanning
Anyhow, the results are mixed. For a bunch of these, I found an answer quicker via a Google search (or Google Lens search), and doing some quick scanning/filtering myself. A few of them, I feel like it was a wash. A couple of them actually do take more iteration/research, the bouncy travelator being the most extreme example, I think; narrowing down the timeline on my own would take a lot of detailed looking through sources.
As far as I can tell the Google + Wikipedia solution gets the name of Cambridge University wrong: Wikipedia lists it as "The Chancellor, Masters and Scholars of the University of Cambridge" whereas GPT-5 correctly verified it to be "The Chancellor, Masters, and Scholars of the University of Cambridge" (note that extra comma) as listed on https://www.cam.ac.uk/about-the-university/how-the-universit...
I tried to reverse engineer the system prompt in the cake pop conversation https://chatgpt.com/share/68bc71b4-68f4-8006-b462-cf32f61e7e... purely because I got annoyed at it for answering "haha I believe you" - I particularly disliked the lower case "haha" because I've seen it switch to lower case (even the lower case word "i") in the past and I wanted to know what was causing it to start talking in the same way that Sam Altman tweets.
With thinking it took longer (just shy of two minutes) but compared a variety of different sources, and comes back with numbers and each statement in the summary sourced.
I’ve used gpt a bunch for finding things like bin information on the council site that I just couldn’t easily find myself. I’ve also sent it off to dig through prs, specs and more for matrix where it found the features and experimental flags required to solve a problem I had. Reading that many proposals and checking what’s been accepted is a massive pain and it solved this while I went to make a coffee.
Why create a fancy infrastructure for for this new universal thing, when the old thing already does it more reliably and with less steps?
It fails completely for complex political or investigative questions where there is no clear answer. Reading a single Wikipedia page is usually a better use of one's time:
You don't have to pretend that you are parallelizing work (which is just for show) while waiting three min for the "AI" answer. You practice speed reading and memory retention. You enhance your own semantic network instead of the network owned and controlled by oligopoly members.
It's not going away, ever.
I just wish the business models could justify a confidence level being attached to the response.
https://www.fortressofdoors.com/researchers-beware-of-chatgp...
Your view is grinding a political axe and I don't think you're in a position to objectively assess whether ChatGPT failed in this case.
Also what “axe” am I grinding? The findings are specifically inconvenient for my political beliefs, not confirming my priors! My priors would be flattered if Silagi was correct about everything but the primary sources definitively prove he’s exaggerating.
> You published a blog about that opinion, and you want ChatGPT to say you're to accept your view.
False, and I address this multiple times in the piece. I don’t want ChatGPT to mindlessly agree with me, I want it to discover the primary source documents.
So just zooming out, that's not the right sort of setup for being an impartial researcher. And in your blog post your disagreements come off to me as wanting a sort of purity with respect to Georgism that I wouldn't be expected to be reflected in the literature.
I like Kant, but it would be a bit like me saying ChatGPT was fundamentally wrong because it considered John Rawls a Kantian because I can point to this or that paper where he diverges from Kant. I could even write a blog post describing this and pointing to primary sources. But Rawls is considered a Kantian and for good reason, and it would (in my opinion) be misleading for me to say that ChatGPT made a big failure mode because it didn't take my view on my pet subject as seriously as I wanted.
The literature — the primary source documents — do not in fact support a maximalist Georgist case! This is what I have been trying to say!!!
You are accusing me of the exact opposite thing I’m arguing for!!! The historical case the primary sources show is inconvenient for my political movement!
The failure of chat gpt is not that it disagrees with any opinion of mine, but that it does not surface primary source documents. That’s the issue.
Its baffling to be accused of confirmation bias when I point out research findings that goes against what would be maximally convenient for my own cause.
But often people who believe in a given doctrine will see differences as more important than they objectively are. For example, just to continue with socialism, it's common for socialist believers to argue that this or that country is or isn't socialist in a way that disagrees with mainstream historians.
I'm sure there are other examples, for example people disagreeing about which bands are punk or hardcore. A music historian would likely cast a wider net. Fans who don't listen to many other types of music might cast a very narrow net.
The Silagi paper makes a factual claim. The Silagi paper claims that there was only one significant tax in the German colony of Kiatschou, a single tax on land.
The direct primary sources reveal that this is not the case. There were multiple taxes, most significantly large tariffs. Additionally there were two taxes on land, not one -- a conventional land value tax, and a "land increment" or capital gains tax.
These are not minor distinctions. These are not matters of subjective opinions. These are clear, verifiable, questions of fact. The Silagi paper does not acknowledge them.
ChatGPT, in the early trials I graded, does not even acknowledge the German primary sources. You keep saying that I am upset it doesn't agree with me.
I am saying the chief issue is that ChatGPT does not even discover the relevant primary sources. That is far more important than whether it agrees with me.
> For example, just to continue with socialism, it's common for socialist believers to argue that this or that country is or isn't socialist in a way that disagrees with mainstream historians.
Notice you said "historians." Plural. I expect a proper researcher to cite more than ONE paper, especially if the other papers disagree, and even if it has a preferred narrative, to at least surface to me that there is in fact disagreement in the literature, rather than to just summarize one finding.
Also, if the claims are being made about a piece of German history, I expect it to cite at least one source in German, rather than to rely entirely on one single English-language source.
The chief issue is that ChatGPT over-cites one single paper and does not discover primary source documents. That is the issue. That is the only issue.
> I am saying you are seeing distinctions as more important than the rest of the literature and concluding that the literature is erroneous.
And I am saying that ChatGPT did not in fact read the "rest of the literature." It is literally citing ONE article, and other pieces that merely summarize that same article, rather than all of the primary source documents. It is not in fact giving me anything like an accurate summary of the literature.
I am not saying "The literature is wrong because it disagrees with me." I am saying "one paper, the only one ChatGPT meaningfully cites, is directly contradicted by the REST of the literature, which ChatGPT does not cite."
A truly "research grade" or "PhD grade" intelligence would at the very least be able to discover that.
I hear you that this is about finding sources, but even perfect coverage of primary sources wouldn’t remove the need for judgment. We’d still have to define what counts as "Georgian," "inspired by George," and "significant" as a tax. Those are contestable choices. What you have is a thesis about the evidence—potentially a strong one—but it isn’t an indisputable fact.
On sourcing: I’m aware ChatGPT won’t surface every primary source, and I’m not sure that should be the default goal. In many fields (e.g., cancer research), the right starting point is literature reviews and meta-analyses, not raw studies. History may differ, but many primary sources live offline in archives, and the digitized subset may not be representative. Over-weighting primary materials in that context can mislead. Primary sources also demand more expertise to interpret than secondary syntheses—Wikipedia itself cautions about this: https://en.wikipedia.org/wikiWikipedia:Identifying_and_using...
To be clear, I’m not saying you’re wrong about the tax or that Silagi is right. I’m saying that framing this as a “pathological failure” overstates the situation. What I see is a legitimate disagreement among competent researchers.
I wonder if asking ChatGPT in German would make a difference.
Switching to GPT5 Thinking helps a little, but it often misses things that it wouldn't when I was using o3 or o1.
As an example, I asked it if there were any incidents involving Botchan in an Onsen. This is a text that is readily available and must have been trained on; in the book, Botchan goes swimming in the onsen, and then is humiliated when the next time he comes back, there is a sign saying "No swimming in the Onsen".
According to GPT5 it gives me this, which is subtly wrong.
> In the novel, when Botchan goes to Dōgo Onsen, he notes the posted rules of the bath. One of them forbids things like: > “No swimming in the bath.” (泳ぐべからず) > “No roughhousing / rowdy behavior.” (無闇に騒ぐべからず) > Botchan finds these signs funny because he’s exactly the sort of hot-headed, restless character who might be tempted to splash around or make noise. He jokes in his narration that it seems as though the rules were written specifically to keep people like him out.
Incidentally, Dogo Onsen still has the "No swimming sign", or it did when I went 10 years ago.
I'll play devil's advocate and say that I think the Codex-cli included with the plus subscription is pretty good (quality wise). However, after using it, it suddenly told me I couldn't use it for a week out without warning. Claude is a bit more reasonable there.
There is value in pruning the search tree because the deeper nodes are usually not reputable. I know you have cause to believe that "Wilhelm Matzat" is reputable but I don't think it can be assumed generally. If you were to force GPT to blindly accept counter points from people - the debate would never end. And there has to be a pruning point at which GPT would accept this tradeoff: maybe the less reputable or well known sources may have a correct point at the cost of being incorrect more often due to taking an incorrect analysis from a not well known source.
You could go infinitely deep into any analysis and you will always have seemingly correct points on both sides. I think it is valid for GPT to prune the search at a point where it converges to what society at large believes. I'm okay with this tradeoff.
If we’re going to claim to it is PhD level it should be able to do “deep” research AND think critically about source credibility, just as a PhD would. If it can’t do that they shouldn’t brand it that way.
Also it’s not like I’m taking Matzat’s word for anything. I can read the primary source documents myself! He’s also hardly an obscure source, he’s just not listed on Wikipedia.
Having an LLM generate search strings and then summarize the results does that research up front and automatically, I need only click the sources to verify. Kagi Assistant does this really well.
But, like the parent, I’m using the Kagi assistant.
So the answer here might be “search for 5 things and pull the relevant results” works incredibly well, but first you have to build an extremely good search engine that lets the user filter out spam sites.
That said, this isn’t magic, it’s just automated an hour of googling. If the content doesn’t exist you won’t find it.
I recently added the following to my custom instructions to get the best of both worlds:
# Modes
When the user enters the following strings you should follow the following mode instructions:
1. "xz": Use the web tool as needed when developing your answer.
2. "xx": Exclusively use your own knowledge instead of searching the internet.
By default use mode "xz". The user can switch between modes during a chat session. Stay with the current mode until the user explicitly switches modes.
I keep switching between both but I think I'm starting to prefer the lighter one that is based on the sources instead.
From what I can tell, they are pretty damn big.
Grok 4 is quite large too.
Have you just hallucinated that?
"Do deep internet research and thinking to present as much evidence in favor of the idea that JRR Tolkein's Lord of the Rings trilogy was inspired by Mervyn Peake's Gormenghast series."
https://chatgpt.com/share/68bcd796-bf8c-800c-ad7a-51387b1e53...
A while ago I bragged at a conference about how ChatGPT had "solved" something... Yeah, we know, it's from Wikipedia and it's wrong :)
Formulating the state of your current knowledge graph, that was just amplified by ChatGPT's research might be a way to offset the loss of XP ... XP that comes with grinding at whatever level kids currently find themselves ...
Relevant blog post: https://housefresh.com/beware-of-the-google-ai-salesman/
GPT-4o and most other AI-assisted search systems in the past worked how you describe: they took the top 10 search results and answered uncritically based on those. If the results were junk the answer was too.
GPT-5 Thinking doesn't do that. Take a look at the thinking trace examples I linked to - in many of them it runs a few searches, evaluates the results, finds that they're not credible enough to generate an answer and so continues browsing and searching.
That's why many of the answers take 1-2 minutes to return!
I frequently see it dismiss information from social media and prefer to go to a source with a good reputation for fact-checking (like a credible newspaper) instead.
The credibility is one side of the story. In many cases, at least for my curious research, I happen to search for something very niche, so to find at least anything related, an LLM needs to find semantic equivalence between the topic in the query and what the found pages are discussing or explaining.
One recent example: in a flat-style web discussion, it may be interesting to somehow visually mark a reply if the comment is from a user who was already in the discussion (at least GP or GGP). I wanted to find some thoughts or talk about this. I had almost no luck with Perplexity, which probably brute-forced dozens of result pages for semantic equivalence comparison, and I also "was not feeling/getting lucky" with Google using keywords, the AROUND operator, and so on. I'm sure there are a couple of blogs and web-technology forums where this was really discussed, but I'm not sure the current indexing technology is semantically aware at scale.
It's interesting that sometimes Google is still better, for example, when a topic I’m researching has a couple of specific terms one should be aware of to discuss it seriously. Making them mandatory (with quotes) may produce a small result set to scan with my own eyes.
How do you know it did not made it up. Are you an expert in the field?
You're now here telling us how it gave you the right answer, which seems to mostly be due to it confirming your bias.
Navigating their feature set is… fun.
I select "GPT-5 Thinking" from the model picker and make sure its regular search tool is enabled.
Not sure if you tend to edit your posts, but it could be worth clarifying.
Btw — my colleagues and I all love your posts. I’ll quit fanboying now lol.
Small nit, Simon: satisfying curiosity is the important endeavor.
<3
In the former, the research feels genuine and in the latter it feels hollow and probably fake.
However, the non-thinking search is total garbage. It searches once, and then gives up or hallucinates if the results don't work out. I asked it the same question, and it says that the information isn't publicly available.
The other ones will do the thing I want: search a bunch, digest the results, and give me a quick summary table or something.
It's annoying when it's so confident making up nonsense.
Imo Chat GPT is just a league above when it comes to reliability.
I like ChatGPT as a product more, but Gemini does well on many things that ChatGPT struggles with a little more. Just my anecdotes.
Which is in my option, the #1 metric an LLM should strive for. It can take quite some time to get anything out of an LLM. If the model turns out to be unreliable/untrustworthy, the value of its output is lost.
It's weird that modern society (in general) so blindly buys in to all of the marketing speak. AI has a very disruptive effect on society, only because we let it happen.
Is the fundamental problem that it weights all sources equally so a bunch of non-experts stating the wrong answer will overpower a single expert saying the correct answer?
> FWIW Deep Research doesn’t run on whatever you pick in the model selector. It’s a separate agent that uses dedicated o‑series research models: full mode runs on o3; after you hit the full‑mode cap it auto‑switches to a lightweight o4‑mini version. The picker governs normal chat (and the pre‑research clarifying Qs), not the research engine itself.
"It's not the Deep [Re]Search or Agent Mode. I select 'GPT-5 Thinking' from the model picker and make sure its regular search tool is enabled."