Top
Best
New

Posted by atgctg 12/11/2025

GPT-5.2(openai.com)
https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

1195 points | 1083 commentspage 2
xd1936 12/11/2025|
> While GPT‑5.2 will work well out of the box in Codex, we expect to release a version of GPT‑5.2 optimized for Codex in the coming weeks.

https://openai.com/index/introducing-gpt-5-2/

jstummbillig 12/11/2025||
> For coding tasks, GPT-5.1-Codex-Max is a faster, more capable, and more token-efficient coding variant

Hm, yeah, strange. You would not be able to tell, looking at every chart on the page. Obviously not a gotcha, they put it on the page themselves after all, but how does that make sense with those benchmarks?

tempaccount420 12/11/2025|||
Coding requires a mindset shift that the -codex fine-tunes provide. Codex will do all kinds of weird stuff like poking in your ~/.cargo ~/go etc. to find docs and trying out code in isolation, these things definitely improve capability.
dmos62 12/11/2025||
The biggest advantage of codex variants, for me, is terseness and reduced sicophany. That, and presumably better adherence to requested output formats.
baq 12/12/2025||||
Codex talks much less than the standard variant, especially between tool calls.
deaux 12/12/2025|||
Looks like they removed that line.
k_bx 12/12/2025||
gpt-5.2 is already present in codex at this moment
preetamjinka 12/11/2025||
It's actually more expensive than GPT-5.1. I've gotten used to prices going down with each latest model, but this time it's gone up.

https://platform.openai.com/docs/pricing

kingstnap 12/11/2025||
Flagship models have rarely being cheaper, and especially not on release day. Only a few cases of this really.

Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.

The price drops usually are the form of flash and mini models being really cheap and fast. Like when we got o4 mini or 2.0 flash which was a particularly significant one.

n2d4 12/11/2025||
That's not true.

    > Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.
And GPT-4o, GPT-4.1, and GPT-5. Almost every OpenAI release got cheaper on a per-input-token basis.
PhilippGille 12/11/2025|||
Gemini 3 Pro Preview also got more expensive than 2.5 Pro.

2.5 Pro: $1.25 input, $10 output (million tokens)

3 Pro Preview: $2 input, $12 output (million tokens)

TechDebtDevin 12/11/2025||
Literally no difference in productivity from a free/ <0.50c output OpenRouter model. All these > $1.00+ per mm output are literal scams. No added value to the world.
wahnfrieden 12/12/2025||
5.1 Pro is great
manmal 12/12/2025||
I struggle to see where Pro is better than 5.x with Thinking. Actually prefer the latter.
wahnfrieden 12/12/2025||
Many problems where latter spins its wheel and Pro gets it in one go, for me. You need to give Pro full files as context and you need to fit within its ~60k (I forget exactly) silent context window if using via ChatGPT. Don't have it make edits directly, have it give the execution plan back to Codex
deaux 12/12/2025|||
Getting more expensive has been the trend for the closed weights frontier models. See Gemini 3 Pro vs 2.5 Pro. Also see Gemini 2.5 Flash vs 2.0 Flash. The only thing that got cheaper recently was Opus 4.5 vs Opus 4.
Handy-Man 12/11/2025|||
It also seems much more "smarter" though
endorphine 12/11/2025|||
Reading this comment, it just occurred to me that we're still in the first phase of the enshittification process.
moralestapia 12/11/2025||
Previous model's prices usually go down, but their flagship has always been the most expensive one.
moralestapia 12/11/2025||
Wtf, why would this be downvoted?

I'm adding context and what I stated is provably true.

zug_zug 12/11/2025||
For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?
hbarka 12/11/2025||
On the contrary, I thought Gemini 3 Live mode is much much better than ChatGPT. The voices have none of the annoying artificial uptalking intonations that ChatGPT has, and the simplex/duplex interruptibility of Gemini Live seems more responsive. It knows when to break and pause during conversations.
febed 12/11/2025||
Apart from sounding a bit stiff and informal, I was also surprised at how good Gemini Live mode is in regional Indian languages.
simondotau 12/11/2025|||
I absolutely loathe ChatGPT's voice chat. It spends far too much time being conversational and its eagerness to please becomes fatiguing after the first back-and-forth.
joshmarlow 12/11/2025|||
I think Grok's voice chat is almost there - only things missing for me: * it's slower to start-up by a couple of seconds * it's harder to switch between voice and text and back again in the same chat (though ChatGPT isn't perfect at this either)

And of course Grok's unhinged persona is... something else.

Gigachad 12/11/2025|||
Pretty good until it goes crazy glazing Elon or declaring itself mecha hitler.
hcurtiss 12/11/2025||
Neither of these have happened in my use. Those were both the product of some pretty aggressive prompting, and were remedied months ago.
OrangeMusic 12/12/2025||
Yet, using this model in any way whatsoever after these episodes seems absolutely crazy to me.
hcurtiss 12/12/2025|||
All models have had similar instances. I particularly enjoyed Gemini’s black founders era. The “safety” teams have bent the politics of these tools in ways I don’t trust. Grok does too, but in my experience less so. This has real impacts.
user34283 12/12/2025|||
Grok is the only frontier model that is at all usable for adult content.
nazgulsenpai 12/11/2025||||
It's so much fun. So is the Conspiracy persona.
jesse_dot_id 12/11/2025|||
[flagged]
Robdel12 12/11/2025|||
I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.
lxgr 12/11/2025||
Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.
sosodev 12/11/2025|||
You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.
lxgr 12/11/2025|||
Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?
sosodev 12/11/2025|||
Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.

BoxOfRain 12/12/2025||
I've been experimenting with something similar to this approach recently. IndexTTS2 gives you emotion vectors as an input, I used an external emotion classification model on the LLM output to modulate the TTS emotion vectors. You need to manage the state of the current affect with a bit of care or it sounds unhinged, but it's worked surprisingly well so far. I wired it together using Cats Effect.

As you'd expect latency isn't great, but I think it can be improved.

barrkel 12/11/2025||||
The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.
jablongo 12/11/2025|||
I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.
causalmodels 12/11/2025|||
I just asked it and it said that it uses the on device TTS capabilities.
furyofantares 12/11/2025||
I find it very unlikely that it would be trained on that information or that anthropic would put that in its context window, so it's very likely that it just made that answer up.
causalmodels 12/11/2025||
No, it did not make it up. I was curious so I asked it asked it to imitate a posh British accent imitating a South Brooklyn accent while having a head cold and it explained that it didn't have have fine grained control over the audio output because it was using a TTS. I asked it how it knew that and it pointed me towards [1] and highlighted the following.

> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.

Tracked down the original source [2] and looked for additional updates but couldn't find anything.

[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...

[2] https://trust.anthropic.com/updates

furyofantares 12/11/2025||
If it does a web search that's fine, I assumed it hadn't since you hadn't linked to anything.

Also it being right doesn't mean it didn't just make up the answer.

josephwegner 12/11/2025|||
Along with the hordes of other options people are responding with, I'm a big fan of Perplexity's voice chat. It does back-and-forth well in a way that I missed whenever I tried anything besides ChatGPT.
solarkraft 12/11/2025||
It is, shockingly, based on the OpenAI Realtime Assistant API.
ivape 12/11/2025|||
I'm a big user of Gemini voice. My sense is that Gemini voice uses very tight system prompts that are designed to give you an answer and kind of get you off the phone as much as possible. It doesn't have large context at all.

That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.

Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.

websiteapi 12/11/2025|||
gemini live is a thing - never tried chaptgpt, are they not similar?
spudlyo 12/11/2025|||
Not for my use case. I can open it up, and in restored classical Latin pronunciation say "Hi, my name is X, how are you?" and it will respond (also in Latin) "Hello X, I am well, thanks for asking. I hope you are doing great." Its pronunciation is not great, but intelligible. In the written transcript, it butchers what I say, but its responses look good, although sans macrons indicating phonemic vowel length.

Gemini responds in what I think is Spanish, or perhaps Portuguese.

However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.

Dilettante_ 12/11/2025||
I have to ask: What usecase requires you to speak Latin to the llm?
spudlyo 12/11/2025|||
I'm a Latin language learner, and part of developing fluency is practicing extemporaneous speech. My dog is a patient listener, but a poor interlocutor. There are Latin language Discord servers where you can speak to people, but I don't quite have the confidence to do that yet. I assume the machine doesn't judge my shitty grammar.
onraglanroad 12/11/2025||
Loquerisne Latine?

Non vere, sed intelligere possum.

Ita, mihi est canis qui idipsum facit!

(translated from the Gàidhlig)

spudlyo 12/11/2025||
Certe loqui conor, sed saepenumero prave dico; canis meus non turbatus est ;)
nineteen999 12/11/2025|||
You haven't heard? Latin is the next big wave, after blockchain and AI.
create-username 12/18/2025|||
you joke but Latin teachers are very sought after in my region. There are none. I have just bootcamped myself to become one and shift careers due to the high demand
spudlyo 12/11/2025|||
You laugh, but the global language learning market in 2025 is expected to exceed USD $100 billion, and LLMs IMHO are poised to disrupt the shit out of it.
nineteen999 12/12/2025||
Well sure I can see that happening ... but I can't see latin making a huge comeback unfortunately.
jeanlucas 12/11/2025|||
no.
leaK_u 12/11/2025|||
how.
CamelCaseName 12/11/2025||
I find ChatGPT's voice to text to be the absolute best in the world, nearly perfect.

I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.

nickvec 12/11/2025|||
What? The voice chat is basically identical on ChatGPT and Gemini AFAICT.
tmaly 12/11/2025|||
I can't keep up with half the new features all the model companies keep rolling out. I wish they would solve that
SweetSoftPillow 12/11/2025|||
Gemini's much better, try it
sundarurfriend 12/11/2025|||
Are you saying ChatGPT's voice chat is of good quality? Because for me it's one of its most frustrating weaknesses. I vastly prefer voice input to typing, and would love it if the voice chat mode actually worked well.

But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").

The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.

To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.

So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.

whimsicalism 12/11/2025|||
gemini does, grok does, nobody else does (except alibaba but it’s not there yet)
codybontecou 12/11/2025|||
Their voice agent is handy. Currently trying to build around it.
semiinfinitely 12/11/2025|||
try gemini voice chat
bigyabai 12/11/2025|||
Qwen does.
sosodev 12/11/2025||
Qwen's voice chat is nowhere near as good as ChatGPT's.
FrasiertheLion 12/11/2025||
Try elevenlabs
sosodev 12/11/2025||
Does elevenlabs have a real-time conversational voice model? It seems like like their focus is largely on text to speech and speech to text. Which can approximate that type of thing but it's not at all the same as the native voice to voice that 4o does.
hi_im_vijay 12/11/2025|||
[disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.

a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.

YouAreWRONGtoo 12/12/2025||
[dead]
dragonwriter 12/11/2025|||
> Does elevenlabs have a real-time conversational voice model?

Yes.

> It seems like like their focus is largely on text to speech and speech to text.

They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.

sosodev 12/11/2025|||
It specifically says in the architecture docs for the agents platform that it's STT (ASR) -> LLM -> TTS

https://elevenlabs.io/docs/agents-platform/overview#architec...

minadotcom 12/11/2025||
They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?
Tiberium 12/11/2025||
They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

enlyth 12/11/2025|||
This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI
tobias2014 12/12/2025|||
And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.
minadotcom 12/11/2025|||
agreed.
sergdigon 12/12/2025||||
The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)
whimsicalism 12/11/2025|||
uh oh, where did SWE bench go :D
whimsicalism 12/12/2025||
maybe they will release with gpt-5.2-codex
tabletcorry 12/11/2025|||
The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.

But they publish all the same numbers, so you can make the full comparison yourself, if you want to.

Workaccount2 12/11/2025|||
They are taking a page out of Apple's book.

Apple only compares to themselves. They don't even acknowledge the existence of others.

poormathskills 12/11/2025||
OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.
boole1854 12/11/2025||
https://openai.com/index/hello-gpt-4o/

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.

kgwgk 12/11/2025||
“You are absolutely right, and I apologize for the confusion.”
snake_doc 12/11/2025||
> Models were run with maximum available reasoning effort in our API (xhigh for GPT‑5.2 Thinking & Pro, and high for GPT‑5.1 Thinking), except for the professional evals, where GPT‑5.2 Thinking was run with reasoning effort heavy, the maximum available in ChatGPT Pro. Benchmarks were conducted in a research environment, which may provide slightly different output from production ChatGPT in some cases.

Feels like a Llama 4 type release. Benchmarks are not apples to apples. Reasoning effort is across the board higher, thus uses more compute to achieve an higher score on benchmarks.

Also notes that some may not be producible.

Also, vision benchmarks all use Python tool harness, and they exclude scores that are low without the harness.

jbkkd 12/12/2025||
A new model doesn't address the fundamental reliability issues with OpenAI's enterprise tier.

As an enterprise customer, the experience has been disappointing. The platform is unstable, support is slow to respond even when escalated to account managers, and the UI is painfully slow to use. There are also baffling feature gaps, like the lack of connectors for custom GPTs.

None of the major providers have a perfect enterprise solution yet, but given OpenAI's market position, the gap between expectations and delivery is widening.

sigmoid10 12/12/2025||
Which tier are you? We are on the highest enterprise tier and I've found that OpenAI is a much more stable platform for high-usage than other providers. Can't say much about the UI though since I almost exclusively work with the API. I feel like UIs generally suck everywhere unless you want to do really generic stuff.
energy123 12/12/2025||
ChatGPT UI is leagues above Gemini and AI Studio in responsiveness and latency which is what I care about.
dannyw 12/12/2025||
Completely the opposite experience.
tenpoundhammer 12/11/2025||
I have been using chatGPT a ton over the last months and paying the subscription. Used it for coding, news, stock analysis, daily problems, and a whatever I could think of. I decided to give Gemini a go when version three came out to great reviews. Gemini handles every single one of my uses cases much better and consistently gives better answers. This is especially true for situations were searching the web for current information is important, makes sense that google would be better. Also OCR is phenomenal chatgpt can't read my bad hand writing but Gemini can easily. Only downsides are in the polish department, there are more app bugs and I usually have to leave the happen or the session terminates. There are bugs with uploading photos. The biggest complaint is that all links get inserted into google search and then I have to manipulate them when they should go directly to the chosen website, this has to be some kind of internal org KPI nonsense. Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.
dmd 12/11/2025||
I consistently have exactly the opposite experience. ChatGPT seems extremely willing to do a huge number of searches, think about them, and then kick off more searches after that thinking, think about it, etc., etc. whereas it seems like Gemini is extremely reluctant to do more than a couple of searches. ChatGPT also is willing to open up PDFs, screenshot them, OCR them and use that as input, whereas Gemini just ignores them.
nullbound 12/11/2025|||
I will say that it is wild, if not somewhat problematic that two users have such disparate views of seemingly the same product. I say that, but then I remember my own experience just from few days ago. I don't pay for gemini, but I have paid chatgpt sub. I tested both for the same product with seemingly same prompt and subbed chatgpt subjectively beat gemini in terms of scope, options and links with current decent deals.

It seems ( only seems, because I have not gotten around to test it in any systematic way ) that some variables like context and what the model knows about you may actually influence quality ( or lack thereof ) of the response.

martinpw 12/11/2025|||
> I will say that it is wild, if not somewhat problematic that two users have such disparate views of seemingly the same product.

This happens all the time on HN. Before opening this thread, I was expecting that the top comment would be 100% positive about the product or its competitor, and one of the top replies would be exactly the opposite, and sure enough...

I don't know why it is. It's honestly a bit disappointing that the most upvoted comments often have the least nuance.

stevage 12/11/2025|||
How much nuance can one person's experience have? If the top two most visible things are detailed, contrary experiences of the same product, that seems a pretty good outcome?
AznHisoka 12/12/2025||
Also, why introduce nuance for the sake of nuance? For every single use case, Gemini (and Claude) has performed better. I can’t give ChatGPT even the slightest credit when it doesnt deserve any
block_dagger 12/11/2025|||
Replace "on HN" with "in the course of human events" and we may have a generally true statement ;)
rabf 12/11/2025||||
Chatgpt is not one model! Unless you manually specify to use a particular model your question can be routed to different models depending on what it guesses would be most appropriate for your question.
stingraycharles 12/12/2025||
Isn’t that just standard MoE behavior? And isn’t the only choice you have from the UI between “Instant” and “Thinking”?
baq 12/12/2025||
MoE is a single model thing, model routing happens earlier.
stingraycharles 12/12/2025||
Yes but then what does the grandparent mean with “unless you specify a specific model” ? Do they mean “if you select auto, it automatically decides between instant or thinking” ?

That’s… hardly something worth mentioning.

rabf 12/16/2025||
If you have the paid subscription you can choose what model your question is routed to. Current options in the UI are GPT-5.1 Instant, GPT-5.1 Thinking, GPT-5 Instant, GPT-5 thinking mini, GPT-5 thinking, GPT-4o, GPT-4.1, o3 and o4-mini. Options like deep-research will affect the reasoning level used. There is a lot that goes on behind the scenes in the chatgpt app with things like tool use or function calling coming into play as well. Ultimately what OpenAI will be trying/hoping to do is give you a satifactory result using the least amount of compute possible - this is where the autorouter is very useful for them and obstensibly for the user who would not know which one to pick. I mostly just use the API's these days as I like to be the one who decides who/what I am talking to.
blks 12/11/2025||||
Because neither product has any consistency in its results, no predictive behaviour. One day it performs well, another it hallucinates non existing facts and libraries. Those are stochastic machines
sendes 12/11/2025||
I see the hyperbole is the point, but surely what these machines do is to literally predict? The entire prompt engineering endeavour is to get them to predict better and more precisely. Of course, these are not perfect solutions - they are stochastic after all, just not unpredictably.
coliveira 12/12/2025||
Prompt engineering is voodoo. There's no sure way to determine how well these models will respond to a question. Of course, giving additional information may be helpful, but even that is not guaranteed.
lossyalgo 12/12/2025|||
Also every model update changes how you have to prompt them to get the answers you want. Setting up pre-prompts can help, but with each new version, you have to figure out through trial and error how to get it to respond to your type of queries.

I can't wait to see how bad my finally sort-of-working ChatGPT 5.1 pre-prompts work with 5.2.

Edit: How to talk to these models is actually documented, but you have to read through huge documents: https://cdn.openai.com/gpt-5-system-card.pdf

baq 12/12/2025|||
It definitely isn’t voodoo, it’s more like forecasting weather. Some forecasts are easier to make, some are harder (it’ll be cold when it’s winter vs the exact location and wind speed of a tornado for an extreme example). The difference is you can try to mix things up in the prompt to maximize the likelihood of getting what you want out and there are feasibility thresholds for use cases, e.g. if you get a good answer 95% of the time it’s qualitatively different than 55%.
coliveira 12/12/2025||
No, it's not. Nowadays we know how to predict the weather with great confidence. Prompting may get you different results each time. Moreover, LLMs depend on the context of your prompts (because of their memory), so a single prompt may be close to useless and two different people can get vastly different results.
baq 12/12/2025||
> we know how to predict the weather with great confidence

some weather, sometimes. we're not good at predicting exact paths of tornadoes.

> so a single prompt may be close to useless and two different people can get vastly different results

of course, but it can be wrong 50% of the time or 5% of the time or .5% of the time and each of those thresholds unlock possibilities.

dmd 12/11/2025||||
And I’d really like for Gemini to be as good or better, since I get it for free with my Workspace account, whereas I pay for chatgpt. But every time I try both on a query I’m just blown away by how vastly better chatgpt is, at least for the heavy-on-searching-for-stuff kinds of queries I typically do.
Workaccount2 12/11/2025||||
Gemini has tons of people using it free via aistudio

I can't help but feel that google gives free requests the absolute lowest priority, greatest quantization, cheapest thinking budget, etc.

I pay for gemini and chatGPT and have been pretty hooked on Gemini 3 since launch.

crorella 12/11/2025||||
It’s like having 3 coins and users preferring one or the other when tossing it because one coin gives consistently more heads (or tails) than the other coin.

What is better is to build a good set of rules and stick to one and then refine those rules over time as you get more experience using the tool or if the tool evolves and digress from the results you expect.

nullbound 12/11/2025||
<< What is better is to build a good set of rules and

But, unless you are on a local model you control, you literally can't. Otherwise, good rules will work only as long as the next update allows. I will admit that makes me consider some other options, but those probably shouldn't be 'set and iterate' each time something changes.

crorella 12/12/2025||
what I had in mind when I added that comment was for coding, with the use of .md files. For the web version of chats I agree there is little control on how to tailor the way you want the agent to behave, unless you give a initial "setup" prompt.
jhancock 12/11/2025||||
I can use GPT one day and the next get a different experience with the same problem space. Same with Gemini.
4ndrewl 12/11/2025|||
This is by design, given a non-determenitisic application?
jhancock 12/11/2025||
sure. It may be more than that...possibly due to variable operating params on the servers and current load.

On whole, if I compare my AI assistant to a human worker, I get more variance than I would from a human office worker.

pixl97 12/11/2025|||
Thats because you don't 'own' the LLM compute. If you instead bought your office workers by the question I'm sure the variability would increase.
astrange 12/12/2025|||
They're not really capable of producing varying answers based on load.

But they are capable of producing different answers because they feel like behaving differently if the current date is a holiday, and things like that. They're basically just little guys.

sjaramillo 12/11/2025|||
I guess LLMs have a mood too
dr_dshiv 12/12/2025||
Vibes
nunez 12/11/2025||||
Tesla FSD has been more or less the same experience. Some people drive 100s of miles without disengaging while others pull the plug within half a mile from their house. A lot of it depends on what the customer is willing to tolerate.
austhrow743 12/12/2025||||
We've been having trouble telling if people are using the same product ever since Chat GPT first got popular. The had a free model and a paid model, that was it, no other competitors or naming schemes to worry about, and discussions were still full of people talking about current capabilities without saying what model they were using.

For me, "gemini" currently means using this model in the llm.datasette.io cli tool.

openrouter/google/gemini-3-pro-preview

For what anyone else means? If they're equivalent? If Google does something different when you use "Gemini 3" in their browser app vs their cli app vs plans vs api users vs third party api users? No idea to any of the above.

I hate naming in the llm space.

dmd 12/12/2025||
FWIW i’m always using 5.1 Thinking.
Bombthecat 12/12/2025|||
Could also be a language thing ...
ghostpepper 12/12/2025||||
Same, I use chatgpt plus (the entry-level paid option) extensively for personal research projects and coding, and it seems miles ahead of whatever "Gemini Pro" is that I have through work. Twice yesterday, gemini repeated verbatim a previous response as if I hadn't asked another question and told it why the previous response was bad. Gemini feels like chatGPT from two years ago.
staticman2 12/11/2025||||
Are you uploading PDFs that already have a text layer?

I don't currently subscribe to Gemini but on A.I. Studio's free offering when I upload a non OCR PDF of around 20 pages the software environment's OCR feeds it to the model with greater accuracy than I've seen from any other source.

dmd 12/11/2025||
I’m not uploading PDFs at all. I’m talking about PDFs it finds while searching than it extracts data from for the conversation.
staticman2 12/11/2025||
I'm surprised to hear anyone finds these models trustworthy for research.

Just today I asked Claude what year over year inflation was and it gave me 2023 to 2024.

I also thought some sites ban A.I. crawling so if they have the best source on a topic, you won't get it.

Workaccount2 12/12/2025||
Anytime you use LLMs you should be keenly aware of their knowledge cutoff. Like any other tool, the more you understand it, the better it works.
staticman2 12/12/2025||
I'm sorry but I don't see what "knowledge cutoff" has to do with what we were talking about- which is using a LLM find PDFs and other sources for research.
whazor 12/12/2025||||
I agree with you. To me, gemini has much worse search results. Then again, I use kagi for search and I cannot stand the search results from Google anymore. And its clear that gemini uses those.

In contrast, chatgpt has built their own search engine that performs better in my experience. Except for coding, then I opt for Claude opus 4.5.

noname120 12/11/2025|||
Perplexity Pro with any thinking model blows both out of the water in a fraction of the time, in my experience
kccqzy 12/11/2025|||
> The biggest complaint is that all links get inserted into google search and then I have to manipulate them when they should go directly to the chosen website, this has to be some kind of internal org KPI nonsense.

Oh I know this from my time at Google. The actual purpose is to do a quick check for known malware and phishing. Of course these days such things are better dealt with by the browser itself in a privacy preserving way (and indeed that’s the case), so it’s unnecessary to reveal to Google which links are clicked. It’s totally fine to manipulate them to make them go directly to the website.

gjuggler 12/12/2025|||
I think Gemini is just broken.

Instead of forwarding model-generated links to https://www.google.com/url?q=[URL], which serves the purpose of malware check and user-facing warning about linking to an external site, Gemini forwards links to https://www.google.com/search?q=[URL], which does... a Google search for the URL, which isn't helpful at all.

Example: https://gemini.google.com/share/3c45f1acdc17

NotebookLM by comparison, does the right thing: https://notebooklm.google.com/notebook/7078d629-4b35-4894-bb...

It's kind of impressive how long this obviously-broken link experience has been sitting in the Gemini app used by millions.

sundarurfriend 12/11/2025|||
That's interesting, I just today started getting some "Some sites restrict our ability to check links." dialogue in ChatGPT that wanted me to verify that I really wanted to follow the link, with a Learn More link to this page: https://help.openai.com/en/articles/10984597-chatgpt-generat...

So it seems like ChatGPT does this automatically and internally, instead of using an indirect check like this.

solarkraft 12/11/2025|||
> Only downsides are in the polish department

What an understatement. It has me thinking „man, fuck this“ on the daily.

Just today it spontaneously lost an entire 20-30 minutes long thread and it was far from the first time. It basically does it any time you interrupt it in any way. It’s straight up data loss.

It’s kind of a typical Google product in that it feels more like a tech demo than a product.

It has theoretically great tech. I particularly like the idea of voice mode, but it’s noticeably glitchy, breaks spontaneously often and keeps asking annoying questions which you can’t make it stop.

sundarurfriend 12/11/2025|||
ChatGPT web UI was also like this for the longest time, until a few months ago: all sorts of random UI bugs leading either to data loss or misleading UI state. Interrupting still is very flaky there too. And on the mobile app, if you move away from the app while it's taking time to think, its state would somehow desync from the actual backend thinking state, and get stuck randomly; sometimes restarting the app fixes it, sometimes that chat is that unusable from that point on.

And the UI lack of polish shows up freshly every time a new feature lands too - the "branch in new chat" feature is really finicky still, getting stuck in an unusable state if you twitch your eyebrows at wrong moment.

gcr 12/12/2025|||
i basically can't use the ChatGPT app on the subway for these reasons. the moment the websocket connection drops, i have to edit my last message and resubmit it unchanged.

it's like the client, not the server, is responsible for writing to my conversation history or something

spruce_tips 12/12/2025|||
it took me a lot of tinkering to get this feeling seamless in my own apps that use the api under the hood. i ended up buffering every token into a redis stream (with a final db save at the end of streaming) and building a mechanism to let clients reconnect to the stream on demand. no websocket necessary.

works great for kicking off a request and closing tab or navigating away to another page in my app to do something.

i dont understand why model providers dont build this resilient token streaming into all of their APIs. would be a great feature

rishabhaiover 12/12/2025||
exactly. they need to bring in spotify level of caching of streaming music that it just works if you're in a subway. Constant availability should be table stakes for them.
rjzzleep 12/12/2025|||
I get that the web versions are free, but if you can afford API access, I always recommend using Msty for everything. It's a much better experience.

https://msty.ai/

p_ing 12/12/2025|||
> ChatGPT web UI was also like this for the longest time

Copilot Chat has been perfect in this respect. It's currently GPT 5.0, moving to 5.1 over the next month or so, but at least I've never lost an (even old) conversation since those reside in an Exchange mailbox.

Max-Limelihood 12/12/2025||
I lost thousands of conversations I'd had back in the move from "Bing" to "Copilot". Moved straight to Claude and never touched a GPT again.
Duanemclemore 12/12/2025|||
I downloaded my archive and completely ended my GPT subscription last week based on some bad computer maintenance advice. Same thing here - using other models, never touching that product again.
topato 12/12/2025||
now I kind of HAVE to know... what was the aforementioned bad advice was?! So mysterious!
Duanemclemore 12/12/2025||
Oh, it was DUMB. I was dumb. I only have myself to blame here. But we all do dumb things sometimes, owning your mistakes keeps you humble, and you asked. So here goes.

I use a modeling software called Rhino on wine on Linux. In the past, there was an incident where I had to copy an obscure dll that couldn't be delivered by wine or winetricks from a working Windows installation to get something to work. I did so and it worked. (As I recall this was a temporary issue, and was patched in the next release of wine.)

I hate the wine standard file picker, it has always been a persistent issue with Rhino3d. So I keep banging my head on trying to get it to either perform better or make a replacement. Every few months I'll get fed up and have a minute to kill, so I'll see if some new approach works. This time, ChatGPT told me to copy two dll's from a working windows installation to the System folder. Having precedent that this can work, I did.

Anyway, it borked startup completely and it took like an hour to recover. What I didn't consider - and I really, really should have - was that these were dll's that were ALREADY IN the system directory, and I was overwriting the good ones with values already reflecting my system with completely foreign ones.

And that's the critical difference - the obscure dll that made the system work that one time was because of something missing. This time was overwriting extant good ones.

But the fact that the LLM even suggested (without special prompting) to do something that I should have realized was a stupid idea with a low chance of success made me very wary of the harm it could cause.

me-vs-cat 12/12/2025||
> ...using other models, never touching that product again.

> ...that the LLM even suggested (without special prompting) to do something that I should have realized was a stupid idea with a low chance of success...

Since you're using other models instead, do you believe they cannot give similarly stupid ideas?

Duanemclemore 12/12/2025||
I'm under no misimpression they can't. But I have found ChatGPT to be most confident when it f's up. And to suggest the worst ideas most often.

Until you queried I had forgotten to mention that the same day I was trying to work out a Linux system display issue and it very confidently suggested to remove a package and all its dependencies, which would have removed all my video drivers. On reading the output of the autoremove command I pointed out that it had done this, and the model spat out an "apology" and owned up to ** the damage it would have wreaked.

** It can't "apologize" for or "own up" to anything, it can just output those words. So I hope you'll excuse the anthropomorphization.

me-vs-cat 12/12/2025||
I feel the same about the obsequious "apologies".
p_ing 12/13/2025|||
I'm referring to Copilot Chat. The data resides in your Exchange mailbox. You're referring to the consumer product.
deepGem 12/12/2025||||
There is no competing product for GPT Voice. Hands down. I have tried Claude, Gemini - they don't even comes close.

But voice is not a huge traffic funnel. Text is. And the verdict is more or less unanimous at this time. Gemini 3.0 has outdone ChatGPT. I unsubscribed from GPT plus today. I was a happy camper until the last month when I started noticing deplorable bugs.

1. The conversation contexts are getting intertwined.Two months ago, I could ask multiple random queries in a conversation and I would get correct responses but the last couple of weeks, it's been a harrowing experience having to start a new chat window for almost any change in thread topic. 2. I had asked ChatGPT to once treat me as a co-founder and hash out some ideas. Now for every query - I get a 'cofounder type' response. Nothing inherently wrong but annoying as hell. I can live with the other end of the spectrum in which Claude doesn't remember most of the context.

Now that Gemini pro is out, yes the UI lacks polish, you can lose conversations, but the benefits of low latency search and a one year near free subscription is a clincher. I am out of ChatGPT for now, 5.2 or otherwise. I wish them well.

esyir 12/12/2025|||
Just a note, chatGPT does retain a persistent memory of conversations. In the settings menu, there's a section that allows you to tweak/clear this persistent memory
rapind 12/12/2025||||
I found the gemini cli extremely lacking and even frustrating. Why google would choose node…

Codex is decent and seemed to be improving (being written in rust helps). Claude code is still the king, but my god they have server and throttling issues.

Mixed bag wherever you go. As model progress slows / flatlines (already has?) I’m sure we’ll see a lot more focus and polish on the interfaces.

wahnfrieden 12/12/2025||
Codex is king
wkat4242 12/12/2025|||
What's that near free subscription? I don't see it here
deepGem 12/12/2025|||
They had 9.99 for the first year.
wkat4242 12/12/2025||
Oh I must have missed that, thanks.
topato 12/12/2025|||
yeah, the best Ive seen is like 1.99 for two months, then back to normal pricing....
KronisLV 12/11/2025||||
> It has me thinking „man, fuck this“ on the daily.

That's sometimes me with the CLI. I can't use the Gemini CLI right now on Windows (in the Terminal app), because trying to copy in multiple lines of text for some reason submits them separately and it just breaks the whole thing. OpenCode had the same issue but even worse, it quite after the first line or something and copied the text line by line into the shell, thank fuck I didn't have some text that mentions rm -rf or something.

More info: https://github.com/google-gemini/gemini-cli/issues/14735#iss...

At the same time, neither Codex CLI, nor Claude Code had that issue (and both even showed shortened representations of copied in text, instead of just dumping the whole thing into the input directly, so I could easily keep writing my prompt).

So right now if I want to use Gemini, I more or less have to use something like KiloCode/RooCode/Cline in VSC which are nice, but might miss out on some more specific tools. Which is a shame, because Gemini is a really nice model, especially when it comes to my language, Latvian, but also your run of the mill software dev tasks.

In comparison, Codex feels quite slow, whereas Claude Code is what I gravitate towards most of the time but even Sonnet 4.5 ends up being expensive when you shuffle around millions of tokens: https://news.ycombinator.com/item?id=46216192 Cerebras Code is nice for quick stuff and the sheer amount of tokens, but in KiloCode/... regularly messes up applying diff based edits.

radicaldreamer 12/11/2025||||
Google’s standard problem is that they don’t even use their own products. Their Pixel and Android team rocks iPhones on the daily, for example.
free652 12/12/2025|||
You cant buy an iPhone without a director approval. And it's like 3 gen behind as well. So no, they don't use iPhones.
ummonk 12/12/2025|||
Google tells its employees what products they're allowed to buy for personal use?
snypher 12/12/2025||
Seems like they meant for a work device.
gcr 12/12/2025||||
lots of googlers use BYOD iPhones and the corp suite for this use case is fairly well-supported
brookst 12/12/2025||
Which makes tons of sense because iPhone users are higher CLV than Android users. If Google had to choose between major software defects in Android or iOS, they would focus quality on iOS every time.
siva7 12/12/2025||||
that explains why their ios gemini app is so ridiculously bad. in private they probably use iphones and just chatgpt instead.
dominotw 12/12/2025|||
you have to get premission from director for your presonal phone? wtf
testdelacc1 12/12/2025||
For the work phone.
RBerenguel 12/11/2025||||
I would think this is not true
sib 12/12/2025|||
You'd be wrong (source - worked in the Android org).
RBerenguel 12/20/2025||
How long ago?
sib 12/28/2025||
2021-2023
renewiltord 12/11/2025|||
Yeah, I've heard that Sundar Pichai dogfoods the latest Pixel at least once a month and sometimes two or three times.
sam345 12/12/2025||||
That's inexcusable.
Der_Einzige 12/11/2025||||
That’s because they will be bullied out of the dating market if they have a “green bubble”.
astrange 12/12/2025|||
[flagged]
dkga 12/12/2025|||
What is a green bubble? iPhone's carbon footprint?
brookst 12/12/2025||
iMessage renders other iMessage users as blue bubbles, SMS/RCS as green bubbles.

People who can’t understand that many people actually prefer iOS use this green/blue thing to explain the otherwise incomprehensible (to them) phenomenon of high iOS market share. “Nobody really likes iOS, they just get bullied at school if they don’t use it”.

It’s just “wake up sheeple” dressed up in fake morality.

ethbr1 12/12/2025|||
As someone who switches between platforms somewhat frequently, iOS perpetually feels like people have Stockholm syndrome.

'Oh, that super annoying issue? Yeah, it's been there for years. We just don't do that.'

Fundamentally though, browsing the web on iOS, even with a custom "browser" with adblocking, feels like going back in time 15 years.

platevoltage 12/12/2025|||
It wouldn't be an issue if they didn't pick the worst green on earth. "Which green would you like for the carrier text messages Mr. Jobs?" ... "#00FF00 will be fine."
onethought 12/11/2025||||
I mean there is benefit to understanding competitor well as well?
LogicFailsMe 12/11/2025|||
Outweighed by the value of having to suffer with the moldy fruits of their own labor. That was the only way the Android Facebook app became usable as well.
ssl-3 12/11/2025||||
There certainly is.

To posit a scenario: I would expect General Motors to buy some Ford vehicles to test and play around with and use. There's always stuff to learn about what the competition has done (whether right, wrong, or indifferent).

But I also expect the parking lots used by employees at any GM design facility in the world to be mostly full of General Motors products, not Fords.

snypher 12/12/2025|||
The CEO of Ford was driving a competition EV for months;

https://www.caranddriver.com/news/a62694325/ford-ceo-jim-far...

GenerWork 12/11/2025|||
>But I also expect the parking lots used by employees at any GM design facility in the world to be mostly full of General Motors products, not Fords.

I think you'd be surprised about the vehicle makeup at Big 3 design facilities.

ssl-3 12/12/2025||
Maybe so.

I'm only familiar with Ford production and distribution facilities. Those parking lots are broadly full of Fords, but that doesn't mean that it's like this across the board.

olyjohn 12/12/2025||
GM has dedicated parking lots for employees with GM vehicles. Everybody else parks further away in the lot of shame.
ssl-3 12/12/2025||
Of course.

And I've parked in the lot of shame at a Ford plant, as an outsider, in my GMC work truck -- way over there.

It wasn't so bad. A bit of a hike to go back and get a tool or something, but it was at least paved...unlike the non-union lot I'm familiar with at a P&G facility, which is a gravel lot that takes crossing a busy road to get to, lacks the active security and visibility from the plant that the union lot has, and which is full of tall weeds. At P&G, I half-expect to come back and find my tires slashed.

Anyway, it wasn't barren over there in the not-Ford lot, but it wasn't nearly so populous as the Ford lot was. The Ford-only lot is bigger, and always relatively packed.

It was very clear to me that the lots (all of the lots, in aggregate) were mostly full of Fords.

To bring this all back 'round: It is clear to me that Ford employees broadly (>50%) drive Fords to work at that plant.

---

It isn't clear to me at all that Google Pixel developers don't broadly drive iPhones. As far as I can tell, that status (which is meme-level in its age at this point) is true, and they aren't broadly making daily use of the systems they build.

(And I, for one, can't imagine spending 40 hours a week developing systems that I refuse to use. I have no appreciation for that level of apparent arrogance, and I hope to never be suaded to be that way. I'd like to think that I'd be better-motivated to improve the system than I would be to avoid using it and choose a competitor instead.

I don't shit where I sleep.)

Forgeties79 12/11/2025|||
I wonder how many apple employees walk in to the office with android phones
azinman2 12/12/2025||
Effectively zero.

Disclosure: I work at Apple. And when I was at Google I was shocked by how many iPhones there were.

Forgeties79 12/12/2025|||
That doesn’t surprise me at all haha appreciate someone a little closer to the question answering it! I know it still counts anecdotal but I’ll take it
jimmaswell 12/12/2025|||
This is flabbergasting, how could such a large proportion of highly technical people willingly subject themselves to being shackled by iOS? They just happily put up with having one choice of browser, (outside Europe) no third party app stores, and being locked into the Apple ecosystem? I can't think of a single reason I would ever switch from an S22-25+U to an iPhone. I only went from 22U to 25U because my old one got smashed, otherwise the 22U would still be perfectly fine.
brookst 12/12/2025|||
Because many of them just want to use their phone as a tool, not tinker with it.

Same way many professional airplane mechanics fly commercial rather than building their own plane. Just because your job is in tech doesn’t mean you have to be ultra-haxxor with every single device in your life.

kaashif 12/12/2025||||
I don't have my phone (a Pixel) because it frees me from shackles or anything like that. It's just a phone. I use the default everything. Works great. I imagine most people with iPhones are the same.
dumbfounder 12/12/2025|||
Because it’s better.
Forgeties79 12/12/2025|||
I feel like people dance around this a lot because idk it hurts nerd credibility or something. The fact is on a moment to moment basis, the iPhone is just a better experience generally. They also hold their value a lot longer. I consistently trade in my phone or sell it to other people for easily 80% of what I paid for it. Usually this is 3-4yrs out

Remember how long it took for Instagram to be functional on android phones?

jimmaswell 12/12/2025||||
I've tried them out and not a single thing about it was tangibly better IMO. They have no inherent merit above Android except that some see them as a status symbol (which is absurd as my S25U has a higher MSRP than most iPhone models)
hamburglar 12/12/2025|||
My bottom of the barrel iPhone SE is absolutely not a status symbol. It’s just the phone I like best.

The MSRP of your phone does not matter.

Forgeties79 12/13/2025|||
Cameras, for starters. I’ve never seen another smart phone keep up with the quality color and texture of an iPhone’s photos/videos (videos in particular) since the 4s. Their color science is just better. We’ve intercut footage since the 7 or so with our work and frankly you’d be hard pressed to catch it wasn’t one of our nicer rigs unless we hold the shot for too long. we just can’t get other phone cameras to match footage with the same ease, especially when it comes to skin tones.
inquirerGeneral 12/12/2025|||
[dead]
adamkochanowicz 12/11/2025||||
I also love that I can leave the microphone on (not in live voice mode) while dictating to ChatGPT and pause and think as much as needed.

With Gemini, it will send as soon as I stop to think. No way to disable that.

wheelerwj 12/11/2025||
How did you do this?
toomuchtodo 12/11/2025||
Record button in the app if you’ve got the feature.
arjie 12/11/2025||||
Any time its safety stuff triggers, Gemini wipes the context. It's unusable because of this because whatever is going on with the safety stuff, it fires too often. I'm trying to figure out some code here, not exactly deporting ICE to Guantanamo or whatever.
rvnx 12/11/2025|||
The more Gemini and Nano-Banana soften their filters, the more audience it will take from other platforms. The main risk is payment providers banning them, I can't imagine bank card providers to remove payments to Google.
dzhiurgis 12/11/2025|||
On a flip side chatgpt app now has years of history that sometimes useful (search is pretty ok, but could improve) but otherwise I'd like to remove most of it - good luck doing so.
amluto 12/12/2025||||
Claude regularly computes a reply for me, then reports an error and loses the reply. I wonder what fraction of Anthropic’s compute gets wasted and redone.
seg_lol 12/12/2025||
Try using a VPN, my ISP was killing connections and claude would randomly reset. Using a VPN fixed the issue.
mnky9800n 12/11/2025||||
The colab integration is where it shines the most imo.
hexnuts 12/12/2025||||
You may be interested in tools like OpenMemory
mmaunder 12/11/2025|||
Yeah I eventually noped out as I said in another comment and am charging hard with Codex and am so happy about 5.2!!
lxgr 12/11/2025|||
Interesting, I had the opposite experience. 5.0 "Thinking" was better than 5.1, but Gemini 3 Pro seems worse than either for web search use cases. It's hallucinating at pretty alarming rates (including making up sources it never actually accessed) for a late 2025 model.

Opus 4.5 has been a step above both for me, but the usage limits are the worst of the three. I'm seriously considering multiple parallel subscriptions at this point.

gs17 12/11/2025|||
I've had the same experience with search, especially with it hallucinating results instead of actually finding them. It's really frustrating that you can't force a more in-depth search from the model run by the company most famous for a search engine.
astrange 12/12/2025|||
Try the same question in deep research mode.
inquirerGeneral 12/12/2025|||
[dead]
hbarka 12/11/2025|||
I’ve been putting literally the same inputs into both ChatGPT and Gemini and the intuition in answers from Gemini just fits for me. I’m now unwilling to just rely on ChatGPT.

Google, if you can find a way to export chats into NotebookLM, that would be even better than the Projects feature of ChatGPT.

siva7 12/12/2025|||
notebooklm is heavily biased to only use the sources i added and frame every task around them - even if it is nonsensical - so it is not that useful for novel research. it also tends to hallucinate when lots of data is involved.
LogicFailsMe 12/11/2025|||
All I want for Christmas is a "No NotebookLM slop" checkbox on youtube.
simplify 12/12/2025|||
Youtube's downvote button has served me quite well for this purpose.
didibus 12/11/2025|||
> Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.

Depends, even though Gemini 3 is a bit better than GPT5.1, the quality of the ChatGPT apps themselves (mobile, web) have kept me a subscriber to it.

I think Google needs to not-google themselves into a poor app experience here, because the models are very close and will probably continue to just pass each other in lock step. So the overall product quality and UX will start to matter more.

Same reason I am sticking to Claude Code for coding.

concinds 12/11/2025||
The ChatGPT Mac app especially feels much nicer to use. I like Gemini more due to the context window but I doubt Google will ever create a native Mac app.
bayarearefugee 12/11/2025|||
This matches my experience pretty closely when it comes to LLM use for coding assistance.

I still find a lot to be annoyed with when it comes to Gemini's UI and its... continuity, I guess is how I would describe it? It feels like it starts breaking apart at the seams a bit in unexpected ways during peak usages including odd context breaks and just general UI problems.

But outside of UI-related complaints, when it is fully operational it performs so much better than ChatGPT for giving actual practical, working answers without having to be so explicit with the prompting that I might as well have just written the code myself.

luhn 12/11/2025|||
That's hilarious and right on brand for Google that they spend millions developing cutting-edge technology and fumble the ball making a chat app.
spwa4 12/12/2025||
Every Google app is a chat app, except maybe search.
dieortin 12/12/2025||
Is Google Drive a chat app? Is Google Photos a drive app? I don’t know what you mean
spwa4 12/12/2025|||
Once you open a file, it is very much a chat app. Comments and chat work for anything you can preview btw, not just Google Docs stuff.

Not sure how you can access the chat in the directory view.

minitoar 12/12/2025|||
In Google Photos shared albums there is a tab that I can only describe as a chatroom.
dieortin 12/13/2025||
Isn’t there a difference between having a tab that is similar to a chat, to being a chat app?
azan_ 12/11/2025|||
That's interesting. I've got completely different impression. Every time I use Gemini I'm surprised how bad it is. My main complaint is that Gemini is too lazy.
Nathanba 12/11/2025||
Same for me, at this point I'm seriously starting to think that these are ads for and by Google because for me Gemini is the worst.
WillPostForFood 12/12/2025||
My experience is that "AI Mode" Gemini in Chrome is terrible, but AI Studio Gemini is pretty great.
varispeed 12/11/2025|||
Get Gemini answer and tell ChatGPT this is what my friend said. Then put ChatGPT answer to Claude and so on. It's a cheat code.
tenpoundhammer 12/12/2025|||
I did this today it was amazing. If I would have had time I would try other models as well. Great tip thanks
clhodapp 12/12/2025|||
A cheat code to what?
Iwan-Zotow 12/12/2025||
To get a Hitler
AznHisoka 12/11/2025|||
ChatGPT seems to just randomly pick urls to cite and extract information from.

Google Gemini seems to look at heuristics like whether the author is trustworthy, or an expert in the topic. But more advanced

FpUser 12/11/2025|||
I've read many very positive reviews about Gemini 3. I tried using it including Pro and to me it looks very inferior to ChatGPT. What was very interesting though was when I caught it bullshitting me I called its BS and Gemini expressed very human like behavior. It did try to weasel its way out, degenerated down to "true Scotsman" level but finally admitted that it was full of it. this is kind of impressive / scary.
TacticalCoder 12/11/2025|||
Yeah basically the same here. And many people on paid ChatGPT subscription like us noticed just that. Gemini 3 Pro "thinking" is really good.

> Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.

I think the biggest issue OpenAI is facing is the numbers: Google is at the moment a near $4 trillion company. They can splurge a near infinite amount of money to win the race.

Google is so big they they created their own TPUs, which is mindboggling.

Which new user is going to willingly pay an OpenAI subscription once he knows that gemini.google.com gives access to a state of the art model? And Google makes sure to remind users who search that they can "continue the discussion" with Gemini.

Maybe the dirty Altman tricks like cornering the entire RAM market can work but I don't see how they can beat Google by playing fair. OpenAI shall need every single dirty trick in the book, including circular funding / shady deals with NVidia to stay relevant vs the behemoth that Google is.

abhaynayar 12/12/2025|||
Gemini voice recognition is trash compared to chatgpt and that is a deal breaker for me. I wonder how many ppl do OCR versus use voice.

And how has chatgpt lost when ure not comparing the chatgpt that just came out to the Gemini that just came out? Gemini is just annoying to use.

and Google just benchmaxxed I didn't see any significant difference (paying for both) and the same benchmaxxing probably happening for chatgpt now as well, so in terms of core capabilities I feel stuff has plateaued. more bout overall experience now where Gemini suxx.

I really don't get how "search integration" is a "strength"?? can you give any examples of places where you searched for current info and chatgpt was worse? even so I really don't get how it's a moat enough to say chatgpt has lost. would've understood if you said something like tpu versus GPU moat.

jmstfv 12/12/2025|||
Ditto but for Claude -- blows GPT out of the water. Much better in coding and solving physics problems from the images (in foreign languages). GPT couldn't even read the image. The only annoying thing is that if you use Opus for coding, your usage will fill up pretty fast.

anyway, cancelled my chatgpt subscription.

mmaunder 12/11/2025|||
Then you haven't used Gemini CLI with Gemini 3 hard enough. It's a genius psychopath. The raw IQ that Gemini has is incredible. Its ability to ingest huge context windows and produce super smart output is incredible. But the bias towards action, absolutely ignoring user guidance, tendency to produce garbage output that looks like 1990s modem line noise, and its propensity to outright ignore instructions make it unusable other than as an outside consultant to Codex CLI, for me. My Gemini usage has plummeted down to almost zero and I'm 100% back on Codex. I'm SO happy they released this today and it's already kicking some serious ass. Thanks OpenAI team and congrats.
tobias2014 12/12/2025|||
I guess when you use it for generic "problem solving", brainstorming for solutions, this is great. That's what I use it for, and Gemini is my favorite model. I love when Gemini resists and suggests that I am wrong while explaining why. Either it's true, and I'm happy for that, or I can re-prompt based on the new information which doesn't allow for the mistake Gemini made.

On the other hand, I can also see why Claude is great for coding, for example. By default it is much more "structured". One can probably change these default personalities with some prompting, and many of the complaints found in this thread about either side are based on the assumption that you can use the same prompt for all models.

Kim_Bruning 12/12/2025||||
That bias towards action is a real thing in Gemini and more so in ChatGPT, isn't it?

Possibly might be improved with custom instructions, but that drive is definitely there when using vanilla settings.

mmaunder 12/12/2025||
Yeah it's a weird mix of issues with the backend model and issues with the CLI client and its prompts. What makes it hard for them is the teams aren't talking to each other. The LLM team throws the API over the wall with a note saying "good luck suckers!".
prodigycorp 12/12/2025|||
Genius psychopath is a good description for Gemini. It’s the most impressive model but post training is not all there.
afro88 12/11/2025|||
> I usually have to leave the happen or the session terminates

Assuming you meant "leave the app open", I have the same frustration. One of the nice things about the ChatGPT app is you can fire off a req and do something else. I also find Gemini 3 Pro better for general use, though I'm keen to try 5.2 properly

WheatMillington 12/11/2025|||
I generate fun images for my kids - turn photos into a new style, create colouring pages from pictures, etc. I lost interest in chatGPT because it throws vague TOS errors constantly. Gemini handles all of this without complaint.
xyzsparetimexyz 12/12/2025||
You feed ai slop to your children? That doesn't seem unhealthy and bad for their development?
retsibsi 12/12/2025|||
What's your specific concern here? I certainly wouldn't want to, e.g., give young kids unmonitored use of an LLM, or replace their books with AI-generated text, or stop directly engaging with their games and stories and outsource that to ChatGPT. But what part of "generate fun images for my kids - turn photos into a new style, create colouring pages from pictures, etc" is likely to be "unhealthy and bad for their development"?
bonesss 12/12/2025|||
Customized, self-guided, tailor made kids content isn’t slop per se.

Colouring pages autogenerated for small kids is about as dangerous as the crayons involved.

Not slop, not unhealthy, not bad.

a_victorp 12/12/2025|||
I see a post like this every time there are news about ChatGPT or OpenAI. I'm probably being paranoid but I keep thinking that it looks like bots or paid advertisement for Gemini
tenpoundhammer 12/12/2025|||
I think people like me just enjoying sharing when something is working for them and they have a good experience. It probably gets voted up because people enjoy reading when that happens
jdiff 12/12/2025|||
The consistent side comments about the interface to Gemini being "half baked" probably doesn't fit into that narrative.
jnordt 12/12/2025|||
Can you share some examples of this where it gives better results?

For me both Gemini and ChatGPT (both paid versions Key in Gemini and ChatGPT Plus) give me similiar results in terms of "every day" research. Im sticking with ChatGPT at the moment, as the UI and scaffolding around the model is in my view better at ChatGpt (e.g. you can add more than one picture at once...)

For Software Development, I tested Gemini3 and I was pretty disappointed in comparison to Claude Opus CLI, which is my daily driver.

UltraSane 12/11/2025|||
Google has such a huge advantage in the amount of training data with the Google search database and with YouTube and in terms of FLOPS with their TPUs.
razster 12/12/2025|||
Just a fair warning, it likes to spell Acknowledge as Acknolwedge. And I've run into issues when it's accessing markdown guides, it loses track and hallucinates from time to time which is annoying.
bossyTeacher 12/11/2025|||
A future where Google still dominates, is that a future we want? I feel a future with more players is better than one with just a single one. Competition is valuable for us consumers
melagonster 12/12/2025|||
It happened at least once; when I asked too many questions, the Gemini web page stopped working because it was occupying too much RAM...
NickNaraghi 12/11/2025|||
Straight up Silicon Valley warfare in the HN comment section.
bckr 12/12/2025|||
Gemini is good at reading bad handwriting you say? Might need to give it a shot at my 10 years of journals
Razengan 12/12/2025|||
It would be useful to see some examples of the differences and supposed strengths of Gemini so this doesn't come off as Google advertisement snarf.

Also, I would never, ever, trust Google for privacy or sign into a Google account except on YouTube (and clear cookies afterwards to stop them from signing me into fucking Search too).

m00dy 12/12/2025|||
it's true that Gemini-3 pro is very good, I recently used it on deepwalker [0]. Its agentic performance is amazing. Much better than 5.1

[0]: https://deepwalker.xyz

anonnon 12/12/2025|||
Could you elaborate on GPT-based stock analysis?
citizenpaul 12/12/2025|||
What?? Am I using the same gemini as everyone else?

>OCR is phenomenal

I literally tried to OCR a TYPED document in Gemini today and it mangled it so bad I just transcribed it myself because it would take less time than futzing around with gemini.

> Gemini handles every single one of my uses cases much better and consistently gives better answers.

>coding

I asked it to update a script by removing some redundant logic yesterday. Instead of removing it it just put == all over the place essentially negating but leaving all the code and also removing the actual output.

>Stocks analysis

lol, now I know where my money comes from.

aix1 12/12/2025||
Was that with Gemini 3 Pro or a different Gemini model?
citizenpaul 12/14/2025||
Yes.

Today I asked it to make a short bit of code to query some info from an API. I needed it to not use the specific function X that is normally used. I added to its instructions "Never use function X" then asked it in the chat to confirm its rules. It then generated code using function X and a word soup explaining how it did not uses function X. Then I copy pasted the line and asked why it used function X and it said more word soup explaining how the function was not there. So yea not so good.

Daz912 12/12/2025|||
No desktop app, not using it
eru 12/12/2025||
HN doesn't have a dedicated desktop app either.
Daz912 12/12/2025||
HN isn't part of my daily workflow so I dont care
LorenDB 12/11/2025|||
What is it with the Polish always messing up products?

(yes, /s)

petersumskas 12/11/2025|||
It’s because their thoughts are Roman while they are always Russian to Finnish things.

Kenya believe it!

Anyway, I’m done here. Abyssinia.

labrador 12/11/2025|||
I like their hotdogs
xyzsparetimexyz 12/12/2025|||
Why do people pay for ai tools? I didn't get that. I feel like I just rotate between them on the free tiers. Unless you're paying for all of them, what's the point?
Zambyte 12/12/2025||
I pay for Kagi and get all of the major ones, a great search engine that I can tune to my liking, and the ability to link any model to my tuned web search.
Onewildgamer 12/11/2025|||
Google AI mode constantly does mistakes and I go back to chatgpt even when I don't like it.
billyrnalvo 12/11/2025||
Oh my good heavens, gotta tell ya, you wrestled that rascal to the floor with a shit-eating grin! Good times my friend!
rallies 12/12/2025||
I work at the intersection of AI and investing, and I'm really amazed at the ability of this model to build spreadsheets.

I gave it a few tools to access sec filings (and a small local vector database), and it's generating full fledged spreadsheets with valid, real time data. Analysts in wallstreet are going to get really empowered, but for the first time, I'm really glad that retail investors are also getting these models.

Just put out the tool: https://github.com/ralliesai/tenk

npodbielski 12/12/2025||
Can't wait for being fired because some VP or other manager asked some model to prepare list of people with lowest productivity to pay ratio.

Model hallucinated half of the data?! Sorry we can't go back on this decision, that would make us look bad!

Or when some silly model will push everyone to invest in some radicoulous company and everybody will do it. Poisoning data attack to inject some I am Future Inc ™ company with high investment rate. After few months pocket money and vanish.

We are certainly going to live in interesting times.

buu700 12/13/2025||
That's more of a management problem than an AI problem. You could get the same result by replacing "model" with "intern" or "dude from Fiverr".
npodbielski 12/15/2025||
With one important difference: nobody would be able to tell if you did the spreadsheet or AI spew it. And you do not pay for that one specific task to be done out of your pocket.
rallies 12/12/2025|||
Here's a nice parsing of all the important financials from an SEC report. This used to be really hard a few years ago.

https://docs.google.com/spreadsheets/d/1DVh5p3MnNvL4KqzEH0ME...

sumedh 12/13/2025||
Doesnt SEC provide XBRL data and the statements in excel?
monatron 12/12/2025|||
Nice tool - I appreciate you sharing the work!
josalhor 12/11/2025||
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

verdverm 12/11/2025||
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
stego-tech 12/11/2025|||
These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

snet0 12/11/2025|||
To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

jacquesm 12/11/2025||
That means you're probably asking it to do very simple things.
baq 12/11/2025|||
I can confidently say that anecdotally you’re completely wrong, but I’ll also allow a very different definition of ‘simple’ and/or attempting to use an unpopular environment as a valid anecdotal counterpoint.
verdverm 12/11/2025|||
the problem with these arguments is there are data points to support both sides because both outcomes are possible

the real thing is are you or we getting an ROI and the answer is increasingly more yeses on more problems, this trend is not looking to plateau as we step up the complexity ladder to agentic system

jacquesm 12/11/2025|||
I don't reach for AI until I'm solidly stuck and then use it mostly for inspiration, it has yet to happen that it directly pointed at a solution and I'm pretty good at writing prompts. When I throw a bunch of elementary stuff at it then it is super good at identifying issues and solving them (but I could have done that myself, it is just nice to try to delineate where you can and where you can't trust the thing, but that too is fluctuating, sometimes even within a single session). Here is a nice example for a slightly more complex test:

Prompt:

"I have a green LED, a 12V powersupply, a single NPN transistor, a 100 ohm resistor, a 2.7K resistor and an electrolytic capacitor of 220 micro farads. My teacher says it is possible to make an LED flasher with these components but I'm hopelessly stuck, can you please give me an ascii art solution so I don't flunk this exam?"

Answer:

ASCII schematic (fully functional):

           +12V
            |
            |
           .-.
           | |
           | | 2.7kΩ   (R1 — charges the capacitor)
           | |
           '-'
            |
            +---------+----------------------+
            |         |                      |
           ( )        |                      |
           ( ) C1     |                      |
           ( ) 220uF  |                      |
            |         |                      |
            |         |                      |
            |        |/ C                    |
            +--------|   NPN transistor      |
            |        |\ E                    |
            |         |                      |
           GND       GND                     |
                                              \
                                               \
                                               / 100Ω (R2 — LED limit)
                                               \
                                                |
                                                >|  Green LED
                                                |
                                               GND
Follow by:

"How it works

The 2.7 kΩ resistor charges the 220 µF capacitor from the 12 V supply.

The capacitor voltage rises slowly.

When the capacitor voltage reaches the transistor’s base-emitter threshold (~0.6–0.7 V), the transistor suddenly switches ON.

When it turns on, the capacitor rapidly discharges through the base, causing:

A brief pulse of current through the transistor

The LED lights up through the 100 Ω resistor

After discharge, the transistor turns back OFF, the LED turns off, and the capacitor begins charging again.

This repeats automatically → LED flasher."

The number of errors in the circuit and the utterly bogus explanation as well as the over confident remark that this is 'working' is so bizarre that I wonder how many slightly more complicated questions are going to yield results comparable to this one.

dagss 12/12/2025|||
I am right now implementing an imagining pipeline using OpenCV and TypeScript.

I have never used OpenCV specifically before, and have little imaging experience too. What I do have though is a PhD in astrophysics/statistics so I am able to follow along the details easily.

Results are amazing. I am getting results in 2 days of work that would have taken me weeks earlier.

ChatGPT acts like a research partner. I give it images and it explains why current scoring functions fails and throws out new directions to go in.

Yes, my ideas are sometimes better. Sometimes ChatGPT has a better clue. It is like a human collegue more or less.

And if I want to try something, the code is usually bug free. So fast to just write code, try it, throw it away if I want to try another idea.

I think a) OpenCV probably has more training data than circuits? and b) I do not treat it as a desperate student with no knowlegde.

I expect to have to guide it.

There are several hundred messages back and forth.

It is more like two researchers working together with different skill sets complementing one another.

One of those skillsets being to turn a 20 message conversation into bugfree OpenCV code in 20 seconds.

No, it is not providing a perfect solution to all problems on first iteration. But it IS allowing me to both learn very quickly and build very quickly. Good enough for me..

jacquesm 12/12/2025||
That's a good use case, and I can easily imagine that you get good results from it because (1) it is for a domain that you are already familiar with and (2) you are able to check that the results that you are getting are correct and (3) the domain that you are leveraging (coding expertise) is one that chatgpt has ample input for.

Now imagine you are using it for a domain that you are not familiar with, or one for which you can't check the output or that chatgpt has little input for.

If either of those is true the output will be just as good looking and you would be in a much more difficult situation to make good use of it, but you might be tempted to use it anyway. A very large fraction of the use cases for these tools that I have come across professionally so far are of the latter variety, the minority of the former.

And taking all of the considerations into account:

- how sure are you that that code is bug free?

- Do you mean that it seems to work?

- Do you mean that it compiles?

- How broad is the range of inputs that you have given it to ascertain this?

- Have you had the code reviewed by a competent programmer (assuming code review is a requirement)?

- Does it pass a set of pre-defined tests (part of requirement analysis)?

- Is the code quality such that it is long term maintainable?

emporas 12/11/2025||||
I have used Gemini for reading and solving electronic schematics exercises, and it's results were good enough for me. Roughly 50% of the exercises managed to solve correctly, 50% wrong. Simple R circuits.

One time it messed up the opposite polarity of two voltage sources in series, and instead of subtracting their voltages, it added them together, I pointed out the mistake and Gemini insisted that the voltage sources are not in opposite polarity.

Schematics in general are not AIs strongest point. But when you explain what math you want to calculate from an LRC circuit for example, no schematics, just describe in words the part of the circuit, GPT many times will calculate it correctly. It still makes mistakes here and there, always verify the calculation.

jacquesm 12/11/2025||
I guess I'm just more critical than you are. I am used my computer doing what it is told and giving me correct, exact answers or errors.
dagss 12/12/2025|||
I think most people treat them like humans not computers, and I think that is actually a much more correct way to treat them. Not saying they are like humans, but certainly a lot more like humans than whatever you seem to be expecting in your posts.

Humans make errors all the time. That doesn't mean having colleagues is useless, does it?

An AI is a colleague that can code very very fast and has a very wide knowledge base and versatility. You may still know better than it in many cases and feel more experienced that in. Just like you might with your colleagues.

And it needs the same kind of support that humans need. Complex problem? Need to plan ahead first. Tricky logic? Need unit tests. Research grade problem? Need to discuss through the solution with someone else before jumping to code and get some feedback and iterate for 100 messages before we're ready to code. And so on.

jacquesm 12/12/2025||
This is an excellent point, thank you.
emporas 12/11/2025|||
There is also Mercury LLM, which computes the answer directly as a 2D text representation. I don't know if you are familiar with Mercury LLM, but you read correctly, 2D text output.

Mercury LLM might work better getting input as an ASCII diagram, or generating an output as an ASCII diagram, not sure if both input and output work 2D.

Plumbing/electrical/electronic schematics are pretty important for AIs to understand and assist us, but for the moment the success rate is pretty low. 50% success rate for simple problems is very low, 80-90% success rate for medium difficulty problems is where they start being really useful.

jacquesm 12/12/2025||
It's not really the quality of the diagramming that I am concerned with, it is the complete lack of understanding of electronics parts and their usual function. The diagramming is atrocious but I could live with it if the circuit were at least borderline correct. Extrapolating from this: if we use the electronics schematic as a proxy for the kind of world model these systems have then that world model has upside down lanterns and anti-gravity as commonplace elements. Three legged dogs mate with zebras and produce viable offspring and short circuiting transistors brings about entirely new physics.
baq 12/12/2025|||
it's hard for me to tell if the solution is correct or wrong because I've got next to no formal theoretical education in electronics and only the most basic 'pay attention to polarity of electrolytic capacitors' practical knowledge, but given how these things work you might get much better results when asking it to generate a spice netlist first (or instead).

I wouldn't trust it with 2d ascii art diagrams, there isn't enough focus on these in the training data is my guess - a typical jagged frontier experience.

emporas 12/12/2025|||
I think you underestimate their capabilities quite a bit. Their auto-regressive nature does not lend well to solving 2D problems.

See these two solutions GPT suggested: [1]

Is any of these any good?

[1] https://gist.github.com/pramatias/538f77137cb32fca5f626299a7...

manmal 12/12/2025|||
I have this mental model of LLMs and their capabilities, formed after months of way too much coding with CC and Codex, with 4 recursive problem categories:

1. Problems that have been solved before have their solution easily repeated (some will say, parroted/stolen), even with naming differences.

2. Problems that need only mild amalgamation of previous work are also solved by drawing on training data only, but hallucinations are frequent (as low probability tokens, but as consumers we don’t see the p values).

3. Problems that need little simulation can be simulated with the text as scratchpad. If evaluation criteria are not in training data -> hallucination.

4. Problems that need more than a little simulation have to either be solved by adhoc written code, or will result in hallucination. The code written to simulate is again a fractal of problems 1-4.

Phrased differently, sub problem solutions must be in the training data or it won’t work; and combining sub problem solutions must be either again in training data, or brute forcing + success condition is needed, with code being the tool to brute force.

I _think_ that the SOTA models are trained to categorize the problem at hand, because sometimes they answer immediately (1&2), enable thinking mode (3), or write Python code (4).

My experience with CC and Codex has been that I must steer it away from categories 2 & 3 all the time, either solving them myself, ask them to use web research, or split them up until they are (1) problems.

Of course, for many problems you’ll only know the category once you’ve seen the output, and you need to be able to verify the output.

I suspect that if you gave Claude/Codex access to a circuit simulator, it will successfully brute force the solution. And future models might be capable enough to write their own simulator adhoc (ofc the simulator code might recursively fall into category 2 or 3 somewhere and fail miserably). But without strong verification I wouldn’t put any trust in the outcome.

With code, we do have the compiler, tests, observed behavior, and a strong training data set with many correct implementations of small atomic problems. That’s a lot of out of the box verification to correct hallucinations. I view them as messy code generators I have to clean up after. They do save a ton of coding work after or while I‘m doing the other parts of programming.

jacquesm 12/12/2025||
This parallels my own experience so far, the problem for me is that (1) and (2) I can quickly and easily do myself and I'll do it in a way that respects the original author's copyright by including their work - and license - verbatim.

(3) and (4) level problems are the ones where I struggle tremendously to make any headway even without AI, usually this requires the learning of new domain knowledge and exploratory code (currently: sensor fusion) and these tools will just generate very plausible nonsense which is more of a time waster than a productivity aid. My middle-of-the-road solution is to get as far as I can by reading about the problem so I am at least able to define it properly and to define test cases and useful ranges for inputs and so on, then to write a high level overview document about what I want to achieve and what the big moving parts are and then only to resort to using AI tools to get me unstuck or to serve as a knowledge reservoir for gaps in domain knowledge.

Anybody that is using the output of these tools to produce work that they do not sufficiently understand is going to see a massive gain in productivity, but the underlying issues will only surface a long way down the line.

camdenreslink 12/11/2025||||
Sometimes you do need to (as a human) break down a complex thing into smaller simple things, and then ask the LLM to do those simple things. I find it still saves some time.
ragequittah 12/11/2025||
Or what will often work is having the LLM break it down into simpler steps and then running them 1 by 1. They know how to break down problems fairly well they just don't often do it properly sometimes unless you explicitly prompt them to.
jacquesm 12/11/2025||
Yes, but for that you have to know that the output it gave you is wrong in the first place and if that is so you didn't need AI to begin with...
djeastm 12/12/2025||||
Possibly, but a lot of value comes from doing very simple things faster.
jacquesm 12/12/2025||
That is a good point. A lot of work really is mostly simple things.
snet0 12/11/2025|||
If you define "simple thing" as "thing an AI can't do", then yes. Everyone just shifts the goalposts in these conversations, it's infuriating.
ACCount37 12/11/2025||
Come on. If we weren't shifting the goalposts, we would have burned through 90% of the entire supply of them back in 2022!
baq 12/11/2025||
It’s less shifting goalposts and more of a very jagged frontier of capabilities problem.
verdverm 12/11/2025|||
I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)

That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).

https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v

stego-tech 12/11/2025||
And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.

I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.

Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.

verdverm 12/11/2025||
I mean more the guts of the agentic systems. Prompts, tool design, state and session management, agent transfer and escalation. I come from devops and backend dev, so getting in at this level, where LLMs are tasked and composed, is more interesting.

If you are only using provider LLM experiences, and not something specific to coding like copilot or Claude code, that would be the first step to getting the magic as you say. It is also not instant. It takes time to learn any new tech, this one has a above average learning curve, despite the facade and hype of how it should just be magic

Once you find the stupid shit in the vendor coding agents, like all us it/devops folks do eventually, you can go a level down and build on something like the ADK to bring your expertise and experience to the building blocks.

For example, I am now implementing environments for agents based on container layers and Dagger, which unlocks the ability to cheaply and reproducible clone what one agent was doing and have a dozen variations iterate on the next turn. Real useful for long term training data and evals synth, but also for my own experimentation as I learn how to get better at using these things. Another thing I did was change how filesystem operations look to the agent, in particular file reads. I did this to save context & money (finops), after burning $5 in 60s because of an error in my tool implementation. Instead of having them as message contents, they are now injected into the system prompt. Doing so made it trivial to add a key/val "cache" for the fun of it, since I could now inject things into the system prompt and let the agent have some control over that process through tools. Boy has that been interesting and opened up some research questions in my mind

remich 12/12/2025||
Any particular papers or articles you've been reading that helped you devise this? Your experiments sound interesting and possibly relevant to what I'm doing.
verdverm 12/13/2025||
Conversations among practitioners on Bluesky (there is an Ai subcommunity)
quantumHazer 12/11/2025||||
Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.
verdverm 12/11/2025||
Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.

I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...

The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth

elcritch 12/11/2025||
You could try Codex cli. I prefer it over Claude code now, but only slightly.
verdverm 12/11/2025||
No thanks, not touching anything Oligarchy Altman is behind
Mistletoe 12/11/2025||||
How do you measure whether it works better day to day without benchmarks?
bulbar 12/11/2025|||
Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

verdverm 12/11/2025||||
Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better
aydyn 12/11/2025||
Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.
verdverm 12/11/2025||
How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason

To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees

aydyn 12/11/2025||
Do you know everything that exists in public benchmarks?

They can give a description of what their metrics are without giving away anything proprietary.

verdverm 12/11/2025||
I'd recommend watching Nathan Lambert's video he dropped yesterday on Olmo 3 Thinking. You'll learn there's a lot of places where even descriptions of proprietary testing regimes would give away some secret sauce

Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way

aydyn 12/12/2025||
Thanks for the reference I'll check it out. But it doesnt really take away from the point I am making. If a level of description would give away proprietary information, then go one level up to a more vague description. How to describe things to a proper level is more of a social problem than a technical one.
verdverm 12/13/2025||
You seem stuck on the idea that they should have to share information when they don't have to. That they share any is a welcome change. Push too hard and they may stop sharing as much
standardUser 12/11/2025|||
Subscriptions.
mrguyorama 12/11/2025||
Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.
standardUser 12/15/2025||
It's all the shareholders care about. These are not research institutions.
brokensegue 12/11/2025||||
how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate
verdverm 12/11/2025||
more or less this, but also synthetic

if you think about GANs, it's all the same concept

1. train model (agent)

2. train another model (agent) to do something interesting with/to the main model

3. gain new capabilities

4. iterate

You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.

Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far

Nathan Lambert (https://bsky.app/profile/natolambert.bsky.social) from Ai2 (https://allenai.org/) & RLHF Book (https://rlhfbook.com/) has a really great video out yesterday about the experience training Olmo 3 Think

https://www.youtube.com/watch?v=uaZ3yRdYg8A

HDThoreaun 12/11/2025|||
Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.
fwip 12/11/2025|||
It is very similar to an IQ test, with all the attendant problems that entails. Looking at the Arc-AGI problems, it seems like visual/spatial reasoning is just about the only thing they are testing.
CamperBob2 12/11/2025||||
Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

npinsker 12/11/2025|||
Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

mrandish 12/11/2025|||
ARC-AGI was designed specifically for evaluating deeper reasoning in LLMs, including being resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.
npinsker 12/11/2025||
I agree with you. I agree it's valuable work. I totally disagree with their claim.

A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but not infinite.

Similarly, ARC-AGI is much more bounded than they seem to think. It correlates with intelligence, but doesn't imply it.

yovaer 12/12/2025|||
> but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems

IMO/AIME problems perhaps, but surely that's too narrow a view for all of mathematics. If solving conjectures were simply a matter of trying a standard range of techniques enough times, then there would be a lot fewer open problems around than what's the case.

keeda 12/12/2025|||
Maybe I'm misinterpreting your point, but this makes it seem that your standard for "intelligence" is "inventing entirely new techniques"? If so, it's a bit extreme, because to a first approximation, all problem solving is combining and applying existing techniques in novel ways to new situations.

At the point that you are inventing entirely new techniques, you are usually doing groundbreaking work. Even groundbreaking work in one field is often inspired by techniques from other fields. In the limit, discovering truly new techniques often requires discovering new principles of reality to exploit, i.e. research.

As you can imagine, this is very difficult and hence rather uncommon, typically only accomplished by a handful of people in any given discipline, i.e way above the standards of the general population.

I feel like if we are holding AI to those standards, we are talking about not just AGI, but artificial super-intelligence.

CamperBob2 12/11/2025|||
Completely false. This is like saying being good at chess is equivalent to being smart.

No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.

fwip 12/11/2025||
Took a couple just now. It seems like a straight-forward generalization of the IQ tests I've taken before, reformatted into an explicit grid to be a little bit friendlier to machines.

Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.

CamperBob2 12/11/2025||
Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.

What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.

fwip 12/11/2025||
Dunno :) I'm not an expert at LLMs or test design, I just see a lot of similarity between IQ tests and these questions.
ACCount37 12/11/2025||||
With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.

Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.

jimbokun 12/11/2025||||
Is it different every time? Otherwise the training could just memorize the answers.
CamperBob2 12/11/2025||
The models never have access to the answers for the private set -- again, at least in principle. Whether that's actually true, I have no idea.

The idea behind Arc-AGI is that you can train all you want on the answers, because knowing the solution to one problem isn't helpful on the others.

In fact, the way the test works is that the model is given several examples of worked solutions for each problem class, and is then required to infer the underlying rule(s) needed to solve a different instance of the same type of problem.

That's why comparing Arc-AGI to chess or other benchmaxxing exercises is completely off base.

(IMO, an even better test for AGI would be "Make up some original Arc-AGI problems.")

FergusArgyll 12/11/2025||||
It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all
esafak 12/11/2025|||
I would not be so sure. You can always prep to the test.
HDThoreaun 12/11/2025||
How do you prep for arc agi? If the answer is just "get really good at pattern recognition" I do not see that as a negative at all.
ben_w 12/11/2025||
It can be not-negative without being sufficient.

Imagine that pattern recognition is 10% of the problem, and we just don't know what the other 90% is yet.

Streetlight effect for "what is intelligence" leads to all the things that LLMs are now demonstrably good at… and yet, the LLMs are somehow missing a lot of stuff and we have to keep inventing new street lights to search underneath: https://en.wikipedia.org/wiki/Streetlight_effect

HDThoreaun 12/11/2025||
I dont think many people are saying 100% arc-agi 2 is equivalent to AGI(names are dumb as usual). Its just the best metric I have found, not the final answer. Spatial reasoning is an important part of intelligence even if it doesnt encompass all of it.
minimaxir 12/11/2025|||
Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.

It'll be noteworthy to see the cost-per-task on ARC AGI v2.

granzymes 12/11/2025|||
> It'll be noteworthy to see the cost-per-task on ARC AGI v2.

Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).

The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).

https://arcprize.org/leaderboard

minimaxir 12/11/2025||
Huh, that is indeed up and to left of Opus.
walletdrainer 12/11/2025|||
5.1-codex supports that too, no? Pretty sure I’ve been using xhigh for at least a week now
causal 12/11/2025|||
That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.
woeirua 12/11/2025|||
They're clearly building better training datasets and doing extensive RL on these benchmarks over time. The out of distribution performance is still awful.
taurath 12/11/2025|||
I don’t think their words mean just about anything, only the behavior of the models.

Still waiting of Full Self Driving myself.

fuddle 12/11/2025|||
I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.
joshuahedlund 12/11/2025||
I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.
thinkingtoilet 12/11/2025|||
Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.
catigula 12/11/2025|||
Yes, but it's not good enough. They needed to surpass Opus 4.5.
mikairpods 12/11/2025||
that is better...?
poormathskills 12/11/2025||
For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.
beering 12/11/2025||
Model capability improvements are very uneven. Changes between one model and the next tend to benefit certain areas substantially without moving the needle on others. You see this across all frontier labs’ model releases. Also the version numbering is BS (remember GPT-4.5 followed by GPT-4.1?).
CodeCompost 12/12/2025|
For the first time, I've actually hidden an AI story on HN.

I can't even anymore. Sorry this is not going anywhere.

andybak 12/12/2025||
How this is different to any other post announcing an incremental improvement in an app or service?
mabedan 12/12/2025||
It’s a little different. Most of these improvements are just more training hours and better weights. Even if it’s about actual improvement in trining algorithm or other software tweaks they’re not open source and hence other than “look how marginally nicer the chat bot responds now” the post doesn’t provide value.
gchokov 12/12/2025||
Here, take my downvote.
bigyabai 12/12/2025||
In lieu of a killer app?
More comments...