A Man Out to Prove How Dumb AI Still Is

Posted by fortran77 4/4/2025

A Man Out to Prove How Dumb AI Still Is(www.theatlantic.com)

64 points | 63 comments

noosphr 4/4/2025|

>Last week, the ARC Prize team released an updated test, called ARC-AGI-2, and it appears to have sent the AIs back to the drawing board. The full o3 model has not yet been tested, but a version of o1 dropped from 32 percent on the original puzzles to just 3 percent on the new version, and a “mini” version of o3 currently available to the public dropped from roughly 30 percent to below 2 percent. (An OpenAI spokesperson declined to say whether the company plans to run the benchmark with o3.) Other flagship models from OpenAI, Anthropic, and Google have achieved roughly 1 percent, if not lower. Human testers average about 60 percent.

Arc AGI is the main reason why I don't trust static bench marks.

If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.

Forget all these fancy benchmarks. If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar. I've had _every_ model fail this on regular grammars with strings of more than 4 characters long.

LLMs are the solution to natural language, which is a huge deal. They aren't the solution to reasoning which is still best solved with what used to be called symbolic AI before it started working, e.g. sat solvers.

globnomulous 4/4/2025||

I tried my own test recently:

"Write a history of the Greek language but reverse it, so that one would need to read it from right to left and bottom to top."

ChatGPT wrote the history and showed absolutely no awareness, let alone, "understanding" of the second half of the prompt.

ahupp 4/4/2025|||

With o3-mini-high (just the last paragraph):

civilization Mycenaean the of practices religious and economic, administrative the into insights invaluable provides and B Linear as known script the in recorded was language Greek the of form attested earliest The

globnomulous 4/5/2025||

Oh, interesting, what do you get when you specify that the letters need to be reversed, too? (That was what I meant and the original prompt explicitly stated that requirement. I forgot to include it in the summary of my 'test' here.)

tkcranny 4/4/2025||||

As much I think AI is overhyped too, that is a prime use case that would be better solved by passing the text to a tool, rather than jam a complex transformations like that into its latent space.

Timwi 4/5/2025||

A real intelligence would recognize that this task is better solved with an automated tool and actually do so. ChatGPT is capable of writing and executing Python code, but it doesn't occur to it to use that in cases like this.

globnomulous 4/5/2025||

Thanks, that was essentially the test. I've gotten into a number of disagreements with people on HN about whether LLMs are 'just' token predictors, whether they 'understand' (whatever we mean by that), whether there's a guiding intelligence, whether they're 'just' language calculators, etc.

As someone else in this thread nicely put it, the tools are being sold as a hop, skip, and jump away from AGI. They clearly aren't. ChatGPT tells us to "ask anything." I did that. There is no 'there' there with these tools. They aren't even dumb.

qingcharles 4/5/2025||||

I have a similar test for image gens. I try to get them to write reversed text in condensation on windows. The new GPT is the best so far, it can sorta, maybe, do it sometimes. Others will sometimes reverse the letter order, but not flip each character.

Fruitmaniac 4/5/2025||||

Try playing a game of Hangman with ChatGPT. It's hilarious.

globnomulous 4/5/2025||

It does surprisingly well!

Edit: scratch that, it thought there was a six letter word starting with "trs" and then changed its mind to "tre" when I guessed "e." Hilarious.

wordofx 4/4/2025|||

Just copied your prompt and it handled it just fine.

globnomulous 4/5/2025||

?siht ekil kool rewsna eht diD

Edit: realized just now that my summary of the 'test' failed to specify the request fully: the letters need to be reversed, too. Maybe I'm just bad with AI tools, because I didn't even get a response that 'this like looked' (i.e. reversed the order of the words).

mhast 4/5/2025||

LLMs work with tokens, not letters. So that's not going to work.

It might work in an agent system where it can make and execute code to solve problems.

amelius 4/5/2025||

By the way, why _do_ llms work with tokens and not letters? Is that some kind of optimization, or is there a more fundamental reason for it?

globnomulous 4/6/2025||

I actually would love to see the output of an LLM that used letters or syllables or phonemes. The output when it makes mistakes would be absolutely wild.

janalsncm 4/4/2025|||

> best solved with what used to be called symbolic AI before it started working

Right, the current paradigm of requiring an LLM to do arbitrary digit multiplication will not work and we shouldn’t need to. If your task is “do X” and it can be reliably accomplished with “write a python program to do X” that’s good enough as far as I’m concerned. It’s preferable, in fact.

Btw Chollet has said basically as much. He calls them “stored programs” I think.

I think he is onto something. The right atomic to approach these problems is probably not the token, at least at first. Higher level abstraction should be refined to specific components, similar to the concept of diffusion.

YurgenJurgensen 4/4/2025|||

As soon as the companies behind these systems stop marketing them as do-anything machines, I will stop judging them on their ability to do everything.

The ChatGPT input field still says ‘Ask anything’, and that is what I shall do.

brookst 4/4/2025||

You can ask me anything. I don’t see that as a promise that I am infallible.

blooalien 4/5/2025||

Pricing Schedule

__________________

Answers: $1

Thoughtful Answers: $5

Correct Answers: $50

Dumb Looks are Free

mdp2021 4/4/2025||||

> that’s good enough as far as I’m concerned

But in that case, why an LLM. If we want Question-Answer machines to be reliable, they must have the skills which include "counting" just as a basic example.

janalsncm 4/4/2025||

The purpose of the LLM would be to translate natural language into computer language, not to do the calculation itself.

mdp2021 4/5/2025||

But in that case, /all/ the LLM would be allowed to do would be to «translate natural language into computer language». And why and how should and could it achieve reliability in that - though not in other realms?

Ologn 4/4/2025|||

Most human ten year olds in school can add two large numbers together. If a connectionist network is supposed to model the human brain, it should be able to do that. Maybe LLMs can do a lot of things, but if they can't do that, then they're an incomplete model of the human brain.

SpicyLemonZest 4/4/2025|||

No LLM or other modern AI architecture I'm aware of is supposed to model the human brain. Even if they were, LLMs can add large numbers with the level of skill I'd expect from a 10 year old:

----

What's 494547645908151+7640745309351279642?

ChatGPT said: The sum of 494,547,645,908,151 and 7,640,745,309,351,279,642 is:

7,641,239,857,997,187,793

----

(7,641,239,856,997,187,793 is the correct answer)

leptoniscool 4/5/2025||

I tried it on gpt-4-turbo and it seems to give the right answer:

>Let's calculate:494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >Answer: 7,641,239,856,997,187,793

michaelmarkell 4/4/2025||||

If I were to guess, most (adult) humans could not add two 3 digit numbers together with 100% accuracy. Maybe 99%? Computers can already do 100%, so we should probably be trying to figure out how to use language to extract the numbers from stuff and send them off to computers to do the calculations. Especially because in the real world most numbers that matter are not just two digits addition

janalsncm 4/4/2025||||

Artificial neural nets are pretty far from brains. We don’t use them because they are like brains, we use them because they can approximate arbitrary functions given sufficient data. In other words, they work.

For what it’s worth, people are also pretty bad at math compared to calculators. We are slow and error prone. That’s ok.

What I was (poorly) trying to say is that I don’t care if the neural net solves the problem if it can outsource it to a calculator. People do the same thing. What is important is reliably accomplishing the goal.

jameshart 4/4/2025|||

Most human ten year olds can add two large numbers together with the aid of a scratchpad and a pen. You need tools other than a single dimensional vector of text to do some of these things.

yencabulator 4/5/2025||

AI apologists need to decide whether they are claiming LLMs are almost-AGI, or not.

This backlash of pointing out LLM failures is a reaction to the overblown hype. We don't expect a statistical-language-processing-gadget to do math well, but then people need to stop claiming they're something other than statistical-language-processing-gadgets.

gambiting 4/4/2025|||

>> If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar.

I'm not sure I understand what that means - could you explain please?

janalsncm 4/4/2025|||

It means applying specific rules about how text can be generated. For example, generating valid json reliably. Currently we use constrained decoding to accomplish this (e.g. the next token must be one of three valid options).

Now you can imagine giving an LLM arbitrary validity rules for generating text. I think that’s what they mean by “grammar”.

elpocko 4/4/2025|||

I'm not GP, but here goes:

LLMs are token-based, which are words or word fragments; they have limited ability to work on a letter-by-letter basis. They can't reliably count letters in a sentence, for example. "give it a string and a grammar and ask it to generate the string from the grammar" can't be done by inference alone because of this: they would generate tokens that don't match the grammar.

But you can use a grammar-based sampler and it'll generate valid strings just fine. llama.cpp can easily do this if you provide an EBNF grammar specification.

noosphr 4/5/2025||

It's not about the generation, it's about verification.

Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.

Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.

The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:

---

Given the following grammar:

    <start> ::= <path>
    <path> ::= Rome <path> | Paris <path> | London <path> | end_path <routes>
    <routes> ::= <path> | end_route <company>
    <company> ::= end_company | <path>

Is the following sentence valid:

Rome Paris Rome end_path Rome London end_path end_company

---

Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.

whiplash451 4/4/2025|||

Show me the results of your symbolic AI on ARC 2.

fritzo 4/4/2025||

ARC 2 is brand new, but neurosymbolic approaches have performed well on the original ARC, e.g. https://arxiv.org/abs/2411.02272

whiplash451 4/5/2025||

Yet LLMs won on ARC 1. So what’s your point exactly?

sionisrecur 4/4/2025|||

> If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.

Sounds like a use-case for property testing: https://en.wikipedia.org/wiki/Software_testing#Property_test...

mdp2021 4/4/2025||

> I've had _every_ model fail this

That seems to be because LLMs don't seem to be able to follow procedures (e.g. reliably counting).

i_am_proteus 4/4/2025||

>Chollet, a French computer scientist and one of the industry’s sharpest skeptics

I feel like this description really buries the lede on Chollet's expertise. (For those who don't know, he's the creator of and lead contributor[0] to Keras)

[0]https://github.com/keras-team/keras/graphs/contributors

mikestew 4/4/2025||

Not to dismiss Chollet’s work, but I’m starting to think he need prove nothing to even the muggles. For example, nearly any endurance athlete stands a good chance of being a Strava user. If you run in those circles, have you heard a single person with anything good to say about Strava’s “Athletic Intelligence”? Garmin is rolling out a beta right now that includes “AI Insights” or summat. Same deal: useless summaries like “you ran 5 miles today, which contributes to your aerobic base”. I could do better with a database and some if/else statements. And Garmin wants a subscription for this. (It’s included in Strava’s subscription, but I suppose you’re still paying for it.) And so now the memes tend toward “dumb AI insight of the day” on many online forums.

Seems to me that a lot of folks are enjoying having an LLM rewrite their email or whatever, but I wonder how many are actually buying the rest of it? The companies themselves sure aren’t helping.

anthomtb 4/4/2025||

I use Strava for mountain biking and the Athletic Intelligence is just comical.

"This ride was longer and harder than usual" - no sh*t, the map, elevation profile and my legs have already informed me.

"You set 3 new PRs" - I can see that with one flick of the thumb thank you.

"Consider a rest day" - consider? None of my job, spouse, equipment or body is too keen on doing that again for a while.

soupfordummies 4/5/2025|||

I'm taking a university course right now and one of the big textbook companies (McGraw-Hill, Macmillan, one of those) has an "AI Tutor" on their homework assignment.

It's somehow LESS helpful than just having a pointer to which part of the text to revisit.

It basically just restates the question/problem. Even worse than that is that it's an essentially STATIC note for each question yet it appears to be REAL-TIME GENERATED each time. I guess that could just be for appearances but it's just dumb all the way around really.

fabbari 4/5/2025|||

I think some of the AI demos are kind of comedy gems.

I have seen the Apple Intelligence presentation a while ago and in the span of five minutes they had someone asking the assistant to expand a one liner into an e-mail and then someone receiving a long e-mail and asking the AI to summarize it.

We spun GPUs to expand, then spun them again to summarize. Gold.

some_random 4/4/2025||

Calling Francois Chollet just "A Man" in the title (or "The Man" in the actual article as of writing) is crazy work, he's been deeply involved in ML for ages including creating Keras.

whiplash451 4/4/2025||

Francois Chollet and his work deserve a better title than this stupid headline.

Francois is out to push the boundaries of science and help create models that are truly more intelligent.

mdp2021 4/4/2025||

> In 2019, Chollet created the Abstraction and Reasoning Corpus for Artificial General Intelligence, or ARC-AGI—an exam designed to show the gulf between AI models’ memorized answers and the “fluid intelligence” that people have

There are a number of skill signals we demand from an intelligence.

Mind you: some of them are achieved - like the ability to interpret pronouns (Hinton's "the trophy will not enter the case: it's too big" vs "the trophy will not enter the case: it's too small").

Others, we meet occasionally when we are not researching said requirements systematically: one example is that detective game described at https://news.ycombinator.com/item?id=43284420 - a simple game of logic that intelligences are required to be able to solve (...and yet, again some rebutted that humans would fail etc.).

It remains important though that those working modules are not clustered (solving specific tasks and remaining unused otherwise): they must be intellectual keys adapted into use in the most general cases they can be be helpful in. That's important in intelligence. So, even the ability to solve "revealing" tasks is not enough - the way in which the ability works is crucial.

andersco 4/4/2025||

https://archive.is/7PL2a

echelon 4/4/2025||

> When I spoke with him earlier this year, Chollet told me that AI companies have long been “intellectually lazy“

s/intellectually lazy/hype maxing for fundraising/

refulgentis 4/4/2025|

I think it's fascinating that his impossible benchmark got defeated, but because the Keras guy doesn't like LLMs, it is possible to mishear algorithmic distaste as saying people shipping this are "lazy" and "hype maxing."

whiplash451 4/4/2025|||

Francois never said he dislike LLMs. In fact, he said he expected them to be part of the solution to ARC.

I don’t know where this persistent myth comes from, but it has to go.

refulgentis 4/4/2025||

> part of the solution to ARC. I don’t know where this persistent myth comes from,

Part of, explicitly, not the, quite 100% explicitly. The TL;DR is "LLMs can't do it alone, program synthesis leveraging LLMs is my bet". Not "Maybe not LLMs but they'll certainly help us get there!", quite the opposite! Hence: well, TFA. And the intellectually lazy quote we are explicitly discussing. And anything Chollet has said on the subject. [^1]

[^1]"LLMs won’t lead to AGI - $1,000,000 Prize to find true solution" - https://www.dwarkesh.com/p/francois-chollet - 1.5 hours with the gent

artificialprint 4/4/2025|||

Arc agi 1 that "got defeated" was published even before first mainstream llms and still stood the test of time

refulgentis 4/4/2025||

> "got defeated"

1.5 hours with Chollet on "LLMs won’t lead to AGI - $1,000,000 Prize to find true solution" -

Published June 2024, and by December, well...we can all agree there's an ARC AGI 2 now.

https://www.dwarkesh.com/p/francois-chollet

HenryBemis 4/4/2025||

  1a) it's not AI, it's LLM. The companies who create/train/operate them may (wink-wink) pitch them as "AI" with half-truths, but we (here) know it's LLMs "all the way down"
  1b) just like I disliked the "autopilot" in Teslas because it was never autopilot.
  2) I know that I wanted to write some software tools, and I have been successful at this for the past many months, and I got top-shelve tools, that work, do their tasks, send alerts, etc. etc. And I am not the only one. So if the purpose is to "show it's a stupid AI".. well.. it's not AI.. so yeah. If the purpose is "it is not perfect", yes, because it draws a hand with 10 fingers. What else is new?

LLMs are a tool, still under development, still early in the curve, they can do A-B-C well but not X-Y-Z well (or at all). Congratulations :)

dgfitz 4/4/2025|

> LLMs are a tool, still under development, still early in the curve…

I completely agree, they are a tool, and a decently useful tool. They are not early in the curve, they’re about flat at this point.

sroussey 4/4/2025|

Maybe if AI knew what it was doing I would not end up banging my head like I did here:

https://chatgpt.com/share/67ef43f4-3b88-800d-a5a3-e3ffea178f...

(Me trying to describe a desk top with a fold down hinged top, and it just drawing whatever)

More comments...