Posted by fortran77 4/4/2025
Arc AGI is the main reason why I don't trust static bench marks.
If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.
Forget all these fancy benchmarks. If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar. I've had _every_ model fail this on regular grammars with strings of more than 4 characters long.
LLMs are the solution to natural language, which is a huge deal. They aren't the solution to reasoning which is still best solved with what used to be called symbolic AI before it started working, e.g. sat solvers.
"Write a history of the Greek language but reverse it, so that one would need to read it from right to left and bottom to top."
ChatGPT wrote the history and showed absolutely no awareness, let alone, "understanding" of the second half of the prompt.
civilization Mycenaean the of practices religious and economic, administrative the into insights invaluable provides and B Linear as known script the in recorded was language Greek the of form attested earliest The
As someone else in this thread nicely put it, the tools are being sold as a hop, skip, and jump away from AGI. They clearly aren't. ChatGPT tells us to "ask anything." I did that. There is no 'there' there with these tools. They aren't even dumb.
Edit: scratch that, it thought there was a six letter word starting with "trs" and then changed its mind to "tre" when I guessed "e." Hilarious.
Edit: realized just now that my summary of the 'test' failed to specify the request fully: the letters need to be reversed, too. Maybe I'm just bad with AI tools, because I didn't even get a response that 'this like looked' (i.e. reversed the order of the words).
It might work in an agent system where it can make and execute code to solve problems.
Right, the current paradigm of requiring an LLM to do arbitrary digit multiplication will not work and we shouldn’t need to. If your task is “do X” and it can be reliably accomplished with “write a python program to do X” that’s good enough as far as I’m concerned. It’s preferable, in fact.
Btw Chollet has said basically as much. He calls them “stored programs” I think.
I think he is onto something. The right atomic to approach these problems is probably not the token, at least at first. Higher level abstraction should be refined to specific components, similar to the concept of diffusion.
The ChatGPT input field still says ‘Ask anything’, and that is what I shall do.
__________________
Answers: $1
Thoughtful Answers: $5
Correct Answers: $50
Dumb Looks are Free
But in that case, why an LLM. If we want Question-Answer machines to be reliable, they must have the skills which include "counting" just as a basic example.
----
What's 494547645908151+7640745309351279642?
ChatGPT said: The sum of 494,547,645,908,151 and 7,640,745,309,351,279,642 is:
7,641,239,857,997,187,793
----
(7,641,239,856,997,187,793 is the correct answer)
>Let's calculate:494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >Answer: 7,641,239,856,997,187,793
For what it’s worth, people are also pretty bad at math compared to calculators. We are slow and error prone. That’s ok.
What I was (poorly) trying to say is that I don’t care if the neural net solves the problem if it can outsource it to a calculator. People do the same thing. What is important is reliably accomplishing the goal.
This backlash of pointing out LLM failures is a reaction to the overblown hype. We don't expect a statistical-language-processing-gadget to do math well, but then people need to stop claiming they're something other than statistical-language-processing-gadgets.
I'm not sure I understand what that means - could you explain please?
Now you can imagine giving an LLM arbitrary validity rules for generating text. I think that’s what they mean by “grammar”.
LLMs are token-based, which are words or word fragments; they have limited ability to work on a letter-by-letter basis. They can't reliably count letters in a sentence, for example. "give it a string and a grammar and ask it to generate the string from the grammar" can't be done by inference alone because of this: they would generate tokens that don't match the grammar.
But you can use a grammar-based sampler and it'll generate valid strings just fine. llama.cpp can easily do this if you provide an EBNF grammar specification.
Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.
Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.
The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:
---
Given the following grammar:
<start> ::= <path>
<path> ::= Rome <path> | Paris <path> | London <path> | end_path <routes>
<routes> ::= <path> | end_route <company>
<company> ::= end_company | <path>
Is the following sentence valid:Rome Paris Rome end_path Rome London end_path end_company
---
Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.
Sounds like a use-case for property testing: https://en.wikipedia.org/wiki/Software_testing#Property_test...
That seems to be because LLMs don't seem to be able to follow procedures (e.g. reliably counting).
I feel like this description really buries the lede on Chollet's expertise. (For those who don't know, he's the creator of and lead contributor[0] to Keras)
Seems to me that a lot of folks are enjoying having an LLM rewrite their email or whatever, but I wonder how many are actually buying the rest of it? The companies themselves sure aren’t helping.
"This ride was longer and harder than usual" - no sh*t, the map, elevation profile and my legs have already informed me.
"You set 3 new PRs" - I can see that with one flick of the thumb thank you.
"Consider a rest day" - consider? None of my job, spouse, equipment or body is too keen on doing that again for a while.
It's somehow LESS helpful than just having a pointer to which part of the text to revisit.
It basically just restates the question/problem. Even worse than that is that it's an essentially STATIC note for each question yet it appears to be REAL-TIME GENERATED each time. I guess that could just be for appearances but it's just dumb all the way around really.
I have seen the Apple Intelligence presentation a while ago and in the span of five minutes they had someone asking the assistant to expand a one liner into an e-mail and then someone receiving a long e-mail and asking the AI to summarize it.
We spun GPUs to expand, then spun them again to summarize. Gold.
Francois is out to push the boundaries of science and help create models that are truly more intelligent.
There are a number of skill signals we demand from an intelligence.
Mind you: some of them are achieved - like the ability to interpret pronouns (Hinton's "the trophy will not enter the case: it's too big" vs "the trophy will not enter the case: it's too small").
Others, we meet occasionally when we are not researching said requirements systematically: one example is that detective game described at https://news.ycombinator.com/item?id=43284420 - a simple game of logic that intelligences are required to be able to solve (...and yet, again some rebutted that humans would fail etc.).
It remains important though that those working modules are not clustered (solving specific tasks and remaining unused otherwise): they must be intellectual keys adapted into use in the most general cases they can be be helpful in. That's important in intelligence. So, even the ability to solve "revealing" tasks is not enough - the way in which the ability works is crucial.
s/intellectually lazy/hype maxing for fundraising/
I don’t know where this persistent myth comes from, but it has to go.
Part of, explicitly, not the, quite 100% explicitly. The TL;DR is "LLMs can't do it alone, program synthesis leveraging LLMs is my bet". Not "Maybe not LLMs but they'll certainly help us get there!", quite the opposite! Hence: well, TFA. And the intellectually lazy quote we are explicitly discussing. And anything Chollet has said on the subject. [^1]
[^1]"LLMs won’t lead to AGI - $1,000,000 Prize to find true solution" - https://www.dwarkesh.com/p/francois-chollet - 1.5 hours with the gent
1.5 hours with Chollet on "LLMs won’t lead to AGI - $1,000,000 Prize to find true solution" -
Published June 2024, and by December, well...we can all agree there's an ARC AGI 2 now.
1a) it's not AI, it's LLM. The companies who create/train/operate them may (wink-wink) pitch them as "AI" with half-truths, but we (here) know it's LLMs "all the way down"
1b) just like I disliked the "autopilot" in Teslas because it was never autopilot.
2) I know that I wanted to write some software tools, and I have been successful at this for the past many months, and I got top-shelve tools, that work, do their tasks, send alerts, etc. etc. And I am not the only one. So if the purpose is to "show it's a stupid AI".. well.. it's not AI.. so yeah. If the purpose is "it is not perfect", yes, because it draws a hand with 10 fingers. What else is new?
LLMs are a tool, still under development, still early in the curve, they can do A-B-C well but not X-Y-Z well (or at all). Congratulations :)I completely agree, they are a tool, and a decently useful tool. They are not early in the curve, they’re about flat at this point.
https://chatgpt.com/share/67ef43f4-3b88-800d-a5a3-e3ffea178f...
(Me trying to describe a desk top with a fold down hinged top, and it just drawing whatever)