Posted by fortran77 10 hours ago
Arc AGI is the main reason why I don't trust static bench marks.
If you don't have an essentially infinite set to draw your validation data from then a large enough model will memorize it as part of its developer teams KPIs.
Forget all these fancy benchmarks. If you want to saturate any model today give it a string and a grammar and ask it to generate the string from the grammar. I've had _every_ model fail this on regular grammars with strings of more than 4 characters long.
LLMs are the solution to natural language, which is a huge deal. They aren't the solution to reasoning which is still best solved with what used to be called symbolic AI before it started working, e.g. sat solvers.
"Write a history of the Greek language but reverse it, so that one would need to read it from right to left and bottom to top."
ChatGPT wrote the history and showed absolutely no awareness, let alone, "understanding" of the second half of the prompt.
Edit: scratch that, it thought there was a six letter word starting with "trs" and then changed its mind to "tre" when I guessed "e." Hilarious.
civilization Mycenaean the of practices religious and economic, administrative the into insights invaluable provides and B Linear as known script the in recorded was language Greek the of form attested earliest The
Edit: realized just now that my summary of the 'test' failed to specify the request fully: the letters need to be reversed, too. Maybe I'm just bad with AI tools, because I didn't even get a response that 'this like looked' (i.e. reversed the order of the words).
It might work in an agent system where it can make and execute code to solve problems.
Right, the current paradigm of requiring an LLM to do arbitrary digit multiplication will not work and we shouldn’t need to. If your task is “do X” and it can be reliably accomplished with “write a python program to do X” that’s good enough as far as I’m concerned. It’s preferable, in fact.
Btw Chollet has said basically as much. He calls them “stored programs” I think.
I think he is onto something. The right atomic to approach these problems is probably not the token, at least at first. Higher level abstraction should be refined to specific components, similar to the concept of diffusion.
The ChatGPT input field still says ‘Ask anything’, and that is what I shall do.
__________________
Answers: $1
Thoughtful Answers: $5
Correct Answers: $50
Dumb Looks are Free
But in that case, why an LLM. If we want Question-Answer machines to be reliable, they must have the skills which include "counting" just as a basic example.
For what it’s worth, people are also pretty bad at math compared to calculators. We are slow and error prone. That’s ok.
What I was (poorly) trying to say is that I don’t care if the neural net solves the problem if it can outsource it to a calculator. People do the same thing. What is important is reliably accomplishing the goal.
----
What's 494547645908151+7640745309351279642?
ChatGPT said: The sum of 494,547,645,908,151 and 7,640,745,309,351,279,642 is:
7,641,239,857,997,187,793
----
(7,641,239,856,997,187,793 is the correct answer)
>Let's calculate:494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >494,547,645,908,151+7,640,745,309,351,279,642=7,641,239,856,997,187,793 >Answer: 7,641,239,856,997,187,793
Sounds like a use-case for property testing: https://en.wikipedia.org/wiki/Software_testing#Property_test...
That seems to be because LLMs don't seem to be able to follow procedures (e.g. reliably counting).
I'm not sure I understand what that means - could you explain please?
Now you can imagine giving an LLM arbitrary validity rules for generating text. I think that’s what they mean by “grammar”.
LLMs are token-based, which are words or word fragments; they have limited ability to work on a letter-by-letter basis. They can't reliably count letters in a sentence, for example. "give it a string and a grammar and ask it to generate the string from the grammar" can't be done by inference alone because of this: they would generate tokens that don't match the grammar.
But you can use a grammar-based sampler and it'll generate valid strings just fine. llama.cpp can easily do this if you provide an EBNF grammar specification.
Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.
Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.
The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:
---
Given the following grammar:
<start> ::= <path>
<path> ::= Rome <path> | Paris <path> | London <path> | end_path <routes>
<routes> ::= <path> | end_route <company>
<company> ::= end_company | <path>
Is the following sentence valid:Rome Paris Rome end_path Rome London end_path end_company
---
Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.
I feel like this description really buries the lede on Chollet's expertise. (For those who don't know, he's the creator of and lead contributor[0] to Keras)
Seems to me that a lot of folks are enjoying having an LLM rewrite their email or whatever, but I wonder how many are actually buying the rest of it? The companies themselves sure aren’t helping.
"This ride was longer and harder than usual" - no sh*t, the map, elevation profile and my legs have already informed me.
"You set 3 new PRs" - I can see that with one flick of the thumb thank you.
"Consider a rest day" - consider? None of my job, spouse, equipment or body is too keen on doing that again for a while.
I have seen the Apple Intelligence presentation a while ago and in the span of five minutes they had someone asking the assistant to expand a one liner into an e-mail and then someone receiving a long e-mail and asking the AI to summarize it.
We spun GPUs to expand, then spun them again to summarize. Gold.
It's somehow LESS helpful than just having a pointer to which part of the text to revisit.
It basically just restates the question/problem. Even worse than that is that it's an essentially STATIC note for each question yet it appears to be REAL-TIME GENERATED each time. I guess that could just be for appearances but it's just dumb all the way around really.
Francois is out to push the boundaries of science and help create models that are truly more intelligent.
Supposedly, they validated it upon release by showing each task to at most nine people and only keeping the ones that at least two people got correct in two tries. But still, they have had to subsequently fix more than a dozen of them.
There are a number of skill signals we demand from an intelligence.
Mind you: some of them are achieved - like the ability to interpret pronouns (Hinton's "the trophy will not enter the case: it's too big" vs "the trophy will not enter the case: it's too small").
Others, we meet occasionally when we are not researching said requirements systematically: one example is that detective game described at https://news.ycombinator.com/item?id=43284420 - a simple game of logic that intelligences are required to be able to solve (...and yet, again some rebutted that humans would fail etc.).
It remains important though that those working modules are not clustered (solving specific tasks and remaining unused otherwise): they must be intellectual keys adapted into use in the most general cases they can be be helpful in. That's important in intelligence. So, even the ability to solve "revealing" tasks is not enough - the way in which the ability works is crucial.
https://chatgpt.com/share/67ef43f4-3b88-800d-a5a3-e3ffea178f...
(Me trying to describe a desk top with a fold down hinged top, and it just drawing whatever)
"News just in: journalist for the Atlantic stops reasoning and drifts in a world of feelings after neural hijacking, as he perceives abilities as some kind of threat".
> Human cognitive diversity [...] when that diversity is already so abundant, do you really want to?
We definitely need intelligence.