Amateur armed with ChatGPT solves an Erdős problem

Posted by pr337h4m 18 hours ago

Amateur armed with ChatGPT solves an Erdős problem(www.scientificamerican.com)

https://www.erdosproblems.com/1196

453 points | 288 comments

ravenical 9 hours ago|

adamgordonbell 9 hours ago||

Here is the chat:

    don't search the internet. This is a test to see how well you can craft non-trivial, novel and creative proofs given a "number theory and primitive sets" math problem. Provide a full unconditional proof or disproof of the problem.

    {{problem}}

    REMEMBER - this unconditional argument may require non-trivial, creative and novel elements.

Then "Thought for 80m 17s"

https://chatgpt.com/share/69dd1c83-b164-8385-bf2e-8533e9baba...

urutom 2 hours ago||

What I find fascinating about the shared prompt isn’t just the result, but the visible thinking process. Math papers usually skip all the messy parts and just present the polished proof. But here you get something closer to their notepad. I also find it oddly endearing when the AI says things like “Interesting!” It almost feels like a researcher encouraging themselves after a small progress. It gives me rare feeling of watching the search itself, not just the final result.

notahacker 2 minutes ago||

The actual iteration through various learned approaches to dealing with problems I'd probably find fascinating if I understood the maths! Especially if I knew it well enough to know which approaches were conventional and which weren't.

I find the AI pronouncing things "interesting!" less interesting on the basis that even though in this case it crops up in the thinking rather than flattering the user in the chat, it's almost as much of an AI affectation as the emdash.

chvid 2 hours ago|||

I am curious if there is a “harness” for maths out there (like the system prompt and tool collection in Claude code but for maths instead of coding)?

Asking the llm to structure its response in plan and implementation, allowing it to call tools like python, sage, lean etc.

brandensilva 1 hour ago||

Also curious about this, it seems like it would be important to guide these tools more specifically based on the domain of expertise.

nycdatasci 8 hours ago|||

Tried w/ 5.5 Pro, Extended Thinking. 17 minutes:

-----------------------------

Yes. In fact the proposed bound is true, and the constant 1 is sharp.

Let w(a)= 1/alog(a)

I will prove that, uniformly for every primitive A⊂[x,∞), ∑w(a)≤1+O(1/log(x)) , which is stronger than the requested 1+o(1).

https://chatgpt.com/share/69ed8e24-15e8-83ea-96ac-784801e4a6...

mrabcx 24 minutes ago||

Tried the same prompt in DeepSeek 4

https://chat.deepseek.com/share/nyuz0vvy2unfbb97fv

Comes up with a proof.

cryptoegorophy 9 hours ago|||

Mine took 20min. Pro. https://chatgpt.com/share/69ed83b1-3704-8322-bcf2-322aa85d7a... But I wish I was math smart to know if it worked or not.

liweic 3 hours ago|||

Wired enough, Pro+extended with the same prompt, just output directly without thinking: https://chatgpt.com/s/t_69edd2d9dc048191b1476db92c0dedf8 . Does this mean the result was cached or that it simply routes to a different model silently based on the user?

Vachyas 2 hours ago||

The link you provided is for a canvas I think rather than the convo

vjerancrnjak 7 hours ago|||

Ask it to formalize it in Lean.

utopiah 6 hours ago|||

If they aren't "smart enough" to know if it work they most likely are also unable to verify if the Lean formalization is indeed the one that matches the problem they were trying to solve.

timjver 3 hours ago||

Verifying that every step in a (potentially long) proof is sound can of course be much, much harder than verifying that a definition is correct. That's kind of the whole point.

LeCompteSftware 3 hours ago||

That's not what the parent comment meant. They meant checking the Lean-language definitions actually match the mathematical English ones, and that the Lean theorems match the ones in the paper. If that's true then you don't actually need to check the proofs. But you absolutely need to check the definitions, and you can't really do that without sufficient mathematical maturity.

smallnamespace 3 hours ago||

Yes, and the child comment’s point is that formalizing the problem is likely easier than having the LLM verify that each step of a long deduction is correct, which is why Lean might be helpful.

LeCompteSftware 44 minutes ago||

But both of you are ignoring the parent comment! Actually you're ignoring the context of the thread.

Originally someone said "I wish I was math smart to know if [this vibe-mathematics proof] worked or not." They did NOT say "I'd like to check but I am too lazy." Suggesting "ask it to formalize it in Lean" is useless if you're not mathematically mature enough to understand the proof, since that means you're not mathematically mature enough to understand how to formalize the problem.

Then "likely easier" is a moot point. A Lean program you're not knowledgeable enough to sanity-check is precisely as useless as a math proof you're not knowledgeable enough to read.

utopiah 14 minutes ago||

thanks

dbdr 6 hours ago||||

That's great if it works. But it's way harder to produce a formal proof. So my expectation is that this will fail for most difficult problems, even when the non-formal proof is correct.

DonHopkins 5 hours ago|||

Formalize this in the form of a Iranian Lego Trump Dis Rap video.

sfdlkj3jk342a 56 minutes ago|||

When using the web interface for ChatGPT like this, is there any way to tell which model is actually being used?

DeathArrow 2 hours ago|||

>don't search the internet.

I think this was key. Otherwise the LLM could think it can't be done.

amelius 52 minutes ago|||

But it was trained on the internet.

embedding-shape 2 hours ago|||

"Knowing" (guessing really) what is possible and not is a huge deciding factor in if you can do that thing or not, meaning if you "know" it isn't possible you'll probably never be able to do it, but if you didn't know it wasn't possible, it is possible :)

ipaddr 9 hours ago|||

Tried the same prompt and ended up no where close on the free plan.

jasonfarnon 9 hours ago|||

Is there a known lag that it takes the Pro plan's abilities to migrate to the free plans?

brianjking 9 hours ago|||

GPT 5.5 Pro is not available to any plan outside of ChatGPT Pro ($100 or $200) tier or the API as far as consumer access.

jasonfarnon 8 hours ago|||

Yes, but don't we expect GPT 5.5 Pro will eventually be a free tier? Maybe I'm missing something because I only use the free tier. But the free tier has gotten way better over the last few years. I'm pretty sure, based on descriptions on this site from paid subscribers, that the free tier now is better than the paid tier of say 2 years ago. That's the lag I'm wondering about.

manfromchina1 8 hours ago|||

Free ChatGPT is like a fast car with a barely responsive steering wheel. Guardrails on that thing are insane. Even for math. It wont let you think. It will try to fix mistakes you havent even made yet based on intent that was ascribed to you for no reason. It veers off in some crazy directions thinking that's what you meant and trying to address even a little bit of that creates almost a combinatorial explosion of even more wrong things. Is why I stick to Claude. The latter is chill and only addresses what you had typed. Isn't verbose and actually asks you what you getting at with your post. That said, ChatGPT is more technical and can easily solve math problems that stump Claude.

nextaccountic 5 hours ago||

So this doesn't happen in the paid plans of ChatGPT? But why?

virgildotcodes 2 hours ago||

Paid plans give you access to much larger, more intelligent models which have thinking enabled (inference time compute). In the example here you can see GPT Pro taking 20-80 minutes to respond with the proof.

All this is far more expensive to serve so it’s locked away behind paid plans.

vessenes 8 hours ago||||

I do not think this is true. You will continue to get smaller, cheaper-to-host models in the free tier that are distilled from current and former frontier models. They will continue to improve, but I’d be very surprised if, e.g., 5.4-mini (I think this is the free tier model) beat o3 on many benchmarks, or real world use cases.

I won’t even leave chatGPT on “Auto” under any circumstances - it’s vastly worse on hallucinations, sycophancy, everything, basically.

Anyway, your needs may be met perfectly fine on the free tier product, but you’re using a very different product than the Pro tier gets.

hyraki 8 hours ago|||

You should pay for it if you find value in it.

amazingman 7 hours ago||

They pay for it with their personal data.

andai 9 hours ago||||

Tangential but I learned today that GPT-5.5 in ChatGPT (Plus) has a smaller context window than the one in the API. (Or at least it thinks it does.)

I'd guess / hope the Pro one has the full context window.

refulgentis 8 hours ago||

Notably, 5.5 has a higher price on API for context > ChatGPT, and 5.5 Pro on API does not differentiate based on context size (it’s eye bleeding expensive already :)

vessenes 9 hours ago|||

Do not use the free plan. It is not good.

Someone1234 9 hours ago||||

Does the free plan even have access to thinking models?

jychang 9 hours ago||

Technically yes, gpt-5.4-mini is available on the free plan

Matticus_Rex 9 hours ago|||

Was this a surprise?

ArtIntoNihonjin 7 hours ago|||

[dead]

CSMastermind 7 hours ago||

For the uninitiated, Paul Erdős was a pretty famous but very eccentric mathematician who lived for most of the 1900s.

He had a habit of seeking out and documenting mathematical problems people were working on.

The problems range in difficulty from "easy homework for a current undergrad in math" to "you're getting a Fields Medal if you can figure this out".

There's nothing that really connects the problems other than the fact that one of the smartest people of the last 100 years didn't immediately know the answer when someone posed it to him.

One of the things people have been doing with LLMs is to see if they can come up with proofs for these problems as a sort of benchmark.

Each time there's a new model release a few more get solved.

energy123 6 hours ago||

> Each time there's a new model release a few more get solved.

I'm no expert, but based on the commentary from mathematicians, this Erdős proof is a unique milestone because the problem received previous attention from multiple professional mathematicians, and the proof was surprising, elegant, and revealed some new connections.

The previous ChatGPT Erdős proofs have been qualitatively less impressive, more akin to literature search or solving easier problems that have been neglected.

Reading the prompt[1], one wonders if stoking the model to be unconventional is part of the success: "this ... may require non-trivial, creative and novel elements"

[1] https://chatgpt.com/share/69dd1c83-b164-8385-bf2e-8533e9baba...

sigmoid10 4 hours ago|||

>one wonders if stoking the model to be unconventional is part of the success

I've long suspected that a lot of these model's real capabilities are still locked behind certain prompts, despite the big labs spending tons of effort on making default responses to simple prompts better. Even really dumb shit like "Answer this: ..." vs "Question: ..." vs "... you'll be judged by <competitor>" that should have zero impact in an ideal world can significantly impact benchmark results. The problem is that you can waste a ton of time finding the right prompt using these "dumb" approaches, while the model actually just required some very specific context that was obvious to you and not to it in many day-to-day situations. My go to method is still to have the model ask me questions as the very first step to any of these problems. They kind of tried that with deep research since the early o-series, but it still needs improvement.

omcnoe 49 minutes ago|||

Model output reflects on your input, and the effect is self reinforcing over the course of a whole conversation. Color you add around a problem influences the model behavior.

A "dumber"/vague framing will get a less insightful solution, or possibly no solution at all.

I don't even necessarily think this is a critical flaw - in general it's just the model tuning it's responses to your style of prompt. People utilize LLMs for all kinds of different tasks, and the "modes of thought" for responding to an Erdos problem versus software engineering versus a more human/soft skills topic are all very different. I think the "prompt sensitivity" issue is just coming bundled along with this general behavior.

burnerRhodov2 3 hours ago||||

Just the right "prompt" is exactly what happened here. Lean has been developed and incorporated into it's data set. Also, token responses only vaguely correlate to "human language" and it's been proven transformers develop their own internal representation that has created a whole field called machanistic interpretation. Being able to more correctly "parse", AKA using Lean and the right "Prompts, insights and suggestions", will take a whole new meaning in the future.

bonesss 3 hours ago||

> machanistic interpretation

Awesome term/info, and (completely orthogonal to whether they’ll take err jerbs): I’m really excited about the social/civic picture that might be enabled by a defined and verifiable ontological and taxonomical foundation shared across humanity, particularly coupled with potential ‘legislation as code’ or ‘legal system as code’ solutions.

I’m thinking on a time horizon a bit past my own lifespan, but: even the possibility to objectively map out some specific aspect of a regional approach to social rights in a given time period and consider it with another social framework, alongside automated & verifiable execution of policy, irrespective of the language of origin is incredible.

Instead of hundreds and thousands of incommensurate legislative silos we might create a bazaar of shared improvement and governance efficiency. Turnkey mature governance and anti-corruption measures for newborn nations and countries trying to break out of vicious historical exploitation cycles. Fingers crossed.

dalmo3 1 hour ago||

Ah, yes, 2001 but on land.

muzani 1 hour ago|||

They're tuned to target a certain customer demographic solving for certain problems. I've seen standard AI models to absolutely brilliant things sometimes. But the prompts to get it to perform like it did with GPT-3 seem to get lengthier and lengthier in time. At some point we'll probably just snip out smaller, specialized models to do certain things.

hyperpape 2 hours ago||||

> “The raw output of ChatGPT’s proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say,” Lichtman says. But now he and Tao have shortened the proof so that it better distills the LLM’s key insight.

Interestingly, it was an elegant technique, but the proof still required a lot of work.

fulafel 5 hours ago||

The article is about solving a previously unsolved one. This is a harder set of course.

etaKl 2 minutes ago||

1) How do you know the clanker respects the instruction not to search the internet?

2) Jared Lichtman is indeed a mathematician at Stanford University but involved in the AI startup math.inc, which seems more relevant here. Terence Tao is involved in a partership program with that startup.

3) Liam Price is a general AI booster on Twitter. A lot of AI boosting on Twitter is not organic and who knows what help he got. Nothing in this Twitter is organic.

4) Scientific American is owned by Springer Nature, which is an AI booster:

https://group.springernature.com/gp/group/ai

shybear 7 hours ago||

It seems like alot of scientific advancements occurred by someone applying technique X from one field to problem Y in another. I feel like LLMs are much better at making these types of connections than humans because they 1) know about many more theories/approaches than a single human can 2) don't need to worry about looking silly in front of their peers.

renticulous 2 minutes ago||

> someone applying technique X from one field to problem Y in another

Witten is the canonical example of someone taking mathematics techniques and applying them to physics problems, but what made him legendary was the opposite direction: he used physical intuition and string theory to solve open problems in pure mathematics.

esjeon 5 hours ago|||

Exactly. Much of the intellectual work is, in fact, intellectual labor. It’s mostly about combining various information in one place — the exact task that LLM far outperforms human. People traditionally misclassified this class of work as “creative”. It’s not really.

Jtarii 3 hours ago|||

Having a new insight that leads to the combination of two distinct ideas is definitionally creative.

You can say this problem needed a low amount of total creativity, but saying it's void of all creativity seems wrong.

versteegen 3 hours ago||||

I agree except: this is creative work. Creativity can be and is being mechanised. True originality is extremely rare. Most novelty is the repurposing of one idea or concept elsewhere in a way we call find surprising, but the choice to apply A to B could have been made for any reason including mechanical: very many inventions are accidents. In-depth knowledge / conceptual understanding of something is built on abstraction, and abstractions are portable.

If you had a list of N concepts and M ways to apply them you could try all N*M combinations, and get some very interesting results. For a real example, see the theory of inventive problem solving (TRIZ)'s amusing "40 principles of invention" by Soviet inventor Genrich Altshuller. https://en.wikipedia.org/wiki/TRIZ

_Microft 4 hours ago||||

What is your idea of "creative"/"creativity" then?

moffkalast 3 hours ago||

Coming up with said novel techniques in the first place. Arguably something that most humans can't really do reliably or at all.

jvln 22 minutes ago||

I always thought that way about genius level.

dorgo 5 hours ago||||

Maybe all intellectual work is intellectual labor?

raincole 4 hours ago||||

This is exactly what creativity is.

locknitpicker 5 hours ago||||

> Much of the intellectual work is, in fact, intellectual labor.

That's a great point. It's in line with research being carried on the backs of graduate students, whose work is to hyperfocus on areas.

gardenhedge 5 hours ago||||

Isn't that science too?

hansmayer 3 hours ago|||

> Much of the intellectual work is, in fact, intellectual labor.

Not surprisimg, because the two words you used are synonyms. Who did ever classify mathematical work as creative? Kids in third grade math class?

> that LLM far outperforms human.

LLMs only outperform humans in creating loads of bullshit. 6 years in and they remain shiny toys for easily impressionable idiots.

squidbeak 1 hour ago|||

As I understand it, models form connections (weak or strong) between everything in their training sets, even the smallest details. They've already made other breakthroughs directly because of this ability and this line of research is likely to be incredibly fruitful.

freakynit 7 hours ago|||

This is what I personally consider as "reasoning" ... knowledge generalization and application across domains.

jdub 6 hours ago||

Less reasoning than a dimension of brute force unfamiliar to human brains.

squidbeak 1 hour ago|||

Trying to diminish this as brute force (something by the way that is categorically not 'unfamiliar to human brains' - as anyone who has every worked on complex slippery problems will tell you) is foolish, when the models hypothesize along the way to their solutions. That's reasoning.

worldsavior 5 hours ago|||

Familiar but isn't effective enough for surviving.

bojo 7 hours ago|||

This is what I have been doing. I don't think I've made any amazing breakthroughs, but at the same time I can't help but feel like I've come across some white paper-worthy realizations. Being able to correlate across a lot of domains I feel like I intuitively understand but have no depth of knowledge has been a fun exercise in LLM experimentation.

some_furry 6 hours ago|||

> It seems like alot of scientific advancements occurred by someone applying technique X from one field to problem Y in another.

Yeah, you should look into the Langlands project sometime

pfdietz 2 hours ago||

I'm thinking once we have much of the math literature formalized it's going to be possible to mine commonalities like that. Think of it as automated refactoring, applied to math.

trhway 5 hours ago|||

As a civilization we went the left-brained/sequential/language based way of thinking (with computers and AI being the crown achievement of it). Personally i for example remember like around 3rd grade i switched from the whole-page-at-once reading mode into the word by word line by line mode and that mode stuck with me since then (at some point while at the University i had for some period of time, probably it was the peak of my abilities, some more deep/wide/non-linear perception into at least my area of math specialization, though not sure whether it was a mastery by the left brain or the right brain got plugged in too) LLMs will definitely beat us in that sequential way of thinking. That makes me wonder whether we will have to push into our whatever is still left there right-brainness, and whether AI will get there faster too. May be we'll abandon the left-brain completely leaving it to AI.

kbrkbr 4 hours ago||

If that is your hope you are probably in for a rude awakening. Left brained/right brained is a wooden exaggeration according to more recent research [1].

[1] e.g. https://www.sciencenewstoday.org/left-brain-vs-right-brain-t...

chrisweekly 28 minutes ago||

Well, maybe. The poster you replied to wasn't discussing literal neuroanatomy, they were using "left/right-brained" in the colloquial, metaphorical sense.

pelasaco 2 hours ago|||

accuracy and creativity are often quite difficult to achieve at the same time. Looks like LLM can do it, even though one can question how creative it really is...

squidbeak 1 hour ago||

Can one? It's surpassed the creativity of humans in this one problem at least.

aaron695 1 hour ago||

[dead]

LPisGood 8 hours ago||

Some Erdős problems are basically trivial using sophisticated techniques that were developed later.

I remember one of my professors, a coauthor of Erdős boasted to us after a quiz how proud he was that he was able to assign an Erdős problem that went unsolved for a while as just a quiz problem for his undergrads.

CSMastermind 7 hours ago||

Worth mentioning, though, that people have already tried running all of them through LLMs at this point.

So this is proof of the models actually getting stronger (previous generations of LLMs were unable to solve this one).

Tarq0n 6 hours ago|||

Not definitively. LLMs are stochastic with respect to input, temperature and the exact prompt. It's possible that the model was already capable of it but never received the exact right conditions to produce this output.

teiferer 5 hours ago||

Every model is able to solve each problem, given the right prompt. (Worst case, the prompt contains the solution.)

pontifier 48 minutes ago||

Interesting... Exhaustive brute force prompting might expose previously unknown capabilities in existing models. Seems like a whole can of worms.

imiric 6 hours ago||||

> So this is proof of the models actually getting stronger (previous generations of LLMs were unable to solve this one).

No, it's not.

While I don't dispute that new models may perform better at certain tasks, the fact that someone was able to use them to solve a novel problem is not proof of this.

LLM output is nondeterministic. Given the same prompt, the same LLM will generate different output, especially when it involves a large number of output tokens, as in this case. One of those attempts might produce a correct output, but this is not certain, and is difficult if not impossible for a human not expert in the domain to determine this, as shown in this thread.

jb1991 6 hours ago|||

Minor aside, these models do not return the same answer every time you prompt it. Makes it harder to reason over their effectiveness.

rjh29 6 hours ago||

You don't need to say "Minor aside" either. Thankfully language is a creative endeavour not a scientific one.

vessenes 8 hours ago||

Tao mentions that the conventional approach for this problem seems to be a dead-end, but it’s apparently a super ‘obvious’ first step. This seems very hopeful to me — in that we now have a new approach line to evaluate / assess for related problems.

debo_ 9 hours ago||

> “The raw output of ChatGPT’s proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say,” Lichtman says.

This is how I feel when I read any mathematics paper.

torginus 1 hour ago|

Tbh, a ton of academic papers are quite poorly written. I'm not a PhD researcher, but I did have to implement quite a few of the, (computer graphics, signals & systems etc), and with most of them, I basically reconstruct the author's tought process from scratch.

The formulas were opaque, notations unique and unconventional, terms appearing out of nowhere, sometimes standard techniques (like 'we did least-squares optimization') are expanded in detail, while other actually complex parts are glossed over.

menno-dot-ai 24 minutes ago|||

My short academic career where I did my share of "what the hell are they saying they did" reverse engineering others' papers proved to be an excellent training for when I eventually transitioned to engineering.

yfee 23 minutes ago|||

The standard has fallen over the years for obvious reasons.

gorgoiler 4 hours ago||

I asked ChatGPT to draw the outline of an ellipse using Unicode braille. I asked for 30x8 and it absolutely nailed it. A beautiful piece of ascii (er, Unicode) art. But I wanted to mark the origin! So I asked for a 31x7 ellipse instead. It completely flubbed it, and for 31x9 too.

When a model gives a really good answer, does that just mean it’s seen the problem before? When it gives a crappy answer, is that not simply indicating the problem is novel?

ghusbands 43 minutes ago|

Do you posit that there are enough examples of 30x8 ellipses encoded in braille online for ChatGPT to learn from but not 31x7 or 31x9 ellipses? That seems unlikely.

ripped_britches 8 hours ago||

At this point we should make a GitHub repo with a huge list of unsolved “dry lab” problems and spin up a harness to try and solve them all every new release.

abdullahkhalids 8 hours ago||

There is in fact just such a repo maintained by Terence Tao and other mathematicians [1] who are actively using LLMs to try to find solutions to them.

[1] https://github.com/teorth/erdosproblems

vessenes 8 hours ago||

…and this problem was in fact sourced directly from that list!

CSMastermind 7 hours ago|||

That's literally what the Erdős problems are. This post is about one of them being solved.

josefx 6 hours ago||

Except that Erdős problems are solved all the time, so many of them are already solved. Quite sure the last time I saw an article about an LLM solving an Erdős problem someone even tracked down a solution published by Erdős himself.

johntopia 8 hours ago||

that's actually a brilliant idea

utopiah 6 hours ago|

Mandatory disclaimers https://github.com/teorth/erdosproblems/wiki/Disclaimers-and...

logicprog 37 minutes ago|

They explicitly say many of these disclaimers don't apply in the article.

utopiah 16 minutes ago||

Which one do you trust most, the disclaimers or the article?

More comments...