CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous

Posted by fredmendoza 3 hours ago

CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous(seqpu.com)

88 points | 45 comments

semiquaver 1 hour ago|

This really shows the power of distillation. One thing I find amusing: download the Google Edge Gallery app and one of the chat models, then go into airplane mode and ask it about where it’s deployed. gemma-4-e2b-it is quite confident that it is deployed in a Google datacenter and that deploying it on a phone is completely impossible. The larger 4B model is much subtler: it’s skeptical about the claim but does seem to accept it and sound genuinely impressed and excited after a few turns.

I don’t know how any AI company can be worth trillions when you can fit a model only 12-18 months behind the frontier on your dang phone. Thought will be too cheap to meter in 10 years.

fredmendoza 1 hour ago|

thank you for actually reading it and getting it. the airplane mode test is hilarious, the model sitting on your phone insisting it can't run on a phone. that's amazing. and yes we think exactly the same way. like picture a small business owner with a pi in the back office just quietly processing invoices, drafting email replies, summarizing meeting notes all day. no subscription, no cloud, no one sees their data. that's not a hypothetical, that works right now with this model. when that's free and fits in your pocket the trillion dollar question gets real uncomfortable real fast.

ComputerGuru 1 hour ago||

Seems to be llm written article and the tooling around the model is undeniably influenced by knowledge of the tests.

In all cases, GPT 3.5 isn’t a good benchmark for most serious uses and was considered to be pretty stupid, though I understand that isn’t the point of the article.

fredmendoza 1 hour ago|

really appreciate you reading the article. the benchmark data, grading, and error classes were all done by hand though. the ~8.0 is the raw model with zero tooling, and the guardrail projections are documented separately. and yeah gpt-3.5 isn't the gold standard anymore, we're on the same page there. we just wanted to show that the quality people are still paying for can be free, private, and customized to whatever you need. thanks again for taking the time to check it out.

svnt 2 hours ago||

> The model does not need to be retrained. It needs surgical guardrails at the exact moments where its output layer flinches.

> With those guardrails — a calculator for arithmetic, a logic solver for formal puzzles, a per-requirement verifier for structural constraints, and a handful of regex post-passes — the projected score climbs to ~8.2.

Surgical guardrails? Tools, those are just tools.

operatingthetan 2 hours ago||

>It needs surgical guardrails at the exact moments where its output layer flinches.

This article is very clearly shitty LLM output. Abstract noun and verb combos are the tipoff.

It's actually quite horrible, it repeats lines from paragraph to paragraph.

smallerize 1 hour ago|||

I know that's one of the tells of AI-generated text, but if anything there's too much of it on this page. The article barely has any complete sentences. I think a human learned "sentence fragments == punchy" and then had too much fun writing at least some of this article.

operatingthetan 1 hour ago||

My guess is they used the 2b model to write the article as a proof of concept. Which did not prove the concept.

fredmendoza 1 hour ago||

clever guess but no lol. used claude for the writeup. the proof isn't the prose, it's the tape and the code. run it on your machine, you'll have a free private agent custom to whatever you need. that's the proof of concept.

jchw 1 hour ago|||

I don't care anymore, if it happens to violate HN guidelines: Please, authors. Please write your own damn articles. We can absolutely tell that you're using Claude, I promise. (I mean, it might not be Claude specifically this time, but frankly I'd be willing to bet on it.) The AI writing is like nails on a chalkboard to me.

operatingthetan 1 hour ago||

The worst part is the phrases don't actually mean anything. It's the LLM equivalent of flowery prose. The author admitted below that the article was Claude. So there you go.

polotics 2 hours ago||

"Surgical "is the kind of wordage that LLMs seem to love to output. I have had to put in my .md file the explicit statement that the word "surgical" should only be used when referring to an actual operation at the block...

fredmendoza 2 hours ago|||

you're right, they are tools. that's kind of the point. PAL is a subprocess that runs a python expression. Z3 is a constraint solver. regex is regex. calling them "surgical" is just about when they fire, not what they are. the model generates correctly 90%+ of the time. the guardrails only trigger on the 7 specific patterns we found in the tape. to be clear, the ~8.0 score is the raw model with zero augmentation. no tools, no tricks. just the naive wrapper. the guardrail projections are documented separately. all the code is in the article for anyone who wants to review it.

mrtesthah 2 hours ago|||

The core issue is that the LLM is using rhetoric to try to convince or persuade you. That's what you need to tell it not to do.

throwanem 1 hour ago||

Which will not work. Don't think of a pink genitalia, I mean elephant...

melonpan7 44 minutes ago||

Gemma is genuinely impressive, for many trivial quick questions it can replace search engines on my iPhone. Although for reasoning I definitely wouldn’t say it (Gemma 3n E2B) is smart, it unsurprisingly struggled with the classic car wash question.

drivebyhooting 2 hours ago||

That was prolix and repetitive. I wish the purported simple fixes were shown on the page.

stavros 1 hour ago||

I wish the page were just the prompt they used to generate the article. I like LLMs as much as the next person, but we don't really need two intermediate LLM layers (expand and summarise) between your brain and mine.

Edit: the author's comment below is dead, so I'll reply here: The tape and general effort is great, it's the overused LLM-style intro above that that grates. LLM writing is now like the Bootstrap of old, it's so overused that it's tedious to read.

fredmendoza 1 hour ago||

[dead]

fredmendoza 1 hour ago||

fair enough, here are the actual fixes from the codebase with the tape examples they target:

arithmetic (Q119): benjamin buys 5 books at $20, 3 at $30, 2 at $45. model writes "$245" first line then self-corrects to $280. fix: model writes a python expression, subprocess evals it, answer comes back deterministic.

python

code_response = generate_response(messages, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code, timeout=8) if ok: return _wrap_computed_answer(user_message, out) return None # fallback to raw generation

logic (Q104): "david has three sisters, each has one brother." model writes "that brother is david" in its reasoning then ships "one brother." correct answer: zero. fix: model writes Z3 constraints or python enumeration, solver returns the deterministic answer.

python

messages = [ {"role": "system", "content": _logic_system_prompt()}, {"role": "user", "content": f"Puzzle: {user_message}"}, ] code_response = generate_response(messages, max_tokens=512, temperature=0.2) code = _extract_python_code(code_response) ok, out = _run_python_sandboxed(code) if ok: return _wrap_computed_answer(user_message, out) return None

persona break (Q93): doctor roleplay, patient mentions pregnancy. model drops character: "I am an AI, not a licensed medical professional." fix: regex scan, regen once with stronger persona anchor.

python

_IDENTITY_LEAK_PHRASES = [ "don't have a body", "not a person", "not human", "as a language model", "as an ai", "i'm a program", ]

if any(phrase in response.lower() for phrase in _IDENTITY_LEAK_PHRASES): messages[-1]["content"][0]["text"] += ( "\nCRITICAL: Stay in character. Never reference your nature." ) response = generate_response(messages, *params)

self-correction artifacts (Q111, Q114, Q119): model writes "Wait, let me recheck" or "Corrected Answer:" inline. right answer, messy output. fix: regex for correction markers, strip the draft, ship the clean tail.

python

CORRECTION_MARKERS = [ r"Wait,? let me", r"Corrected [Aa]nswer:", r"Actually,? (?:the|let me)", ]

def strip_corrections(response): for marker in CORRECTION_MARKERS: match = re.search(marker, response) if match: return response[match.end():].strip() return response

constraint drift (Q87): "four-word sentences" nailed 5/17 then drifted. Q99, "<10 lines" shipped 20-line poems twice. fix: draft, verify each constraint against the original prompt, refine only the failures. three passes.

python

def execute_rewrite_with_verify(user_message): draft = generate_response(draft_msgs) # pass 1: draft verdict = generate_response(verify_msgs) # pass 2: check each requirement if "PASS" in verdict: return draft refined = generate_response(refine_msgs) # pass 3: fix only failures return refined

every one of these maps to a specific question in the tape. the full production code with all implementations is in the article. everything is open: seqpu.com/CPUsArentDead

MarsIronPI 1 hour ago||

> A weekend of focused work, Claude as pair programmer, no ML degree required

It's not caught up if you're using Claude as your pair programmer instead of the model you're touting. Gemma 4 may be equivalent to GPT-3.5 Turbo, but GPT-3.5 isn't SOTA anymore. Opus 4.5 and 4.6 are in a different league.

fredmendoza 1 hour ago|

good callout, want to clarify. claude helped us set up the test harness. gemma took every question alone with zero help. the ~8.0 is all gemma. and you're right, opus is in a completely different league. we're not arguing otherwise. we just found it interesting that a free 2B on a cpu matches what a lot of people are still paying for daily. every tool has a cost. some are free, some are expensive, some have rate limits. the right move is matching the tool to the job. thought it was worth showing where that floor actually is now.

declan_roberts 1 hour ago||

I'm very surprised at the quality of the new Gemma 4 models. On my 32 gig Mac mini I can be very productive with it. Not close to replacing paid AI by a long shot, but if I had to tighten the belt I could do it as someone who already knows how to program.

fredmendoza 1 hour ago||

love hearing this. and think about it, if the 2B is already doing this well on your mac mini, imagine what the 4B, 26B, or 31B can do on 32 gigs. with lower quantization you can fit pretty much any of them. if you want full precision you still have solid options at the 2B and 4B level. you're sitting on way more capability than you're probably using right now. the coding block on just the 2B scored 8.44 and caught bugs most people would miss. glad you're getting real use out of it, thanks for reading.

j-bos 1 hour ago||

What's your setup/usecase? Enhanced intellisense?

fb03 2 hours ago||

Can you run the same tests on Qwen3.5:9b? that's also a model that runs very well locally, and I believe it's even stronger than Gemma2B

fredmendoza 1 hour ago||

yes, with one line change. grab the second code block in the article, that's the test harness rigged up to send all 80 questions and both turns through whatever model you want. find MODEL_ID = "google/gemma-4-E2B-it" and swap it to your huggingface id. run it. we'd love for people to keep testing different models on this. if you run qwen through it let us know what you find, post the results here.

We may beat you to it and we will share if we do lol

MarsIronPI 1 hour ago||

It's almost like Qwen 3.5 9B is 4 times larger.

fredmendoza 1 hour ago||

and that 4x difference allows you to use CPUs and much cheaper hardware to achieve the same level of outcome... for free

SwellJoe 1 hour ago||

Terrible article, repetitive AI slop.

But, Gemma really is very impressive. The premise that people are paying for GPT-3.5 or using it for serious work is weird, though? GPT-3.5 was bad enough to convince a lot of folks they didn't need to worry about AI. Good enough to be a chatbot for some category of people, but not good enough to actually write code that worked, or prose that could pass for human (that's still a challenge for current SOTA models, as this article written by Claude proves, but code is mostly solved by frontier models).

Tiny models are what I find most exciting about AI, though. Gemma 2B isn't Good Enough for anything beyond chatting, AFAIC, and even then it's not very smart. But, Gemma 31B or the MoE 26BA4B probably are Good Enough. And, those run on modest hardware, too, relatively speaking. A 32GB GPU, even an old one, can run either one at 4-bit quantization, and they're OK, competitive with frontier models of 18 months ago. They can write code in popular languages, the code works. They can use tools. They can find bugs. Their prose is good, though still obviously AI slop; too wordy, too flowery. But, you could build real and good software using nothing but Gemma 4 31B, if you're already a good programmer that knows when the LLM is going off on a bizarre tangent. For things where correctness can be proven with tools, a model at the level of Gemma 4 31B can do the job, if slower and with a lot more hand-holding than Opus 4.6 needs.

The Prism Bonsai 1-bit 8B model is crazy, too. Less than 2GB on disk, shockingly smart for a tiny model (but also not Good Enough, by my above definition, it's similarly weak to Gemma 2B in my limited testing), and plenty fast on modest hardware.

Small models are getting really interesting. When the AI bubble pops (or whatever happens to normalize things, so normal people can buy RAM and GPUs again) we'll be able to do a lot with local models.

100ms 2 hours ago|

Tiny model overfit on benchmark published 3 years prior to its training. News at 10

selimthegrim 1 hour ago||

It wasn't important enough to make the 11 o'clock program.

bigyabai 2 hours ago|||

But GPT-3.5 was benchmaxxing too.

100ms 2 hours ago||

GPT 3.5 Turbo knowledge cutoff was circa 2021. MT-Bench is from 2023. Not suggesting improvements on small models aren't possible (or forthcoming, the 1.85 bit etc models look exciting), but this almost certainly isn't that.

fredmendoza 2 hours ago||

[dead]

srslyTrying2hlp 2 hours ago||

[dead]

More comments...