CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous

Posted by fredmendoza 1 day ago

CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous(seqpu.com)

88 points | 45 commentspage 2

roschdal 1 day ago|

I yearn for the days when I can program on my PC with a programming llm running on the CPU locally.

glitchc 1 day ago||

You can do that now. Qwen-coder3.5 and gpt-oss-20b are pretty good for local coding help.

yazaddaruvala 1 day ago|||

I’ve been using Google AI Edge Gallery on my M1 MacBook with Gemma4B with very good results for random python scripts.

Unfortunately still need to copy paste the code into a file+terminal command. Which is annoying but works.

fredmendoza 1 day ago|||

you're honestly not that far off. the coding block on this model scored 8.44 with zero help. it caught a None-init TypeError on a code review question that most people would miss. one question asked for O(n) and it just went ahead and shipped O(log(min(m,n))) on its own. it's not copilot but it's free, it's offline, and it runs on whatever you have. there's a 30-line chat.py in the article you can copy and run tonight.

luxuryballs 1 day ago|||

You can do it on a laptop today, faster with gpu/npu, it’s not going to one shot something complex but you can def pump out models/functions/services, scaffold projects, write bash/powershell scripts in seconds.

trgn 1 day ago||

we need sqlite for llms

philipkglass 1 day ago||

I think that we're getting there. I put together a workstation in early 2023 with a single 4090 GPU. I did it to run things like BERT and YOLO image classifiers. At that point the only "open weights" LLM was the original Llama from Meta, and even that was open-weights only because it was leaked. It was a very weak model by today's standards.

With the same hardware I now get genuine utility out of models like Qwen 3.5 for categorizing and extracting unstructured data sources. I don't use small local models for coding since frontier models are so much stronger, but if I had to go back to small models for coding too they would be more useful than anything commercially available as recently as 4 years ago.

fredmendoza 1 day ago||

we found something interesting and wanted to share it with this community.

we wanted to know how google's gemma 4 e2b-it — 2 billion parameters, bfloat16, apache 2.0 — stacks up against gpt-3.5 turbo. not in vibes. on the same test. mt-bench: 80 questions, 160 turns, graded 1-10 — what the field used to grade gpt-3.5 turbo, gpt-4, and every major model of the last three years. we ran gemma through all of it on a cpu. 169-line python wrapper. no fine-tuning, no chain-of-thought, no tool use.

gpt-3.5 turbo scored 7.94. gemma scored ~8.0. 87x fewer parameters, on a cpu — the kind already in your laptop.

but the score isn't what we want to talk about. what's interesting is what we found when we read the tape.

we graded all 160 turns by hand. (when we used ai graders on the coding questions, they scored responses as gpt-4o-level.) the failures aren't random. they're specific, nameable patterns at concrete moments in generation. seven classes.

cleanest example: benjamin buys 5 books at $20, 3 at $30, 2 at $45. total is $280. the model writes "$245" first, then shows its work — 100 + 90 + 90 = 280 — and self-corrects. the math was right. the output token fired before the computation finished. we saw this on three separate math questions — not a fluke, a pattern.

the fix: we gave it a calculator. model writes a python expression, subprocess evaluates it, result comes back deterministic. ~80 lines. arithmetic errors gone. six of seven classes follow the same shape — capability is there, commit flinches, classical tool catches the flinch. z3 for logic, regex for structural drift, ~60 lines each. projected score with guardrails: ~8.2. the seventh is a genuine knowledge gap we documented as a limitation.

one model, one benchmark, one weekend. but it points at something underexplored.

this model is natively multimodal — text, images, audio in one set of weights. quantized to Q4_K_M it's 1.3GB. google co-optimized it with arm and qualcomm for mobile silicon. what runs it now:

phones: iphone 14 pro+ (A16), mid-range android 2023+ with 6GB+ ram

tablets: ipads m-series, galaxy tab s8+, pixel tablet — anything 6GB+

single-board: raspberry pi

laptops: anything from the last 5-7 years, 8GB+ ram

edge/cloud: cloudflare containers, $5/month — scales to zero, wakes on request

google says e2b is the foundation for gemini nano 4, already on 140 million android devices. the same model that matched gpt-3.5 turbo. on phones in people's pockets. think about what that means: a pi in a conference room listening to meetings, extracting action items with sentiment, saving notes locally — no cloud, no data leaving the building. an old thinkpad routing emails. a mini-pc running overnight batch jobs on docs that can't leave the network. a phone doing translation offline. google designed e2b for edge from the start — per-layer embeddings, hybrid sliding-window/global attention to keep memory low. if a model designed for phones scores higher than turbo on the field's standard benchmark, cpu-first model design is a real direction, not a compromise.

the gpu isn't the enemy. it's a premium tool. what we're questioning is whether it should be the default — because what we observed looks more like a software engineering problem than a compute problem. cs already has years of tools that map onto these failure modes. the models may have just gotten good enough to use them. the article has everything: every score, every error class with tape examples, every fix, the full benchmark harness with all 80 questions, and the complete telegram bot code. run it yourself, swap in a different model, or just talk to the live bot — raw model, no fixes, warts and all.

we don't know how far this extends beyond mt-bench or whether the "correct reasoning, wrong commit" pattern has a name. we're sharing because we think more people should be looking at it. everything is open. the code is in the article. tear it apart.

ComputerGuru 1 day ago||

Grading by hand was done fully blinded?

(Also this comment is ai generated so I’m not sure who I’m even asking.)

fredmendoza 1 day ago||

Fred, nice to meet you. The grading model had no idea what was being tested. We used separate accounts to compartmentalize. The Claude grader was guessing GPT-3.5 Turbo or GPT-4 by the end. On the coding block it consistently scored responses as GPT-4o level. We followed the MT-Bench grading guidelines as published by the team that created them. Did the research, followed the book, had no horse in the race. Every score and every response is published in the tape so you can regrade the whole thing yourself if you want. And this is me typing, I'm just a guy in LA who spent a weekend running 80 questions through a 2B model and thought the results were interesting enough to share.

simianwords 1 day ago||

[dead]

FergusArgyll 1 day ago||

Posters comment is dead. It may be llm-assisted but should prob be vouched for anyway as long as the story isn't flagged.

fredmendoza 1 day ago|

appreciate the vouch but come on lol. we ran 80 questions, graded 160 turns by hand, documented 7 error classes, open sourced all the code, and put a live bot up for people to test. to write this post up took me hours. everyone is a critic lol.

FergusArgyll 1 day ago||

I didn't mean it as criticism. I was trying to convince flaggers that it should be vouched - even if they don't like the tone or content of the comment - due to it coming from you.

invariantjason 1 day ago|

[dead]