Show HN: State of the Art of Coding Models, According to Hacker News Commenters

Posted by yunusabd 15 hours ago

Show HN: State of the Art of Coding Models, According to Hacker News Commenters(hnup.date)

Hello HN,

I was away from my computer for two weeks, and after coming back and reading the latest discussions on HN about coding assistants (models, harnesses), I felt very out of the loop. My normal process would have been to keep reading and figure out the latest and greatest from people's comments, but I wanted to try and automate this process.

Basically the goal is to get a quick overview over which coding models are popular on HN. A next iteration could also scan for harnesses that people use, or info on self-hosting or hardware setups.

I wrote a short intro on the page about the pipeline that collects and analyzes the data, but feel free to ask for more details or check the Google Sheet for more info.

https://hnup.date/hn-sota

119 points | 60 commentspage 2

gobdovan 12 hours ago|

Before harnesses, I'd fix the methodology/claims. A saner methodology would be to see comments that compare two models, say 'gpt5.5>opus4.7' and infer context ('ctx:frontend', for example). For your current methodology, 'opus 4.6 was very smart, opus4.7 is a disappointing upgrade to 4.6' would make normal aspect-based sentiment analysis consider 4.6 is smarter than 4.6. But considering you have <300 mentions total, probably you'd be better off scrapping some other websites as well. I'd also take out completely the SotA claim and downgrade the mentions to measuring something like visibility rather than performance.

yunusabd 11 hours ago||

That's fair, my immediate concern would be that there would be very few comments comparing any two models, so the data would be very anecdotal.

The context would be really nice to have, but reading the comments myself, it often just isn't very clear what exactly users are building or which programming language they are using.

I think analyzing more comments is promising. If you get enough data, you can generalize across use cases and get more meaningful ratings. The obvious lever is including more posts, although it might hit diminishing returns. I'll play around with it.

For the context, I want to try giving Gemini a "scratch pad", where it can note down strengths and weaknesses per model that it finds in the comments. Something like "some users say that model x is good for writing tests". Then on each run, I let it update the scratch pad and publish the results as more of a qualitative analysis.

For the wording, I'd like to keep a certain amount of click bait, sorry ;)

itsnewme 3 hours ago||

[dead]

skeptrune 7 hours ago||

What a win it is for open source that qwen and kimi show up on this at all.

yakkomajuri 14 hours ago||

"Prompts an LLM" -> which LLM?

I saw you're using Gemini for the sentiment rating (which I guess you picked because it's not often mentioned and thus "neutral"? lol)

But would be interesting to get more details overall

yunusabd 13 hours ago|

It's actually ChatGPT at the moment for the first filtering step, for no other reason than having a code snippet ready that I could point Cursor at (I know, so 2025). The Gemini call is using batch processing, so it's handled differently.

jesse_dot_id 9 hours ago||

I suspect companies are deploying bots to shift sentiment around their products. I find metrics like this to be largely useless vs. actually just trying stuff out.

jatins 5 hours ago||

So no one's using Gemini on HN?

pbgcp2026 13 hours ago||

So, it's a webpage with 3 paragraphs and a simple chart. It has: 1) terrible color scheme – fine, I switch to reader mode 2) shitloads of JS - fine, NoScript works, page breaks 3) Fancy "design" with simple graph but unreadable X axis labels - fine, I can use screen zoom for that ... to see 3x "Claude O..." LOL are we playing guess-me-over game? 4) ... "LxxxLxxx - Learn languages with YouTube!"

nonameiguess 3 hours ago||

Something that has been interesting to me for my entire life was the geek/jock cultural split in the US that seemed to crescendo in the 80s with the rise of popular nerd films and then the 90s when software started taking over the world. Being a pretty athletic kid who lettered in four sports, won a state championship, but also won math tournaments and spelling bees, it felt artificial to me. Plenty of high-level athletes have always been into video games, anime, and comic books, and are just as smart as people who can't run without tripping themselves and never learned to throw or catch any kind of ball.

Now it seems like it's come circle from the other direction, too. We always had fandom elements in computing nerd culture. Editor wars. Language wars. Framework wars. Now that software tooling has become nearly human-like, mercurial, unpredictable, inconsistent in performance and experience from week to week, software developers have turned into sports scouts and ESPN talking heads, going so far as to make continually updating live power rankings the way commentators try to predict in season which team is looking most like they'll win the championship that year. You're in the position talent evaluators were in roughly the late 90s, relying mostly on eye test and rough proxy measures of raw potential. Simon Willison applies the pelican test the way draft combines put athletes through shuttle drills and test vertical leap to try and predict how well they'll do in real gameplay.

It leaves me wondering when we'll have the Bill James style analytics breakthrough in software talent evaluation or if such a thing is even possible. At least with athletes, practice can make them better and injury and age can make them worse, but you can't just silently swap out an entirely different mind and body under the same name and face. You guys are trying to assess the performance of constantly moving targets that can and do change capabilities and characteristics on a daily basis.

Hari2028 11 hours ago||

How noisy is the sentiment classification? Feels like that could skew results a lot

yunusabd 11 hours ago|

From the comments that I've checked manually it's pretty good. You can go to the "User Ratings" tab in the Google Sheet and check some comments to get an idea.

input_sh 3 hours ago||

Terrible metric that tells absolutely nothing about what's state-of-the-art. You might as well call this list the most astroturfed models on HN.

julianlam 8 hours ago|

Interesting that Gemma 4 didn't crack the top 10.

I've been experimenting with the 26B-A4B model with some surprisingly good results (both in inference speed and code quality — 15 tok/s, flying along!), vs my last few experiments with Devstral 24B. Not sure whether I can fit that 35B Qwen model everybody's so keen on, on my 32GB unified RAM.

However I think I may be in the minority of HN commenters exploring models for local inference.

asnelt 1 hour ago|

Can you elaborate on your setup? What harness are you using with Gemma 4 on your 32GB machine?

More comments...