Top
Best
New

Posted by spectraldrift 15 hours ago

Gemini 3.5 Flash(blog.google)
https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...
759 points | 534 commentspage 7
andrewstuart 14 hours ago|
The benchmark that matters - can it actually program as well as Claude code.

If not then I’m not using it.

Cancelled my account 3 months ago, only Claude code level capability would bring me back.

cmrdporcupine 13 hours ago|
I spent 10 minutes with it in their new "agy" CLI tool and immediately found it is nowhere close to GPT 5.5 high in codex. It was sloppy and made poor assumptions in its analysis. It would have produced a mess if I let it go ahead with its plan. And it was just like previous versions of Gemini with poor tool use (e.g. "I wrote a file with the plan..." but file was never written.)

For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)

They're still months behind OpenAI and Anthropic on coding.

Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).

I do use Gemini for "lifestyle" AI usage (web research etc) tho.

danny094 11 hours ago||
so google is just trying to be cool in 2026 huh
hubraumhugo 14 hours ago||
Just updated my HN Wrapped project with it and it does well on my totally unscientific LLM humor benchmark: https://hn-wrapped.kadoa.com
amarant 13 hours ago|
Lol, nice project! I liked the xkcd-style comic the most!

I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!

bakugo 15 hours ago||
Triple the price of the last Flash model ($3 -> $9 per 1M output). Quickly approaching Sonnet prices.

Feels like the AI pricing noose is tightening sooner rather than later.

kristopolous 12 hours ago||
I have a tool to track these I've built

Relatively speaking here's where it's at:

    score  age  size    name
    44.2   97   large   GLM-5 (Reasoning)
    44.7   187  -       GPT-5.1 (high)
    44.9   29   -       Qwen3.6 Max Preview
    45     0    -       Gemini 3.5 Flash
    45.5   27   large   MiMo-V2.5-Pro
    45.6   75   -       GPT-5.4 (low)
this is from artificial-analysis using https://github.com/day50-dev/aa-eval-email/blob/main/art-ana...

I really don't know why people down vote me. What do I need to say to make things for free that people like? Sincere question. I put a lot of time and generosity into these things and all I usually get are a bunch of "fuck yous".

This is honestly an existential issue for me. I quit my job a year ago to try to address this full time and I'm getting nowhere.

kridsdale3 9 hours ago||
Buddy, this tone may be why.

We genuinely don't understand what your post is about. What is this tool? What are these numbers representative? Why are things sorted in that order?

You haven't communicated really anything at all. I am interested, I'd like to understand. Write a more complete post, please.

kristopolous 8 hours ago||
Are you familiar with https://artificialanalysis.ai/leaderboards/models

The json on the page has a coding index result it hides from the table.

That's what this exposes. It's a sorting from the leading evals company on the coding index for basically every model that matters presented in an easy to parse format that you can feed into model routing harnesses in real time so, for instance, your agents can dynamically upgrade themselves to better models as they come out or cost optimize based on eval results.

I do stuff like this, give it away for free and it's either ignored or makes people angry...

I really wish I didn't piss people off with my sincerity but somehow it always goes down that way

I really appreciate your time thank you so much

esafak 10 hours ago||
I see no 'score' or 'age' mentioned in your script. What does age signify and how are they calculated?
kristopolous 9 hours ago||
This isn't obvious?

    "\(
        10 \* (.codingIndex // 0) | round / 10
    ) \(
      (
        now - (
        .releaseDate |
          try ( strptime("%Y-%m-%d") | mktime )
          catch (now + 86400)
      ) ) / 86400 | floor
Real question. I see 86400 and I know it's time... That might just be me.

I'm not being an ass, I don't know how to talk to people or when I think I'm being clear but I'm actually being cryptic

mrbungie 8 hours ago||
It is kind of noisy because the release recency, which is what your "age" column actually represents, is not important data for the comparison you are trying to make.

Also what message we should get from that table is not really obvious.

kristopolous 8 hours ago||
Okay I think there's a familiarity delta. I constantly run into this

I know artificial analysis quite well as the gold standard in llm evals.

But I guess they're still obscure

I didn't think they were.

The age is important because new techniques keep being developed and so it is a very rough indicator of the size/cost/efficiency trade-off.

How old a model is is a major indicator of what you can expect from it.

I really need to develop a better sense for what people know. That's only one of my problems

Thanks for engaging with me

mrbungie 11 minutes ago||
> I know artificial analysis quite well as the gold standard in llm evals.

I also know them, but it took me a while to realise you were publishing their data in that table. I don't think it was clear.

> The age is important because new techniques keep being developed and so it is a very rough indicator of the size/cost/efficiency trade-off.

Yes but you are already including the name of the model, your potential public for the table already know about model's release history and therefore each model's age, at least roughly.

nightski 15 hours ago||
AI being a product is not the future. It's more like an operating system that deserves to be open and free (aka Linux). Unless that happens we are in for a very dystopian future. I wish I had the intelligence, resources and/or connections to try and make that happen.
lugu 13 hours ago|
What we need today is a standard local API (think of it as a POSIX extension). So that each desktop app that needs AI to enhance a feature can simply call that. This way, those apps will need to handle the case where AI is not availabile. This will empower users.
charcircuit 9 hours ago||
All major operating systems Windows, macOS, iOS, and Android have local APIs for using AI.
hedora 8 hours ago||
Why would I use those instead of just grabbing a model from hugging face? Are they as good as qwen 30B?
charcircuit 2 hours ago||
Because it is simpler as an application developer to just use an OS API then trying to figure out some 3rd party thing and setting that up. Each platform has several different models for different things so I can't give a comparison.
stan_kirdey 14 hours ago||
EXPENSIVE ._.
uejfiweun 12 hours ago||
This is funny, I was randomly using Gemini today and I was astounded how good the responses I was getting were from Flash. I guess this must be the reason why.
casey2 13 hours ago||
I think the field moved to agents too fast. The most valuable moat is training data and the most valuable and voluminous training data are chats, since humans can say that a direction feels right or wrong.
dsabanin 7 hours ago|
now matter what google does for some reason the agentic performance of their models is missing something, i hope this release is stronger. we need more competition.
More comments...