The End of Moore's Law for AI? Gemini Flash Offers a Warning

Posted by sethkim 7/3/2025

The End of Moore's Law for AI? Gemini Flash Offers a Warning(sutro.sh)

113 points | 75 commentspage 2

refulgentis 7/3/2025|

This is a marketing blog[^1], written with AI[^2], heavily sensationalized, & doesn't understand much in the first place.

We don't have accurate price signals externally because Google, in particular, had been very aggressive at treating pricing as a competition exercise than anything that seemed tethered to costs.

For quite some time, their pricing updates would be across-the-board exactly 2/3 of the cost of OpenAI's equivalent mode.

[^1] "If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help."

[^2] "Google's decision to raise the price of Gemini 2.5 Flash wasn't just a business decision; it was a signal to the entire market." by far the biggest giveaway, the other tells are repeated fanciful descriptions of things that could be real, that when stacked up, indicate a surreal, artifical, understanding of what they're being asked to write about, i.e. "In a move that at first went unnoticed,"

impure 7/3/2025||

> In a move that at first went unnoticed

Oh, I noticed. I've also complained how Gemini 2.0 Flash is 50% more expensive than Gemini 1.5 Flash for small requests.

Also I'm sure if Google wanted to price Gemini 2.5 Flash cheaper they could. The reason they won't is because there is almost zero competition at the <10 cents per million input token area. Google's answer to the 10 cents per million input token area is 2.5 Flash Lite which they say is equivalent to 2.0 Flash at the same cost. Might be a bit cheaper if you factor in automatic context caching.

Also the quadratic increase is valid but it's not as simple as the article states due to caching. And if it was a bit issue Google would impose tiered pricing like they do for Gemini 2.5 Pro.

And for what it's worth I've been playing around with Gemma E4B on together.ai. It takes 10x as long as Gemini 2.5 Flash Lite and it sucks at multilingual. But other than that it seems to produce acceptable results and is way cheaper.

gchamonlive 7/4/2025||

> This is the first time a major provider has backtracked on the price of an established model. While it may seem like a simple adjustment, we believe this signals a turning point. The industry is no longer on an endless downward slide of cost. Instead, we’ve hit a fundamental soft floor on the cost of intelligence, given the current state of hardware and software.

That is assuming pricing and price drops only occurred because of cost reductions caused by technical advancements. While that certainly played a role in it, that disregards the role investiment money takes.

Maybe we've hit a wall in the "Moore's law for AI", or maybe it's just harder to justify these massive investments while all you have to show for are marginal improvements to the eyes of these investors, which are becoming increasingly anxious to have their money back.

notphilipmoran 7/4/2025||

I feel that the details regarding the type of the model and the purpose it serves are underrepresented here. Yes existing models will get cheaper over time as they become more obsolete but to be at the forefront of innovation or models costs will only increase due to the these mentioned bottlenecks. Also there is the basic law of supply and demand coming into play here. As models get more advanced more industries will be exposed to them and see the potential cost savings compared to the current alternative. This will further increase demand and with further innovation, there will be further capability in turn again increasing demand. I only see this reversing is you are not at the forefront of innovation and many people using these LLMs at this point are close at least compared to many "normal" people and their understandings of LLMs.

incomingpain 7/3/2025||

I think the big thing that really surprised me.

Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is 400billion.

Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2 trillion.

I dont have the resources to be able to test these unfortunately; but they are claiming behemoth is superior to the best SAAS options via internal benchmarking.

Comparatively Deepseek r1 671B is 404gb in size; with pretty similar benchmarks.

But you compare deepseek r1 32b to any model from 2021 and it's going to be significantly superior.

So we have quality of models increasing, resources needed decreasing. In 5-10 years, do we have an LLM that loads up on a 16-32GB video card that is simply capable of doing it all?

sethkim 7/3/2025||

My two cents here is the classic answer - it depends. If you need general "reasoning" capabilities, I see this being a strong possibility. If you need specific, factual information baked into the weights themselves, you'll need something large enough to store that data.

I think the best of both worlds is a sufficiently capable reasoning model with access to external tools and data that can perform CPU-based lookups for information that it doesn't possess.

ezekiel68 7/3/2025||

"How is our 'Strategic Use of LLM Technology' initiative going, Harris?"

"Sir, I'm delighted to report that the productivity and insights gained outclass anything available from four years ago. We are clearly winning."

x-complexity 7/4/2025||

The article assumes that there will be no architectural improvements / migrations in the future, & that Sparse MoE will always stay. Not a great foundation to build upon.

Personally, I'm rooting for RWKV / Mamba2 to pull through, somehow. There's been some work done to increase their reasoning depths, but transformers still beat them without much effort.

https://x.com/ZeyuanAllenZhu/status/1918684269251371164

NetRunnerSu 7/4/2025|

In fact, what you need is a dynamic sparse live hyperfragmented Transformer MoE, rather than a product like RNN that is destined to be backward...

In terms of microbiology, the architecture of Transformer is more in line with the highly interconnected global receptive field of neurons

https://github.com/dmf-archive/PILF

flakiness 7/3/2025||

It can be just Google trying to capitalize Gemini's increasing popularity. Until 2.5 Gemini was a total underdog. Less so since 2.5.

ezekiel68 7/3/2025|

There's another side to that coin: supply.

Since Gemini CLI was recently released, many people on the "free" tier noticed that their sessions immediately devolved from Gemini 2.5 Pro to Flash "due to high utilization". I asked Gemini itself about this and it reported that the finite GPU/TPU resources in Google's cloud infrastructure can get oversubscribed for Pro usage. Google (no secret here) has a subscription option for higher-tier customers to request guaranteed provisioning for the Pro model. Once their capacity gets approached, they must throttle down the lower-tier (including free) sessions to the less resource-intensive models.

Price is one lever to move once capacity becomes constrained. Yet, as the top voted comment of this post explains, it's not honest to simply label this as a price increase. They raised Flash pricing on input tokens but lowered pricing on output tokens up to certain limits -- which gives creedence to the theory that they are trying to shape the demand in order for it to better match their capacity.

ramesh31 7/3/2025||

>By embracing batch processing and leveraging the power of cost-effective open-source models, you can sidestep the price floor and continue to scale your AI initiatives in ways that are no longer feasible with traditional APIs.

Context size is the real killer when you look at running open source alternatives on your own hardware. Has anything even come close to the 100k+ range yet?

sethkim 7/3/2025||

Yes! Both Llama 3 and Gemma 3 have 128k context windows.

ryao 7/3/2025||

Llama 3 had a 8192 token context window. Llama 3.1 increased it to 131072.

ryao 7/3/2025||

Mistral Small 3.2 has a 131072 token context window.

mpalmer 7/3/2025||

Is this overthinking it? Google had a huge incentive to outprice Anthropic and OAI to join the "conversation". I was certainly attracted to the low price initially, but I'm staying because it's still affordable and I still think the Gemini 2.5 options are the best simple mix of models available.

YetAnotherNick 7/3/2025|

Pricing != Cost.

One of the clearest example is Deepseek v3. Deepseek has mentioned its price of 0.27/1.10 has 80% profit margin, so it cost them 90% lesser than the price of Gemini flash. And Gemini flash is very likely smaller model than Deepseek v3.

More comments...