Posted by sethkim 7/3/2025
We don't have accurate price signals externally because Google, in particular, had been very aggressive at treating pricing as a competition exercise than anything that seemed tethered to costs.
For quite some time, their pricing updates would be across-the-board exactly 2/3 of the cost of OpenAI's equivalent mode.
[^1] "If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help."
[^2] "Google's decision to raise the price of Gemini 2.5 Flash wasn't just a business decision; it was a signal to the entire market." by far the biggest giveaway, the other tells are repeated fanciful descriptions of things that could be real, that when stacked up, indicate a surreal, artifical, understanding of what they're being asked to write about, i.e. "In a move that at first went unnoticed,"
Oh, I noticed. I've also complained how Gemini 2.0 Flash is 50% more expensive than Gemini 1.5 Flash for small requests.
Also I'm sure if Google wanted to price Gemini 2.5 Flash cheaper they could. The reason they won't is because there is almost zero competition at the <10 cents per million input token area. Google's answer to the 10 cents per million input token area is 2.5 Flash Lite which they say is equivalent to 2.0 Flash at the same cost. Might be a bit cheaper if you factor in automatic context caching.
Also the quadratic increase is valid but it's not as simple as the article states due to caching. And if it was a bit issue Google would impose tiered pricing like they do for Gemini 2.5 Pro.
And for what it's worth I've been playing around with Gemma E4B on together.ai. It takes 10x as long as Gemini 2.5 Flash Lite and it sucks at multilingual. But other than that it seems to produce acceptable results and is way cheaper.
That is assuming pricing and price drops only occurred because of cost reductions caused by technical advancements. While that certainly played a role in it, that disregards the role investiment money takes.
Maybe we've hit a wall in the "Moore's law for AI", or maybe it's just harder to justify these massive investments while all you have to show for are marginal improvements to the eyes of these investors, which are becoming increasingly anxious to have their money back.
Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is 400billion.
Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2 trillion.
I dont have the resources to be able to test these unfortunately; but they are claiming behemoth is superior to the best SAAS options via internal benchmarking.
Comparatively Deepseek r1 671B is 404gb in size; with pretty similar benchmarks.
But you compare deepseek r1 32b to any model from 2021 and it's going to be significantly superior.
So we have quality of models increasing, resources needed decreasing. In 5-10 years, do we have an LLM that loads up on a 16-32GB video card that is simply capable of doing it all?
I think the best of both worlds is a sufficiently capable reasoning model with access to external tools and data that can perform CPU-based lookups for information that it doesn't possess.
"Sir, I'm delighted to report that the productivity and insights gained outclass anything available from four years ago. We are clearly winning."
Personally, I'm rooting for RWKV / Mamba2 to pull through, somehow. There's been some work done to increase their reasoning depths, but transformers still beat them without much effort.
In terms of microbiology, the architecture of Transformer is more in line with the highly interconnected global receptive field of neurons
Since Gemini CLI was recently released, many people on the "free" tier noticed that their sessions immediately devolved from Gemini 2.5 Pro to Flash "due to high utilization". I asked Gemini itself about this and it reported that the finite GPU/TPU resources in Google's cloud infrastructure can get oversubscribed for Pro usage. Google (no secret here) has a subscription option for higher-tier customers to request guaranteed provisioning for the Pro model. Once their capacity gets approached, they must throttle down the lower-tier (including free) sessions to the less resource-intensive models.
Price is one lever to move once capacity becomes constrained. Yet, as the top voted comment of this post explains, it's not honest to simply label this as a price increase. They raised Flash pricing on input tokens but lowered pricing on output tokens up to certain limits -- which gives creedence to the theory that they are trying to shape the demand in order for it to better match their capacity.
Context size is the real killer when you look at running open source alternatives on your own hardware. Has anything even come close to the 100k+ range yet?
One of the clearest example is Deepseek v3. Deepseek has mentioned its price of 0.27/1.10 has 80% profit margin, so it cost them 90% lesser than the price of Gemini flash. And Gemini flash is very likely smaller model than Deepseek v3.