Posted by hnburnsy 5 days ago
And one of the biggest ironies of AI scaling is that where scaling succeeds the most in improving efficiency, we realize it the least, because we don't even think of it as an option. An example: a Transformer (or RNN) is not the only way to predict text. We have scaling laws for n-grams and text perplexity (most famously, from Jeff Dean et al at Google back in the 2000s), so you can actually ask the question, 'how much would I have to scale up n-grams to achieve the necessary perplexity for a useful code writer competitive with Claude Code, say?' This is a perfectly reasonable, well-defined question, as high-order n-grams could in theory write code without enough data and big enough lookup tables, and so it can be answered. The answer will look something like 'if we turned the whole earth into computronium, it still wouldn't be remotely enough'. The efficiency ratio is not 10:1 or 100:1 but closer to ∞:1. The efficiency gain is so big no one even thinks of it as an efficiency gain, because you just couldn't do it before using AI! You would have humans do it, or not do it at all.
Here is the NOAA on the improvements:
> 8% better predictions for track, and 10% better predictions for intensity, especially at longer forecast lead times — with overall improvements of four to five days.(1)
I’d love someone to explain what these measurements mean though. Does better track mean 8% narrower angle? Something else? Compared to what baseline?
And am I reading this right that that improvement is measured at the point 4-5 days out from landfall? What’s the typical lead time for calling an evacuation, more or less than four days?
(1)https://www.noaa.gov/news/new-noaa-system-ushers-in-next-gen...
Eg you want to find a really good design. Designs are fairly easy to generate, but expensive to evaluate and score. Understand we can quickly generate millions of designs but evaluating one can take 100ms-1s. With simulations that are not easy to GPU parallelize. We ended up training models that try to predict said score. They don’t predict things perfectly, but you can be 99% sure that the actual score designs is within a certain distance of said score.
So if normally you want to get the 10 best design out of your 1 million, we can now first have the model predict the best 1000 and you can be reasonably certain your top 10 is a subset of these 1000. So you only need to run your simulation on these 1000.
This training part of LLMs is still mostly Greek to me, so if anyone could explain that claim as true or false and the reasons why, I’d appreciate it
OpenAI has had $20B in revenue this year, and it seems likely to me they have spent considerably less than that on compute for training GPT5. Probably not $5M, but quite possibly under $1B.
There's no reason you couldn't generate training data for a model by getting output from another model. You could even get the probability distribution of output tokens from the source model and train the target model to repeat that probability distribution, instead of a single word. That'd be faster, because instead of it learning to say "Hello!" and "Hi!" from two different examples, one where it says hello and one where it says hi, you'd learn to say both from one example that has a probability distribution of 50% for each output.
Sometimes DeepSeek said it's name is ChatGPT. This could be because they used Q&A pairs from ChatGPT for training or because they scraped conversations other people posted where they were talking to ChatGPT. Or for unknown reasons where the model just decided to respond that way, like mixing up some semantics of wanting to say "I'm an AI" and all the scraped data referring to AI as ChatGPT.
Short of admission or leaks of DeepSeek training data it's hard to tell. Conversely, DeepSeek really went hard into an architecture that is cheap to train, using a lot of weird techniques to optimize their training process for their hardware.
Personally, I think they did. Research shows that a model can be greatly improved with a relatively-small set of high quality Q&A pairs. But I'm not sure the cost evaluation should be influenced that much, because the ChatGPT training price was only paid once, it doesn't have to be repaid for every new model that cribs its answers.
"Area for future improvement: developers continue to improve the ensemble’s ability to create a range of forecast outcomes."
Someone else noted the models are fairly simple.
My question is "what happens if you scale up to attain the same levels of accuracy throughout? Will it still be as efficient?"
My reading is that these models work well in other regions but I reserve a certain skepticism because I think it's healthy in science, and also because I think those ultimately in charge have yet to prove reliable judges of anything scientific.
I've done some work in this area, and the answer is probably 'more efficient, but not quite as spectacularly efficient.'
In a crude, back-of-the-envelope sense, AI-NWP models run about three orders of magnitude faster than notionally equivalent physics based NWP models. Those three orders of magnitude divide approximately evenly between three factors:
1. AI-NWP models produce much sparser outputs compared to physics-based models. That means fewer variables and levels, but also coarser timesteps. If a model needs to run 10x as often to produce an output every 30m rather than every 6h, that's an order of magnitude right there.
2. AI-NWP models are "GPU native," while physics-based models emphatically aren't. Hypothetically running physics-based models on GPUs would gain most of an order of magnitude back.
3. AI-NWP models have fantastic levels of numerical intensity compared to physics-based NWP models since the former are "matrix-matrix multiplications all the way down." Traditional NWP models perform relatively little work per grid point in comparison, which puts them on the wrong (badly memory-bandwidth limited) side of the roofline plots.
I'd expect a full-throated AI-NWP model to give up most of the gains from #1 (to have dense outputs), and dedicated work on physics-based NWP might close the gap on #2. However, that last point seems much more durable to me.
Even when you include training, the payoff period is not that long. Operational NWP is enormously expensive because high-resolution models run under soft real-time deadlines; having today's forecast tomorrow won't do you any good.
The bigger problem is that traditional models have decades of legacy behind them, and getting them to work on GPUs is nontrivial. That means that in a real way, AI model training and inference comes at the expense of traditional-NWP systems, and weather centres globally are having to strike new balances without a lot of certainty.
There's an interesting parallel to Formula One, where there are limits on the computational resources teams can use to design their cars, and where they can use an aerodynamic model that was previously trained to get pretty good outcomes with less compute use in the actual design phase.
Assuming you’re not throwing the whole thing out after one forecast, it is probably better to reduce runtime energy usage even if it means using more for one-time training.
One of the big benefits of both the single run (AIGFS) and ensemble (AIGEFS) models is the speed and (less) computation time required. Weather modeling is hard and these models should be used as complementary to deterministic models as they all have their own strengths and weaknesses. They run at the same 0.25 degree resolution as the ECMWF AIFS models which were introduced earlier this year and have been successful[4].
Edit: Spring 2025 forecasting experiment results is available here[6].
[1] https://www.weatherbell.com/
[2] https://www.youtube.com/watch?v=47HDk2BQMjU
[3] https://www.youtube.com/watch?v=DCQBgU0pPME
[4] https://www.ecmwf.int/en/forecasts/dataset/aifs-machine-lear...
[5] https://www.tropicaltidbits.com/analysis/models/
[6] https://repository.library.noaa.gov/view/noaa/71354/noaa_713...
Even before LLMs got big, a lot of machine learning research being published were models which underperformed SOTA (which was the case for weather modeling for a long time!) or models which are far far larger than they need to be (e.g. this [1] Nature paper using 'deep learning' for aftershock prediction being bested by this [2] Nature paper using one neuron.
I'm not saying this is an LLM, margalabargala is not saying this is an LLM. They only said they hoped that they did not integrate an LLM into the weather model, which is a reasonable and informed concern to have.
Sigmar is correctly pointing out that they're using a transformer model, and that transformers are effective for modeling things other than language. (And, implicitly, that this _isn't_ adding a step where they ask ChatGPT to vibe check the forecast.)
The quoted NOAA Administrator, Neil Jacobs, published at least one falsified report during the first Trump administration to save face for Trump after he claimed Hurricane Dorian would hit Alabama.
It's about as stupid as replacing magnetic storage tapes with SSDs or HDDs, or using a commercial messaging app for war communications and adding a journalist to it.
It's about as stupid as using .unwrap() in production software impacting billions, or releasing a buggy and poorly-performing UX overhaul, or deploying a kernel-level antivirus update to every endpoint at once without a rolling release.
But especially, it's about as stupid as putting a language model into a keyboard, or an LLM in place of search results, or an LLM to mediate deals and sales in a storefront, or an LLM in a $700 box that is supported for less than a year.
Sometimes, people make stupid decisions even when they have fancy titles, and we've seen myriad LLMs inserted where they don't belong. Some of these people make intentionally malicious decisions.
Which is surprising to me because I didn't think it would work for this; they're bad at estimating uncertainty for instance.
FGN (the model that is 'WeatherNext 2'), FourCastNet 3 (NVIDIA's offering), and AIFS-CRPS (the model from ECMWF) have all moved to train on whole ensembles, using a cumulative ranked probability score (CRPS) loss function. Minimizing the CRPS minimizes the integrated square differences of the cumulative density function between the prediction and truth, so it's effectively teaching the model to have uncertainty proportional to its expected error.
GenCast is a more classic diffusion-based model trained on a mean-squared-error-type loss function, much like any of the image diffusion models. Nonetheless it performed well.
They aren't, but both of them are transformer models.
nb GAN usually means something else (Generative Adversarial Network).
I was looking at this part in particular:
> And while Transformers [48] can also compute arbitrarily long-range computations, they do not scale well with very large inputs (e.g., the 1 million-plus grid points in GraphCast’s global inputs) because of the quadratic memory complexity induced by computing all-to-all interactions. Contemporary extensions of Transformers often sparsify possible interactions to reduce the complexity, which in effect makes them analogous to GNNs (e.g., graph attention networks [49]).
Which kind of makes a soup of the whole thing and suggests that LLMs/Graph Attention Networks are "extensions to transformers" and not exactly transformers themselves.
(Well, not necessarily architecture. Training method?)
The following snippet highlights the algorithm used to determine <thing>
```fortran
.....For Gencast ('WeatherNext Gen', I believe), the repository provides instructions and caveats (https://github.com/google-deepmind/graphcast/blob/main/docs/...) for inference on GPU, and it's generally slower and more memory intensive. I imagine that FGN/WeatherNext 2 would also have similar surprises.
Training is also harder. DeepMind has only open-sourced the inference code for its first two models, and getting a working, reasonably-performant training loop written is not trivial. NOAA hasn't retrained its weights from scratch, but the fine-tuning they did re: GFS inputs still requires the full training apparatus.
> so what’s AI about this that wasn’t AI previously?
The weather models used today are physics-based numerical models. The machine learning models from DeepMind, ECMWF, Huawei and others are a big shift from the standard, numerical approach used for the last decades.
So are they essentially training a neural net on a bunch of weather data and getting a black box model that is expensive to train but comparatively cheap to run?
Are there any other benefits? Like is there a reason to believe it could be more accurate than a physics model with some error bars?
Surprisingly, the leading AI-NWP forecasts are more accurate than their traditional counterparts, even at large scales and long lead times (i.e. the 5-day forecast).
The reason for this is not at all obvious, to the point I'd call it an open question in the literature. Large-scale atmospheric dynamics are a well-studied domain, so physics-based models essentially have to be getting "the big stuff" right. It's reasonable to think that AI-NWP models are doing a better job at sub-grid parameterizations and local forcings because those are the 'gaps' in traditional NWP, but going from "improved modelling of turbulence over urban and forest areas" (as a hypothetical example) to "improvements in 10,000 km-scale atmospheric circulation 5 days later" isn't as certain.
This isn't actually true, unless you're considering ML to be just linear regression, in which case we have been using "AI" for >100 years. "Advanced ML" with NN is what's being showcased here.
I suspect the nail in the coffin was the hurricane season, where NOAA’s model was basically beat by every major AI model. [0]
The GFS also just had its worst year in predicting hurricane paths since 2005. [1] That’s not a trend you want to continue.
[0] https://arstechnica.com/science/2025/11/googles-new-weather-...
[1] https://www.local10.com/weather/hurricane/2025/11/03/this-hu...
The answer: AI is not even covered, at least at the undergrad level. This is just a sample of one, so are any other universities educating future meteorologists on this subject?
https://www.nco.ncep.noaa.gov/pmb/products/gens/
https://www.emc.ncep.noaa.gov/emc/pages/numerical_forecast_s...
pywgrib https://www.cpc.ncep.noaa.gov/products/people/lxu/cookbook/a... containerized https://hub.docker.com/repository/docker/jmarks213/container...
[1]: https://github.com/ecmwf/eccodes [2]: https://docs.xarray.dev/en/stable/index.html
We know how the current admin views science and with the cuts to NOAA done this year, I expect that trend to continue and widen. At least where I am, we get to see both.
A quick search didn't turn up anything about the model's skill or resolution, though I'm sure the data exists.