Top
Best
New

Posted by mayerwin 8 hours ago

Arena AI Model ELO History(mayerwin.github.io)
Hi HN,

I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.

We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.

Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab. It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see. It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.

However, I have a specific data blindspot that I'm hoping this community might have insights on.

Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.

Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?

I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!

61 points | 48 commentspage 2
Thomashuet 4 hours ago|
It seems to be a USA only thing, Chinese models and Mistral don't show any downward trend.
TurdF3rguson 1 hour ago||
Sure they do. Most models are on a downward trend because newer models are moving into top spots.
patall 4 hours ago||
Wouldn't it be really weird if a open-weight model dropped in performance? Because then, it would rather be the Elo ranking
refulgentis 5 hours ago||
Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?
ninjalanternshk 2 hours ago|
It links to the GitHub repo for the project, and while it’s not inconceivable that an AI bot would create and populate a functioning public GitHub repo, it’s pretty unlikely.
gptbased 2 hours ago|
[dead]