Top
Best
New

Posted by codexon 4 hours ago

Top AI models fail at >96% of tasks(www.zdnet.com)
13 points | 5 comments
tfehring 1 hour ago|
Success rate already up from 2.5% in Q3 2025 to 3.75% with Opus 4.5 (November 2025), presumably even higher with Opus 4.6 and/or GPT-5.3-Codex https://www.remotelabor.ai
codexon 4 hours ago||
This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.

Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.

kolinko 2 hours ago|
They didn't test Opus at all, only Sonnet.

One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.

Venn1 4 hours ago||
ChatGPT: when you want spellcheck to argue with you.
zb3 2 hours ago|
You think they don't? You think AI can replace programmers, today?

Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051