Posted by latchkey 23 hours ago
It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.
So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts
So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device
I also think the dynamic would be really different if model inference can run at ridiculous speeds. You could make a genetic algorithm loop around it, so it can generate a population of proposals at each step, then have those tested and whittled down iteratively. If inference happens at thousands of tokens per second, then from user perspective it would still be really fast, and even a small model could solve complex problems.
even having something like opus 4.8 locally would completely change the landscape