Posted by xaskasdf 3 days ago
Had to build a custom quantized format (PSNT), hack endianness, write a tokenizer pipeline, and most of the PS2 SDK from scratch (releasing that separately). The model itself is also custom — a 10M param Llama-style architecture I trained specifically for this.
And it works. On real hardware.
I doubt the VUs can help with inference given their small scratchpad sizes and instruction set though, haha.
Curious about 2 things if you can share:
whats your per-token latency on real hardware how much quality loss came from PSNT quantization vs fp16 baseline Either way this is peak hacker energy, shipping on actual hardware makes it 10x cooler.
PS: Thank you! And forgot to mention PSNT also supports bitnet models, they work like crap tho
Very cool that it supports bitnet too even if results are rough right now, feels like theres a lot of room to tune there over time. when you do fix tok/sec, are you planning to post per-stage timings too (tokenizer, weight stream, matmul, samppling)? would be awesome to see where the biggest bottleneck is on real hw