Posted by matt_d 3 days ago
> Abstract: [...] Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization;
> (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions.
> Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.
> [...] flash-kmeans achieves up to 17.9X end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33X and over 200X, respectively.
k-means clustering > Algorithms > Variations: https://en.wikipedia.org/wiki/K-means_clustering#Variations
Also analogous to flash attention, a linear speedup in big O sense based on the typical algorithmoc complexity computing model can be a polynomial speedup in measured wall clock time due to memory hierarchy differences.
Still small compared to exponential differences, but for an NP-Hard problem, a linear 100x speedup is the difference between practically computable vs. not. There are a ton of things I'd wait 2 hours for that I wouldn't wait a week for.
from what I've seen I had the impression that Yinyang k-means was the best way to take advantage of the sparsity.