Posted by grep_it 3 days ago
https://scikit-learn.org/stable/modules/generated/sklearn.ma...
I think other methods are more fashionable today
https://scikit-learn.org/stable/modules/manifold.html
particularly multi-dimension scaling, but personally I think tSNE plots are less pathological (they don't have as many of these crazy cusps that make me think it's projecting down from a higher-dimensional surface which is near-parallel to the page)
After processing documents with BERT I really like the clusters generated by the simple and old k-Means algorithm
https://scikit-learn.org/stable/modules/generated/sklearn.cl...
It has the problem that it always finds 20 clusters if you set k=20 and a cluster which really oughta be one big cluster might get treated as three little clusters but the clusters I get from it reflect the way I see things.
You have three points nearby, and a fourth a bit more distant. 4 best match is 1, but 1 best match is 2 and 3.
redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose
montrose 0.8640020787715912
redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg
pg 0.8639097809791565
So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.
This is impressive and scary. Obviously I had to create a throwaway to say this.
don't +0.9339
It's also a tool for wannabe impersonators to hoan their writing style mimic skills!
(The Satawalese language has 460 speakers, most of who live in Satawal Island in the Federated States of Micronesia.)
What a profiler would do to identify someone, I imagine, requires much more. Like the ability to recognize someone's tendency of playing the victim to leverage social advantage in awkward situations.
But, limiting to the top couple hundred words, probably does limit me to sounding like a pretentious dickhole, as I often use "however", "but", and "isn't". Corrections are a little too frequent in my post history.
I'd expect things might be a tiny bit looser with precisions if something small like stop words were removed. Though, it'd be interesting to do the opposite. If you were only measuring stopwords, would that show a unique cadence?
tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?