Posted by grep_it 4/16/2025
don't +0.9339
This is impressive and scary. Obviously I had to create a throwaway to say this.
It's also a tool for wannabe impersonators to hoan their writing style mimic skills!
(The Satawalese language has 460 speakers, most of who live in Satawal Island in the Federated States of Micronesia.)
What a profiler would do to identify someone, I imagine, requires much more. Like the ability to recognize someone's tendency of playing the victim to leverage social advantage in awkward situations.
But, limiting to the top couple hundred words, probably does limit me to sounding like a pretentious dickhole, as I often use "however", "but", and "isn't". Corrections are a little too frequent in my post history.
I'd expect things might be a tiny bit looser with precisions if something small like stop words were removed. Though, it'd be interesting to do the opposite. If you were only measuring stopwords, would that show a unique cadence?
tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?
Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.
Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.
EDIT: at the end of the post I added the visual representations of pg and montrose.
I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.
I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)
I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style
https://en.wikipedia.org/wiki/Stylometry
[1] With https://sbert.net/ this is so easy.