Posted by grep_it 3 days ago
Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.
Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.
EDIT: at the end of the post I added the visual representations of pg and montrose.
I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.
I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)
I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style
https://en.wikipedia.org/wiki/Stylometry
[1] With https://sbert.net/ this is so easy.
https://news.ycombinator.com/item?id=43662951
https://news.ycombinator.com/item?id=43662889
If you keep this up, we're going to have to ban you again.
If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.
Edit: ChatGTP, my bad
not very useful for more newer users like me :/
https://antirez.com/hnstyle?username=gfd&threshold=20&action...
zawerf (Similarity: 0.7379)
ghj (Similarity: 0.7207)
fyp (Similarity: 0.7197)
uyt (Similarity: 0.7052)
I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I'm now very self conscious about the words I overuse...