Posted by grep_it 4/16/2025
- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data
- remove stop words (NLP definition of stop words)
- perform stemming/tokenization/depluralization etc (again, NLP standard)
- implement commutativity and transitivity in the similarity function
- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity
- consider word bigrams, etc
- weight variations and misspellings higher as distinguishing signals
What are your ideas ?
Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.
Maybe some "like attracts like" phenomena
Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?
Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?
Thanks for the interesting tool!
Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?