Top
Best
New

Posted by grep_it 4/16/2025

Reproducing Hacker News writing style fingerprinting(antirez.com)
325 points | 155 commentspage 2
keepamovin 4/17/2025|
We can improve this. antirez has made a highly compelling poc but it could be refined for authorship attribution judging by the number of misses in the comments here, and how this compares to greater accuracy of the original post to which antirez refers. I’m no expert, but some ideas:

- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data

- remove stop words (NLP definition of stop words)

- perform stemming/tokenization/depluralization etc (again, NLP standard)

- implement commutativity and transitivity in the similarity function

- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity

- consider word bigrams, etc

- weight variations and misspellings higher as distinguishing signals

What are your ideas ?

declan_roberts 4/17/2025||
This is exactly why HN needs to allow us to delete accounts.
gkbrk 4/17/2025|
It wouldn't change anything though. Unless you delete your comment / account a few minutes after you post, it's gonna get scraped and saved into a DB almost instantly. After that, the fact that HN deleted them won't save you from this.
qsort 4/16/2025||
Have you tried to analyze whether there is a correlation between "closeness" according to this metric and how often users chat in the same thread? I recognize some usernames that are reported as being similar to me, I wonder if there's some kind of self-selection at play.
selcuka 4/17/2025|
Maybe we like comments written closer to our style more, and that motivates us to respond to them.
MivLives 4/17/2025||
Managed to find an alt I forgot I made and gave up using years ago. I do wonder about other high up people. Like what about our mutual histories makes us have similar word usage? Are we from the same areas or did we hang out in similar places online?
seabombs 4/17/2025||
This is a bit tangential but I've noticed lots of comments aping the style of Matt Walsh. Not just on HN either, but probably more here than other places I visit.

Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.

brookst 4/17/2025|
Matt Walsh? I mean, sure, maybe he’s your guy. Or maybe he’s… not. Matt Levine, though, that’s the style to ape.
wild_egg 4/16/2025||
Very cool. Also a bit surprising — two of my matches are people I know IRL.
antirez 4/16/2025|
Are you all from the same town? Another user reported this finding.
wild_egg 4/16/2025||
We had all met in the same city but weren't originally from there or live there any longer.

Maybe some "like attracts like" phenomena

jackphilson 4/17/2025|||
Very interesting phenomenon. I feel like the term 'phenomenon' is too unsubstantial for something like this.
LinuxBender 4/16/2025||
I think it would be interesting to run this tool against Reddit, 4chan and Tweeter to find astroturf accounts. Does it look like a real browser to those sites or would it be blocked?
ziddoap 4/16/2025||
I noticed that in my top 20 similar users, the similarity rank/score/whatever are all >~0.83. However, randomly sampling from users in this thread, some top 20s are all <~0.75, or all roughly 0.8, etc.

Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?

Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?

antirez 4/16/2025|
It's not "less unique" as the structure of the sentence is what matters: the syntax. But you simply tend to use words with balanced frequency. It's not a bad thing.
ziddoap 4/16/2025||
Yeah, definitely not a bad thing. This just piqued my curiosity and is in a field I'm not super familiar with, so I'm just trying to wrap my head around it.

Thanks for the interesting tool!

lnauta 4/17/2025||
That makes me wonder two things. Firstly, if your can use this to find LLM generated content, which I guess would need similar instructions. Imagine instructing it to talk like a pirate, it would be quite different from a generic response.

Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?

giancarlostoro 4/16/2025|
I tried my name, and I don't think a single "match" is any of my (very rarely used) throw away alts ;) I guess I have a few people I talk like?
delichon 4/16/2025||
I got 3 correct matches out of 20, and I've had about 6 accounts total (using one at a time), with at least a fair number of comments in each. I guess that means that my word choices are more outliers than yours or there is just more to match. So it's not really good enough to reliably identify alt accounts, but it is quite suggestive.
giancarlostoro 4/16/2025||
I think if you rule out insanely common words, it might get scary accurate.
lolinder 4/16/2025||
Actually, the way that these things work is usually by focusing exclusively on the usage patterns of very common (top 500) words. You get better results by ignoring content words in favor of the linking words.
giancarlostoro 4/16/2025||
Interesting, I think it also doesn't help that outside of a throw away on a blue moon, I don't really use alts...
antirez 4/16/2025|||
When they are rarely used (a small amount of total words produced), they don't have meaningful statistical info for a match, unfortunately. A few users here reported finding actual duplicated accounts they used in the past.
nozzlegear 4/17/2025||
I've had several accounts over the last decade, but this wasn't able to find any of the old ones, even after expanding the results to 50 users. I personally chalk it up to my own writing style changing (intentionally and unintentionally) over the years.
More comments...