Top
Best
New

Posted by grep_it 4/16/2025

Reproducing Hacker News writing style fingerprinting(antirez.com)
325 points | 155 commentspage 4
Lerc 4/16/2025||
Used More Often by dang.

don't +0.9339

GenshoTikamura 4/17/2025||
Such a nice scientific way to detect and mute those who go against the agenda's grain, oh I mean don't contribute anything meaningful to the community
Uptrenda 4/17/2025||
I knew that this was possible but I always thought it took much more... effort? How do we mitigate this, then? Run our posts through an LLM?
throAwOfCou 4/17/2025||
I rotate hn accounts every year or two. In my top 4, I found 3 old alts.

This is impressive and scary. Obviously I had to create a throwaway to say this.

alganet 4/16/2025||
Cool tool. It's a shame I don't have other accounts to test it.

It's also a tool for wannabe impersonators to hoan their writing style mimic skills!

shakna 4/16/2025|
I don't have other accounts, but still matched at 85+% accuracy for a half dozen accounts. Seems I don't have very original thoughts or writing style.
Frieren 4/16/2025|||
My guess is that people from the same region and similar background will have more and closer "alters". So, if you are Californian-American then there is many people that will speak similar to you in HN. If you are a Satawalese speaker then you may be quite alone in your own group.

(The Satawalese language has 460 speakers, most of who live in Satawal Island in the Federated States of Micronesia.)

pc86 4/16/2025||
You couldn't have just picked a European country, you had to flex on us with Satawalese? :)
alganet 4/16/2025|||
It's a fingerprinting tool, not a profiling tool. You can't draw such conclusions from it.

What a profiler would do to identify someone, I imagine, requires much more. Like the ability to recognize someone's tendency of playing the victim to leverage social advantage in awkward situations.

shakna 4/16/2025||
85% is surprisingly high for fingerprinting, hence self-deprecation over insulting the author by poking at efficacy. I wouldn't have expected my Australian spelling, Oxford comma, or cadence to be anything close to the Californian Rust enthusiasts I apparently match against. Especially as there's no normalization happening - so even the Burrows-Delta method shouldn't match my use of "gaol" or "humour" that often.

But, limiting to the top couple hundred words, probably does limit me to sounding like a pretentious dickhole, as I often use "however", "but", and "isn't". Corrections are a little too frequent in my post history.

I'd expect things might be a tiny bit looser with precisions if something small like stop words were removed. Though, it'd be interesting to do the opposite. If you were only measuring stopwords, would that show a unique cadence?

alganet 4/16/2025||
I don't know dude, don't take it personally.
wizzwizz4 4/16/2025||
PhasmaFelis and mikeash have all matches mutual for the top 20, 30, 50 and 100. Are there other users like this? If so, how many? What's the significance of this, in terms of the shape of the graph?

tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?

tptacek 4/16/2025|
This is an interesting and well-written post but the data in the app seems pretty much random.
antirez 4/16/2025|
Thank you, tptacek. I was able to verify, thanks to the Internet Archive caching of "pg" for the post of 3 years ago, that the entries are quite similar in the case of "pg". Consider that it captures just the statistical patterns in very common words, so you are not likely to see users that you believe are "similar" to yourself. Notably: montrose may likely be a really be a secondary account of PG, and was also found as a cross reference in the original work of three years ago.

Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart.

Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity.

EDIT: at the end of the post I added the visual representations of pg and montrose.

PaulHoule 4/16/2025|||
If you want to do document similarity ranking in general it works to find nearby points in word frequency space but not as well as: (1) applying an autoencoder or another dimensional reduction technique to the vectors or (2) running a BERT-like model and pooling over the documents [1].

I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.

I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)

I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style

https://en.wikipedia.org/wiki/Stylometry

[1] With https://sbert.net/ this is so easy.

antirez 4/16/2025||
Indeed, but my problem is: all those vector databases (including Redis!) are always thought as useful in the context of learned embeddings, BERT, Clip, ... But I really wanted to show that vectors are very useful and interesting outside that space. Now, I also like encoders very well, but I have the feeling that the Vector Sets, as a data structure, needs to be presented as a general tool. So I really cherry picked a use case that I liked and where neural networks were not present. Btw, Redis Vector Sets support dimensionality reduction by random projection natively in the case the vector is too redundant. Yet, in my experiments, I found that using binary quantization (also supported) is a better way to save CPU/space compared to RP.
More comments...