Top
Best
New

Posted by grep_it 4/16/2025

Reproducing Hacker News writing style fingerprinting(antirez.com)
325 points | 155 commentspage 3
SnorkelTan 4/17/2025|
I remember the original post the author is referring to. I was captivated by it and thought it was cool. When I ran the original mentioned in the post, it detected my one of my alt's that I forgot about. OP's newer implementation using different methodologies did not detect the alt. For reference, the alt was created in 2010 and the last post was in 2012. Perhaps my writing style has changed?
SchemaLoad 4/17/2025|
I usually just create a new account every time I get a new computer or reinstall OS. I thought most of the results here were noise, but after closer inspection it just found 10 accounts I forgot having. Actually incredible and a little scary how well it works.
wruza 4/17/2025||
Dang's analysis was funny:

don't site comment we here post that users against you're

Quite a stance, man :)

And me clearly inarticulate and less confident than some:

it may but that because or not and even these

I noticed that randomly remembered usernames tend to produce either lots of utility words like the above, or very few of them. Interestingly, it doesn't really correlate with my overall impression about them.

Boogie_Man 4/16/2025||
No matches higher than .7something and no mutual matches let's go boys I'm a special unique snowflake
morkalork 4/16/2025||
I wonder if such an analysis could tease apart the authors of intentionally anonymous publications. Things like peer review notes for papers or legal opinions (afaik in countries that are not the USA, the authors of a dissenting supreme court decision are not named).
atiedebee 4/16/2025||
It looks like I don't use the word "and" very often. I do notice that I tend to avoid concatenating sentences like that, lthough it is likely that there just isn't enough data on my account as I haven't been on HN for that long.
0xWTF 4/16/2025||
There are some interesting similarities in o.g. accounts aaronsw, pg, and jedberg.

  - aaronsw and jedberg share danielweber
  - aronsw and jedberg share wccrawford
  - aaronsw and pg share Natsu
  - aaronsw and pg share mcphage
byearthithatius 4/16/2025||
This is so cool. The user who talks most like me, and I can confirm he does, is ajb257
nottorp 4/16/2025||
Interesting, the top 3 similar accounts to me are two USers and an Australian. I'm Romanian (and living in Romania). I probably read too many books and news in English :)

Well, and worked a lot with americans over text based communication...

jmward01 4/16/2025||
I think an interesting use of this is potentially finding LLMs trained to have the style of a person. Unfortunately now, just because a post has my style it doesn't mean it was me. I promise I am not a bot. Honest.
formerly_proven 4/16/2025|
I'm surprised no one has made this yet with a clustered visualization.
PaulHoule 4/16/2025||
Personally I like this approach a lot

https://scikit-learn.org/stable/modules/generated/sklearn.ma...

I think other methods are more fashionable today

https://scikit-learn.org/stable/modules/manifold.html

particularly multi-dimension scaling, but personally I think tSNE plots are less pathological (they don't have as many of these crazy cusps that make me think it's projecting down from a higher-dimensional surface which is near-parallel to the page)

After processing documents with BERT I really like the clusters generated by the simple and old k-Means algorithm

https://scikit-learn.org/stable/modules/generated/sklearn.cl...

It has the problem that it always finds 20 clusters if you set k=20 and a cluster which really oughta be one big cluster might get treated as three little clusters but the clusters I get from it reflect the way I see things.

antirez 4/16/2025|||
Redis supports random projection to a lower dimensionality, but the reality is that projecting a 350d vector into 2d is nice but does not remotely captures the "reality" of what is going on. But still, it is a nice idea to use some time. However I would do that with more than 350 top words, since when I used 10k it strongly captured the interest more than the style, so 2D projection of this is going to be much more interesting I believe.
layer8 4/16/2025||
Given that some matches are “mutual” and others are not, I don’t see how that could translate to a symmetric distance measure.
antirez 4/16/2025||
Imagine the 2D space, it also has the same property!

You have three points nearby, and a fourth a bit more distant. 4 best match is 1, but 1 best match is 2 and 3.

layer8 4/16/2025||
Good point, but the similarity score between mutual matches is still different, so it doesn’t seem to be a symmetric measure?
antirez 4/16/2025||
Your observation is really acute: the small difference is due to quantization. When we search for element A, that is int8 quantized by default, the code paths de-quantize it, then re-quantize it and searches. This produces a small loss of precision, like that:

redis-cli -3 VSIM hn_fingerprint ELE pg WITHSCORES | grep montrose

montrose 0.8640020787715912

redis-cli -3 VSIM hn_fingerprint ELE montrose WITHSCORES | grep pg

pg 0.8639097809791565

So why cosine similarity is commutative, the quantization steps lead to a small different result. But the difference is .000092 that is in practical terms not important. Redis can use non quantized vectors using the NOQUANT option in VADD, but this will make the vectors elements using 4 bytes per component: given that the recall difference is minimal, it is almost always not worth it.

More comments...