Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

Posted by dnhkng 9 hours ago

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs(dnhkng.github.io)

212 points | 74 commentspage 3

tjwei 6 hours ago|

Really interesting discovery, especially the part about base64. Reminds me of this: Transformer Layers as Painters https://arxiv.org/abs/2407.09298

blourvim 8 hours ago||

I am not really an ml dev so I don't understand most of it. It does sound ridiculous how it would even work work. Brilliant work and great article I enjoyed reading it

This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?

dnhkng 6 hours ago|

No worries, happy to discuss anyway :)

MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.

This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.

lordmathis 6 hours ago||

That's cool. I tried the b64 thing on my local qwen3.5 27b without access to tools and it did it.

patchnull 6 hours ago||

This lines up with what I have seen doing CKA (centered kernel alignment) analysis on transformer internals. The middle layers in most large models have surprisingly similar representations to their neighbors, so duplicating them is basically giving the model extra compute cycles in a region where it is already doing useful refinement without messing up the input/output encoding stages. Curious whether picking layers by representation similarity instead of just a contiguous block would do even better.

dnhkng 6 hours ago|

Have a look at the boundaries in the heatmaps.

They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.

This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?

doctorpangloss 5 hours ago||

Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.

dnhkng 3 hours ago||

There are similar patterns in the models from all the big labs. I think the transform layer stack starts out 'undifferentiated', analogous to stem cells. Pre-training pushes the model to develop structure and this technique helps discover the hidden structure.

seeknotfind 6 hours ago||

Did you ever try multiple copies?

dnhkng 6 hours ago|

I did, but the combinatorics are mad. I have also tried training a meta-model that predicts the outputs of the combinations.

I will make another post if the topic is popular; its pretty geeky though, even more than my usual blog posts...

cosarara 2 hours ago||

My first idea would be to generate one of those heatmaps using RYS as the base model. And see if it gets meaningfully better. And then again!

GaggiX 6 hours ago||

This reminds me when people were doing crazy stuff to improve the first Stable Diffusion model by swapping layers, interpolating weights, documenting which layer was most responsible for the quality of the hands etc. At the end the final models had dozens of different ancestors.

rob_c 6 hours ago||

very awesome writeup, glad to see someone with access to hw actually playing with this.

Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.

The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.

The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.

priowise 5 hours ago|

[flagged]

user_7832 5 hours ago|

A 5 hour old account with a standard chatgpt reply? Seriously, try harder.