Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

Posted by dnhkng 7 hours ago

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs(dnhkng.github.io)

183 points | 65 commentspage 2

d0100 2 hours ago|

I wonder if joining layers from the "organs" of different models could further enhance the results

dnhkng 3 hours ago||

Here's an extract, the core TL;DR for a feel of the article.

"And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!

Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.

The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.

Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.

Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.

If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with."

dongecko 3 hours ago||

What a great read! You got me at the base64 oddity. I also stumbled over this, while trying to dodge some LLM limitation. (was trying to generate images in a time before multimodal was a thing. it only worked to a degree).

cootsnuck 5 hours ago||

Super cool. Love seeing these writeups of hobbyists getting their hands dirty, breaking things, and then coming out on the other side of it with something interesting.

kovek 4 hours ago||

Is this similar to send 48656c6c6f2c20686f772061726520796f753f in the prompt? As done here: https://youtu.be/GiaNp0u_swU?si=m7-LZ7EYxJCw0k1-

dnhkng 3 hours ago|

Yes, I was using Base64 to 'jailbreak' LLMs back in the day (so similar), and thats what led me to the hypothesis, and months of GPU use to find optimal later dultication!

Aditya_Garg 4 hours ago||

Wild stuff and great read

Do you think karpathy's autoresearch would be useful here?

janalsncm 3 hours ago|

Based on Karpathy’s writeup the auto research would not have found this. He tells the agent to improve the model and training loop with a five minute time limit, but honestly this “hack” is so far out of distribution that it seems really unlikely an agent would find this.

goodmythical 5 hours ago||

Isn't this similar to models that have "double check the answer"?

First pass runs your input through, second pass runs it's output as input?

Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?

sva_ 4 hours ago||

I don't think its mathematically equivalent or even close because the context/logprobs will be very different, since you only produce 1 token per pass. I'd say the token itself has a lot less information than the signal propagating through the residual stream of transformer blocks.

dnhkng 5 hours ago||

Maybe, but the interesting thing for me it this only works with specific 'chunks' of the transformer layer stack. More or less that the optimal leads to worse performance.

tjwei 5 hours ago||

Really interesting discovery, especially the part about base64. Reminds me of this: Transformer Layers as Painters https://arxiv.org/abs/2407.09298

blourvim 7 hours ago||

I am not really an ml dev so I don't understand most of it. It does sound ridiculous how it would even work work. Brilliant work and great article I enjoyed reading it

This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?

dnhkng 5 hours ago|

No worries, happy to discuss anyway :)

MoE (mixture of experts), is an architecture that forces sparsity (not all 'neurons' are active during the forward pass.

This is pretty much orthogonal to that; it works with dense and MoE models, by repeating 'vertical' sections of the transformer stack.

lordmathis 4 hours ago|

That's cool. I tried the b64 thing on my local qwen3.5 27b without access to tools and it did it.

More comments...