Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

Posted by dnhkng 6 hours ago

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs(dnhkng.github.io)

183 points | 65 comments

dnhkng 6 hours ago|

Author here. I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, the top 4 models on that leaderboard are still descendants.

The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.

The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.

Happy to answer questions.

3abiton 24 minutes ago||

Man, that was such an enjoyable read. I loved your story on the wild server hunt, back when it was posted on r/localllama. I think one thing that is missing from the whole AI "discussion" is this train of thought of how we go from abstract mathetmatical formulation to intuitive understanding of the underlying functionality, and you showcased it beautifully in this article. Similarly to 3blue1brown who also did an amazing series on transformers. Kudos!

Balinares 3 hours ago|||

The idea that there may be a cognitive lingua franca hiding in the layers is fascinating and gives me hope for a neat idea: pluggable knowledge banks.

MoE notwithstanding, a model trained on the whole Internet and a few hundred thousands stolen books carries way more knowledge than is actually needed for any given workflow. It would be great if we could ship slimmed down models into which we'd plug the knowledge banks useful for today's work, and only those.

It would also mean that you could keep a model's knowledge fresh without retraining the whole of it.

gitpusher 2 hours ago|||

> pluggable knowledge banks.

plugs in knowledge bank LLM: ... I know kung fu.

pennomi 48 minutes ago||||

Agreed, I suspect that LLMs in the future will have separate (possibly standardized) decoding/encoding layers that plug into logic layers.

dormento 2 hours ago|||

This is interesting. Would this mean less space for hallucination as well (depending on the breadth of knowledge applied to a specific task)?

oliver_dr 1 hour ago||

[dead]

phn 1 hour ago|||

A fascinating thing for me after reading this is: how can it be that the "circuit input" is compatible with its output to the point where the performance improves? The training process never saw this particular connection just like it didn't see layer 60 output into layer 3 or whatever.

Great read, makes you wonder what else is encoded in these models that might be useful!

rapatel0 4 hours ago|||

I think you may have cracked latent space reasoning. I've had a hunch that something like this would work, but couldn't figure out how the training would back propagate. But you've shown that you just need to duplicate existing layers.

Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.

dnhkng 3 hours ago|||

Yes, I've tried duplicating indvidual layers, but its not useful.

I think this hasn't been tried before because it's totally unintuitive that feeding the output from later layers into previous ones would actually do anything. And in fact, it usually is detrimental. I guess it takes really bored hobbyists with too much compute to check this stuff.

I have done some interesting work on applying multiple layer duplications in different regions of the model too, going so far as to train a meta-model (actually just XGBoost) to predict the merges. Seems to work, buts thats a whole other blog post.

This works with MoE, and yes, I would be interested in looking into this in more detail. But my wife might disagree with this time sink...

rapatel0 19 minutes ago||

Clarification. Duplicating multiple groups of layers in a "reasoning" loop

Normal:

  L1 -> L2 -> L3 -> L4 -> out

Unrolled (current framing):

  L1 -> [L2->L3] -> [L2->L3] -> L4 -> out

Looped (proposed):

       --<--loop----
       |           |

  L1 -> [L2->L3] x N --> L4 -> out

"reasoning loop"

Note: ascii rendering HN is not trivial

skerit 2 hours ago|||

This is kind of what LoopLM is doing, no? https://arxiv.org/abs/2510.25741

digdugdirk 4 hours ago|||

Super cool! Do you do any analysis or have any tools that help you identify these circuits? I came across this [1] recently, and wanted to try to identify specifically strong "circuits" in what seems to be a similar way to what you did.

[1] https://weightwatcher.ai/

dnhkng 4 hours ago||

I build my own analysis tools. I'm just finishing up running the current generation of LLMs (MiniMax M2.5 and the Qwen3.5 family), and then I will put it all on Github.

It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).

user_7832 2 hours ago|||

Thanks for the post, really cool stuff you did!

Extra thanks for making it written in a readable and approachable way! I don't have much of a background in this topic, but still managed to understand about 70-80% of it :) You're a good writer

afpx 2 hours ago|||

Thank you so much for sharing this in a delightful blog post. One of the more enjoyable things I've read in a while. Very motivating!

jauntywundrkind 4 hours ago|||

The dual GH200 build was amazing. Awesome to see someone with such talent & flare in one area also doing great in another area. Thanks for noting that that was you. https://news.ycombinator.com/item?id=46222237

naasking 4 hours ago||

This layer duplication strikes me as a bit of "poor man's" version of looped language models:

https://ouro-llm.github.io/

Pretty cool though. LLM brain surgery.

dnhkng 3 hours ago||

Agrees, but one thing to note:

I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?

This would give 'organs' space to develop.

radarsat1 17 minutes ago||

it also reminds me a bit of this diffusion paper [1] which proposes having an encoding layer and a decoding layer but repeats the middle layers until a fixed point is reached. but really there is a whole field of "deep equilibrium models" that is similar. it wouldn't be surprising if large models develop similar circuits naturally when faced with enough data.

finding them on the other hand is not easy! as you've shown, i guess brute force is one way.. it would be nice to find a short cut but unfortunately as your diagrams show, the landscape isn't exactly smooth.

I would also hypothesize that different circuits likely exist for different "problems" and that these are messy and overlapping so the repeated layers that improve math for example may not line up with the repeated layers that improve poetry or whatever, meaning the basic layer repetition is too "simple" to be very general. that said you've obviously shown that there is some amount of generalizing at work, which is definitely interesting.

[1] https://arxiv.org/abs/2401.08741

mysteria 33 minutes ago||

The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.

This wasn't something I really dug into in great detail but I remember my surprise back then at how all those merged models and those "expanded" models like Goliath still generated coherent output. IMO those were more community models made by small creators for entertainment rather than work, and only really of interest to the local LLM groups on Reddit, 4chan, and Discord. People might briefly discuss it on the board and say "that's cool" but papers aren't being written and it's less likely for academics or corpo researchers to notice it.

That being said I wonder if it's possible to combine the layers of completely different models like say a Llama and a Qwen and still get it to work.

Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.

Would using grammar parsing help here by forcing the LLM to only output the expected tokens (i.e. numbers)? Or maybe on the scoring side you could look at the actual probabilities per token to see how far the correct digit is.

Lerc 1 hour ago||

I have had broadly the same intuitions on the use of middle layers, but haven't had much luck with the tiny models that I can run on my hardware.

There's a video on YouTube https://www.youtube.com/watch?v=pDsTcrRVNc0

about a looping layer models, after watching that I poured some thoughts off the top of my head into a comment which, of course, promptly sunk without a trace. I'll repost the gist of them here.

If you gain benefit from looping layers, at some level every layer of parameters is in front of and behind every other, the conclusion must be that the order of the layers does not need to be fixed at all.

If you cycle through the layers multiple times, are you doing so for the benefit of a particular layer on a particular problem. If so, can you skip the other layers that don't add on repetition. If you can skip (and you can know when to skip), and you can repeat (and know when to repeat)

What you would need is a mechanism which can decide which layer is needed next. Is that then not a looping single layer MOE model? Storing the layers as a wide set of selectable options rather than a deep set of unconditional layers. You would be picking what the next layer should be (or exit the loop) the threshold for exit drops each iteration so it always eventually exits. With a tunable 'how hard to think' knob to adjust the threshold.

iamjackg 38 minutes ago||

I find the concept of LLM "brain surgery" fascinating, precisely because of how opaque the network is. One of the first things I did back when llama.cpp first got vision model support was hack the code to zero out (or otherwise modify) random numbers in the image embedding generated by the projector and then ask the LLM to describe the image. It was absolutely fascinating.

It would go from a normal description of the item in the picture to suddenly seeing people clapping in the background that were not there, or making up some other stuff. I kinda stopped after a while, but I should pick that back up and do a more coherent experiment to see if I can find any correlation between vector dimensions and "meaning."

Havoc 3 hours ago||

Crazy writeup.

Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive

dinobones 2 hours ago|

why tho? it's just an alternate alphabet/set of symbols.

dnhkng 2 hours ago||

Because its generally expected that models only work 'in distribution', i.e. they work on stuff they have previously seen.

They almost certainly have never seen regular conversations in Base64 in their training set, so its weird that it 'just works'.

Does that make sense?

dormento 2 hours ago|||

For all we know, AI tech companies could theoretically have converted all of the "acquired" (ahem!) training set material into base64 and used it for training as well, just like you would encode say japanese romaji or hebrew written in the english alphabet.

dtj1123 1 hour ago|||

Unlikely that every company would have bothered to do this.

idiotsecant 48 minutes ago|||

'Yes, I know we already trained on all that data, but now I want you to convert to base64 and train it again! at enormous cost!'

broDogNRG 42 minutes ago|||

[dead]

hmokiguess 3 hours ago||

I really enjoyed reading this. I feel like generalists intuitively experience this exact thing so much throughout their lives because they must have this neuroanatomy you describe. There’s a certain geometry to knowledge that makes possible for this orthogonal movement and it is really fascinating to me. Thank you for publishing this, you made my day!

dnhkng 3 hours ago|

Thanks!

BloodAndCode 30 minutes ago||

Did you try repeating the same mid-layer block more than once?

If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.

It would be really interesting to see which of those regims the model falls into.

dnhkng 26 minutes ago|

Yes!

I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...

efromvt 16 minutes ago||

If you found two disjoint sections that seemed positive on their own, did you try looping both separately in the same model? Wondering how localized the structures are.

hex4def6 1 hour ago||

I've gotta say, this writeup gives me an itchy feeling. It really does feel like poking around a synthetic brain at this point.

You could make the argument it's closer to the blocks of a CPU compared with a brain, and it's no different to copy-pasting some IP block for eg, HW JPEG decoding. But I feel like the difference here is we're 'discovering' these blocks / organs. They weren't designed, they were evolved.

dnhkng 23 minutes ago|

At some point I will clean up and share the dynamic layer modification code for oobabooga Text-Generation-WebuUI.

You can enter the setting, and apply new re-layering architectures. Its very weird chatting with these brain-damaged models.

WithinReason 4 hours ago||

Here is a paper that made a similar observation recently:

https://www.alphaxiv.org/abs/2512.19941

dnhkng 4 hours ago||

Thanks for the link!

I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way

WithinReason 4 hours ago||

I think the recurrence is a consequence of using a residual connection, seems like that makes the representation stay consistent across layers

tgw43279w 4 hours ago||

Very cool, thanks for sharing! Recovering 96% using just two blocks on IMN-1k, wow!

tgw43279w 5 hours ago|

That was a fun read! The base64 decoding and encoding is quite interesting. A parallel: these models are surprisingly robust to heavy word mangling, back in 2023 people used this trick to jailbreak the models very often, but what was more surprising is that they even understand it. I always thought of it this way there must be some circuitry in the model that maps these almost unrecognizable words/sentences into their rectified versions. But what your base64 also shows is the fact thy can also encode them back as well! (However models are known to not be able to produce mangled output that looks convincingly random. I think the base64 transformation is more mechanical in this regard and hence it‘s easier to do the reverse for them.) So your layer circuit hypothesis aligns pretty well with my mental model of how these models work based on the interpretability work I am familiar with! I really also like the way you used the heatmaps as a tool to derive layer insights, very intuitive! But it’s really surprising that you can simply duplicate layers and achieve better results that generalize! This is some research grade effort! I’m confident you could publish this in NeurIPS or ICML if you put it into a paper! I‘m quite impressed! Great work!

More comments...