Top
Best
New

Posted by realberkeaslan 19 hours ago

LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?(dnhkng.github.io)
118 points | 35 commentspage 2
yodon 15 hours ago|
Apologies if I missed this in the article (or in the first article in the series) - what happens if you add two copies of the layer set? Does performance improve over adding one copy of the layer set?
dnhkng 15 hours ago|
Author here: That was done in this blog post, in the beam search. I started with the best re-layer configs, and iteratively added more blocks, including the same multiple times, during a long beam search.

It turns out this does not help (somewhat surprisingly).

coppsilgold 10 hours ago|||
It's possible that the gains are despite the noise the coarse process introduces. After two repetitions the noise may overwhelm the advantage.

The residual connections resemble the Euler method (this observation led to Neural ODE's IIRC) which isn't known to be exactly clean. If the model has been trained to be a particular number of layers, adding more layers will also add a lot of noise.

Ultimately, the LLM will need to be fine tuned with the loops or a looped architecture trained from scratch, such as: <https://ouro-llm.github.io> unfortunately they made the mistake of looping the entire LLM rather than just the center portion.

skyde 14 hours ago|||
Actually not surprised. I guess this is for the same reason “say it twice” [1] is working. Because LLm are trained as causal language model, past token cannot attend to future token. One copy of the layer set solve this. [1]https://arxiv.org/html/2512.14982v1
JPLeRouzic 16 hours ago||
Has anyone started to implement this technique in Llama.cpp or similar inference tool?
dnhkng 16 hours ago|
There was some work done on this a while back, during the FrankenMerge craze of 23'

I am working with TurboDerp to integrate this into the Exllama v3 format.

sigbottle 13 hours ago||
Wow, super interesting keywords. Are you a ML researcher? What kind of experiments do you do?
lostmsu 16 hours ago||
How's the reproducibility of the results? Like avg score of 10 runs vs original.
dnhkng 15 hours ago|
Author here: The code is up on GitHub.

The probes I used seem to help identify good configurations, but are quite noisey. A small probe set was initially used to make the scan tractable, and then the higher ranked models were retested on a set ~10x larger.

eightysixfour 15 hours ago||
[dead]
OhNoNotAgain_99 14 hours ago||
[dead]
manudaro 12 hours ago||
[dead]
_lex 16 hours ago|
We've discovered the language. It changes the economics of computing.

As in, this entire cloud buildout is unnecessary because it becomes like using a calculator.

Reach out to chat.

cjameskeller 15 hours ago|
Would you be willing to elaborate? I would be curious to hear more.
_lex 13 hours ago||
shoot me an email and lets jump on a call. I'll blow your mind.