Hypernetworks: Neural Networks for Hierarchical Data

QueensGambit 13 hours ago|

Factorization is key here. It separates dataset-level structure from observation-level computation so the model doesn't waste capacity rediscovering structure.

I've been arguing the same for code generation. LLMs flatten parse trees into token sequences, then burn compute reconstructing hierarchy as hidden states. Graph transformers could be a good solution for both: https://manidoraisamy.com/ai-mother-tongue.html

stephantul 22 hours ago||

What a good post! I loved the takeaways at the end of each section.

I think it would maybe get more traction if the code was in pytorch or JAX. It’s been a long while since I’ve seen people use Keras.

joefourier 1 day ago||

Odd that the author didn’t try giving a latent embedding to the standard neural network (or modulated the activations with a FiLM layer) and had static embeddings as the baseline. There’s no real advantage to using a hypernetwork and they tend to be more unstable and difficult to train, and scale poorly unless you train a low rank adaptation.

mkmccjr 1 day ago||

Hello. I am the author of the post. The goal of this was to provide a pedagogical example of applying Bayesian hierarchical modeling principles to real world datasets. These datasets often contain inherent structure that is important to explicitly model (eg clinical trials across multiple hospitals). Oftentimes a single model cannot capture this over-dispersion but there is not enough data to split out the results (nor should you).

The idea behind hypernetworks is that they enable Gelman-style partial pooling to explicitly modeling the data generation process while leveraging the flexibility of neural network tooling. I’m curious to read more about your recommendations: their connection to the described problems is not immediately obvious to me but I would be curious to dig a bit deeper.

I agree that hypernetworks have some challenges associated with them due to the fragility of maximum likelihood estimates. In the follow-up post, I dug into how explicit Bayesian sampling addresses these issues.

yobbo 23 hours ago||

I think a latent embedding is almost equivalent to the article's hypernetwork, which I assume as y = (Wh + c)v + b, where h is a dataset-specific trainable vector. (The article uses multiple layers ...)

keepamovin 14 hours ago|

This is actually the way to AGI, ngl. Come back when it lands and see that it's right.