AlphaGenome: AI for better understanding the genome

Posted by i_love_limes 4 days ago

AlphaGenome: AI for better understanding the genome(deepmind.google)

511 points | 170 comments

Kalanos 3 days ago|

The functional predictions related to "non-coding" variants are big here. Non-coding regions, referred to as the dark genome, produce regulatory non-coding RNA's that determine the level of gene expression in a given cell type. There are more regulatory RNA's than there are genes. Something like 75% of expression by volume is ncRNA.

dekhn 3 days ago||

There is a big long-running argument about what "functional" means in "non-coding" parts of the genome. The deeper I pushed into learning about the debate the less confident I became of my own understanding of genomics and evolution. See https://www.sciencedirect.com/science/article/pii/S096098221... for one perspective.

wespiser_2018 3 days ago||

It's possible that the "functional" aspect of non-coding RNA exists on a time scale much larger that what we can assay in a lab. The sort of "junk DNA/RNA" hypothesis: the ncRNA part of the genome is material that increases fitness during relative rare events where it's repurposed into something else.

On a millions or billions of year time frame, the organisms with the flexibility of ncRNA would have an advantage, but this is extremely hard to figure out with a "single point in time" view point.

Anyway, that was the basic lesson I took from studying non-coding RNA 10 years ago. Projects like ENCODE definitely helped, but they really just exposed transcription of elements that are noisy, without providing the evidence that any of it is actually "functional". Therefore, I'm skeptical that more of the same approach will be helpful, but I'd be pleasantly surprised if wrong.

cysteinechapel 8 hours ago||

Such an advantage that is rare and across such long time scales would be so small on average that it would be effectively neutral. Natural selection can only really act on fitness advantages greater than on the order of the inverse of effective population size, which for large multicellular organisms such as animals, is low. Most of this is really just noisy transcription/binding/etc.

For example, we don't keep transposons in general because they're useful, which are almost half of our genomes, and are a major source of disruptive variation. They persist because we're just not very good at preventing them from spreading, we have some suppressive mechanisms but they don't work all the time, and there's a bit of an arms race between transposons and host. Nonetheless, they can occasionally provide variation that is beneficial.

RivieraKid 4 days ago||

I wish there's some breakthrough in cell simulation that would allow us to create simulations that are similarly useful to molecular dynamics but feasible on modern supercomputers. Not being able to see what's happening inside cells seems like the main blocker to biological research.

bglazer 4 days ago||

Molecular dynamics describes very short, very small dynamics, like on the scale of nanoseconds and angstroms (.1nm)

What you’re describing is more like whole cell simulation. Whole cells are thousands of times larger than a protein and cellular processes can take days to finish. Cells contain millions of individual proteins.

So that means that we just can’t simulate all the individual proteins, it’s way too costly and might permanently remain that way.

The problem is that biology is insanely tightly coupled across scales. Cancer is the prototypical example. A single mutated letter in DNA in a single cell can cause a tumor that kills a blue whale. And it works the other way too. Big changes like changing your diet gets funneled down to epigenetic molecular changes to your DNA.

Basically, we have to at least consider molecular detail when simulating things as large as a whole cell. With machine learning tools and enough data we can learn some common patterns, but I think both physical and machine learned models are always going to smooth over interesting emergent behavior.

Also you’re absolutely correct about not being able to “see” inside cells. But, the models can only really see as far as the data lets them. So better microscopes and sequencing methods are going to drive better models as much as (or more than) better algorithms or more GPUs.

fainpul 4 days ago||

> A single mutated letter in DNA in a single cell can cause a tumor that kills a blue whale.

Side note: whales rarely get cancer.

https://en.wikipedia.org/wiki/Peto's_paradox

https://www.youtube.com/watch?v=1AElONvi9WQ

mbeavitt 3 days ago|||

Simulating the real world at increasingly accurate scales is not that useful, because in biology - more than any other field - our assumptions are incorrect/flawed most of the time. The most useful thing simulations allow us to do is directly test those assumptions and in these cases, the simpler the model the better. Jeremy Gunawardena wrote a great piece on this: https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007...

tim333 3 days ago|||

It's a main aim at DeepMind. I hope they succeed as it could be very useful.

andrewchoi 4 days ago|||

The folks at Arc are trying to build this! https://arcinstitute.org/news/virtual-cell-model-state

dekhn 4 days ago||

STATE is not a simulation. It's a trained graphical model that does property prediction as a result of a perturbation. There is no physical model of a cell.

Personally, I think arc's approach is more likely to produce usable scientific results in a reasonable amount of time. You would have to make a very coarse model of the cell to get any reasonable amount of sampling and you would probably spend huge amounts of time computing things which are not relevant to the properties you care amount. An embedding and graphical model seems well-suited to problems like this, as long as the underlying data is representative and comprehensive.

ahns 4 days ago|||

You may enjoy this, from a top-down experimental perspective (https://www.nikonsmallworld.com/galleries/small-world-in-mot...). Only a few entries so far show intracellular dynamics (like this one: https://www.nikonsmallworld.com/galleries/2024-small-world-i...), but I always enjoy the wide variety of dynamics some groups have been able to capture, like nervous system development (https://www.nikonsmallworld.com/galleries/2018-small-world-i...); absolutely incredible.

RivieraKid 3 days ago||

Very interesting, thanks.

eleveriven 4 days ago|||

What's missing feels like the equivalent of a "fast-forward" button for cell-scale dynamics

t_serpico 4 days ago|||

'Seeing' inside cells/tissues/organs/organisms is pretty much most modern biological research.

j7ake 4 days ago|||

Why simulate? We can already do it experimentally

tim333 3 days ago||

You can't see what's going on in most cases.

m3kw9 4 days ago|||

I believe this is where quantum computing comes in but could be a decade out, but AI acceleration is hard to predict

noduerme 4 days ago||

I wish there were more interest in general in building true deterministic simulations than black boxes that hallucinate and can't show their work.

b0a04gl 3 days ago||

1mbp context makes so much sense here wow. genome's flat yeah but reg stuff's like.. all over : loops, timing, chromatin state. model needs that whole view just to even line it up right. giving it enough space to rewire what the cell's already doing. and the transformer memory just clicks here and actually fits.

jebarker 4 days ago||

I don't think DM is the only lab doing high-impact AI applications research, but they really seem to punch above their weight in it. Why is that or is it just that they have better technical marketing for their work?

331c8c71 4 days ago||

This one seems like well done research but in no way revolutionary. People have been doing similar stuff for a while...

Gethsemane 4 days ago||

Agreed, there’s been some interesting developments in this space recently (e.g. AgroNT). Very excited for it, particularly as genome sequencing gets cheaper and cheaper!

I’d pitch this paper as a very solid demonstration of the approach, and im sure it will lead to some pretty rapid developments (similar to what Rosettafold/alphafold did)

tim333 4 days ago|||

They have been at it for a long time and have a lot of resources courtesy of Google. Asking perplexity it says the alphafold 2 database took "several million GPU hours".

kridsdale3 4 days ago||

It's also a core interest of Demis.

eleveriven 4 days ago|||

Other labs are definitely doing amazing work too, but often it's either more niche or less public-facing

forgotpwagain 4 days ago|||

DeepMind/Google does a lot more than the other places that most HN readers would think about first (Amazon, Meta, etc). But there is a lot of excellent work with equal ambition and scale happening in pharma and biotech, that is less visible to the average HN reader. There is also excellent work happening in academic science as well (frequently as a collaboration with industry for compute). NVIDIA partners with whoever they can to get you committed to their tech stack.

For instance, Evo2 by the Arc Institute is a DNA Foundation Model that can do some really remarkable things to understand/interpret/design DNA sequences, and there are now multiple open weight models for working with biomolecules at a structural level that are equivalent to AlphaFold 3.

nextos 4 days ago|||

In biology, Arc Institute is doing great novel things.

Some pharmas like Genentech or GSK also have excellent AI groups.

331c8c71 4 days ago||

Arc have just released a perturbation model btw. If it reliably beats linear benchmarks as claimed it is a big step

https://arcinstitute.org/news/virtual-cell-model-state

daveguy 4 days ago|||

Well, they are a Google organization. Being backed by a $2T company gives you more benefits than just marketing.

jebarker 4 days ago||

Money and resources are only a partial explanation. There’s some equally and more valuable companies that aren’t having nearly as much success in applied AI.

sidibe 4 days ago||

There are more valuable companies but there aren't companies with more resources. If apple wanted to turn all their cash pile into something like Google's infrastructure it would still take years

inquirerGeneral 4 days ago||

[dead]

another_twist 3 days ago||

So very similar approach to Conformer - convolution head for downsampling and transformer for time dependencies. Hmm, surprising that this idea works across application domains.

xipho 4 days ago||

"To ensure consistent data interpretation and enable robust aggregation across experiments, metadata were standardized using established ontologies."

Can't emphasize enough about how DNA requires human data curation to make things work, even from day one alignments models were driven based on biological observations. Glad to see UBERON, which represents a massive amount of human insight and data curation of what is for all intents and purposes a semantic-web product (OWL based RDF at the heart) playing a significant role.

kylehotchkiss 3 days ago||

I'm somewhat a noob here, but does this model have good understanding of things like OvRFs, methylation, etc, or is it strictly a sequence pattern matching thingy?

seydor 4 days ago||

this is such an interesting problem. Imagine expanding the input size to 3.2Gbp, the size of human genome. I wonder if previously unimaginable interactions would occur. Also interesting how everything revolves around U-nets and transformers these days.

pfisherman 4 days ago||

You would not need much more than 2 megabases. The genome is not one contiguous sequence. It is organized (physically segregated) into chromosomes and topologically associated domains. IIRC 2 megabases is like the 3 sd threshold for interactions between cis regulatory elements / variants and their effector genes.

eleveriven 3 days ago|||

Even just modeling 3D genome organization or ultra-long-range enhancers more realistically could open up new insights

teaearlgraycold 4 days ago||

> Also interesting how everything revolves around U-nets and transformers these days.

To a man with a hammer…

TeMPOraL 4 days ago|||

Or to a man with a wheel and some magnets and copper wire...

There are technologies applicable broadly, across all business segments. Heat engines. Electricity. Liquid fuels. Gears. Glass. Plastics. Digital computers. And yes, transformers.

SV_BubbleTime 4 days ago|||

Soon we’ll be able to get the whole genome up on the blockchain. (I thought the /s was obvious)

mountainriver 4 days ago|

With the huge jump in RNA prediction seems like it could be a boon for the wave of mRNA labs

iandanforth 4 days ago|

Those outside the US at least ...

TechDebtDevin 4 days ago||

I've been saying we need a rebranding of mRNA in the USA its coming.

divbzero 4 days ago||

“in situ therapeutics”

More comments...