Top
Best
New

Posted by sohamrj 5 hours ago

Show HN: Gemini can now natively embed video, so I built sub-second video search(github.com)
Gemini Embedding 2 can project raw video directly into a 768-dimensional vector space alongside text. No transcription, no frame captioning, no intermediate text. A query like "green car cutting me off" is directly comparable to a 30-second video clip at the vector level.

I used this to build a CLI that indexes hours of footage into ChromaDB, then searches it with natural language and auto-trims the matching clip. Demo video on the GitHub README. Indexing costs ~$2.50/hr of footage. Still-frame detection skips idle chunks, so security camera / sentry mode footage is much cheaper.

135 points | 42 commentspage 2
SpaceManNabs 3 hours ago|
> No transcription, no frame captioning, no intermediate text.

If there is text on the video (like a caption or wtv), will the embedding capture that? Never thought about this before.

If the video has audio, does the embedding capture that too?

sohamrj 3 hours ago|
Yes to both. The embedding is over raw video frames, so anything visible (text, signs, captions) gets captured in the vector. And Gemini Embedding 2 extracts the audio track and embeds it alongside the visual frames. So a query like 'someone yelling' would theoretically match on audio. My dashcam footage doesn't have audio though, so I haven't tested that side yet.
7777777phil 3 hours ago||
Today I learned that Gemini can now natively embed video..

Cool Project, thanks for sharing!

Aeroi 4 hours ago||
very cool, anybody have apparent use cases for this?
sohamrj 4 hours ago||
dashcam and home security footage are the 2 main ones i can think of.

a bit expensive right now so it's not as practical at scale. but once the embedding model comes out of public preview, and we hopefully get a local equivalent, this will be a lot more practical.

giozaarour 3 hours ago|||
I think a good use case would be searching for certain products or videos across social media (TikTok and Instagram). especially useful for shopping, maybe
vidarh 3 hours ago||
Branding/marketing monitoring companies would be all over this.
hebelehubele 3 hours ago||
State surveillance
wahnfrieden 3 hours ago||
Worker surveillance
klntsky 4 hours ago|
why not skip the text conversion? is it usable at all?
sohamrj 4 hours ago|
gemini embedding 2 converts straight video to vectors. in this case, dashcam clips don't have audio to transcribe and even if they did, it would be useless in the search
password4321 3 hours ago||
What are the SoA audio models right now?