Meta Segment Anything Model 3

Posted by lukeinator42 9 hours ago

Meta Segment Anything Model 3(ai.meta.com)

280 points | 57 comments

daemonologist 5 hours ago|

First impressions are that this model is extremely good - the "zero-shot" text prompted detection is a huge step ahead of what we've seen before (both compared to older zero-shot detection models and to recent general purpose VLMs like Gemini and Qwen). With human supervision I think it's even at the point of being a useful teacher model.

I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.

rocauc 4 hours ago|

As someone that works on a platform users have used for labeling 1B images, I'm bullish SAM 3 can automate at least 90% of the work. Data prep is flipped to models being human-assisted instead of humans being model-assisted (see "autolabel" https://blog.roboflow.com/sam3/). I'm optimistic majority of users can now start deploying a model to then curate data instead of the inverse.

gs17 5 hours ago||

The 3D mesh generator is really cool too: https://ai.meta.com/sam3d/ It's not perfect, but it seems to handle occlusion very well (e.g. a person in a chair can be separated into a person mesh and a chair mesh) and it's very fast.

Animats 5 hours ago|

It's very impressive. Do they let you export a 3D mesh, though? I was only able to export a video. Do you have to buy tokens or something to export?

WhiteNoiz3 3 hours ago|||

The models it creates are gaussian splats, so if you are looking for traditional meshes you'd need a tool that can create meshes from splats.

bahmboo 3 hours ago||

Are you sure about that? They say "full 3D shape geometry, texture, and layout" which doesn't preclude it being a splat but maybe they just use splats for visualization?

FeiyouG 48 minutes ago||

On their paper they mentioned using an "latent 3D grid" internally, which can be converted to mesh/gs using a decoder. The spatial layout of the points shown in the demo doesn’t resemble a Gaussian splat either

modeless 4 hours ago||||

The model is open weights, so you can run it yourself.

TheAtomic 4 hours ago|||

I couldn't download it. Model appears to be comparable to Sparc3D, Huyunan, etc but w/o download, who can say? It is much faster though.

visioninmyblood 3 hours ago||

you can download it at https://github.com/facebookresearch/sam3. for 3d https://github.com/facebookresearch/sam-3d-objects

I actually found the easiest way was to run it for free to see if it works for my use case of person deidentification https://chat.vlm.run/chat/63953adb-a89a-4c85-ae8f-2d501d30a4...

bahmboo 3 hours ago||

Like the models before it it struggles with my use case of tracing circuit board features. It's great with a pony on the beach but really isn't made for more rote industrial type applications. With proper fine-tuning it would probably work much better but I haven't tried that yet. There are good examples on line though.

squigz 2 hours ago|

Wow that sounds like a really interesting use-case for this. Can you link to some of those examples?

bahmboo 1 hour ago||

I don't have anything specific to link to but you could try it yourself with line art. Try something like a mandala or a coloring book type image. The model is trying to capture something that encompasses an entity. It isn't interested in the subfeatures of the thing. Like with a mandala it wants to segment the symbol in its entirety. It will segment some subfeatures like a leaf shaped piece but it doesn't want to segment just the lines such that it is a stencil.

I hope this makes sense and I'm using terms loosely. It is an amazing model but it doesn't work for my use case, that's all!

clueless 5 hours ago||

With a avg latency of 4 seconds, this still couldn't be used in real-time video, correct?

[Update: should have mentioned I got the 4 second from the roboflow.com links in this thread]

Etheryte 4 hours ago||

Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:

> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.

featureofone 36 minutes ago||

The SAM models are great. I used the latest version when building VideoVanish ( https://github.com/calledit/VideoVanish ) a video-editing GUI for removing or making objects vanish from videos.

That used SAM 2, and in my experience SAM 2 was more or less perfect—I didn’t really see the need for a SAM 3. Maybe it could have been better at segmenting without input.

But the new text prompt input seams nice; much easier to automate stuff using text input.

Benjamin_Dobell 4 hours ago||

For background removal (at least my niche use case of background removal of kids drawings — https://breaka.club/blog/why-were-building-clubs-for-kids) I think birefnet v2 is still working slightly better.

SAM3 seems to less precisely trace the images — it'll discard kids drawing out the lines a bit, which is okay, but then it also seems to struggle around sharp corners and includes a bit of the white page that I'd like cut out.

Of course, SAM3 is significantly more powerful in that it does much more than simply cut out images. It seems to be able to identify what these kids' drawings represent. That's very impressive, AI models are typically trained on photos and adult illustrations — they struggle with children's drawings. So I could perhaps still use this for identifying content, giving kids more freedom to draw what they like, but then unprompted attach appropriate behavior to their drawings in-game.

hodgehog11 5 hours ago||

This is an incredible model. But once again, we find an announcement for a new AI model with highly misleading graphs. That SA-Co Gold graph is particularly bad. Looks like I have another bad graph example for my introductory stats course...

fzysingularity 6 hours ago||

SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more. It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!

[1] https://chat.vlm.run

[2] https://vlm.run/orion

visioninmyblood 5 hours ago|

Wow this is actually pretty cool, I was able to segment out the people and dog in the same chat. https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...

luckyLooking 3 hours ago|||

Even works with long range shots. https://chat.vlm.run/chat/e8bd5a29-a789-40aa-ae31-a510dc6478...

fzysingularity 4 hours ago|||

Nice, that's pretty neat.

8f2ab37a-ed6c 2 hours ago||

Couple of questions for people in-the-know:

* Does Adobe have their version of this for use within Photoshop, with all of the new AI features they're releasing? Or are they using this behind the scenes? * If so, how does this compare? * What's the best-in-class segmentation model on the market?

____tom____ 1 hour ago|

Ok, I tried convert body to 3d, which is seems to do well, but it just gives me the image, I see no way to export or use this image. I can rotate it, but that's it.

Is there some functionality I'm missing? I've tried Safari and Firefox.

nmfisher 1 hour ago||

I didn't look too close but it wouldn't surprise me if this was intentional. Many of these Meta/Facebook projects don't have open licenses so they never graduate from web demos. Their voice cloning model was the same.

FeiyouG 1 hour ago||

If you open inspect element you can download the blob there. It is a .ply file and you can view it in any splat viewer.

More comments...