Cohere Launches Embed 4

Posted by rekovacs 4 days ago

96 points | 47 comments

simonw 4 days ago|

I have huge respect for Cohere and this embedding model looks like it could be best-in-class, but I find it hard to commit to a proprietary embedding model that's only available via an API when there are such good open weight models available.

I really like the approach Nomic take: their most recent models are available via their API or as open weights for non-commercial use only (unless you buy a license). They later relicense their older models under Apache 2.0 licenses.

This gives me confidence that I can continue to use my calculated vectors in the future even if Nomic's model is no longer available because I can run the local one instead.

Nomic Embed Vision 1.5 for example started out as CC-BY-NC-4.0 but was later relicensed to Apache 2.0: https://www.nomic.ai/blog/posts/nomic-embed-vision

mahjongmen 3 days ago||

Hey Simon,

Elliott here from Cohere.

We benchmarked against Nomic's models on our consortium of datasets ranging from text-only, image-only, and mixed modalities. Without publishing additional benchmarks, I am confident in saying that our model is more performant.

At Cohere, for our embed models, we have not deprecated any of our embedding models since we started (I know because I've been there that long) and if we were to start doing so, I would take into account the worry of ensuring our users have a way of accessing our models.

One aspect here that isn't factored is also efficiency. Yes there might be strong open weight models but if you're punching at the 7bn+ weight class your serving requirements are vastly different from a throughput efficiency perspective (also your query-inference speed).

All food for thought. That being said, if for your use-case, Nomic Embed Vision 1.5 is better than Embed-v4.0, happy to hop on a call to discuss where the differential may be.

simonw 3 days ago|||

I don't doubt the new Cohere model is better - but one of the features I value most from an embedding model is having an escape hatch, so I can continue to calculate vectors using that same model far into the future if something happens to the hosting provider.

This matters for embedding models because I'm presumably building up a database of many millions of vectors for later similarity comparisons - so I need to know I'll be able to embed an arbitrary string in the future in order for that investment to still make sense.

Size doesn't matter much to me, I don't even need to be able to run that model, it's more about having an insurance policy for my own peace of mind.

(Even a covenant that says "in the event that Cohere goes out of business this model will be made available under license X" would address this itch for me.)

mahjongmen 3 days ago||

I'll start off with, I'm not one of our founders and REALLY wouldn't want to be publicly held accountable for policies or commitments until I've been able to get internal alignment on things I say.

That being said, since I do manage our Search and Retrieval offering, if we were to deprecate any of our embedding models (which is generally the risk of closed-source models), I will make sure that there is an "escape hatch" for users.

Heard on what your concerns are though :)

wrs 3 days ago|||

To someone building a long term dataset, I’m not sure what assurances would help. Certainly a personal assurance doesn’t (though you’re kind to offer), and even a corporate statement doesn’t (new owners or C-suite could walk that back anytime). It might take a formal third-party “model escrow” arrangement to be really convincing.

mahjongmen 3 days ago||

Hey All,

Thanks for engaging! Apologies for the delay but HN seems to have throttled my account from posting so I'm answering as fast as I can (or they will let me).

You're right in the sense that I could wake up tomorrow and Cohere could lay me off, fire me, or I could quit! All of these are possible statements, but the reason I don't want to publicly commit particularly on our policy on Open Sourcing our models if our business is a going concern or if we deprecate our models is as follows:

1) Cohere is not a going concern 2) I haven't thought about deprecating any of our embedding models because of the reason that simonw stated!

I wouldn't say I'm a Cohere employee playing PR - I'm responsible for all the search and embedding models and products at Cohere and I care deeply on how our users perceive, understand and user our models/products. I'm actually really excited that there is so much engagement this time around (a far cry from 2021 - when I started).

For reference in terms of policies: For our SaaS API, I wrote our model deprecation policy (https://docs.cohere.com/docs/deprecations) and had only deprecated our Rerank-v2.0 Models largely because they were stateless

Again - happy for all the engagement. Heard on the things we can improve on!

brianshaler 2 days ago||

This is a little off-topic and nitpicky, so I waited a day to avoid cluttering comments while the thread was on the front page..

I believe the term "going concern" means exactly the opposite of what you were trying to say here. Generally, comments about pedantry aren't helpful or uninteresting. This case was amusing to me in the context of assuring people Cohere is likely to stay around by boldly stating Cohere is at risk of being insolvent or ceasing operations ("Cohere is not a going concern"). Beyond that, I think it's pretty interesting how understandable it is to look at the term without knowing its meaning and assume the presence of the word "concern" must mean people are concerned about it going [bankrupt?].

I'm sure given the context nobody got the wrong impression. If anything, it makes me wonder if the term could, at least in informal contexts, reach a point of semantic inversion.

Cheer2171 3 days ago||||

> I'm not one of our founders and REALLY wouldn't want to be publicly held accountable for policies or commitments

I don't mean to phrase this in a hostile way, but then what is even the point of posting? Your word means nothing. You are not in a position to promise anything. You could wake up one morning and find yourself laid off with all your accounts terminated.

And the fact that a Cohere employee is playing PR trying to deflect this issue gives me less faith, not more.

xena 3 days ago|||

I can claim that my car is able to fly. That does not mean pressing the gas pedal makes it generate lift.

handfuloflight 3 days ago||

What a disingenuous comparison. The contention here is organizational politics, not physics.

andriym 3 days ago||||

Hey Elliot,

Andriy, co-founder at Nomic here! Congrats on Embed v4 - the more embeddings the merrier!

Embed v1.5 is a 1.5 year old model!

You should check out our latest comparable open-weights, multimodal embedding model that's designed for text, PDFs and images! I can't directly say anything about relative performance to Embed v4 as you guys didn't publish evals on the Vidore-V2 open benchmark!

https://www.nomic.ai/blog/posts/nomic-embed-multimodal

mahjongmen 3 days ago||

Hey Andriy!

We actually did internally run benchmarks against your models since they are open-weights - however, when looking at the license on the 3bn multimodal model (https://huggingface.co/nomic-ai/nomic-embed-multimodal-3b/bl...) we're not permitted to include the results for the marketing of products/services. Rest assured, we know how our model stacks up against yours :)

In any-case, we didn't publish evals on only Vidore-V2 but we did benchmark on it internally.

throwup238 3 days ago||

In my experience, a non-finetunable closed source API is a complete nonstarter for a large fraction of possible use cases, especially the higher value ones like law and engineering. Most of these embedding models are trained too much on colloquial use of language on the internet that has little overlap with how terms of art are used, and without the ability to fine tune it to a specific use case, they're only really useful for generic use cases and even then they can be limited.

mahjongmen 3 days ago|||

Hey!

Since we focus on Enterprise use-cases, we made sure to include training data from domains like you mentioned above. While in very specific use-cases finetuning may be helpful, but we also do offer that as a customization service (just not available via SaaS)

serjester 3 days ago|||

Have you looked at fine tuning linear adaptors to sit on top of the embedding models? This works with any model (proprietary or open) and I think in practice this is significantly easier to implement anyways.

xfalcox 3 days ago||

No downloadable open weights ?

Looks like I'll stay on [bge-m3](https://huggingface.co/BAAI/bge-m3)

lukebuehler 4 days ago||

I just started to look into multi-modal embedding models recently, and I was surprised how few options there are.

For example, Google's model only supports 30 text tokens [1]!!

This is definitely a welcome addition.

Any pointers to similarly powerful embedding models? I'm looking specifically for text and images? I wish there'd be also one that could do audio and video, but I don't think that exists.

[1] https://cloud.google.com/vertex-ai/generative-ai/docs/embedd...

mahjongmen 3 days ago||

Hey Luke, Our model does exceptionally well on text and images, and in particular, when text and images are mixed together. An example of where this works well would be in E-commerce where you may have a product title, description, and multiple images of the product. When combining that into a single payload using our inputs parameter we find that our model responds really well to adding more images (i.e. retrieval quality moves up as you add 1,2,3....N images). As you pointed out with Google's multimodal model, most jointly trained multimodal embedding models will suffer in the text modality. Amazon used to have a multimodal embedding model, which also took in a very small text payload. We're thinking about Audio / Video as well but nothing for Q2 at least....

podgietaru 3 days ago||

I built a little RSS Reader / Aggregator that uses Cohere in order to do some arbitrary classification into different topics. I found it incredibly cheap to work with, and pretty good overall at classifying even with very limited inputs.

I also built this into a version of an OpenSource read it later app.

You can check it out here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...

mahjongmen 3 days ago|

cool project - I like the read-me but it looks like your link is down: https://djwtmt1np1xe4.cloudfront.net/

DrBenCarson 3 days ago||

Still down—behold, the vibe coding is upon us

podgietaru 3 days ago||

It wasn't vibe coded, I just don't work at AWS anymore. I can't update the link now. And the environment it was deployed to has been destroyed.

It literally has the entire IaC stack for you to deploy it yourself.

neom 3 days ago||

Curious for those in the industry, is there room for Cohere? Apparently they are doing very well in the enterprise, however recently I found myself wondering what their long term value prop is.

jeffchuber 3 days ago|

enterprise GTM has its own set of challenges and needs and warrants someone really focused on it

moojacob 4 days ago||

Seems to under-perform voyage-3-large on the same benchmark. At the same time, I'm unsure how useful benchmarks are for embeddings.

SparkyMcUnicorn 3 days ago||

I had the same thought, although voyage is 32k vs 128k for cohere 4.

Anecdotal evidence points to benchmarks correlating with result quality for data I've dealt with. I haven't spent a lot of time comparing results between models, because we were happy with the results after trying a few and tuning some settings.

Unless my dataset lines up really well with a benchmark's dataset, creating my own benchmark is probably the only way to know which model is "best".

CharlieDigital 3 days ago||

Are people using 32k embeddings and no longer chunking?

It feels like embedding content that large -- especially in dense texts -- will lead to loss of fidelity/signal in the output vector.

SparkyMcUnicorn 3 days ago|||

My understanding is that long context models can create embeddings that are much better at capturing the overall meaning, and are less effective (without chunking) for documents that consist of short standalone sentences.

For example, "The configuration mentioned above is critical" now "knows" what configuration is being referenced, along with which project and anything else talked about in the document.

pilotneko 3 days ago|||

It is common to use long context embedding models as a feature extractor for classification models.

mahjongmen 3 days ago|||

Which benchmark are you referring to?

Voyage-3-large is a text-only and much larger model than Embed-v4. If you want to unlock multimodality with Voyage-3-large, you'd have to either OCR (really bad results usually) or use a VLM to parse your data into textual descriptions (this works alright, but the cost of using a VLM will jack-up your data-pre-processing costs).

serjester 3 days ago|||

I think anyone that cares enough about embedding performance to use niche models is probably parsing their PDF's into some sort of textual format. Otherwise you need orient your all your pipelines to handle images which adds significant complexity (hybrid search, reranking, LLM calls, etc - all way harder with images).

Not to mention an image is optimistically 50 KB vs the same page represented as markdown is maybe 2–5 KB. When you're talking about pulling in potentially hundreds of pages, that's a 10–20x increase in storage, memory usage, and network overhead.

I do wish they had a more head-to-head comparison with voyage. I think they're the de facto king of proprietary embeddings and with Mongo having bought them, I'd love to migrate away once someone can match their performance.

mahjongmen 3 days ago||

Hey Serjester Email me at elliott@cohere.ai, let's arrange time to chat. We did head to head evals with Voyage Large / Voyage Multimodal and I can share them with you if you are serious about moving your embeddings over. We tested configurations of top open-source, closed-source, multi-vector and single-dense embedding models but I can only choose so many to put on a graph and I'm not in the business of giving Voyage free advertising haha. I agree with you that there is some complexity on multi-modal reranking w.r.t to inference time speeds as well as data transfer / network latency costs. Happy to talk more :)

moojacob 3 days ago|||

I messed up, I apologize.

I looked at the NDCG and thought that was the dataset.since voyage and cohere both used NDCG. I now realize it was separate benchmarks with the same evaluation metric.

esafak 3 days ago||

Why? How do you pick an embedding model without benchmarks?

moojacob 3 days ago||

The comment by SparkyMcUnicorn worded it better than I did.

You’re right, there’s no other way to compare embeddings than a benchmark.

Just that what the benchmark used by Voyage and Cohere tracks might not be relevant to your own needs.

pencildiver 3 days ago||

I'm a huge fan of Cohere. We were highlighted in the launch post and use their V3 text embeddings in production: https://www.searchagora.com/

We're switching to the V4 to store unified embeddings of our products. From the early tests we ran, this should help with edge case relevancy (i.e. when a product's image and text mismatch, thus creating a greater need for multi-modal embeddings) and improve our search speed by ~100ms.

mahjongmen 3 days ago|

Thank you sir! I appreciate you.

cahaya 3 days ago||

Wondering how this compares to the Gemini (preview) embeddings as they seem to perform significantly better than OpenAI embeddings 3 large. I don't see any MTEB scores so hard to compare.

mahjongmen 3 days ago|

Hey Cahaya,

While we benchmarked internally, on BEIR, we opted not to report our model onto MTEB for the following reason:

1) MTEB has been gamed - if you look at this model (https://huggingface.co/voyageai/voyage-3-m-exp) on the MTEB leaderboard, its an intermediate checkpoint of Voyage-3-Large where they finetuned it on datasets that represent MTEB datasets.

2) If you look at the recent datasets in MMTEB, you'll find that it has quite a lot of machine translated or "weird" datasets that are quite noisy

In general, for our Search Models, we benchmark on these public academic datasets but we definitely do not try to hillclimb in this direction as we find it has little correlation with real use-cases

BrandiATMuhkuh 3 days ago||

This is really great. I'll use it asap. I'm working with enterprise clients in the AEC space. Having a model that actually understands documents with messy data (drawings, floor plans, books, norms, ...) will be great.

The current situation of chunking and transforming is such a messy situation.

mahjongmen 3 days ago|

That sounds really cool! Would love to better understand your use-case and make sure it works well for you!

Drop me an email at elliott@cohere.ai

tiffanyh 3 days ago|

Can someone help me understand what Cohere does.

Do they just host open source models - so you can get them up and going faster?

If so, what’s their moat?

What prevents AWS from doing the same thing?

laborcontract 3 days ago|

they develop models around a very defined set of used cases, and they are very good at it. Look through their documentation and throughout their API. It’s very opinionated and quite a delight, honestly.

mahjongmen 1 day ago||

thanks for the kind words - we're always looking for ways to make our documentation more of a delight :)

More comments...