I really like the approach Nomic take: their most recent models are available via their API or as open weights for non-commercial use only (unless you buy a license). They later relicense their older models under Apache 2.0 licenses.
This gives me confidence that I can continue to use my calculated vectors in the future even if Nomic's model is no longer available because I can run the local one instead.
Nomic Embed Vision 1.5 for example started out as CC-BY-NC-4.0 but was later relicensed to Apache 2.0: https://www.nomic.ai/blog/posts/nomic-embed-vision
Elliott here from Cohere.
We benchmarked against Nomic's models on our consortium of datasets ranging from text-only, image-only, and mixed modalities. Without publishing additional benchmarks, I am confident in saying that our model is more performant.
At Cohere, for our embed models, we have not deprecated any of our embedding models since we started (I know because I've been there that long) and if we were to start doing so, I would take into account the worry of ensuring our users have a way of accessing our models.
One aspect here that isn't factored is also efficiency. Yes there might be strong open weight models but if you're punching at the 7bn+ weight class your serving requirements are vastly different from a throughput efficiency perspective (also your query-inference speed).
All food for thought. That being said, if for your use-case, Nomic Embed Vision 1.5 is better than Embed-v4.0, happy to hop on a call to discuss where the differential may be.
This matters for embedding models because I'm presumably building up a database of many millions of vectors for later similarity comparisons - so I need to know I'll be able to embed an arbitrary string in the future in order for that investment to still make sense.
Size doesn't matter much to me, I don't even need to be able to run that model, it's more about having an insurance policy for my own peace of mind.
(Even a covenant that says "in the event that Cohere goes out of business this model will be made available under license X" would address this itch for me.)
That being said, since I do manage our Search and Retrieval offering, if we were to deprecate any of our embedding models (which is generally the risk of closed-source models), I will make sure that there is an "escape hatch" for users.
Heard on what your concerns are though :)
Thanks for engaging! Apologies for the delay but HN seems to have throttled my account from posting so I'm answering as fast as I can (or they will let me).
You're right in the sense that I could wake up tomorrow and Cohere could lay me off, fire me, or I could quit! All of these are possible statements, but the reason I don't want to publicly commit particularly on our policy on Open Sourcing our models if our business is a going concern or if we deprecate our models is as follows:
1) Cohere is not a going concern 2) I haven't thought about deprecating any of our embedding models because of the reason that simonw stated!
I wouldn't say I'm a Cohere employee playing PR - I'm responsible for all the search and embedding models and products at Cohere and I care deeply on how our users perceive, understand and user our models/products. I'm actually really excited that there is so much engagement this time around (a far cry from 2021 - when I started).
For reference in terms of policies: For our SaaS API, I wrote our model deprecation policy (https://docs.cohere.com/docs/deprecations) and had only deprecated our Rerank-v2.0 Models largely because they were stateless
Again - happy for all the engagement. Heard on the things we can improve on!
I believe the term "going concern" means exactly the opposite of what you were trying to say here. Generally, comments about pedantry aren't helpful or uninteresting. This case was amusing to me in the context of assuring people Cohere is likely to stay around by boldly stating Cohere is at risk of being insolvent or ceasing operations ("Cohere is not a going concern"). Beyond that, I think it's pretty interesting how understandable it is to look at the term without knowing its meaning and assume the presence of the word "concern" must mean people are concerned about it going [bankrupt?].
I'm sure given the context nobody got the wrong impression. If anything, it makes me wonder if the term could, at least in informal contexts, reach a point of semantic inversion.
I don't mean to phrase this in a hostile way, but then what is even the point of posting? Your word means nothing. You are not in a position to promise anything. You could wake up one morning and find yourself laid off with all your accounts terminated.
And the fact that a Cohere employee is playing PR trying to deflect this issue gives me less faith, not more.
Andriy, co-founder at Nomic here! Congrats on Embed v4 - the more embeddings the merrier!
Embed v1.5 is a 1.5 year old model!
You should check out our latest comparable open-weights, multimodal embedding model that's designed for text, PDFs and images! I can't directly say anything about relative performance to Embed v4 as you guys didn't publish evals on the Vidore-V2 open benchmark!
We actually did internally run benchmarks against your models since they are open-weights - however, when looking at the license on the 3bn multimodal model (https://huggingface.co/nomic-ai/nomic-embed-multimodal-3b/bl...) we're not permitted to include the results for the marketing of products/services. Rest assured, we know how our model stacks up against yours :)
In any-case, we didn't publish evals on only Vidore-V2 but we did benchmark on it internally.
Since we focus on Enterprise use-cases, we made sure to include training data from domains like you mentioned above. While in very specific use-cases finetuning may be helpful, but we also do offer that as a customization service (just not available via SaaS)
Looks like I'll stay on [bge-m3](https://huggingface.co/BAAI/bge-m3)
For example, Google's model only supports 30 text tokens [1]!!
This is definitely a welcome addition.
Any pointers to similarly powerful embedding models? I'm looking specifically for text and images? I wish there'd be also one that could do audio and video, but I don't think that exists.
[1] https://cloud.google.com/vertex-ai/generative-ai/docs/embedd...
I also built this into a version of an OpenSource read it later app.
You can check it out here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...
It literally has the entire IaC stack for you to deploy it yourself.
Anecdotal evidence points to benchmarks correlating with result quality for data I've dealt with. I haven't spent a lot of time comparing results between models, because we were happy with the results after trying a few and tuning some settings.
Unless my dataset lines up really well with a benchmark's dataset, creating my own benchmark is probably the only way to know which model is "best".
It feels like embedding content that large -- especially in dense texts -- will lead to loss of fidelity/signal in the output vector.
For example, "The configuration mentioned above is critical" now "knows" what configuration is being referenced, along with which project and anything else talked about in the document.
Voyage-3-large is a text-only and much larger model than Embed-v4. If you want to unlock multimodality with Voyage-3-large, you'd have to either OCR (really bad results usually) or use a VLM to parse your data into textual descriptions (this works alright, but the cost of using a VLM will jack-up your data-pre-processing costs).
Not to mention an image is optimistically 50 KB vs the same page represented as markdown is maybe 2–5 KB. When you're talking about pulling in potentially hundreds of pages, that's a 10–20x increase in storage, memory usage, and network overhead.
I do wish they had a more head-to-head comparison with voyage. I think they're the de facto king of proprietary embeddings and with Mongo having bought them, I'd love to migrate away once someone can match their performance.
I looked at the NDCG and thought that was the dataset.since voyage and cohere both used NDCG. I now realize it was separate benchmarks with the same evaluation metric.
You’re right, there’s no other way to compare embeddings than a benchmark.
Just that what the benchmark used by Voyage and Cohere tracks might not be relevant to your own needs.
We're switching to the V4 to store unified embeddings of our products. From the early tests we ran, this should help with edge case relevancy (i.e. when a product's image and text mismatch, thus creating a greater need for multi-modal embeddings) and improve our search speed by ~100ms.
While we benchmarked internally, on BEIR, we opted not to report our model onto MTEB for the following reason:
1) MTEB has been gamed - if you look at this model (https://huggingface.co/voyageai/voyage-3-m-exp) on the MTEB leaderboard, its an intermediate checkpoint of Voyage-3-Large where they finetuned it on datasets that represent MTEB datasets.
2) If you look at the recent datasets in MMTEB, you'll find that it has quite a lot of machine translated or "weird" datasets that are quite noisy
In general, for our Search Models, we benchmark on these public academic datasets but we definitely do not try to hillclimb in this direction as we find it has little correlation with real use-cases
The current situation of chunking and transforming is such a messy situation.
Drop me an email at elliott@cohere.ai
Do they just host open source models - so you can get them up and going faster?
If so, what’s their moat?
What prevents AWS from doing the same thing?