I recently learned that traditionally in Shipibo culture, ayahuasca was never meant to be given to "the normal mind". Instead the maestras would be the ones taking the ayahuasca in order to help guide them into diagnosing people dealing with various sicknesses.
These maestras were also ranked by how many different plants they'd done a dieta on. A dieta is kinda similar to fasting. You can't shower with soap, you can't have sex, you can't have too much salt/seasoning, can't be exposed to too much smoke, can't have alcohol, etc. And you use that specific plant throughout your time. Basically you want to eliminate any conflicting variables so you can experience the plant as purely as possible to understand its effects. Traditionally these dietas could last over a year but modern day maestros typically do them for just a few weeks.
I don't really have a point to this. Just found it fascinating how deeply and strictly they study certain plant medicines and wanted to share
(Fwiw I've accumulated a couple years worth of dieta under my belt and am well aware of the restrictions! It's indeed very fascinating, been pretty serious about it the last few years and I've barely scratched the surface)
FYI - Lens on Android does in-place language translation including attempting to use the same/similar font that the original language is written/printed.
Unfortunately, I don't think Lens can be used in an automated batch translation mode to convert an entire book/multiple pages
And that translation is likely only a rough approximation, as words don't often translate directly. To add in an extra layer (spanish -> english) seems like another layer of imperfect (due to language) abstraction.
Of course your efforts are targeting a niche, so likely people will understand the attempt and be thankful. I hope this suggestion isn't too forward, but this being an electronic version, you could allow some way for the original spanish to be shown if desired. That sort of functionality would be quite helpful, even non-native spanish speakers might get a clearer picture.
What tools are you using to abstract all of this?
If the spacing and columns of the images are consistent, I'd think imagemagick would allow you to automate extraction by column (eg, cutting the individual pages up), and OCR could then get to work.
For the Shipibo side, I'd want to turn off all LLM interpretation. That tends to use known groupings of words to probabilistically determine best-match, and that'd wreak havoc in this case.
Back to the images, once you have imagemagick chop and sort, writing a very short script to iterate over the pages, display them, and prompt with y/n would be a massive time saver. Doing so at each step would be helpful.
For example, one step? Cut off header and footer, save to dir. Using helpful naming conventions (page-1, and page-1-noheader_footer). You could then use imagemagick to combine page-1 and -age-1-noheader_footer side by side.
Now run a simple bash vet script. Each of 500 pages pops up, you instantly see the original and the cut result, and you hit y or n. One could go through 500 pages like this in 10 to 20 minutes, and you'd be left with a small subset of pages that didn't get cut properly (extra large footer or whatever). If it's down to 10 pages or some such, that's an easy tweak and fix for those.
Once done, you could do the same for column cuts. You'd already have all the scripts, so it's just tweaking.
I'm mentioning all of this, because combo of automation plus human intervention is often the best method to something such as this.
Anyhow, good luck!
I would love to create a json version of it that would essentially have a bunch of fields for each word (Shipibo/Spanish/English word/definition/example, type of word, etc). It's further complicated by how words can be modified in Shipibo (it's actually a very technical language- words can have any number of prefixes and suffixes tagged on to change their meaning and their precision. In their "icaros", the healing songs they sing in ceremony, the most technical use of the language is considered to be the most beautiful. Essentially poetry from their "medical" jargon).
I've done some human-in-the-loop attempts but still come up short in one way or another (I end up getting frustrated and throwing my hands up after seeing how much time I dump on it). So I figure this will remain a good test as the tools (and my prompting abilities) get better. It's definitely not urgent for me.
https://urn.digitalarkivet.no/URN:NBN:no-a1450-rk10101508282...
and the output wasn't even recognizably Danish.
Just out of pity I gave it a birthday card from my sister written in very readable modern handwriting, and while in managed to make the contents of that readable, the errors it made reveals that it has very little contextual intelligence. Even if ! and ? can be hard to tell apart sometimes, they weren't here, and you do not usually start a birthday letter with "Happy Birthday brother?"
> the output wasn't even recognizably Danish
How would you know that it's good then?
It seems like EU in general should be heavily invested in Mistral's development, but it doesn't seem like they are.
I don't know... feels like this sort of area, while not nearly so sexy as video production or coding or (etc.)... but seems like reaching a better-than-human performance level should be easier for these kinds of workloads.
Until then, they seem to be able to keep enough talent in the EU to train reasonably good models. The kernel is there, which seems like the attainable goal.
Are they? IIRC their best model is still worse than the gpt-oss-120B?
Though I haven't checked other benchmarks and they only report swe
Of course currently Mistral has an insane free tier, 1 billion tokens for each(?) of their models per month.
This goes to show how leaders in Mistral don't quite get that they are not special as they seem to think they are. Anthropic or OpenAI also require their talent to relocate but with stakes that are at least a high reward - $500k or $1M a year is a good start that is maybe worth investing into.
The best talents have been regularly leaving Paris and London, India and China for decades. With the US closing its borders, they definitely have a chance to lure some.
Would you find it compelling to move your whole life for ~100k EUR when you can make as much or more at your home city, with a job that is almost certainly more stable?
And I meant the Europeans. People in EU don't have a culture of moving between cities or countries unless they really have a strong reason to, e.g. can't find a job at home.
> would it really be that surprising if there was more unallocated talent in the EU, at this point?
I am pretty sure there is. It has changed over the course of last few years, primarily because of COVID, and companies willing to offer remote contracts, but it's far from being able to utilize the talent.
Southern and Eastern Europeans certainly do.
The EU is extremely invested in Mistral's development: half of the effort is finding ways to tax them (hello Zucman tax), the other half is wondering how to regulate them (hello AI act)
Maybe, i think it will be to our benefit when the bubble pops that we are not heavily invested, no harm investing a little.
EDIT: you can try it yourself for free at https://console.mistral.ai/build/document-ai/ocr-playground once you create a developer account! Fingers crossed to see how well it works for my use case.
Regular Gemini Thinking can actually get 70-80% of the documents correct except lots of mistakes on given names. Chatgpt maybe understands like 50-60%.
This Mistral model butchered the whole text, literally not a word was usable. To the point I think I'm doing something wrong.
The test document: https://files.fm/u/3hduyg65a5
The model might need tuning in order to be effective - this is normal for releases of image mode models, and after a couple days, there will be properly set up endpoints to test from, so it might be much better than you think. Or it could be really bad with turn of the 19th century portugese cursive.
We were mind blown how good Gemini was at it.
Huge timesaver.
> can someone help folks at Mistral find more weak baselines to add here? since they can't stomach comparing with SoTA....
> (in case y'all wanna fix it: Chandra, dots.ocr, olmOCR, MinerU, Monkey OCR, and PaddleOCR are a good start)
Its failure mode are also vastly different. VLM-based extraction can misread entire sentences or miss entire paragraphs. Sonnet 3 had that issue. Computer vision models instead will make in-word typos.
Edit: Gemini 2.0 was good enough for VLM cleanup, and now 2.5 or above with structured output make reconstruction even easier.
In their website, the benchmarks say “Multilingual (Chinese), Multilingual (East-asian), Multilingual (Eastern europe), Multilingual (English), Multilingual (Western europe), Forms, Handwritten, etc.” However, there’s no reference to the benchmark data.
I'm still hoping for improved locally hosted models: qwen3-vl:30b-a3b-thinking-q4_K_M is already really good.
- paddleOCR-VL
- olmOCR-2
- chandra
- dots.ocr
I kind of miss there is not many leaderboard sections or arena for OCR and CV and providers hosting those. Neglected on both Artificial Analysis and OpenRouter.
https://www.ocrarena.ai/leaderboard
Hasn't been updated for Mistral but so far gemeni seems to top the leaderboard.
Getting the wrong answer really quickly is not the best goal.
E.g. with Gemini 3.0 flash you might seem that model pricing increased only slightly comparing to Gemini 2.5 flash until you test it and will see that what used to be 258 per 384x384 input tokens now is around 3x more.
Now I have to figure out how large a page can be.
It took an hour and a half to install 12 gigabytes of pytorch dependencies that can't even run on my device, and then it told me it had some sort of versioning conflict. (I think I was supposed to use UV, but I had run out of steam by that point.)
Maybe I should have asked Claude to install it for me. I gave Claude root on a $3 VPS, and it seems to enjoy the sysadmin stuff a lot more than I do...
Incidentally I had a similar experience installing open web UI... It installed 12 GB of pytorch crap.. I rage quit and deleted the whole thing, and replicated the functionality I actually needed in 100 lines of HTML.... Too bad I can't do that with OCR ;)
But yes, in general, you want to use uv. Otherwise, the next Python application you install WILL break the last one you installed.
I suppose you could use gemini-cli as a substitute for proper Python virtual environment management, always letting it fix whatever broke since the last time you tried to run the program, but that'd be like burning down a rainforest to toast a marshmallow.
I don’t know how they can make this statement with 79% accuracy rate. For any serious use case, this is an unacceptable number.
I work with scientific journals and issues like 2.9+0.5 and 29+0.5 is something we regularly run into that has us never being able to fully trust automated processes and require human verification every step.
What matters is whether this is better than competition/alternatives. Of course nobody is just going to take the output as is. If you do that, that's your problem.
If I am wildly off, I am happy to learn.
The previous version already achieved up to 99% accuracy in multiple benchmarks, already better than most OCR software.