Posted by joshdickson 1 day ago
Today I’m excited to launch OpenNutrition: a free, ODbL-licenced nutrition database of everyday generic, branded, and restaurant foods, a search engine that can browse the web to import new foods, and a companion app that bundles the database and search as a free macro tracking app.
Consistently logging the foods you eat has been shown to support long-term health outcomes (1)(2), but doing so easily depends on having a large, accurate, and up-to-date nutrition database. Free, public databases are often out-of-date, hard to navigate, and missing critical coverage (like branded restaurant foods). User-generated databases can be unreliable or closed-source. Commercial databases come with ongoing, often per-seat licensing costs, and usage restrictions that limit innovation.
As an amateur powerlifter and long-term weight loss maintainer, helping others pursue their health goals is something I care about deeply. After exiting my previous startup last year, I wanted to investigate the possibility of using LLMs to create the database and infrastructure required to make a great food logging app that was cost engineered for free and accessible distribution, as I believe that the availability of these tools is a public good. That led to creating the dataset I’m releasing today; nutritional data is public record, and its organization and dissemination should be, too.
What’s in the database?
- 5,287 common everyday foods, 3,836 prepared and generic restaurant foods, and 4,182 distinct menu items from ~50 popular US restaurant chains; foods have standardized naming, consistent numeric serving sizes, estimated micronutrient profiles, descriptions, and citations/groundings to USDA, AUSNUT, FRIDA, CNF, etc, when possible.
- 313,442 of the most popular US branded grocery products with standardized naming, parsed serving sizes, and additive/allergen data, grounded in branded USDA data; the most popular 1% have estimated micronutrient data, with the goal of full coverage.
Even the largest commercial databases can be frustrating to work with when searching for foods or customizations without existing coverage. To solve this, I created a real-time version of the same approach used to build the core database that can browse the web to learn about new foods or food customizations if needed (e.g., a highly customized Starbucks order). There is a limited demo on the web, and in-app you can log foods with text search, via barcode scan, or by image, all of which can search the web to import foods for you if needed. Foods discovered via these searches are fed back into the database, and I plan to publish updated versions as coverage expands.
- Search & Explore: https://www.opennutrition.app/search
- Methodology/About: https://www.opennutrition.app/about
- Get the iOS App: https://apps.apple.com/us/app/opennutrition-macro-tracker/id...
- Download the dataset: https://www.opennutrition.app/download
OpenNutrition’s iOS app offers free essential logging and a limited number of agentic searches, plus expenditure tracking and ongoing diet recommendations like best-in-class paid apps. A paid tier ($49/year) unlocks additional searches and features (data backup, prioritized micronutrient coverage for logged foods), and helps fund further development and broader library coverage.
I’d love to hear your feedback, questions, and suggestions—whether it’s about the database itself, a really great/bad search result, or the app.
1. Burke et al., 2011, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3268700/
2. Patel et al., 2019, https://mhealth.jmir.org/2019/2/e12209/
This is not a dataset. This is an insult to the very idea of data. This is the most anti-scientific post I have ever seen voted to the top of HN. Truth about the world is not derived from three LLMs stacked on top of each other in a trenchcoat.
Adding an LLM to this just adds a unnecessary layer of complexity, for what benefit? For street cred?
Millions of people use food logging apps to drive behavioral change and help adhere to healthy lifestyles. I believe there is immense societal good in continuing to offer improved tools to accomplish this, especially for free, and that's why I created the project and chose to open source the data.
https://www.opennutrition.app/about#current-state-of-nutriti...
https://www.reddit.com/r/ABoringDystopia/comments/1jq8kzl/th...
This workflow, this motivation, this business model, this marketing is an affront to truth itself.
This is data from the world that has altered and augmented with stuff from a model. The informational content has been altered by stuff not from the world. Therefore it’s no longer data, according to the above definition.
That isn’t to say that it can’t be useful, or anything like that. But it’s _not_ information collected from the world. And that’s why people who care about science and a strict definition of data would be offended by calling this a dataset.
What is the source of that nutritional data?
Some of them are fundamentalists, and no amount of reason will reach them (read the comments on the Ghibli-style images to get a sample), others are opposed for very self-interested reasons: "It is difficult to get a man to understand something when his income depends on his not understanding it"
Yesterday, I vibe coded a DNS server in python from scratch in half a day (!) and it works extremely well after spending a few minutes on manually improving a specific edge case for reverse DNS using AAAA records: dig -x requests use the exploded form in the ip6.arpa, while I think it's better for the AAAA entries to keep using the compressed form, and I wanted to generate the reverse algorithmically from AAAA and A records.
Just ignore them, as your approach is sound: I have experience creating, curating and improving datasets with LLMs.
Like vibe coding, it works very well if you know what you are doing: here, you just have to use statistics to leverage the non deterministic aspects of AI in your favor.
Good luck with your app!
This is true of so very many things involving computers (and tools in general, really) and LLMs are no exception. Just like any tool, "knowing what you are doing" is the really important part, but so many folks are convinced that these "AI" things can do the thinking part for them, and it's just not the case (yet). You gotta know what you're doing and how to properly use the tool to avoid a lotta the "foot-guns" and get the most benefit outta these things.
Not representable because I dont have US food but since its AI enhanced I cant compare my stuff with the stuff in the "dataset" and be sure thats an Us vs germany thing..
It looks like for unsweetened oat milk:
https://www.opennutrition.app/search/unsweetened-oat-milk-mt...
...it is leaning into a citation from the Australian Nutrient Database (e.g. Oat beverage, fluid, unfortified. Australian Nutrient Database. Public Food Key F006132. ), which is what I instructed it to do if it thought there was an exact match from a governmental database.
It's possible this is a poor general source for oat milk or that's not the beverage intended for the entry to stand for. I'll check it out, thank you for the report.
>> Foods discovered via these searches are fed back into the database,
Aren’t LLMs also unreliable? How do you ensure the new content is from an authoritative, accurate source? How do you ensure the numbers that make it into the database are actually what the source provided?
According to the Methodology/About page
>> The LLM is tasked with creating complete nutritional values, explicitly explaining the rationale behind each value it generates. Outputs undergo rigorous validation steps,
Those rigorous validation steps were also created with LLMs, correct?
>> whose core innovations leveraged AI but didn’t explicitly market themselves as “AI products.”
Odd choice for an entirely AI based service. First thought I had after reading that was: must be because people don’t trust AI generated information. Seems disengenuous to minimize the AI aspect in marketing while this product only exists because of AI.
Great idea though, thanks for giving it a shot!
Not really. I do explain in the methodology post how good o1-pro is at the task, but there was a lot of manual effort involved in coming to that conclusion with my own effort to review the LLM's reasoning, and even still, o1-pro is not perfect.
>> Outputs undergo rigorous validation steps, including cross-checking with advanced auditing models such as OpenAI’s o1-pro, which has proven especially proficient at performing high-quality random audits.
>> there was a lot of manual effort involved in coming to that conclusion with my own effort to review the LLM's reasoning
So, the randomly audited entries seemed reasonable to you – not even the data itself, just the reasoning about the generated data. Did the manual reviews stop once things started looking good enough? Are the audits ongoing, to fill out the rest of the dataset? Would those be manually double-checked as well?
>> I became interested in exploring how recent advances in generative AI could enable entirely new kinds of consumer products—ones whose core innovations leveraged AI but didn’t explicitly market themselves as “AI products.”
Once again: Why not market this as an AI product? This is LLMs all the way down.
People are already interested in using this dataset. I was. Now, LLM generated “usually close enough to not be actively harmful” data is being distributed as a source for any and all to use. I think your disclaimer is excellent. Does your license require an equivalent disclaimer be provided by those using this data?
Poor phrasing on my end -- yes, absolutely the end data as well as the reasoning, as the reasoning tends to include the final answer.
Maybe I should! Appreciate the feedback.
This looks like a lot of work and good will were poured into it, and I can see how it can be useful to a fitness focused audience.
You control the messaging on the site and in your apps, and you make it clear that this is not authoritative data. Everything built on top of this needs to have the same messaging, but it has probably been ingested into multiple LLMs already.
I think some sort of licensing requirement that the LLM source of this data be prominently disclosed will not keep this from becoming a source of truth for other datasets, products, and services; but, it is still worth the effort. All you can do is all you can do, right?
> TL;DR: They are estimates from giving an LLM (generally o3 mini high due to cost, some o1 preview) a large corpus of grounding data to reason over and asking it to use its general world knowledge to return estimates it was confident in, which, when escalating to better LLMs like o1-pro and manual verification, proved to be good enough that I thought they warranted release.
Most of the data being close enough to be better than nothing and not actively harmful + a disclaimer and the author is absolved of all responsibility!
Even better, this will now be used in all sorts of other apps, analyses, and for training other LLMs! And I expect all those will also prominently include an “all of this was genereated by an LLM” disclamers. For sure.
1. Generic, non-branded foods
2. Simple prepared foods that ease food entry
3. Restaurant foods
4. Micronutrients beyond those reported by the brand.
OFF is a fantastic project but OpenNutrition is really trying to fit a different niche. OFF does what it does very well; I would never be able to use it to track my food intake.
We're happy to cover more use-cases, so feel free to join the project and contribute your time/coding skills to help us solve those issues. https://slack.openfoodfacts.org or https://forum.openfoodfacts.org or directly https://github.com/openfoodfacts
Appreciate the feedback!
> I wanted to investigate the possibility of using LLMs
ah, yeah, I guess it makes sense then...
Edit: Should be patched in Desktop Safari now.
The first item I manually look up is has about double calories listed in the "dataset" versus reality. Honey bunches of oats honey roasted.
OpenNutrition: https://www.opennutrition.app/search/honey-bunches-of-oats-h...
Via Manufacturer: https://www.honeybunchesofoats.com/product/honey-bunches-of-...
If you wouldn't mind DM'ing me the barcode you're looking at that would be helpful to understand what the nature of the discrepancy is.
How can a large egg (50 g) contain 147 g choline?
https://www.opennutrition.app/search/eggs-eeG7JQCQipwf
Additionally, on https://www.opennutrition.app/search/brown-lentils-VwKWF7CQq... it says:
> Unlike larger legumes, they require no pre-soaking and cook in 20-30 minutes, making them ideal for soups, stews, and salads
That is not necessarily true. Based on my experience, it does require pre-soaking, otherwise you will have to cook it for a long time, as opposed to red lentils (which is done under 15 minutes, no pre-soaking needed), although red lentils taste more like yellow peas.
In any case, I think this could be really useful, once accurate enough. One could even implement other features on top, such as a calorie tracker and so forth, but that is a huge project on its own.
I wish you luck!
BTW when you hover over the ingredients, you just get back the name. Are you guys going to do something with it in the future? Right now there is a visual feedback (the cursor changes), but it is not useful yet. I am not entirely sure what I would have expected, perhaps a description of what it is, and upon clicking on it, it could have information gathered from various sources, like examine.com and what have you, but that would be a huge change on its own, the short description upon mouse hover-over should work for now and may not be a huge change.
Right now you'll see that aggregated on some items like this where the reported data is an ensemble of all of the linked resources: https://www.opennutrition.app/search/eggs-eeG7JQCQipwf
Frankly, I just couldn't justify the additional time and monetary expense in doing that if I released this initial version and nobody cared or found it useful. This dataset was also compiled before tools like Claude Citations came out which could make it easier. That is the nature of AI-driven data; I think this is useful now, it is also the worst it will ever be.
Keep it as accurate as possible, and maintainable, and then it will be easy to add larger features. If no one else does, I might add a calorie tracker of some sort, it would be helpful to my mom. It is helpful as it is even now. How difficult would it be to add translations right now? She might look for "tojás" which is "egg" in Hungarian, and I would like her to be able to do that at some point.
Really easy to use (just scan the barcode and you get easily digested data about the product) has every product imaginable, also analyzes cosmetics and best of all, all the basic functionality is free.
Not affiliated, been using it for years at this point and now it's an essential partner when going shopping. That they let people decide their own premium pricing per year is just icing on the cake.
So, very little nutrient info beyond calories and protein. No info about micronutrients. No info about minerals, vitamins, amino acids, fatty acids.
It's useless for nutrition tracking since if you're eating packaged food, then you already have that information yourself.
It doesn't answer basic questions like "I ate 100g of extra firm tofu, how did it move me towards my daily mineral/vitamin targets?"
Many items do have these things.
https://world.openfoodfacts.org/product/5060495116377/huel-b...
But consider that OpenFoodFacts can't give you that info on just about anything else, especially not basic foods like "apples" or "tofu" or "chicken breast".
I'm not dumping on the project. It's really useful to have a database of packaged food labels. It's just not trying to solve this problem.
I was looking at this page: https://www.opennutrition.app/search/original-shells-cheese-... and saw the amino acid, vitamin, and mineral sections; there are many things listed which aren't covered by the official nutritional data. These entries also have very precise numbers but I'm not sure where and how they're derived and if I could put any serious weight in them. I'd love to hear more if you're willing to share!
You can read about the background on how I did them in more detail in the about/methodology section: https://www.opennutrition.app/about (see "Technical Approach")
My guess is that this dataset is probably more accurate on the whole than many datasets used by the kinds of calorie-tracking apps that outsource their collection of nutrition information to users. But an analysis would be required.
Regardless, the only workable approach is to describe the provenance of your data and explain what steps have been taken to ensure accuracy. Then anyone who wants to use the data can account for that information.