Posted by rettichschnidi 2 days ago
I understand the analogy to other types of critical data often not included in open-source distros (e.g Quake III's source is GPL but its resources like textures are not, as mentioned in the article). The distinction is in these cases the data does not clarify anything about the functioning of the engine, nor does its absence obscure anything. So by my earlier smell test it makes sense to say Quake III is open source.
But open-sourcing a transformer ANN without the training data tells us almost nothing about the internal functioning of the software. The exact same source code might be a medical diagnosis machine, or a simple translator. It does not pass my smell test to say this counts as "open source." It makes more sense to say that ANNs are data-as-code programming paradigms, glued together by a bit of Python. An analogy would be if id released its build scripts and announced Quake III was open-source, but claimed the .cpp and .h files were proprietary data. The batch scripts tell you a lot of useful info - maybe even that Q3 has a client-server architecture - but they don't tell you that the game is an FPS, let alone the tricks and foibles in its renderer.
Training data simply does not help you here. Our existing architectures are not explainable or auditable in any meaningful way, training data or no training data.
I don't necessarily agree and suggest the Open Source Definition could be extended to cover data in general (media, databases, and yes, models) with a single sentence, but the lowest risk option is to not touch something that has worked well for a quarter century.
The community is starting to regroup and discuss possible next steps over at https://discuss.opensourcedefinition.org
To the extent the problem is intractable, I think kt mostly reflects that LLMs have an enormous amount of training data and do an enormous amount of things. But for a given specific problem the training data can tell you a lot:
- whether there is test contamination with respect to LLM benchmarks or other assessments of performance
- whether there's any CSAM, racist rants, or other things you don't want
- whether LLM weakness in a certain domain is due to an absence of data or if there's a more serious issue
- whether LLM strength in a domain is due to unusually large amounts of synthetic training data and hence might not generalize very reliably in production (this is distinct from test contamination - it is issues like "the LLM is great at multiplication until you get to 8 digits, and after 12 digits it's useless")
- investigating oddness like that LeetMagikarp (or whatever) glitch in ChatGPT
They're really a kind of database. Perhaps a better way to think about it is in terms of "commons". Consider how creative commons licenses are explicit about requirements like attribution, noncommercial, share-alike, etc.; that feels like a useful model for talking about weights.
If you want to talk about the openness and accessibility of these systems I'd just ditch the "source" part and create some new criteria for what makes an AI model open.
I don't really know what you're referring to....
This makes sense. What the OSI gets right here is that the artifact that is made open source is the weights. Making modifications to the weights is called fine tuning and does not require the original training data any more than making modifications to a piece of source code requires the brain of the original developer.
Tens of thousands of people have fine-tuned these models for as long as they've been around. Years ago I trained GPT-2 to produce text resembling Shakespeare. For that I needed Shakespeare, not GPT-2's training data.
The training data is properly part of the development process of the open source artifact, not part of the artifact itself. Some open source companies (GitLab) make their development process fully open. Most don't, but we don't call IntelliJ Community closed source on the grounds that they don't record their meetings and stream them for everyone to watch their planning process.
Edit: Downvotes are fine, but please at least deign to respond and engage. I realize that I'm expressing a controversial opinion here, but in all the times I've posted similar no one's yet given me a good reason why I'm wrong.
In short, your argument doesn’t work because source code is to binaries as training data is to MLMs. Source code is the closest comparison we have with training data, and the useless OSI claims that’s only a “benefit” not a “requirement”. This isn’t a stance meant for long term growth but for maintaining a moat of training data for “AI” companies.
Because the binaries were not licensed under a FOSS license?
Also, as I note in another comment [0], source code is the preferred form of a piece of software for making modifications to it. The same cannot be said about the training data, because getting from that to weights costs hundreds of millions of dollars in compute. Even the original companies prefer to fine-tune their existing foundation models for as long as possible, rather than starting over from training data alone.
> In short, your argument doesn’t work because source code is to binaries as training data is to MLMs.
I disagree. Training data does not allow me to recreate an LLM. It might allow Jeff Bezos to recreate an LLM, but not me. But weights allow me to modify it, embed it, and fine tune it.
The weights are all that really matters for practical modification in the real world, because in the real world people don't want to spend hundreds of millions to "recompile" Llama when someone already did that, any more than people want to rewrite the Linux kernel from scratch based on whiteboard sketches and mailing list discussions.
So a URL to the data? To download the data? Or what? Someone says "Just scrape the data from the web yourself." And a skilled person doesn't need a URL to the source data? No source? Source: The entire WWW?
"Open source" should include the training code and the data. Anything you need to train from scratch or fine tune. Otherwise it's just a binary artifact.
Just create a couple more for AI, one with training data, one without.
Holy grail thinking, finding "the one and only open" license instead of "an open" license, is in a sense anti-open.
and since, what humans say is more horrible than good, the whole thing is a garbage mine
go talk to the crews ,who have been maintaining the consise oxford for the last number of centuries,or the French government and the department in charge of regulating the french language,remembering that the french, all but worship there language
there you will find,perhaps insight,or terror of the idea of creating a standard,consistant,concise,and useable,LLM
anything else is closed source. it's as simple as that.
At the end of the day, what threatens OpenAI is falling apart before they hit the runway. They can't lose the Microsoft deal, they can't lose more founders (almost literally at this point) and they can't afford to let their big-ticket partnerships collapse. They are financially unstable even by Valley standards - one year in a down market could decimate them.