OSI readies controversial open-source AI definition

Posted by rettichschnidi 2 days ago

OSI readies controversial open-source AI definition(lwn.net)

114 points | 133 commentspage 2

aithrowawaycomm 2 days ago|

What I find frustrating is that this isn't just about pedantry - you can't meaningfully audit an "open-source" model for security or reliability problems if you don't know what's in the training data. I believe that should be the "know it when I see it" test for open-source: has enough information been released for a competent programmer (or team) to understand the how the software actually works?

I understand the analogy to other types of critical data often not included in open-source distros (e.g Quake III's source is GPL but its resources like textures are not, as mentioned in the article). The distinction is in these cases the data does not clarify anything about the functioning of the engine, nor does its absence obscure anything. So by my earlier smell test it makes sense to say Quake III is open source.

But open-sourcing a transformer ANN without the training data tells us almost nothing about the internal functioning of the software. The exact same source code might be a medical diagnosis machine, or a simple translator. It does not pass my smell test to say this counts as "open source." It makes more sense to say that ANNs are data-as-code programming paradigms, glued together by a bit of Python. An analogy would be if id released its build scripts and announced Quake III was open-source, but claimed the .cpp and .h files were proprietary data. The batch scripts tell you a lot of useful info - maybe even that Q3 has a client-server architecture - but they don't tell you that the game is an FPS, let alone the tricks and foibles in its renderer.

lolinder 2 days ago|

> I believe that should be the "know it when I see it" test for open-source: has enough information been released for a competent programmer (or team) to understand the how the software actually works?

Training data simply does not help you here. Our existing architectures are not explainable or auditable in any meaningful way, training data or no training data.

samj 2 days ago|||

That's why Open Source analyst Redmonk now "do not believe the term open source can or should be extended into the AI world." https://redmonk.com/sogrady/2024/10/22/from-open-source-to-a...

I don't necessarily agree and suggest the Open Source Definition could be extended to cover data in general (media, databases, and yes, models) with a single sentence, but the lowest risk option is to not touch something that has worked well for a quarter century.

The community is starting to regroup and discuss possible next steps over at https://discuss.opensourcedefinition.org

aithrowawaycomm 2 days ago||||

I don't think your comment is really true, LLM providers and researchers have been a bit too eager to claim their software is mystically complex. Anthropic's research is shedding light on interpretability, there has been good work done on the computational complexity side, and I am quite confident that the issue is LLM's newness and complexity, not that the problem is actually intractable (or specifically "more intractable" than other hopelessly complex software like Facebook or Windows).

To the extent the problem is intractable, I think kt mostly reflects that LLMs have an enormous amount of training data and do an enormous amount of things. But for a given specific problem the training data can tell you a lot:

- whether there is test contamination with respect to LLM benchmarks or other assessments of performance

- whether there's any CSAM, racist rants, or other things you don't want

- whether LLM weakness in a certain domain is due to an absence of data or if there's a more serious issue

- whether LLM strength in a domain is due to unusually large amounts of synthetic training data and hence might not generalize very reliably in production (this is distinct from test contamination - it is issues like "the LLM is great at multiplication until you get to 8 digits, and after 12 digits it's useless")

- investigating oddness like that LeetMagikarp (or whatever) glitch in ChatGPT

blackeyeblitzar 2 days ago|||

But training data can itself be examined for biases, and the curation of data also brings in biased. Auditing the software this way doesn’t require explainability in the way you’re talking about.

Legend2440 2 days ago||

Does "open-source" even make sense as a category for AI models? There isn't really a source code in the traditional sense.

atq2119 1 day ago||

There's code for training and inference that could be open-source. For the weights, I agree that open-source doesn't make sense as a category.

They're really a kind of database. Perhaps a better way to think about it is in terms of "commons". Consider how creative commons licenses are explicit about requirements like attribution, noncommercial, share-alike, etc.; that feels like a useful model for talking about weights.

mistrial9 2 days ago|||

I have heard government people talk about "the data is open-source" meaning it has public, no cost copy points to get data files e.g. csv or other.

Barrin92 2 days ago|||

I had the same thought. "Source Code" is a human readable and modifiable set of instructions that describe the execution of a program. There's obviously parts of an AI system that include literal code, usually a bunch of python scripts or whatever, to interact and build the thing, but most of it is on the one hand data, and on the other an artifact, the AI model and neither is source code really.

If you want to talk about the openness and accessibility of these systems I'd just ditch the "source" part and create some new criteria for what makes an AI model open.

paulddraper 2 days ago||

Yeah, it's like an open-source jacket.

I don't really know what you're referring to....

echoangle 1 day ago||

An Open Source jacket actually makes more sense to me than an open source LLM. I generally understand hardware to be open source when all design files are available (for example CAD models of a case and KiCad files for a PCB). If the patterns of a jacket were available in an editable standard-format file, you could argue that’s an open source jacket.

lolinder 2 days ago||

> Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

This makes sense. What the OSI gets right here is that the artifact that is made open source is the weights. Making modifications to the weights is called fine tuning and does not require the original training data any more than making modifications to a piece of source code requires the brain of the original developer.

Tens of thousands of people have fine-tuned these models for as long as they've been around. Years ago I trained GPT-2 to produce text resembling Shakespeare. For that I needed Shakespeare, not GPT-2's training data.

The training data is properly part of the development process of the open source artifact, not part of the artifact itself. Some open source companies (GitLab) make their development process fully open. Most don't, but we don't call IntelliJ Community closed source on the grounds that they don't record their meetings and stream them for everyone to watch their planning process.

Edit: Downvotes are fine, but please at least deign to respond and engage. I realize that I'm expressing a controversial opinion here, but in all the times I've posted similar no one's yet given me a good reason why I'm wrong.

tourmalinetaco 2 days ago||

By your logic, many video games have been “open source” for decades because tools were accessible to modify the binary files in certain ways. We lacked the source code, but that’s just “part of the development process”, and maybe parts like comments were lost during the compiling process, but really, why isn’t it open source? Tens of thousands have modified the binaries as long as they’ve been around, and for that I needed community tools, not the source code.

In short, your argument doesn’t work because source code is to binaries as training data is to MLMs. Source code is the closest comparison we have with training data, and the useless OSI claims that’s only a “benefit” not a “requirement”. This isn’t a stance meant for long term growth but for maintaining a moat of training data for “AI” companies.

lolinder 2 days ago||

> By your logic, many video games have been “open source” for decades because tools were accessible to modify the binary files in certain ways. We lacked the source code, but that’s just “part of the development process”, and maybe parts like comments were lost during the compiling process, but really, why isn’t it open source?

Because the binaries were not licensed under a FOSS license?

Also, as I note in another comment [0], source code is the preferred form of a piece of software for making modifications to it. The same cannot be said about the training data, because getting from that to weights costs hundreds of millions of dollars in compute. Even the original companies prefer to fine-tune their existing foundation models for as long as possible, rather than starting over from training data alone.

> In short, your argument doesn’t work because source code is to binaries as training data is to MLMs.

I disagree. Training data does not allow me to recreate an LLM. It might allow Jeff Bezos to recreate an LLM, but not me. But weights allow me to modify it, embed it, and fine tune it.

The weights are all that really matters for practical modification in the real world, because in the real world people don't want to spend hundreds of millions to "recompile" Llama when someone already did that, any more than people want to rewrite the Linux kernel from scratch based on whiteboard sketches and mailing list discussions.

[0] https://news.ycombinator.com/item?id=41951945

fedro2791 1 day ago||

[dead]

koolala 2 days ago||

"sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system"".

So a URL to the data? To download the data? Or what? Someone says "Just scrape the data from the web yourself." And a skilled person doesn't need a URL to the source data? No source? Source: The entire WWW?

mensetmanusman 2 days ago||

The 1000 lines of code is open source, the $100,000,000 in electricity costs to train is not.

JoshTriplett 1 day ago||

In the early days of Open Source, many people didn't have access to a computer, and those who had access to a computer often didn't have access to development tools. The aspirations of early Open Source became more and more feasible as more people had access to technology, but the definitions still targeted developers.

echelon 1 day ago||

Training costs will come down. We already have hacks for switching mathematical operators and precision. We originally used to program machines on room-sized computers, yet we now all have access.

"Open source" should include the training code and the data. Anything you need to train from scratch or fine tune. Otherwise it's just a binary artifact.

pabs3 1 day ago||

I prefer the Debian policy about this:

https://salsa.debian.org/deeplearning-team/ml-policy

rdsubhas 1 day ago||

There are already hundreds of OSI licenses for source code.

Just create a couple more for AI, one with training data, one without.

Holy grail thinking, finding "the one and only open" license instead of "an open" license, is in a sense anti-open.

metalman 1 day ago||

call it what it is a search engine,feeding back extracts from real human interaction,useing targeted advertising data to refine the responses

and since, what humans say is more horrible than good, the whole thing is a garbage mine

go talk to the crews ,who have been maintaining the consise oxford for the last number of centuries,or the French government and the department in charge of regulating the french language,remembering that the french, all but worship there language

there you will find,perhaps insight,or terror of the idea of creating a standard,consistant,concise,and useable,LLM

a-dub 2 days ago||

the term "open source" means that all of the materials that were used to create a distribution are available to inspect and modify.

anything else is closed source. it's as simple as that.

chrisfosterelli 2 days ago|

I imagine that Open AI (the company) must really not like this.

talldayo 2 days ago|

I hate OpenAI but Sam Altman is probably giddy with excitement watching the Open Source pundits fight about weights being "good enough". He's suffered the criticism over his brand for years but they own the trademark and openly have no fucks to give about the matter. Founding OpenAI more than 5 years before Open AI was defined is probably another perverse laurel he wears.

At the end of the day, what threatens OpenAI is falling apart before they hit the runway. They can't lose the Microsoft deal, they can't lose more founders (almost literally at this point) and they can't afford to let their big-ticket partnerships collapse. They are financially unstable even by Valley standards - one year in a down market could decimate them.