Posted by rettichschnidi 2 days ago
I quite like that comment that was left on the article. I know some models you can tweak the weights, without the source data, but it does seem like you are more restricted without the actual dataset.
Personally, the data seems to be part of the source to me, in this case. I mean, the code is derived from the data itself, the weights are the artifact of training. If anything, they should provide the data, the training methodology, the model architecture, the code to train and infer, and the weights could be optional. I mean, the weights basically are equivalent to a built artifact, like the compiled software.
And that means commercially, people would pay for the cost of training. I might not have the resources to "compile" it myself, aka, run the training, so maybe I pay a subscription to a service that did.
When we're dealing with source code, the cost of getting from source -> binary is minimal. The entire Linux kernel builds in two hours on one modest machine. Since it's cheap to compile and the source code is itself legible, the source code is the preferred form for making modifications.
This doesn't work when we try to apply the same reasoning to `training data -> weights`. "Compilation" in this world costs hundreds of millions of dollars per compilation run. Cost of "compilation" alone means that the preferred form for making modifications can't possibly be the training data, even for the company that built the thing in the first place. As for the data itself, it's a far cry from source code—we're talking tens of terrabytes of data at a minimum, which is likewise infeasible to work with on a regular basis. The weights must be the preferred form for making modifications for simple logistics reasons.
Importantly, the weights are the preferred form for modifications even for the companies that built them.
I think a far more reasonable analogy, to the extent that any are reasonable, is that the training data is all the stuff that the developers of the FOSS software ever learned, and the thousands of computer-hours spent on training are the thousands of man-hours spent coding. The entire point of FOSS is for a few experts to do all that work once and then we all can share and modify the output of those years of work and millions of dollars invested as we see fit, without having to waste all that time and money doing it over again.
We don't expect the authors of the Linux kernel to document their every waking thought so we could recreate the process they used to produce the kernel code... we just thank them for the kernel code and contribute to it as best we can.
No, that's not why weights are object code. Binary vs. text is irrelevant.
Weights are object code because training data is declarative source code defining the desired behavior of the system and training code is a compiler which takes that source code and produces a system with the desired behavior.
Now, the behavior produced is less exactly known from the source code than is the case with traditional programming, but the function is the same.
You could have a system where the training and inference codes were open source and the model specified by the weights itself was not — that would be like having a system where software was not open source, but the compiler use to build it and the runtime library it relies on were. But one shouldn't confuse that with an open source model.
What do you do with the fact that no one (including the companies who do the initial training) modifies the training data when they want to modify the work? Are the weights not the preferred form for modifying a model?
People do take and modify training data sets for new models, it's not as common for modifications to foundation models where you aren't also changing the architecture, because it's not necessary for efficient additive changes, which are the most common kinds of changes, and because training datasets for foundation models are rarely shared. It is commonly used by first parties when the change involves changing the architecture as well (so you can't do an additive to change to the existing trained model, and need to train from scratch but also want to address issues -- either expanding the scope, improving quality, etc. -- with the training data but don't want to start from scratch with training data.) Meanwhile, there is research on fine tuning for subtractive changing (removing concepts from a trained model) because, at least for third parties, while fine-tuning is available, altering the training data and retraining a foundation model from scratch usually isn't an option.
Certainly, people doing derivatives of non-foundation models (LoRA, finetunes, etc.) often reuse and modify training sets used by earlier non-foundation models of the same type, and model sharing sites with an open-source preference facilitate dataset sharing to support this.
Now scale that up and consider - at which point such project would start being "FOSS" in your book without actually providing its sources on an appropriate license?
The intention behind "preferred form for modification" is to put you as a user on the same level as the copyright holder. This construct works well in a world where compiling is cheap; where it isn't, it may require some refinement to preserve the intention behind it. The copyright holder could decide to modify the learning set before pressing the "start" button, you can't.
At the moment when the copyright holder stops ever recompiling the code from scratch and starts just patching binaries.
We are at that point with LLMs.
> The intention behind "preferred form for modification" is to put you as a user on the same level as the copyright holder.
Exactly. And the copyright holders for these LLMs do not ever "recompile". They create brand new works that aren't derivatives at all, but when it comes to modifying the existing work they invariably fine-tune it rather than retraining it.
So when the copyright holder considers the work done and stops changing it at all, it's now FOSS too?
I'll repeat myself, as you ignored the important part:
> The copyright holder could decide to modify the learning set before pressing the "start" button, you can't.
Even if the copyright holder does not intent to retrain their model, you are not in the same position as them. The choices they made at initial training are now ingrained in the model, putting them at an advantage over anyone else if you can't inspect and change those choices. They got to decide, you did not. Your only option to be in a similar position is to start from scratch.
If you wanted to write a project in Rust you would have needed to be there at the beginning, too. Same if you wanted to make it a web app versus native. There are dozens and dozens of decisions that can only be made at the beginning of a project and will require completely reworking it if you're receiving it later.
If a project needed to put all future users on equal footing with where the copyright holder was at the beginning of the project in order to be open source, there can be no open source. The creator of the project invariably made decisions that cannot be undone later without redoing all the work.
It's written on the OSI page about license approval:
""" The license does not have terms that structurally put the licensor in a more favored position than any licensee. """
Equal footing is the whole idea behind Free Software. Open Source puts accents somewhere else, but in practice it's essentially about the same thing.
> If you wanted to write a project in Rust you would have needed to be there at the beginning, too.
I can take the source code, inspect it and rewrite it line-by-line.
I can't take a closed model, inspect the data it was trained on and retrain it, even if I had enough money and resources to do so.
Whether the copyright holder will ever want to retrain it themselves is irrelevant. They have the information needed to do so if they wanted, I don't.
Trying to draw an equivalency between code and weights is [edited for temperament, I guess] not right. They are built from the source material supplied to an algorithm. Weights are data, not code.
Otherwise, everyone on the internet would be an author, and would have a say in the licensing of the weights.
By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense. Analogies will always fail, which is why "preferred form for making modifications" is the rod we use, not vague attempts at drawing analogies between completely different development processes.
> They are built from the source material supplied to an algorithm. Weights are data, not code.
As Lispers know well, code is data and data is code. You can't draw a line in the sand and definitively say that on this side of the line is just code and on that side is just data.
In terms of how they behave, weights function as code that is executed by an interpreter that we call an inference engine.
I'm not comfortable with calling the resulting weights "open source", since people can't look at a set of weights and understand all of the components in the same way as actual source code. It's more like "freeware". You might be able to disassemble it with some work, but otherwise it's an incomprehensible thing you can run and have for free. I think it would be more appropriate to co-opt the term "open source" for weights generated from freely available material, because then there is no confusion whether the "source" is open.
And this is what I think everyone is actually dancing around: I suspect the insistence on publishing the training data has very little to do with a sense of purity around the definition of Open Source and everything to do with frustrations about copyright and intellectual property.
For that same reason, we won't see open source models by this definition any time soon, because the legal questions around data usage are profoundly unsettled and no company can afford to publicize the complete set of data that they trained on until they are.
My personal ethic says that intellectual property is a cancer that sacrifices knowledge and curiosity on the altar of profit, so I'm not overly concerned about forcing companies to reveal where they got the data. If they're releasing the resulting weights under a free license (which, notably, Llama isn't) then that's good enough for me.
It's totally fine if we don't have many (or any) models meeting the definition of open source! How hard is it to use a different term that actually applies?
The people on my side of the argument seem to be saying: "do not misapply these words", not "do not give away your weights".
Insisting on calling a model with undisclosed sources "open source" has what benefit? Marketing? That's really all I can think of... that it's to satisfy the goals of propagandists.
Such obligations are designed to mitigate the inherent risks that AI can pose to individuals and society.
The AI Act exempts open source from such scientific scrutiny because it's already transparent.
BUT if OSI defines black boxes as "open source", they open a loophole that will be exploited to harm people without being held accountable.
So it's not just marketing, but dangerous corporate capture.
The OSI’s definition matches the legal definition in the EU and California (and common use). If the OSI says open data only, it will just be ignored. (If people are upset about the current use, they can make the free vs. open distinction we do in software to keep the pedantic definition contained.)
The whole reason FOSS exists is because of frustrations about copyright and intellectual property, anything else is derived from that, so I'm not sure what your point is.
> By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense.
To me the weights map to assembly and the training data+models map to source code+compilers. Sure, you can hand me assembly, and with the assembly I may be able to execute the model/program, but having it does not mean that I can stare at it and learn nor modify it with a reasonable understanding of what's going to change.
I got to add that the situation feels even worse than assembly, because assembly, hand-coded or mutilated by an optimizing compiler still does something very specific and deterministic, but the weights on the model makes things equivalent to programming without booleans, but seemingly random numbers and checking for inequalities to get a binary decision.
In contrast, the weights are the preferred form for modification, even for the company that built it. They only very rarely start a brand new training run from scratch, and when they do so it's not to modify the existing work, it's to make a brand new work that builds on what they learned from the previous model.
If the company makes the form of the work that they themselves use as the primary artifact freely available, I'm not sure why we wouldn't call the work open.
Preferred is obviously not a particularly strong line.
If someone ships object code for a bunch of stable modules, and only provides the source for code that’s expected to be where changed, is that really open?
“Preferred” gets messy quick. Not sure how that can be defined in any consistent way. Models are going to get more & more complex. Training with competitive models, etc.
I think you either have it all, or it isn’t really open. Or only some demarked subset is.
I don't think your argument holds any water.
Now that I've beat my head against this issue for a while, I think it's best summed up as: weights are a binary artifact, not source of any kind.
No one trains an existing model from scratch, even those who have access to all of the data to do so. There's just no compelling reason to retrain a model to make a change when you have the weights already—fine tuning is preferred by everyone.
The only people I've seen who've asserted otherwise are random commenters on the internet who don't really understand the tech.
> ...fine tuning is preferred by everyone
How do you know this? Did you take a survey? When? What if preferences change or there is no consensus?
> The only people I've seen who've asserted otherwise are random commenters on the internet who don't really understand the tech.
There are lots of things that can be done with the training set that don't involve retraining the entire model from scratch. As a random example, I could perform a statistical analysis over a portion of the training set and find a series of vectors in token-space that could be used to steer the model. Something like this can be done without access to the training data, but does it work better? We don't know because it hasn't been tried yet.
But none of that really matters, because what we're discussing is the philosophy of open source. I think it's a really bad take to say that something is open source because it's in a "preferred" format.
Preferred form and under a free license. Llama isn't open source, but that's because the license has restrictions.
As for if it's a bad take that the preferred form matters—take it up with the GPL, I'm just using their definition:
> The “source code” for a work means the preferred form of the work for making modifications to it.
Despite the fact that people keep insisting on the buzzword "AI" to describe these large neural networks, they are more succinctly defined as approximate computer programs. The means by which we create them is through a relatively standardized family of statistical modeling algorithms paired with a dataset they are meant to emulate in their output
A computer program that's specified in logic is already a usable representation that can be used to understand every aspect of the functioning code in its entirety, albeit some of it may be hard to understand. You don't need to consult the original programmer at all, let alone read their mind
In contrast, a function that is approximated in the manner described needs the training data to replicate or make sense of it, and in fact is even necessary to assess whether the model is cheating at the benchmarks its creators assess it against. The weights themselves are a functional approximation, not a functional description
For the purposes of the ethos of free and open source software, it is obvious that training data must be included. However, this argument is also deployed in various other places, like intellectual property disputes, and is equally stupid there. Just because we use the term "learning" to describe these systems doesn't mean it makes sense for the law to treat them as people. It is both nonsensical and harmful to say that no human can be held responsible for what an "AI" model does, but that somehow they are "just like people learning from experience" when it benefits tech companies to believe that
You're applying reproducibility unevenly, though.
The Linux kernel source code cannot feasibly be reproduced, but it can be copied and modified. The Mistral weights cannot feasibly be reproduced, but they can be copied and modified. Why is the kernel code open source while the Mistral weights are not?
Reproducibility is clearly not the deciding factor.
Now I get that “Identical” is a bit more nebulous when it comes to LLMs due to their inherent nondeterminism, but let’s take it to mean the executable itself, not the results produced by the executable.
No, I'm using the strict definition "capable of being reproduced", where reproduce means "to cause to exist again or anew". In and of itself the word doesn't comment on whether you're starting from source code or anything else, it just says that something must be able to be created again.
Yes, in the context of compilation this tends to refer to reproducible builds (which is a whole rabbit hole of its own), but here we're not dealing with two instances of compilation, so it's not appropriate to use the specialized meaning. We're dealing with two artifacts (a set of C files and a set of weights) that were each produced in different ways, and we're asking whether each one can be reproduced exclusively from data that was released alongside the artifact. The answer is that no, neither the source files or the weights can be reproduced given data that was released alongside them.
So my question remains: on what basis can we say that the weights are not open source but the C files are? Neither can be reproduced from data released alongside them, and both are the preferred form which the original authors would choose to make modifications to. What distinguishes them?
I think it goes to show how hard it is to make analogies between thw two fields.
Maybe it is just not source at all. Open or closed.
It is data. Like a csv of addresses and coordinates that were collated from different sources that say are no longer available.
It is a very philosophical topic. What if machines got faster and you could train Llama in 5 minutes, and an SSD could hold all the training data. Then it would feel like a compiled artifact more than data. Not releasing the training data would then feel like hiding something.
Now I know it seems like I’m taking the opposite side of my original take here but come on - you can’t really genuinely believe that because I can’t produce a byte for byte representation of the Linux kernel immediately even if it behaves 99.999% the same that somehow that is even remotely the same as not being able to reproduce an “open” LLM.
It's open source because they licensed the preferred form of the work for making modifications under a FOSS license. That's it. Reproducibility of that preferred form from scratch doesn't factor into it.
Really the conversation should be reframed to be something along the lines of “is it even ethical for these companies to offer their LLM as anything other than open source”? The answer, if you look into what they do, is “probably not”. Arguing about the technicalities of whether they follow the letter of whatever rule or regulation is probably a waste of time. Because it is completely obvious to anyone who understands how this works that these models are built and sold off the backs of centuries of open source work not licensed or intended to be used for profit.
Agreed, but I'm personally of the opinion that this is true for all intellectual endeavors. Intellectual property is the great sin of our generation, and I hope we eventually learn better.
And I think you've hit at the heart of the matter: the push for open source training data has never been about the definition of open source, it's always been a cover for complaints about where the data was sourced from. Which is also why we won't see it any time soon—not until the lawsuits wind their way through the courts, and even then only if the results are favorable towards training.
source code -> compile -> kernel binary. That binary is what can be reproduced, given the source code.
We don't have the equivalent for Mistral:
source code (+ training data) -> training -> weights
Training is too expensive for the training data to be the preferred form for making modifications to the work. Given that, the weights themselves are the closest thing these things have to "source code".
And this is where the reproducibility argument falls apart: on what basis can we insist that the preferred form for modifying an LLM (the weights) must be reproducible to be open source but the preferred form for modifying a piece of regular software (the code) can be open sourced as is, with none of the processes used to produce the code?
In order for the weights to take all the training data and embed it in the model, by definition, some data must be lost. That data can't be recovered, no matter how much you fine tune the model. Because we can't, we don't know how alignment gets set, or the extent of it.
The closet thing these things have to source code is the source code and training data used to create the model. Because that's what's used to created the model. How big a system is necessary to train it doesn't factor in. It used to take many days to compile the Linux kernel, and many people at the time didn't have access to systems that could even compile it.
First, licenses matter. Photoshop.exe is closed source first and foremost because the license says so.
Second and more importantly for this discussion, Adobe doesn't prefer to work with hexedit, they prefer to work with the source code.
OpenAI prefers to fine tune their existing models rather than train new ones. They fine tune regularly, and have only trained from scratch four times total, with each of those being a completely new model, not a modification.
That means the weights of an LLM are the preferred form for modification, which meets the GPL's definition of 'source code':
> The “source code” for a work means the preferred form of the work for making modifications to it.
I think this is a decent point. Is your FOSS project actually open source if your 3D assets were made in Fusion or Adobe?
Similarly, how open is a hardware project if you post only finalized STLs? What about with and without Fusion source files?
You can still replicate the project. You can even do relatively minor edits to the STL. Is that open or not?
The entire point of FOSS is to preserve user freedom. Avoiding pointless waste of repeated work is a side effect of applying that freedom.
It would feel entirely on point for things that require ungodly amounts of money and resources to even start considering exercising your freedoms on to not be considered FOSS, even if that aspect isn't considered by currently accepted definitions.
I realize I'm the one who used the combo acronym first, but this is a discussion about the OSI, which exists to champion the cynical company-centric version of the movement, and for that version my description is accurate.
I suppose I will have to stop writing 'F/OSS'. I'll probably use the term 'open-source' less and less, and maybe stop altogether.
Really? Hmm yeah maybe you’re right, but for some reason, said that way it somehow starts to seem a little disappointing and depressing. Maybe I’m reading it differently than you intended. I always considered the point of FOSS to be about the freedoms to use, study, customize, and share software, like to become an expert, not to avoid becoming an expert. But if the ultimate goal of all that is just a big global application of DRY so that most people rely on the corpus without having to learn as much, I feel like that is in a way antithetical to open source, and it could have a big downside or might end up being a net negative, but I dunno…
I think it is better to compare with something really big and fast evolving, e.g. Chromium. It will take a day to compile it. (~80000 seconds vs. ~8 seconds for a convenient/old Pascal program.)
The same cannot be said for LLM weights, as evidenced by the fact that even the enormous megacorps that put these things out tend to follow up by fine tuning the weights (using different training data) rather than retraining from scratch.
Thus it is either too early to define "open" for AI. Or "open" must be truly open. Though it remains not practically achievable at home or even at small companies.
And by today standards, a PDP-11 is quite comparable with the cost of the server farms used in training.
And yet Emacs was released under GPL.
So the economic argument is pretty miopic.
The same is not true of these models. To my knowledge no company has retrained a model from scratch to make a modification to it. They make new models, but these are fundamentally different works with different parameter counts and architectures. When they want to improve on a model that they already built, they fine tune the weights.
If that's what companies that own all the IP do, that tells me that the weights themselves are the preferred form for making modifications, which makes them source code under the gpl's definition.
So much they forget the basics of the discipline.
What do you think cross validation is for?
To compare different weights obtained from different initializations, different topologies, different hyper-parameters... all trained from the same training dataset.
Even for LLM, have you ever tried to reduce the size of the vocabulary of, say, Llama?
No?
Yet it's a totally reasonable modification.
What's the preferred form to make modifications like this?
Can you do it fine tuning llama weights?
No.
You need training data.
That's why training data are the preferred form to make modification, because whatever the AI (hyped or not) it's the only form that let you make all modifications you want.
There's a much simpler analogy: a photo
You can't have an "Open source photo" because that would require shipping everything (but the camera) that shows up in the photo so that someone could "recreate" the photo
It doesn't make sense.
A public domain photo is enough
You could have a programming language whose compiler is a superoptimizer that's very slow and is also stochastic, and it would amount to the same thing in practice.
The community is starting to regroup at https://discuss.opensourcedefinition.org because the OSI's own forums are now heavily censored.
I encourage you to join the discussion about the future of Open Source, the first option being to keep everything as is.
Didn't personally know they even had one. ;)
The legal system in the US doesn't provide them any other options but to act.
Now we're seeing that maybe putting all that trust and responsibility in one entity wasn't such a great idea.
Plus we still have FSF's definition and DFSG.
Well, another org is getting directors' salaries while open source writers get nothing.
And going the quotations in TFA, it seems the FSF's thinking about this is clear and nuanced, as usual:
> [T]he FSF makes a distinction between non-free and unethical in this case:
> > It may be that some nonfree ML have valid moral reasons for not releasing training data, such as personal medical data. In that case, we would describe the application as a whole as nonfree. But using it could be ethically excusable if it helps you do a specialized job that is vital for society, such as diagnosing disease or injury.
If they end up needing new terminology to describe this case, I'm sure they will devise some-- and it will be more explicit than a moniker like 'shared source'.I wonder who has legal liability for the closed-data generated weights and some of the rubbish they spew out. Since users will be unable to change the source-data inputs, and will only be able to tweak these compiled-model outputs.
Is such tweaking analogous to having a car resprayed, and the manufacturer washes their hands of any liability over design safety.
> We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer.
(I didn't read the paper. The ordinary version of this point is already compelling imo, given the current state of the art of reverse-engineering large models.)
> I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.
My impression is that LLMs are very much the latter-case, with respect to unwanted behaviors. You can't audit them, you can't secure them against malicious inputs, and whatever limited steering we have over the LSD-trip-generator involves a lot of arbitrary trial and error and hoping our luck holds.
On the other hand if the data isn't open you should probably use the term open weights not open source. They're so close.
We risk giving AI the same opportunity to grow in an open direction, and by our own hand. Massive own goal.
I thought it was thanks to a lot of software developers’ uncompensated labor. Silly me.
Tangential, but I wonder how well an AI performs when trained on genuine human data, versus a synthetic data set of AI-generated texts.
If performance when trained on the synthetic data set is close to that when trained on the original human dataset – this could be a good way to "launder" the original training data and reduce any potential legal issues with it.
It makes sense as any bias in the model generated synthetic data will just get magnified as models are continuously trained on that biased data.
the word "people" is so striking here... teams and companies, corporations and governments.. how can the cast of characters be so completely missed. An extreme opposite to a far previous era where one person could only be their group member. Vocabulary has to evolve in deliberations.
You have some sort of engine that runs the model. That's like the JVM, and the JIT.
And you have the program that takes the training data and trains the model. That's your compiler, your javac, your Makefile and your make.
And you have the training data itself, that's your source code.
Each of the above pieces has its own source code. And the training set is also source code.
All those pieces have to be open to have a fully open system.
If only the training data is open, that's like having the source, but the compiler is proprietary.
If everything but the training set is open, well, that's like giving me gcc and calling it Microsoft Word.
If I can't reproduce the model, I'm beholden to whoever trained it.
>"If you're explaining, you're losing."
That is an interesting point, but isn't this the same organization that makes "open source" vs. "source available" a topic? e.g. why Winamp wouldn't be open source?
I don't think you can even call a trained AI model "source available." To me the "source" is the training data. The model is as much of a binary as machine code. It doesn't even feel right to have it GPL licensed like code. I think it should get the same license you would give to a fractal art released to the public, e.g. CC.
I think dictionaries are copyrightable, however?
Heck, a regular old binary is much less opaque than “open” weights. You can at least run it through a disassembler and slowly, dreadfully, figure out how it works. Just look at the game emulator community.
For open weight AI models, is there anything close to that?
I wonder how could anyone be an open source enthusiast, distrusting source code they can't verify, and yet a LLM enthusiast, trusting a huge configuration file that can't be debugged.
Granted I don't have a lot of knowledge about LLMs. From what I know, there are some tools that can tell you the confidence/stickiness of certain parts of the generate output, e.g. "for a prompt like this, this word WILL appear almost every time, while this other word will almost never appear." I think there was something similar for image generation that could tell what areas of an image stemmed from what terms in the prompt. I have no idea how this information is derived, but it doesn't feel like there are many end-user tools for this. Maybe the AI researchers have access to more powerful tooling.
For source code I can just open a file in notepad.exe to inspect it. I think that's the standard.
If, for example, a computer program was programmed using an esoteric language that read used image files instead of text files as source code, I don't think you would be able to consider that program "open source" unless the image format it used was also open source, e.g. PNG. If it was some proprietary format, people can't create tools for it, so they can't actually do anything the image blob, which restricts their freedoms.
Huh, then this will be a useful definition.
The FSF position is untenable. Sure, it’s philosophically pure. But given a choice between a practical definition and a pedantically-correct but useless one, people will use the former. Irrespective of what some organisation claims.
> would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate
Not how language works.
Natural languages are parsimonious; they reuse related words. In this case, the closest practical analogy to open-source software has the lower barrier to entry. Hence, it will win.
There is no place for defining open source as data available. In software, too, this problem is solved by using “free software” for the extreme definition. The practical competition is between the Facebook model available with restrictions definition and this.
previously on: https://news.ycombinator.com/item?id=41791426
its really interesting to contrast this "outsider" definition of open ai with people with real money at stake https://news.ycombinator.com/item?id=41046773
I guess this is a question of what we want out of "open source". Companies want to make money. Their asset is data, access to customers, hardware and integration. They want to "open source" models, so that other people improve their models for free, and then they can take them back, and sell them, or build something profitable using them.
The idea is that, like with other software, eventually, the open source version becomes the best, or just as good as the commercial ones, and companies that build on top no longer have to pay for those, and can use the open source ones.
But if what you want out of "open source" is open knowledge, peeking at how something is built, and being able to take that and fork it for your own. Well, you kind of need the data. And your goal in this case is more freedom, using things that you have full access to inspect, alter, repair, modify, etc.
To me, both are valid, we just need a name for one and a name for the other, and then we can clearly filter for what we are looking for.
I don’t need a board to tell me what’s open.
And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.
I don’t need people to tell me that.
OSI approved this and that has turned into a Ministry of Magic approved thinking situation that feels gross to me.
We'll end up with like 5 versions of the same "open source" model, all performing differently because they're all built with their own dataset. And yet, none of those will be considered a fork lol?
I don't know what the obsession is either. If you don't want to give others permission to use and modify everything that was used to build the program, why are you wanting to trick me in thinking you are, and still calling it open source?
Because there is an excemption clause in the EU AI Act for free and open source AI.
Making training exactly reproducible locks off a lot of optimizations, you are practically not going to get bit-for-bit reproducibility for nontrivial models
Similarly, if you run the scripts and it produces the model then it's Open Source that happens to be AI.
To quote Bruce Perens (definition author): the training data IS the source code. Not a perfect analogy but better than a recipe calling for unicorn horns (e.g., FB/IG social graphs) and other toxic candy (e.g., NYT articles that will get users sued).
This is the new cracker/hacker, GIF pronunciation, crypto(currency)/crypto(graphy) mole hill. Like sure, nobody forces you to recognise any word. But the common usage already precludes open training data—that will only get more ensconced as more contracts and jurisdictions embrace it.
In marketing terms, a simple market communication, consistently and diligently applied, in varied contexts and over time, can and usually will take hold despite untold number of individuals who shake their fists at the sky or cut with clever and cruel words that few hear IMHO
OSI branding and market communications seem very likely to me to be effective in the future, even if the content is exactly what is being objected to here so vehemently.