Top
Best
New

Posted by pabs3 10/27/2024

Codeberg Reconsidering OSI License Approval in Terms of Use(codeberg.org)
52 points | 15 comments
rettichschnidi 10/27/2024|
EU AI Act (https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:...) as of today (CTRL + F "open-source"):

> (89) Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, in particular towards the provider that has used or integrated them, when those tools, services, processes, or AI components are made accessible under a free and open-source licence. ...

> Article 2, 12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under Article 5 or 50.

Let's see if the EU AI Act will be adjusted in the same spirit as discussed in the linked discussion.

weebull 10/27/2024|
> > ..., should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, ...

What does that mean?

TheNewsIsHere 10/27/2024||
It reads to me that you can’t pass on upstream AI vendor/platform/service requirements downstream to users/customers/end parties.

Which would be separate from any legislated requirements or limitations.

pabs3 10/27/2024||
I like Debian's policy for libre AI:

https://salsa.debian.org/deeplearning-team/ml-policy/

rettichschnidi 10/27/2024|
«ToxicCandy Model» is a great term!
RobotToaster 10/27/2024||
Given that most AI is trained on data scraped from the internet (most of which isn't open source), isn't it basically impossible to release an entire training dataset under an open source licence?
regularfry 10/27/2024||
That would, I suspect, be the point. If your AI is trained on non-free content, the implication is that it would be impossible for it to be released with an open source licence. So don't do that, the argument goes: only use content that has been released with a sufficiently free licence that republishing it in your dataset is not a problem. And as a side effect, you have to show that there isn't any "misappropriated" content in your training set. That side effect is what gets some people excited here.

I don't agree with that position legally, but I do mechanically. The point of the GPL family (to pick one random type of licence) is that the end user should have the capability to modify the product to their own ends, and I don't think fine-tuning provides enough capability to qualify.

pabs3 10/27/2024|||
It has been done before, for eg the original RNNoise was trained on proprietary data, later there was crowd-sourced effort to record new data and have it under libre licenses.

https://github.com/xiph/rnnoise/

guerrilla 10/27/2024||
They could release as much is as necessary to recreate it, the crawlers or list of links they used and configuration or scripts used to drive the training. Nobody is asking for the entire web in their git repo, only the ability to retrain from scratch, possibly with modifications.
tourmalinetaco 10/27/2024|||
This is incredibly unrealistic. Imagine hundreds to thousands of individuals scraping thousands of websites, DDoSing them because their mid-tier “open source” LLM project gave them a list of links to “recreate” their dataset. It’s far more sustainable to create a dataset filled GPL, public domain, and other permissively licensed data rather than nuke half the Internet’s bandwidth. And that’s ignoring the fact that scraping it yourself does not actually grant you a permissive license, that’s like saying that by watching a movie in theaters you have a legal right to sell that movie.
echoangle 10/27/2024|||
Not really, because there’s no guarantee that it will be available in the future. A script to download the data doesn’t mean I can reliably recreate the data in 5 years, I wouldn’t call that open source. To me, the data itself needs to be published.
guerrilla 10/27/2024||
Oh well, they did their best. You can't expect them to do better than what is possible. Enough Nirvana fallacy here.
echoangle 10/27/2024||
I’m not expecting them to do the impossible, but they shouldn’t call it open source then. Either you provide all the data and call it open source, or you don’t provide the data because it is proprietary and don’t call the model open source.
guerrilla 10/27/2024||
Well, I'm glad you exist to push the Overton window even further anyway. A lot of people are trying to claim that what's being pushed now (opaque data) is open source. I'll be satisfied if I can at least aplroximate the training witb whatever is online at the time I were to do it.
guerrilla 10/27/2024||
It's heartening to see people take this seriously. Let's hope many more stand up for the basic ontology and spirit of free software.