Is legal the same as legitimate: AI reimplementation and the erosion of copyleft

Posted by dahlia 5 hours ago

Is legal the same as legitimate: AI reimplementation and the erosion of copyleft(writings.hongminhee.org)

134 points | 131 commentspage 2

wccrawford 3 hours ago|

"Antirez closes his careful legal analysis as though it settles the matter. Ronacher acknowledges that “there is an obvious moral question here, but that isn't necessarily what I'm interested in.” Both pieces treat legal permissibility as a proxy for social legitimacy. "

This whole article is just complaining that other people didn't have the discussion he wanted.

Ronacher even acknowledged that it's a different discussion, and not one they were trying to have at the moment.

If you want to have it, have it. Don't blast others for not having it for you.

wizzwizz4 3 hours ago|

Having this discussion involves blasting others for not considering it. Consider the rest of the paragraph you quoted:

> But law only says what conduct it will not prevent—it does not certify that conduct as right. Aggressive tax minimization that never crosses into illegality may still be widely regarded as antisocial. A pharmaceutical company that legally acquires a patent on a long-generic drug and raises the price a hundredfold has not done something legal and therefore fine. Legality is a necessary condition; it is not a sufficient one.

amarant 3 hours ago||

If the discussion inherently cannot be had without blasting innocent bystanders, I don't think it's a discussion worth having.

It might even be morally abhorrent to have such a discussion in the first place!

kazinator 3 hours ago||

You can't put a copyright and MIT license on something you generated with AI. It is derived from the work of many unknown, uncredited authors.

Think about it; the license says that copies of the work must be reproduced with the copyright notice and licensing clauses intact. Why would anyone obey that, knowing it came from AI?

Countless instances of such licenses were ignored in the training data.

harshreality 20 minutes ago||

When learning is sufficiently atomized and recombined, creations cease to be "derived from" in a legal sense.

A lego sculpture is copyrighted. Lego blocks are not. The threshold between blocks and sculpture is not well-defined, but if an AI isn't prompted specifically to attempt to mimic an existing work, its output will be safely on the non-copyrighted side of things.

A derivative work is separately copyrightable, but redistribution needs permission from the original author too. Since that usually won't be granted or would be uneconomical, the derivative work can't usually be redistributed.

AI-produced material is inherently not copyrightable, but not because it's a derivative work.

moralestapia 1 hour ago||

Courts have already ruled that AI-generated work belongs to the public domain. So, even the MIT license does not apply.

danbruc 2 hours ago||

Why are people even having problems with sharing their changes to begin with? Just publishing it somewhere does not seem too expensive. The risk of accidentally including stuff that is not supposed to become public? Or are people regularly completely changing codebases and do not want to make the effort freely available, maybe especially to competitors? I would have assumed that the common case is adding a missing feature here, tweaking something there, if you turn the entire thing on its head, why not have your own alternative solution from scratch?

dleslie 3 hours ago||

IMHO, the API and Test Suite, particularly the latter, define the contract of the functional definition of the software. It almost doesn't matter what that definition looks like so long as it conforms to the contract.

There was an issue where Google did something similar with the JVM, and ultimately it came down to whether or not Oracle owned the copyright to the header files containing the API. It went all the way to the US supreme court, and they ruled in Google's favour; finding that the API wasn't the implementation, and that the amount of shared code was so minimal as to be irrelevant.

They didn't anticipate that in less than half a decade we'd have technology that could _rapidly_ reimplement software given a strong functional definition and contract enforcing test suite.

nicole_express 3 hours ago||

Not a lawyer, but my understanding is: In theory, copyright only protects the creative expression of source code; this is the point of the "clean room" dance, that you're keeping only the functional behavior (not protected by copyright). Patents are, of course, an entirely different can of worms. So using an LLM to strip all of the "creative expression" out of source code but create the same functionality feels like it could be equivalent enough.

I like the article's point of legal vs. legitimate here, though; copyright is actually something of a strange animal to use to protect source code, it was just the most convenient pre-existing framework to shove it in.

dathinab 1 hour ago|

> this is the point of the "clean room" dance

which is the actual relevant part: they didn't do that dance AFIK

AI is a tool, they set it up to make a non-verbatim copy of a program.

Then they feed it the original software (AFIK).

Which makes it a side by side copy, as in the original source was used as reference to create the new program. Which tend to be seen as derived work even if very different.

IMHO They would have to:

1. create a specification of the software _without looking at the source code_, i.e. by behavior observation (and an interface description). I.e. you give the AI access to running the program, but not to looking into the insides of it. I really don't think they did it as even with AI it's a huge pain as you normally can't just brute force all combinations of inputs and instead need to have a scientific model=>test=>refine loop (which AI can do, but can take long and get stuck, so you want it human assisted, and the human can't have inside knowledge about the program).

2. then generate a new program from specification, And only from it. No git history, no original source code access, no program access, no shared AI state or anything like that.

Also for the extra mile of legal risk avoidance do both human assisted and use unrelated 3rd parties without inside knowledge for both steps.

While this does majorly cut cost of a clean room approach, it still isn't cost free. And still is a legal mine field if done by a single person, especially if they have enough familiarity to potentially remember specific peaces of code verbatim.

RaffaelCH 29 minutes ago|||

> Then they feed it the original software (AFIK).

My understanding is they did do the dance. From the article: "He fed only the API and the test suite to Claude and asked it to reimplement the library from scratch."

One could still make the argument that using the test suite was a critical contributing factor, but it is not a part of the resulting library. So in my uninformed opinion, it seems to me like the clean room argument does apply.

nicole_express 1 hour ago|||

Well sure they didn't do the dance, but you don't have to do the dance. The reason to do it is that it's a good defense in a lawsuit. Like you say, all of this is a legal minefield.

So my understanding was that the original code was specifically not fed into Claude. But was almost certainly part of its training data, which complicates things, but if that's fair use then it's not relevant? If training's not fair use and taints the output, then new-chardet is a derivative of a lot of things, not just old-chardet...

This is all new legal ground. I'm not sure if anyone will go to court over chardet, though, but something that's an actual money-maker or an FSF flagship project like readline, on the other hand, well that's a lot more likely.

bjt 3 hours ago||

> If source code can now be generated from a specification, the specification is where the essential intellectual content of a GPL project resides. Blanchard's own claim—that he worked only from the test suite and API without reading the source—is, paradoxically, an argument for protecting that test suite and API specification under copyleft terms.

This is an interesting reversal in itself. If you make the specification protected under copyright, then the whole practice of clean room implementations is invalid.

kccqzy 3 hours ago||

> When GNU reimplemented the UNIX userspace, the vector ran from proprietary to free. Stallman was using the limits of copyright law to turn proprietary software into free software. […] The vector in the chardet case runs the other way.

That’s just your subjective opinion which many other people would disagree. I bet Armin Ronacher would agree that an MIT licensed library is even freer than an LGPL licensed library. To them, the vector is running from free to freer.

t43562 3 hours ago||

Why does anyone need his new library? They can do what he did and make their own.

I'm glad we can fork things at a point and thumb our noses at those who wish to cash in on other's work.

warkdarrior 3 hours ago|

Why would I make my own? The new library is released under MIT license and faster than the old one.

t43562 2 hours ago||

If you decide to improve it in any way to fit your needs you can merely tell your own AI to re-implement it with your changes. Then it's proprietary to you.

strongpigeon 3 hours ago|

I feel like the licenses that suffer the most isn't the GPL, but the ones like SSPL. If your code can be re-implemented easily and legally by AWS using an LLM, why risk publishing it?

It does feel like open source is about to change. My hunch is that commercial open source (beyond the consultation model) risks disappearing. Though I'd be happy to be proven wrong.

More comments...