Top
Best
New

Posted by m463 4 days ago

FSF statement on copyright infringement lawsuit Bartz v. Anthropic(www.fsf.org)
211 points | 104 comments
briandw 2 hours ago|
I'm really confused by the FSF statement here. The court ruled that the use of copyrighted information is fair use. The issue is that Anthropic pirated (obtained illegally) copyrighted work and that was the offense. FSF books are free to download and store etc. The license says: "This is a free license allowing use of the work for any purpose without payment." So how can they claim that their rights were infringed when the court ruled that the problem was the illegal downloading of copyrighted work? It's impossible to illegally download a FSF book.
MajorArana 4 hours ago||
Thank you FSF!

The hero we need, but not the hero we deserve..

The issue is that every CS masters student & AI researcher knows how to build a SOTA LLM.. But, only a few companies have the resources.

The process:

(1) steal as much data from the internet as possible (data is everything) (2) raise incomprehensible amounts of money (3) find a location where you can take over the energy grid for training (4) put a black box around it so nobody can see the weights (5) charge users $$$ to use (6) retrain models with user session data (opt in by default) (7) peek around at how users are using, (maybe) change policies to stop them from using that way, and (maybe) rapidly develop features for that use case.

(Sorry that last one is jaded and not fair - just included to give you a picture of what could be happening with this sort of tech) …

The entire premise of the product is “built on the backs of any & everyone who has ever published a work”

margalabargala 2 hours ago|
> The entire premise of the product is “built on the backs of any & everyone who has ever published a work”

Do any products exist which are not built on uncompensated work of other people in the past?

Generally speaking societies do better when knowledge is shared and not hoarded.

Hoarding knowledge via legal constructs is great at concentrating wealth to the hoarder at the expense of everyone else.

We should restore copyright to its original term lengths.

I agree with the stance of Anthropic et al that these models should be built with all possible information.

I agree with the stance of the FSF that the resulting models should be as freely usable/available as possible.

tpxl 1 hour ago||
> Generally speaking societies do better when knowledge is shared and not hoarded.

These companies do even better because we're not allowed to share the knowledge (read, illegally copy protected works) and they are.

margalabargala 1 hour ago||
Exactly.
teeray 7 hours ago||
> It is a class action lawsuit… the parties agreed to settle instead of waiting for the trial…

It would be nice if members of the class could vote to force a case to trial. For the typical token settlement amount, I’m sure many would rather have the precedent-setting case instead.

ksherlock 6 hours ago|
If/when you get a postcard/spam email that you're included in a potential class action lawsuit settlement, you can opt out of the class (in which case you preserve your legal rights to sue separately) or file comments with the Court.
teeray 4 hours ago||
You can, but then you lose the power of a collective and have to manage a lawsuit yourself. If you are being represented as part of a group, then you should have means to direct that representation.
bbor 2 hours ago||
Surely some firms choose to hold referendums already, but I could see that being a good law! As Better Call Saul explored in its early seasons, the interests of the large law firm can easily diverge significantly from the interests of the plaintiffs.
bobokaytop 11 hours ago||
The framing of 'share your weights freely' as a remedy is interesting but underspecified. The FSF's argument is essentially that training on copyrighted code without permission is infringement, and the remedy should be open weights. But open weights don't undo the infringement -- they just make a potentially infringing artifact publicly available. That's not how copyright remedies work. What they're actually asking for is more like a compulsory license, which Congress would have to create. The demand for open weights as a copyright remedy is a policy argument dressed up as a legal one.
wongarsu 10 hours ago||
In GPL cases for software, making the offending proprietary code publicly available under the GPL has been the usu outcome.

But whether you can actually be compelled to do that isn't well tested in court. Challenging that the GPL is enforcable in that way leads you down the path that you had no valid license at all, and for past GPL offenders that would have been the worse outcome. AI companies could change that

pessimizer 1 hour ago|||
> But open weights don't undo the infringement -- they just make a potentially infringing artifact publicly available.

This is true when talking about the infringement of the copyrights of others. But when discussing the infringement of GPL copyleft, making a potentially infringing artifact publicly available likely satisfies the license conditions.

The evil is that this case was settled, and before being settled was decided in a way contrary to all previous copyright decisions. The courts decided that rap records had to clear every single sample, thereby basically destroying the art form, but now you can literally feed every book into a blender, piece another book together out of the pieces, and sell it.

Hip-hop when it peaked with the Bomb Squad was such a frenetic mix of so many recognizable, unrecognizable, and transformed sources that it doesn't resemble anything that was made after the decisions against Biz Markie and De La Soul. Afterwards, you just licensed one song, slightly cut it up, and rapped over it. It was just a new way to sell old shit to young people unfamiliar with it.

Now you can literally just train a machine on the same stuff, and it's legal. A machine transformation was elevated over human creativity, simply because rich people wanted it.

simoncion 8 hours ago||
> The framing of 'share your weights freely' as a remedy is interesting but underspecified. The FSF's argument is essentially that training on copyrighted code without permission is infringement, and the remedy should be open weights.

Ignoring the fact that the statement doesn't talk about FSF code in the training data at all, [0] are you sure about that? From the start of the last of three paragraph in the statement:

  Obviously, the right thing to do is protect computing freedom: share complete training inputs with every user of the LLM, together with the complete model, training configuration settings, and the accompanying software source code. Therefore, we urge Anthropic and other LLM developers that train models using huge datasets downloaded from the Internet to provide these LLMs to their users in freedom.
This seems to me to be consistent with the FSF's stance of "You told the computer how to do it. The right thing to do is to give the humans operating that computer the software, input data, and instructions that they need to do it, too.".

[0] In fact, it talks about the inclusion of a book published under the terms of the GNU FDL, [1] which requires distribution of modified copies of a covered work to -themselves- be covered by the GNU FDL.

[1] <https://www.gnu.org/licenses/fdl-1.3.html>

latexr 10 hours ago||
What weak, counter-productive, messaging. This is like having a bully punching you in the face and responding with “hey man, I’m not going to do anything about this, I’m not even going to tell an adult, but I’d urge you to consider not punching me in the face”. Great news for the bully! You just removed one concern from their mind, essentially giving the permission to be as bad to you as they want.
nazgulsenpai 2 hours ago|
It's the FSF and their licensing is what it is. What other messagaging would be consistent with the foundation's mission?
latexr 1 hour ago||
They could not mention they usually don’t sue and that they are small and “have to pick [their] battles”, which effectively means “there will be no repercussions from our side, we won’t even consider trying, so continue to do as you please and even worse”.

Saying nothing is an option. It is very possible (and the FSF has done it) to put yourself into a weaker position by saying something.

You don’t have to lie, but you don’t have to unpromptedly volunteer you don’t have a hand to play, either.

nazgulsenpai 48 minutes ago||
Thanks for explaining, that's fair.
Topfi 11 hours ago||
A related topic that I have in the past thought about is, whether LLM derived code would necessitate the release under a copyleft license because of the training data. Never saw a cogent analysis that explained either why or why not this is the case beyond practicality due to models having been utilized in closed source codebases already…
mjg59 11 hours ago||
The short answer is that we don't know. The longer answer based purely on this case is that there's an argument that training is fair use and so copyleft doesn't have any impact on the model, but this is one case in California and doesn't inherently set precedent in the US in general and has no impact at all on legal interpretations in other countries.
bragr 10 hours ago||
The dearth of case law here still makes a negative outcome for FSF pretty dangerous, even if they don't appeal it and set precedent in higher courts. It might not be binding but every subsequent case will be able to site it, potentially even in other common law countries that lack case law on the topic.

And then there is the chilling effect. If FSF can't enforce their license, who is going to sue to overturn the precedent? Large companies, publishers, and governments have mostly all done deals with the devil now. Joe Blow random developer is going to get a strip mall lawyer and overturn this? Seems unlikely

adampunk 4 hours ago||
I don't think this argument is a winner. It fails on a few grounds:

First, unless you can point to regurgitation of memorized code, you're not able to make an argument about distribution or replication. This is part of the problem that most publishers are having with prose text and LLMs. Modern LLMs don't memorize harry potter like GPT3 did. The memorization older models showed came from problems in the training data, e.g. harry potter and people writing about harry potter are extraordinarily over-represented. It's similar to how with stable diffusion you could prompt for anything in the region of "Van Gogh's Starry Night" and get it, since it was in the training data 50-100 different ways. You can't reliably do this with Opus or GPT5. If they're not redistributing the code verbatim, they're not in violation of the license. One could argue that the models produce "derivative works, but..."

The derivative works argument is inapt. The point of it is to disrupt someone's end-run around the license by saying that building on top of GPL code is not enough to non-GPL it. We imagine this will still work for LLMs because of the GPLs virality--I can't enclose a critical GPL module in non-GPL code and not release the GPL code. But the models aren't DOING THAT. They're not reaching for XYZ GPL'd project to build with. They're vibing out a sparsely connected network of information about literally trillions of lines of software. What comes out is a mishmash of code from here and there, and only coincidentally resembles GPL code, when it does. In order to make this argument work, you need a theory of how LLMs are trained and operate that supports it. Regardless of whether or not one of those theories exist, in court, you'd need to show that your theory was better than the company's expert witness's theory. Good luck.

Second, infringement would need discovery to uncover and would be contingent on user input. This is why the NYT sued for deleted user prompts to ChatGPT--the plaintiffs can't show in public that the content is infringing, so they need to seek discovery to find evidence. That's only going to work in cases where you survive a motion to dismiss--which is EXACTLY where a few of these suits have failed. You need to show first that you can succeed on the merits, then you proceed. That will cut down many of these challenges since they just can't show the actual infringement.

Third, and I think this is the most important, the license protections here are enforced by *copyright*. For copyright it very much matters if something is lifted verbatim vs modified. It is not like patent protection where things like clean room design are shown to have mattered to real courts on real matters. In additional contrast to patents, copyright doesn't care if the outcome is close. That's very much a concern for patents. If I patent a gizmo and you produce a gizmo that operates through nearly identical mechanisms to those I patented, then you can be sued--they don't need to be exact. If I write a novel about a boy wizard with glasses who takes a train to a school in Scotland and you write a novel about a boy wizard with glasses who takes a boat to a school in Inishmurray, I can't sue you for copyright infringement. You need to copy the words I wrote and distribute them to rise to a violation.

Topfi 39 minutes ago|||
> Modern LLMs don't memorize harry potter like GPT3 did. [...] You can't reliably do this with Opus or GPT5.

If you try any modern LLM, you will find that you can. Easily [0], reliably [1], consistently [2]. All these examples are with models released in 2025/26.

[0] https://arxiv.org/html/2601.02671?amp=&amp=

[1] https://arxiv.org/abs/2506.12286

[2] https://ai.stanford.edu/blog/verbatim-memorization/

adampunk 14 minutes ago||
So, they have to do anything special to those models in order to get them to regurgitate ~ 100%? Any special prompts they needed to use to get sonnet to cough that up?

What is the real copyright risk of there being an arcane procedure to sometimes recover most of a text? So far it’s nothing. Which is what I’m saying. Pragmatically this is a loser of an argument in a court room. It is too easy for the chain of reasoning to be disrupted and even undisrupted the argument for model maker liability is attenuated.

themafia 1 hour ago|||
> unless you can point to regurgitation of memorized code

I have, on many occasions, gotten an LLM to do just this. It's not particularly hard. In the most recent case google's search bar LLM happily regurgitated a digital ocean article as if it was it's own output. Searching for some strings in the comments located the original page and it was a 95% match between origin and output.

> The memorization older models showed came from problems in the training data,

And what proof do you have that they "fixed" this? And what was the fix?

> harry potter and people writing about harry potter

I'm not sure that's how you get GPT to reproduce upwards of 85% of Harry Potter novels.

> Second, infringement would need discovery to uncover and would be contingent on user input.

That's not at all how copyright infringement works. That would be if you wanted to prove malice and get triple damages. Copyright infringement is an exceptionally simple violation of the law. You either copied, or you did not.

> For copyright it very much matters if something is lifted verbatim vs modified.

Transformation is a valid defense for _some_ uses. It is not for commercial uses. Using LLM generated code for commercial purposes is a hazard.

adampunk 21 minutes ago||
This must be why all of these copyright plaintiffs are having tremendous days in court! If even half of this were correct, they wouldn’t be losing in summary judgment.

We have yet to see a single judgment come down against a model maker for distributing the gist of content. We have yet to see a single judgment come down against a model maker for infringement at all.

Copyright is just an inapt tool here. It’s not going to do the job. It is not as though big interests have not tried to use this tool. It just doesn’t reflect what’s actually happening and it’s going to lose again and again.

We can imagine a theoretical legal regime where what is done with large language models counts as copyright infringement, we just don’t live in a world where that regime holds.

kavalg 11 hours ago||
It looks like the stance of FSF is for proliferation of the copyleft to trained LLMs

> "Therefore, we urge Anthropic and other LLM developers that train models using huge datasets downloaded from the Internet to provide these LLMs to their users in freedom"

mjg59 11 hours ago|
No, it looks like the stance of the FSF is that models should be free as a matter of principle, the same as their stance when it comes to software. Nothing in the linked post contradicts the description that the judgement was that the training was fair use.
jamesnorden 8 hours ago||
The FSF seems toothless when it comes to actually enforcing anything regarding license violations.
phendrenad2 5 hours ago||
Huh, I've been waiting for the FSF to say something about the current big issue: mandatory Operating System age-asking. Maybe now that they've meddled in a copyright lawsuit that has no broader ramifications for the public (the people they supposedly fight for), they can get back to that.
charcircuit 11 hours ago|
>share complete training inputs with every user of the LLM

They don't have the rights to distribute the training data.

zelphirkalt 10 hours ago|
So if a user can bring an LLM to output a copy of some training data, then the ones who distribute the LLM are engaging in illegal activity?
charcircuit 10 hours ago||
It isn't illegal as a LLM model is transformative.
anthk 1 hour ago||
So is awk, sed. Good luck convinving any judge/lawyer.
More comments...