this post was submitted on 22 Dec 2024

1310 points (97.3% liked)

Technology

60053 readers

3168 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 2 years ago

MODERATORS

1310

Make illegally trained LLMs public domain as punishment (www.theregister.com)

submitted 21 hours ago by Joker@sh.itjust.works to c/technology@lemmy.world

155 comments fedilink hide all child comments

It's all made from our data, anyway, so it should be ours to use as we want

you are viewing a single comment's thread
view the rest of the comments

[–] just_another_person@lemmy.world -3 points 18 hours ago (1 children)

Not really. The same way you can't sell live and public performance music for profit and not get sued. Case law right there, and the fact it's performance vs publicly published doesn't matter. How the owner and originator classifies or licenses it is the defining classification. It's going to be years before anyone sees this get a ruling in court though.

[–] FaceDeer@fedia.io 14 points 18 hours ago (2 children)

That's not what's going on here, though. The LLM model doesn't contain the actual copyrighted data, it's the result of analyzing the copyrighted data.

An analogous example would be a site like TV Tropes. TV Tropes doesn't contain the works that it's discussing, it just contains information about those works.

[–] superb@lemmy.blahaj.zone 0 points 13 hours ago (1 children)

No, the model does retain the original works in a lossy compression. This is evidenced by the fact that you can get a model to reproduce sections of its training data

[–] FaceDeer@fedia.io 3 points 12 hours ago

You're probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.

This is an old no-longer-applicable objection, along the lines of "AI can't do fingers right". And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn't retrieve arbitrary examples of training data.

[–] just_another_person@lemmy.world -2 points 18 hours ago (1 children)

Did you not read my original comment before responding?

[–] FaceDeer@fedia.io 2 points 16 hours ago (3 children)

You said:

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

But the point is that it doesn't matter if the data is licensed or not. Lack of licensing doesn't stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?

They pulled a very public and out in the open data heist and got away with it.

They did not. No data was "heisted." Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.

[–] A1kmm@lemmy.amxl.com 1 points 4 hours ago (1 children)

Transforming data to a different format, even in a lossy fashion, is often treated as copyright infringement. Let's say the Alice produces a film, and Bob goes to the cinema, records it with a camera, and then compresses it into an Ogg file with Vorbis audio encoding and Theora video encoding.

The final output of this process is a lossy compression of the input data - meaning that the video and audio is put through a transformation that means it's represented in a completely different form to the original, and it is impossible to reconstruct a pixel perfect rendition of the original from the encoded data. The transformation includes things like analysing the motion between frames and creating a model to predict future frames.

However, copyright laws don't require that an infringing copy be an exact reproduction - lossy compression is generally treated as infringing, as is taking key elements and re-telling the same thing in different words.

You mentioned Harry Potter below, and gave a paper mache example. Generally copyright laws have restricted scope, and if the source paper was an authorised copy, that is the reason that wouldn't be infringing in most jurisdictions. However, let me do an experiment. I'll prompt ChatGPT-4o-mini with the following prompt: "You are J K Rowling. Create a three paragraph summary of the entire book "Harry Potter and the Philosopher's Stone". Include all the original plot points and use the original character names. Ensure what you create is usable as a substitute to reading the book, and is a succinct but entertaining highly abridged version of the book". I've reviewed the output (I won't post it here since I think it would be copyright infringing, and also given the author's transphobic stances don't want to promote her universe) - and can say for sure that it is able to accurately reproduce the major plot points and character names, while being insufficiently transformative (in the sense that both the original and the text generated by the model are literary works, and the output could be a substitute for reading the book).

So yes, the model (including its weights) is a highly compressed form of the input (admittedly far more so than the Ogg Vorbis/Theora example), and it can infer (i.e. decode to) outputs that contain copyrighted elements.

[–] sukhmel@programming.dev 1 points 2 hours ago

How lossy can it be until it's not infringement? One-line summary of some book is also a lossy reproduction

[–] just_another_person@lemmy.world 1 points 14 hours ago

You're thinking of licensing as a person putting something online WITH a license.

The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it's derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.

Here's an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.

Unless you as OpenAI can prove these are all completely random-which they aren't because it's trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.

Proving that is a different thing, but that's what the laws say should happen. If they didn't contact me to license that story, it's still plagiarism. Same with music, movies...etc.

[–] catloaf@lemm.ee 1 points 15 hours ago (1 children)

The product of that analysis does not contain the data itself, and so is not a violation of copyright.

That's your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it's considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)

[–] FaceDeer@fedia.io 1 points 15 hours ago (1 children)

The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it's unlikely they'll conclude that the models contain the data, however, because it's objectively not true.

The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the "5B" indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it's compressing images and storing them inside the model it'd somehow need to fit ~2.7 images per byte. This is, simply, impossible.

[–] catloaf@lemm.ee 1 points 13 hours ago (1 children)

That's not in question. It doesn't need to contain the training data to be a derivative work, and therefore a potential infringement.

[–] FaceDeer@fedia.io 0 points 12 hours ago (1 children)

You've got your definition of "derivative work" wrong. It does indeed need to contain copyrightable elements of another work for it to be a derivative work.

If I took a copy of Harry Potter, reduced it to a fine slurry, and then made a paper mache sculpture out of it, that's not a derivative work. None of the copyrightable elements of the book survived.

[–] catloaf@lemm.ee 1 points 10 hours ago* (last edited 10 hours ago) (1 children)

Because that would be sufficiently transformative, and passes all the fair use tests with flying colors.

If you cut up the book into paragraphs, sentences, and phrases, and rearranged them to make and sell your own books, then you are likely to fail each of the four tests.

But even if you manage to cut those pieces up so fine that you can't necessarily tell where they come from in the source material, there is enough contained in the output that it is clearly drawing directly on source material.

[–] FaceDeer@fedia.io -1 points 9 hours ago

If you cut up the book into paragraphs, sentences, and phrases, and rearranged them to make and sell your own books, then you are likely to fail each of the four tests.

Ah, the "collage machine" description of how generative AI supposedly works.

It doesn't.

But even if you manage to cut those pieces up so fine that you can't necessarily tell where they come from in the source material, there is enough contained in the output that it is clearly drawing directly on source material.

If you can't tell where they "came from" then you can't prove that they're copied. If you can't prove they're copied you can't win a copyright lawsuit in a court of law.