2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow. : technology

[-] OldGreyTroll@kbin.social 124 points 1 year ago

If I read a book to inform myself, put my notes in a database, and then write articles, it is called "research". If I write a computer program to read a book to put the notes in my database, it is called "copyright infringement". Is the problem that there just isn't a meatware component? Or is it that the OpenAI computer isn't going a good enough job of following the "three references" rule to avoid plagiarism?

[-] bioemerl@kbin.social 79 points 1 year ago

Yeah. There are valid copyright claims because there are times that chat GPT will reproduce stuff like code line for line over 10 20 or 30 lines which is really obviously a violation of copyright.

However, just pulling in a story from context and then summarizing it? That's not a copyright violation that's a book report.

[-] nlogn@lemmy.world 54 points 1 year ago

Or is it that the OpenAI computer isn’t going a good enough job of following the “three references” rule to avoid plagiarism?

This is exactly the problem, months ago I read that AI could have free access to all public source codes on GitHub without respecting their licenses.

So many developers have decided to abandon GitHub for other alternatives not realizing that in the end AI training can safely access their public repos on other platforms as well.

What should be done is to regulate this training, which however is not convenient for companies because the more data the AI ingests, the more its knowledge expands and "helps" the people who ask for information.

[-] bioemerl@kbin.social 42 points 1 year ago

It's incredibly convenient for companies.

Big companies like open AI can easily afford to download big data sets from companies like Reddit and deviantArt who already have the permission to freely use whatever work you upload to their website.

Individual creators do not have that ability and the act of doing this regulation will only force AI into the domain of these big companies even more than it already is.

Regulation would be a hideously bad idea that would lock these powerful tools behind the shitty web APIs that nobody has control over but the company in question.

Imagine the world is the future, magical new age technology, and Facebook owns all of it.

Do not allow that to happen.

[-] mydataisplain@lemmy.world 20 points 1 year ago

Is it practically feasible to regulate the training? Is it even necessary? Perhaps it would be better to regulate the output instead.

It will be hard to know that any particular GET request is ultimately used to train an AI or to train a human. It's currently easy to see if a particular output is plagiarized. https://plagiarismdetector.net/ It's also much easier to enforce. We don't need to care if or how any particular model plagiarized work. We can just check if plagiarized work was produced.

That could be implemented directly in the software, so it didn't even output plagiarized material. The legal framework around it is also clear and fairly established. Instead of creating regulations around training we can use the existing regulations around the human who tries to disseminate copyrighted work.

That's also consistent with how we enforce copyright in humans. There's no law against looking at other people's work and memorizing entire sections. It's also generally legal to reproduce other people's work (eg for backups). It only potentially becomes illegal if someone distributes it and it's only plagiarism if they claim it as their own.

load more comments (2 replies)

[-] Kilamaos@lemmy.world 7 points 1 year ago

Plus, any regulation to limit this now means that anyone not already in the game will never breakthrough. It's going to be the domain of the current players for years, if not decades. So, not sure what's better, the current wild west where everyone can make something, or it being exclusive to the already big players and them closing the door behind

load more comments (1 replies)

[-] Wander@kbin.social 18 points 1 year ago* (last edited 1 year ago)

Say I see a book that sells well. It's in a language I don't understand, but I use a thesaurus to replace lots of words with synonyms. I switch some sentences around, and maybe even mix pages from similar books into it. I then go and sell this book (still not knowing what the book actually says).

I would call that copyright infringement. The original book didn't inspire me, it didn't teach me anything, and I didn't add any of my own knowledge into it. I didn't produce any original work, I simply mixed a bunch of things I don't understand.

That's what these language models do.

[-] magic_lobster_party@kbin.social 13 points 1 year ago

The fear is that the books are in one way or another encoded into the machine learning model, and that the model can somehow retrieve excerpts of these books.

Part of the training process of the model is to learn how to plagiarize the text word for word. The training input is basically “guess the next word of this excerpt”. This is quite different compared to how humans do research.

To what extent the books are encoded in the model is difficult to know. OpenAI isn’t exactly open about their models. Can you make ChatGPT print out entire excerpts of a book?

It’s quite a legal gray zone. I think it’s good that this is tried in court, but I’m afraid the court might have too little technical competence to make a ruling.

[-] nyakojiru@lemmy.dbzer0.com 12 points 1 year ago* (last edited 1 year ago)

What about… they are making billions from that “read” and “storage” of information copyrighted from other people. They need to at least give royalties. This is like google behavior, using people data from “free” products to make billions. I would say they also need to pay people from the free data they crawled and monetized.

load more comments (7 replies)

[-] dedale@kbin.social 80 points 1 year ago* (last edited 1 year ago)

AI fear is going to be the trojan horse for even harsher and stupider 'intellectual property' laws.

[-] bioemerl@kbin.social 48 points 1 year ago* (last edited 1 year ago)

Yeah, they want the right only to protect who copies their work and distributes it to other people, but who's able to actually read and learn from their work.

It's asinine and we should be rolling back copy right, not making it more strict. This 70 year plus the life of the author thing is bullshit.

[-] RedCowboy@lemmy.world 32 points 1 year ago

Copyright of code/research is one of the biggest scams in the world. It hinders development and only exists so the creator can make money, plus it locks knowledge behind a paywall

[-] Pseu@kbin.social 9 points 1 year ago

Researchers pay for publication, and then the publisher doesn't pay for peer review, then charges the reader to read research that they basically just slapped on a website.

It's the publisher middlemen that need to be ousted from academia, the researchers don't get a dime.

[-] Wander@kbin.social 8 points 1 year ago

It's generally not the creator who gets the money.

[-] Pseu@kbin.social 17 points 1 year ago* (last edited 1 year ago)

Remember, Creative Commons licenses often require attribution if you use the work in a derivative product, and sometimes require ShareAlike. Without these things, there would be basically no protection from a large firm copying a work and calling it their own.

Rolling pack copyright protection in these areas will enable large companies with traditional copyright systems to wholesale take over open source projects, to the detriment of everyone. Closed source software isn't going to be available to AI scrapers, so this only really affects open source projects and open data, exactly the sort of people who should have more protection.

[-] magic_lobster_party@kbin.social 9 points 1 year ago

There’s also GPL, which states that derivations of GPL code can only be used in GPL software. GPL also states that GPL software must also be open source.

ChatGPT is likely trained on GPL code. Does that mean all code ChatGPT generates is GPL?

I wouldn’t be surprised if there would be an update to GPL that makes it clear that any machine learning model trained on GPL code must also be GPL.

load more comments (1 replies)

[-] babelspace@kbin.social 11 points 1 year ago

I wish I could get through to people who fear AI copyright infringement on this point.

[-] kescusay@lemmy.world 36 points 1 year ago

I think this is exposing a fundamental conceptual flaw in LLMs as they're designed today. They can't seem to simultaneously respect intellectual property / licensing and be useful.

Their current best use case - that is to say, a use case where copyright isn't an issue - is dedicated instances trained on internal organization data. For example, Copilot Enterprise, which can be configured to use only the enterprise's data, without any public inputs. If you're only using your own data to train it, then copyright doesn't come into play.

That's been implemented where I work, and the best thing about it is that you get suggestions already tailored to your company's coding style. And its suggestions improve the more you use it.

But AI for public consumption? Nope. Too problematic. In fact, public AI has been explicitly banned in our environment.

[-] ablackcatstail@lemmy.goblackcat.com 24 points 1 year ago

Too be honest, I hope they win. While I my passion is technology, I am not a fan of artificial intelligence at all! Decision-making is best left up to the human being. I can see where AI has its place like in gaming or some other things but to mainstream it and use it to decide who's resume is going to be viewed and/or who will be hired; hell no.

[-] HumbertTetere@feddit.de 24 points 1 year ago

use it to decide who’s resume is going to be viewed and/or who will be hired

Luckily that's far removed from ChatGPT and entirely indepentent from the question whether copyrighted works may be used to train conversational AI Models or not.

[-] Chailles@lemmy.world 13 points 1 year ago

You don't need AI to unfairly filter out résumés, they've been doing it already for years. Also the argument that a human would always make the best decision really doesn't work that well. A human is biased and limited. They can only do so much and if you make someone go through a 100 résumés, you're basically just throwing out all the applicants who happen to be in the middle of that pile as they are not as outstanding compared towards the first and last applicants in the eyes of the human mind.

load more comments (1 replies)

[-] ulu_mulu@lemmy.world 11 points 1 year ago

I'm not against artificial intelligence, it could be a very valuable tool, but that's nowhere near a valid reason to break laws as OpenAI has done, that's why I too hope authors win.

[-] bioemerl@kbin.social 14 points 1 year ago

What laws are you saying they've broken?

[-] ulu_mulu@lemmy.world 10 points 1 year ago* (last edited 1 year ago)

Copyright, this is not the first time they're sued for it apparently (violating copyright is a crime).

[-] bioemerl@kbin.social 13 points 1 year ago

Scraping the web is legal and training AI on data is also legal.

[-] ulu_mulu@lemmy.world 12 points 1 year ago* (last edited 1 year ago)

Reusing the content you scraped, if copyright protected, is not.

Edit: unless you get the authorization of the original authors but OpenAI didn't even asked, that's why it's a crime.

[-] GnothiSeauton@lemmy.world 16 points 1 year ago

Sounds like fair use to me.

load more comments (1 replies)

[-] bioemerl@kbin.social 8 points 1 year ago

Yeah it is. The only protection in copyright is called derivative works, and an AI is not a derivative of a book, No more than your brain is after you've read one.

The only exception would be if you manage to overtrain and encode the contents of the book inside of the model file. That's not what happened here because I'll chat GPT output was a summary.

The only valid claim here is the fact that the books were not supposed to be on the public internet and it's likely that the way open AI the books in the first place was through some piracy website through scraping the web.

At that point you just have to hold them liable for that act of piracy, not the fact that the model release was an act of copyright violation.

load more comments (1 replies)

[-] burrp@burrp.xyz 19 points 1 year ago

I'd love to know the source for the works that were allegedly violated. Presuming OpenAI didn't scour zlib/libgen for the books, where on the net were the cleartext copies of their writings stored?

Being stored in cleartext publicly on the net does not grant OpenAI the right to misuse their art, but the authors need to go after the entity that leaked their works.

[-] jaywalker@lemmy.world 8 points 1 year ago

That’s not how copyright works though. Just because someone else “leaked” the work doesn’t absolve openai of responsibility. The authors are free to go after whomever they want.

[-] burrp@burrp.xyz 8 points 1 year ago

You misunderstood. I said the public availability does not grant OpenAI the right to use content improperly. The authors should also sue the party who leaked their works without license.

[-] trial_and_err@lemmy.world 16 points 1 year ago

ChatGPT got entire books memorised. You can and (or could at least when I tried a few weeks back) make it print entire pages of for example Harry Potter.

load more comments (5 replies)

[-] dhork@lemmy.world 15 points 1 year ago

There's an additional question: who holds the copyright on the output of an algorithm? I don't think that is copyrightable at all. The bot doesn't really add anything to the output, it's just a fancy search engine. In the US, in particular, the agency in charge of Copyrights has been quite insistent that a copyright can only be given to the output if a human.

So when an AI incorporates parts of copyrighted works into its output, how can that not be infringement?

[-] cerevant@lemmy.world 9 points 1 year ago

How can you write a blog post reviewing a book you read without copyright infringement? How can you post a plot summary to Wikipedia without copyright infringement?

I think these blanket conclusions about AI consuming content being automatically infringing are wrong. What is important is whether or not the output is infringing.

[-] dhork@lemmy.world 12 points 1 year ago* (last edited 1 year ago)

You can write that blog post because you are a human, and your summary qualifies for copyright protection, because it is the unique output of a human based on reading the copywrited material.

But the US authorities are quite clear that a work that is purely AI generated can never qualify for copyright protection. Yet since it is based on the synthesis of works under copyright, it can't really be considered public domain either. Otherwise you could ask the AI "Write me a summary of this book that has exactly the same number of words", and likely get a direct copy of the book which is clear of copyright.

I think that these AI companies are going to face a reckoning, when it is ruled that they misappropriated all this content that they didn't explicitly license for use, and all their output is just fringing by definition.

load more comments (2 replies)

[-] totallynotarobot@lemmy.world 15 points 1 year ago

Can’t reply directly to @OldGreyTroll@kbin.social because of that “language” bug, but:

The problem is that they then sell the notes in that database for giant piles of cash. Props to you if you’re profiting off your research the way OpenAI can profit off its model.

But yes, the lack of meat is an issue. If I read that article right, it’s not the one being contested here though. (IANAL and this is the only article I’ve read on this particular suit, so I may be wrong).

[-] sjatar@sjatar.net 10 points 1 year ago* (last edited 1 year ago)

Was also going to reply to them!

"Well if you do that you source and reference. AIs do not do that, by design can't.

So it's more like you summarized a bunch of books. Pass it of as your own research. Then publish and sell that.

I'm pretty sure the authors of the books you used would be pissed."

Again cannot reply to kbin users.

"I don't have a problem with the summarized part ^^ What is not present for a AI is that it cannot credit or reference. And that is makes up credits and references if asked to do so." @bioemerl@kbin.social

[-] bioemerl@kbin.social 10 points 1 year ago* (last edited 1 year ago)

It is 100% legal and common to sell summaries of books to people. That's what a reviewer does. That's what Wikipedia does in the plot section of literally every Wikipedia page about every book.

This is also ignoring the fact that Chat GPT is a hell of a lot more than a bunch of summaries

[-] totallynotarobot@lemmy.world 10 points 1 year ago

Good point, attribution is a non-trivial part of it.

[-] totallynotarobot@lemmy.world 8 points 1 year ago

@owf@kbin.social can’t reply directly to you either, same language bug between lemmy and kbin.

That’s a great way to put it.

Frankly idc if it’s “technically legal,” it’s fucking slimy and desperately short-term. The aforementioned chuckleheads will doom our collective creativity for their own immediate gain if they’re not stopped.

load more comments (1 replies)

[-] jecxjo@midwest.social 11 points 1 year ago* (last edited 1 year ago)

The only question I have to content creators of any kind who are worried about AI...do you go after every human who consumed your content when they create anything remotely connected to your work?

I feel like we have a bias towards humans, that unless you're actively trying to steal someone's idea or concepts we ignore the fact that your content is distilled into some neurons in their brain and a part of what they create from that point forward. Would someone with an eidetic memory be forbidden from consuming your work as they could internally reference your material when creating their own?

[-] Eccitaze@yiffit.net 9 points 1 year ago

The problem with AI as it currently stands is that it has no actual comprehension of the prompt, or ability to make leaps of logic, nor does it have the ability to extend and build upon existing work to legitimately transform it, except by using other works already fed into its model. All it can do is blend a bunch of shit together to make something that meets a set of criteria. There's little actual fundamental difference between what ChatGPT does and what a procedurally generated game like most roguelikes do--the only real difference is that ChatGPT uses a prompt while a roguelike uses a RNG seed. In both cases, though, the resulting product is limited solely to the assets available to it, and if I made a roguelike that used assets ripped straight from Mario, Zelda, Mass Effect, Crash Bandicoot, Resident Evil, and Undertale, I'd be slapped with a cease and desist fast enough to make my head spin.

The fact that OpenAI stole content from everybody in order to make its model doesn't make it less infringing.

[-] ClamDrinker@lemmy.world 13 points 1 year ago* (last edited 1 year ago)

That's incorrect. Sure it has no comprehension of what the words it generates actually means, but it does understand the patterns that can be found in the words. Ask an AI to talk like a pirate, and suddenly it knows how to transform words to sound pirate like. It can also combine data from different text about similar topics to generate new responses that never existed in the first place.

Your analogy is a little flawed too, if you mixed all the elements in a transformative way and didn't re-use any materials as-is, even if you called it Mazefecootviltale, as long as the original material were transformed sufficiently, you haven't infringed on anything. LLMs don't get trained to recreate existing works (which would make it only capable of producing infringing works), but to predict the best next word (or even parts of a word) based on the input information. It's definitely possible to guide an AI towards specific source materials based on keywords that only exist in the source material that could be infringing, but in general it generates so generalized that it's inherently transformative.

load more comments (11 replies)

load more comments (3 replies)

load more comments (8 replies)

[-] phx@lemmy.ca 9 points 1 year ago

If you're doing research, there are actually some limits on the use of the source material and you're supposed to be citing said sources.

But yeah, there's plenty of stuff where there needs to be a firm line between what a random human can do versus an automated intelligent system with potential unlimited memory/storage and processing power. A human can see where I am in public. An automated system can record it for permanent record. An integrated AI can tell you detailed information about my daily activities including inferences which - even if legal - is a pretty slippery slope.

load more comments (5 replies)

[-] mojo@lemm.ee 9 points 1 year ago

They definitely should follow through with this, but this is a more broad issue where we need to be able to prevent data scraping in general. Though that is a significantly harder problem.

[-] randomdude567@lemmy.world 8 points 1 year ago* (last edited 1 year ago)

I don't really understand why people are so upset by this. Except for people who train networks based on someone's stolen art style, people shouldn't be getting mad at this. OpenAI has practically the entire internet as its source, so GPT is going to have so much information that any specific author barely has an effect on the output. OpenAI isn't stealing peoples art because they are not copying the artwork, they are using it to train models. imagine getting sued for looking at reference artwork before creating artwork.

[-] whereisk@lemmy.world 21 points 1 year ago

Unless you provide for personhood to those statistical inference models the analogy falls flat.

We're talking about a corporation using copyrighted data to feed their database to create a product.

If you were ever in a copyright negotiation you'd see that everything is relevant: intended use, audience size, sample size, projected income, length of usage, mode of transmission, quality etc.

They've negotiated none of it and worst of all they commercialised it. I'd consider that being in real trouble.

[-] assassin_aragorn@lemmy.world 10 points 1 year ago

Not to mention, if we're going to judge them based on personhood, then companies need to be treating it like a person. They can't have it both ways. Either pay it a fair human wage for its work, or it isn't a person.

Frankly, the fact that the follow-up question would be "well what's it going to do with the money?" tells us it isn't a person.

load more comments (1 replies)

Technology

Our Rules

Approved Bots