this post was submitted on 22 Dec 2024
1301 points (97.3% liked)

Technology

60053 readers
3095 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

It's all made from our data, anyway, so it should be ours to use as we want

top 50 comments
sorted by: hot top controversial new old
[–] werefreeatlast@lemmy.world 3 points 16 minutes ago

I want to have a personal llm that learns all my interests from my files and websites visited. I just want to ask it stuff that I don't have to remember.

[–] buzz86us@lemmy.world 0 points 20 minutes ago

I really don't care about AI used on designs for generic products.

[–] x0x7@lemmy.world 2 points 1 hour ago

So if I make a better car using customer feedback is the rights to the car really theirs because it was their opinions that went partially into the end product?

IP is a joke anyway. If you put information out into the world you don't own it. Sorry, you can't have it both ways. You can simultaneously support torrenting movies (I do, and I assume you do too), while also claiming you own your comments on the internet and no one can "pirate" them.

[–] Blackmist@feddit.uk 4 points 2 hours ago

They don't mean your data, silly. They don't give a fuck about that.

They mean other huge corporations data.

[–] RandomVideos@programming.dev 3 points 2 hours ago (1 children)

Wouldnt that give people who is it for bad things easier access? It should be made illegal to create if they dont legally have access to that data

[–] Ajen@sh.itjust.works 1 points 39 minutes ago (1 children)

The "illegally trained LLMs" they're taking about are trained on copyrighted data that they didn't have permission to use, this isn't about LLMs that have been trained to do illegal things. OpenAI (chatgpt) is being sued because there is a lot of evidence that they used copyrighted content for training, like NY Times articles. OpenAI is so profitable that they'll probably see these lawsuits as a business expense and keep doing it. Most people won't sue anyway...

[–] RandomVideos@programming.dev 1 points 24 minutes ago (1 children)

i know that by illegally trained LLMs they are talking about training on copyrighted data(by legally have access to, i meant that they are legally allowed to train AI on it).

Its ridiculous that companies can just ignore laws

[–] Ajen@sh.itjust.works 1 points 13 minutes ago

Oh, I'm not sure what you meant in your first comment then?

[–] nutsack@lemmy.world 24 points 6 hours ago* (last edited 6 hours ago) (3 children)

intellectual property doesn't really exist in most of the world. they don't give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore...

it's arbitrary law that is designed to protect corporations and it's generally unenforceable.

[–] C126@sh.itjust.works 1 points 1 hour ago* (last edited 1 hour ago)

So true. IP only helps the corps and slows tech development. Contracts, ndas, and trade secrets are all you really need to keep your ideas safe. If you want your country to develop fast, get rid of any IP laws.

[–] FlyingSquid@lemmy.world 1 points 3 hours ago

they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

Unless it's their intellectual property, whereupon it's suddenly a whole different story. I'm sure you knew that.

[–] echodot@feddit.uk 9 points 6 hours ago (1 children)

But they're not developing AI in those countries they're developing it mostly in the US. In the US copyright law is enforced.

[–] dsilverz@thelemmy.club 3 points 3 hours ago

There are many AI development happening in China. Doubao (from Bytedance, the same company behind TikTok), DeepSeek and Qwen are some examples of Chinese LLMs.

[–] Magnetic_dud@discuss.tchncs.de 15 points 7 hours ago

I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un'ipnosi non verbale

Clearly stolen from this Dr paret YouTube channels where he's selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

[–] ClamDrinker@lemmy.world 28 points 11 hours ago (3 children)

Although I'm a firm believer that most AI models should be public domain or open source by default, the premise of "illegally trained LLMs" is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

The idea of... well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn't tip further into their favor to the point AI technology only exists to benefit them.

If the model is built on the corpus of humanity, then humanity should benefit.

[–] barsoap@lemm.ee 5 points 3 hours ago* (last edited 3 hours ago)

As per torrentfreak

OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.

...crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It's US startup culture, plain and simple, "move fast and break laws", get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.

[–] patatahooligan@lemmy.world 5 points 4 hours ago (3 children)

the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world

They are not "analyzing" the data. They are feeding it into a regurgitating mechanism. There's a big difference. Their defense is only "good" because AI is being misrepresented and misunderstood.

I agree that we shouldn't strive for more strict copyright. We should fight for a much more liberal system. But as long as everyone else has to live by the current copyright laws, we should not let AI companies get away with what they're doing.

[–] ClamDrinker@lemmy.world 0 points 13 minutes ago

They are not “analyzing” the data. They are feeding it into a regurgitating mechanism. There’s a big difference. Their defense is only “good” because AI is being misrepresented and misunderstood.

I really kind of hope you're kidding here. Because this has got to be the most roundabout way of saying they're analyzing the information. Just because you think it does so to regurgitate (which I have yet to see any good evidence for, at least for the larger models), does not change the definition of analyzing. And by doing so you are misrepresenting it and showing you might just have misunderstood it, which is ironic. And doing so does not help the cause of anyone who wishes to reduce the harm from AI, as you are literally giving ammo to people to point to and say you are being irrational about it.

[–] gazter@aussie.zone 2 points 3 hours ago (2 children)

I've never really delved into the AI copyright debate before, so forgive my ignorance on the matter.

I don't understand how an AI reading a bunch of books and rearranging some of those words into a new story, is different to a human author reading a bunch of books and rearranging those words into a new story.

Most AI art I've seen has been... Unique, to say the least. To me, they tend to be different enough to the art they were trained in to not be a direct ripoff, so personally I don't see the issue.

[–] catloaf@lemm.ee 1 points 1 hour ago (1 children)

The for-profit large-scale media blender is the problem. When it's a human writing Harry Potter fan fiction, it's fine. When a company sells a tool for you to write thousands of trash "books" for profit, it's a problem.

[–] ClamDrinker@lemmy.world 1 points 26 minutes ago

Which is why the technology itself isn't the issue, but those willing to use it in unethical ways. AI is an invaluable tool to those with limited means, unlike big corporations.

[–] trashgirlfriend@lemmy.world 1 points 2 hours ago (1 children)

ML algorithms aren't capable of producing anything new, they can only ever produce a mishmash of copies of existing works.

If you feed a generative model a bunch of physics research papers, it won't create a new valid physics research paper, just a mishmash of jargon from existing papers.

[–] ClamDrinker@lemmy.world 1 points 19 minutes ago* (last edited 19 minutes ago)

You say it's not capable of producing anything new, but then give an example of it creating something new. You just changed the goal from "new" to "valid" in the next sentence. Looking at AI for "valid" information is silly, but looking at it for "new" information is not. Humans do this kind of information mixing all the time. It's why fan works are a thing, and why most creative people have influences they credit with being where they are today.

Nobody alive today isn't tainted by the ideas they've consumed in copyrighted works, but we do not bat an eye if you use that in a transformative manner. And AI already does this transformation much better than humans do since it's trained on that much more information, diluting the pool of sources, which effectively means less information from a single source is used.

[–] Landless2029@lemmy.world 0 points 2 hours ago

Not to mention patent laws are bullshit.

There are law offices that exist specifically to fuck with people over patent and copywrite law.

There's also cases where people use copywrite and patent law to hold us back. I can't find the article but some religious jerk patented connecting a sex toy to a computer via USB. Thankfully someone got around this law with bluetooth and cell phones. Otherwise I imagine the camgirl and LDR market for toys would've been hit with products 10 years sooner.

[–] echodot@feddit.uk 3 points 6 hours ago* (last edited 6 hours ago) (1 children)

Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.

But it's important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it's this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can't because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.

[–] barsoap@lemm.ee 1 points 3 hours ago* (last edited 3 hours ago)

AI is right-out unregulated in the EU unless and until you actually use it for something where it becomes relevant, then you've got at the lower end labelling requirements (If your customer service is an AI chat, say that it's an AI chat), up to heavy, heavy requirements when you use it for stuff like sifting through job applications. The burden of proof that the AI isn't e.g. racist is on you. Or, for that matter, using to reject health insurance claims, I think we saw some news lately out of the US what can happen when you do that.

OpenAI's copyright case isn't really good to make the legal situation any clearer: We already know that using pirated content to train stuff isn't legal because you're not looking at it legitimately. The case isn't about the "are computers allowed to learn from public sources just as humans are" question.

[–] interdimensionalmeme@lemmy.ml 50 points 15 hours ago (1 children)

It's not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

This is our common heritage, not OpenAI's private property

[–] echodot@feddit.uk 1 points 5 hours ago

It doesn't matter anyway, we still need the big companies to bankroll AI. So it effectively does belong to them whatever we do.

Hopefully at some point people can get the processor requirements to something sane and AI development opens up to us all.

[–] Dkarma@lemmy.world 10 points 11 hours ago

Another clown dick article by someone who knows fuck all about ai

[–] circuitfarmer@lemmy.sdf.org 60 points 17 hours ago (5 children)

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I'm not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

[–] barsoap@lemm.ee 1 points 2 hours ago

We essentially do have the death penalty for corporations, it's called being declared a criminal organisation.

load more comments (4 replies)
[–] cypherpunks@lemmy.ml 22 points 15 hours ago* (last edited 2 hours ago) (1 children)

"Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data"

I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

I liked the author's earlier very-unlikely-to-be-met-demand activism last year better:

I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

...which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it's technically true - a court didn't order it, but a guy who goes by the name "That One Privacy Guy" while blogging on linkedin did).

load more comments (1 replies)
[–] fmstrat@lemmy.nowsci.com 77 points 18 hours ago (6 children)

So banks will be public domain when they're bailed out with taxpayer funds, too, right?

[–] ArchRecord@lemm.ee 54 points 18 hours ago (2 children)

They should be, but currently it depends on the type of bailout, I suppose.

For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank's assets, and now effectively owns the bank.

load more comments (2 replies)
load more comments (5 replies)
[–] just_another_person@lemmy.world 119 points 21 hours ago* (last edited 18 hours ago) (39 children)

It won't really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

[–] MrKurteous@feddit.nu 2 points 7 hours ago

Just a little note about the word "model", in the article it's used in a way that actually includes the weights, and I think this is the usual way of using it! If you change the weights, you get a different model, though the two models will have the same structure.

Anyway, you make good points!

load more comments (38 replies)
load more comments
view more: next ›