Technology

60533 readers

3933 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

516

OpenAI strikes Reddit deal to train its AI on your posts (www.theverge.com)

submitted 8 months ago by return2ozma@lemmy.world to c/technology@lemmy.world

124 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] myliltoehurts@lemm.ee 152 points 8 months ago (2 children)

So they filled reddit with bot generated content, and now they're selling back the same stuff likely to the company who generated most of it.

At what point can we call an AI inbred?

[–] orca@orcas.enjoying.yachts 90 points 8 months ago (2 children)

This is actually a thing. It's called "Model Collapse". You can read about it here.

[–] FaceDeer@fedia.io 21 points 8 months ago (2 children)

"Model collapse" can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.

[–] Ghostalmedia@lemmy.world 15 points 8 months ago (1 children)

A model trained on jokes about bacon, narwhals, and rage comics.

[–] FaceDeer@fedia.io 1 points 8 months ago (2 children)

By "old archives" I mean everything from 2022 and earlier.

[–] BakerBagel@midwest.social 13 points 8 months ago (1 children)

But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.

[–] FaceDeer@fedia.io 1 points 8 months ago

Existing AIs such as ChatGPT were trained in part on that data so obviously they've got ways to make it work. They filtered out some stuff, for example - the "glitch tokens" such as solidgoldmagikarp were evidence of that.

[–] Ghostalmedia@lemmy.world 1 points 8 months ago

I SAID RAGE COMICS

[–] mint_tamas@lemmy.world 2 points 8 months ago (1 children)

That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?

[–] barsoap@lemm.ee 1 points 8 months ago* (last edited 8 months ago) (1 children)

That paper is yet to be peer reviewed or released.

Never doing either (release as in submit to journal) isn't uncommon in maths, physics, and CS. Not to say that it won't be released but it's not a proper standard to measure papers by.

I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?

Quoth:

If each linear model is instead fit to the generate targets of all the preceding linear models i.e. data accumulate, then the test squared error has a finite upper bound, independent of the number of iterations. This suggests that data accumulation might be a robust solution for mitigating model collapse.

Emphasis on "finite upper bound, independent of the number of iterations" by doing nothing more than keeping the non-synthetic data around each time you ingest new synthetic data. This is an empirical study so of course it's not proof you'll have to wait for theorists to have their turn for that one, but it's darn convincing and should henceforth be the null hypothesis.

Btw did you know that noone ever proved (or at least hadn't last I checked) that reversing, determinising, reversing, and determinising again a DFA minimises it? Not proven yet widely accepted as true, crazy, isn't it? But, wait, no, people actually proved it on a napkin. It's not interesting enough to do a paper about.

[–] mint_tamas@lemmy.world 2 points 8 months ago (1 children)

Peer review, for all its flaws is a good minimum before a paper is worth taking seriously.

In your original comment you said tha model collapse can be easily avoided with this technique, which is notably different from it being mitigated. I’m not saying that these findings are not useful, just that you are overselling them a bit with this wording.

[–] barsoap@lemm.ee 0 points 8 months ago

It was someone different who said that. There's a chance the authors might've gotten some claim wrong because their maths and/or methodology is shoddy but it's a large and diverse set of authors so that's unlikely. Fraud in CS empirics is generally unheard of, I mean what are you going to do when challenged, claim that the dog ate the program you ran to generate the data? There's shenanigans about the equivalent of p-hacking especially from papers from commercial actors trying to sell stuff but that's not the case here, either.

CS academics generally submit papers to journals more because of publish or perish than the additional value formal peer review offers. It's on the internet, after all. By all means, if you spot something in the paper that's wrong then be right on the internet.

[–] noodlejetski@lemm.ee 5 points 8 months ago

I prefer "Habsburg AI".

[–] restingboredface@sh.itjust.works 18 points 8 months ago

I wonder if Open AI or any of the other firms have thought to put in any kind of stipulations about monitoring and moderating reddit content to reduce ai generated posts and reduce risk of model collapse.

Anybody who's looked at reddit in the past 2 years especially has seen the impact of ai pretty clearly. If I was running open ai I wouldn't want that crap contaminating my models.