this post was submitted on 22 Aug 2023
762 points (95.7% liked)

Technology

58431 readers
4330 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.

you are viewing a single comment's thread
view the rest of the comments
[–] assassin_aragorn@lemmy.world 1 points 1 year ago (1 children)

I think they probably have criteria for what's used to train it, but they don't keep a list of what material was used. I believe they've said in the past they don't have that information.

For another AI -- these models fall apart when they're trained on AI generated content, after a few generations. If they have no way of discerning if content is AI generated or not, they're going to have a ticking time bomb. At some point the models will heavily degrade in quality because of it. The question I guess is what % of training material can be AI generated before it causes problems.

This does mean however that AI generated material can never become a substantial % of all the content out there. Whenever there's too much, the algorithms will fall apart, and probably not recover until that content falls below a certain % again.

[–] kava@lemmy.world 1 points 1 year ago (1 children)

but they don’t keep a list of what material was used. I believe they’ve said in the past they don’t have that information.

I will look into this. I feel like that's quite an oversight. Perhaps it's easier to just tell the public otherwise because of the legal questions like we are discussing. I would have kept everything in storage so we can re-train updated models or what have you with the same data.

I think it's an interesting thing you bring up. There will be a sort of distinction in the corpus of human works. Pre ~2023 and Post ~2023. All work before that time will more or less be legitimate and you can use it for training data. Afterwards it will all be tainted.

Honestly the implications go further than that. For one, I don't trust that there is a human behind any comment I see online anymore. Especially in topics and areas that I feel are likely to be astroturfed - like politics.

[–] assassin_aragorn@lemmy.world 1 points 1 year ago

Perhaps it’s easier to just tell the public otherwise because of the legal questions like we are discussing.

Very possible. I think they don't want to keep things in storage because then they indisputably need to pay for it.

Agreed on the human element too. The Reddit protests were eye opening for me because of the supposed "pro Reddit/anti mod" crowd that showed up as a vocal minority. They popped out of nowhere, and in some cases they were verified as AI bots.