this post was submitted on 27 May 2024
1102 points (98.0% liked)
Technology
59696 readers
5186 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
no, the truth is it's impossible even then. If the result involves randomness at its most fundamental level, then it's not reliable whatever you do.
Sure, the AI is never going to understand what it's doing or why, but training it on better datasets certain WILL improve the results.
Garbage in, garbage out.
The problem is that given the way they combine things is determine by probability, even training it with the greatest bestest of data, the LLM is still going to halucinate because it's combining multiple sources word by word (roughly) guided only by probabilities derived from language, not logic.
Yes, I understand that. But I'm fairly certain the quality of the data will still have a massive influence over how much and how egregiously that happens.
Basically, what I'm saying is, training your AI on a corpus on shitposts instead of factual information seems like a good way to increase the frequency and magnitude of such hallucinations.
Yeah, true.
If you train you LLM on exclusivelly Nazi literature (to pick a wild example) don't expect it to by chance end up making points similar to Marx's Das Kapital.
(Personally I think what might be really funny - in the sense of laughter inducing - would be to purposefull train an LLM exclusivelly on a specific kind of weird material).
Yeah, I mean that’s basically what GPT4Chan did, which someone else already mentioned ITT.
Basically, this guy took a dataset of several gigabytes worth of archived posts from /pol/ and trained a model on that, then hooked it up to a chatbot and let it loose on the board. You can see the results in this video.
That was hilarious!
Thanks for the link.
Here is an alternative Piped link(s):
in this video
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I'm open-source; check me out at GitHub.