this post was submitted on 24 May 2025
1352 points (99.0% liked)

Science Memes

14649 readers
3504 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.



Rules

  1. Don't throw mud. Behave like an intellectual and remember the human.
  2. Keep it rooted (on topic).
  3. No spam.
  4. Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.



Research Committee

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 2 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] essteeyou@lemmy.world 43 points 15 hours ago (2 children)

This is surely trivial to detect. If the number of pages on the site is greater than some insanely high number then just drop all data from that site from the training data.

It's not like I can afford to compete with OpenAI on bandwidth, and they're burning through money with no cares already.

[–] bane_killgrind@slrpnk.net 23 points 14 hours ago (2 children)

Yeah sure, but when do you stop gathering regularly constructed data, when your goal is to grab as much as possible?

Markov chains are an amazingly simple way to generate data like this, and a little bit of stacked logic it's going to be indistinguishable from real large data sets.

[–] Valmond@lemmy.world 18 points 12 hours ago (1 children)

Imagine the staff meeting:

You: we didn't gather any data because it was poisoned

Corposhill: we collected 120TB only from harry-potter-fantasy-club.il !!

Boss: hmm who am I going to keep...

[–] yetAnotherUser@lemmy.ca 11 points 11 hours ago* (last edited 11 hours ago)

The boss fires both, "replaces" them for AI, and tries to sell the corposhill's dataset to companies that make AIs that write generic fantasy novels

[–] Aux@feddit.uk -1 points 11 hours ago

AI won't see Markov chains - that trap site will be dropped at the crawling stage.

[–] Korhaka@sopuli.xyz 1 points 10 hours ago* (last edited 10 hours ago) (1 children)

You can compress multiple TB of nothing with the occasional meme down to a few MB.

[–] essteeyou@lemmy.world 1 points 3 hours ago* (last edited 3 hours ago)

When I deliver it as a response to a request I have to deliver the gzipped version if nothing else. To get to a point where I'm poisoning an AI I'm assuming it's going to require gigabytes of data transfer that I pay for.

At best I'm adding to the power consumption of AI.

I wonder, can I serve it ads and get paid?