Technology

37747 readers

242 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

Los@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org

259

Reddit signs $60M contract allowing AI company to train its models on the social media platform's content (www.reuters.com)

submitted 9 months ago by minnix@lemux.minnix.dev to c/technology@beehaw.org

108 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] lemmyingly@lemm.ee 4 points 9 months ago (1 children)

If an instance is defederated, the owners can just spin up a new instance.

I've always thought about what you've said about Lemmy when people start talking about how Lemmy is more privacy focused than Reddit.

As one of your replies have said many people in the hundreds/thousandths have a copy of your data on Lemmy - the instance owners. If you decide you've shared too much information then you end up asking every owner to delete that nugget of information. And realistically there is nothing to enforce it. This is one benefit of the walled garden of places like Reddit because they are legally obligated to delete the information especially in places like the EU.

[–] SorteKanin@feddit.dk 4 points 9 months ago (1 children)

This is one benefit of the walled garden of places like Reddit because they are legally obligated to delete the information especially in places like the EU.

In theory yes, but anyone can also scrape reddit for all its posts and comments (and someone likely is). And nobody is making them delete the data. And then there's stuff like the Internet archive complicating stuff further.

[–] lemmyingly@lemm.ee 1 points 9 months ago (3 children)

Whilst true about anyone can scrape data off Reddit, I think it's more of a pain since before the API updates the rate limit was 2 API calls per second. You also have to find or create a scraper. With Lemmy, you follow the instructions (copy and paste) on join-lemmy.org to create your instance and you're done. Both methods you have to configure it to subscribe to communities, so they're about the same.

In the EU at least there is a right to be forgotten, so yeah, Reddit and other platforms are forced to delete the data on request. I'm not sure how the same can be applied to a distributed network like Lemmy.

There were publicly available archives of Reddit. The last time I checked, you couldn't find the latest submissions and comments. Maybe things have changed, maybe newer alternatives have appeared.

[–] tryptaminev@feddit.de 3 points 9 months ago (1 children)

For the right to be forgotten, this only applies to personal information. E.g. information that can be associated with information, that could be used to identify you.

Since you usually have an email for signup, that would make the data fall under personal information. But reddit could just delete the email adress and your user name and show something like:

[deleted]
When does the Narwhal bacon?

And well, it is pretty difficult to find out if, when and where there is backups that still contain your information and could be given to the AI model trainers too. To find these things out, we'd need a precedence case that makes a data protection agency investigate reddit throughouly.

[–] lemmyingly@lemm.ee 1 points 9 months ago

It's all of the data or just the data that associates content with you, the latter if the company has a genuine reason to keep the content, which a forum generally does.

If the content cannot be associated with you then does it matter if the content is present on the website?

[–] Kichae@lemmy.ca 2 points 9 months ago (1 children)

Creating a new instance only gets you access to content that users of your instance have subscribed to, and then mostly only content that comes in after subscription (I believe Lemmy primes the pump a bit on community subs, pulling in a handful of posts at the time of discovery, but discovery is done by users). So, there's a limit on what you can scrape with your own private instance, and you're taking a bit of a bet on which communities will yield what you're looking for in the future.

It'd be easier and more reliable to just crawl the network and scrape it the old fashion way.

[–] lemmyingly@lemm.ee 1 points 9 months ago

"If you search for a community first time, 20 posts are fetched initially. Only if a least one user on your instance subscribes to the remote community, will the community send updates to your instance. Updates include:

New posts, comments
Votes
Post, comment edits and deletions
Mod actions"

So you create a single user and subscribe to all communities of interest.

I probably downplayed the difficulty of setting up a Lemmy instance that will come if you do something out of order or don't quite have the host set up correctly or something. Although I do think it's easier than pigging about with web crawlers.