this post was submitted on 02 Apr 2025
710 points (99.0% liked)

Technology

68526 readers
3251 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
top 50 comments
sorted by: hot top controversial new old
[–] Mubelotix@jlai.lu 31 points 6 days ago (4 children)

Doesn't make any sense. Why would you crawl wikipedia when you can just download a dump as a torrent ?

[–] ChairmanMeow@programming.dev 25 points 6 days ago

AI bros aren't that smart.

[–] mke@programming.dev 6 points 6 days ago* (last edited 6 days ago)

Apparently the dump doesn't include media, though there's ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don't care about externalizing costs onto others if it might mean a competitive advantage (e.g. most recent data, not having to spend time and resources developing dedicated ingestion systems for specific sites).

I want to stress this: it's not that "tech bros" are just stupid—even though a lot of them are revoltingly unappreciative of the giants whose sholders they stand on—it's that they don't care.

[–] cabillaud@lemmy.world 3 points 6 days ago (1 children)

To have the most recent data?

[–] umbraroze@slrpnk.net 4 points 6 days ago

To just have the most recent data within reasonable time frame is one thing. AI companies are like "I must have every single article within 5 minutes they get updated, or I'll throw my pacifier out of the pram". No regard for the considerations of the source sites.

[–] Kolrami@lemmy.world 2 points 6 days ago

There's a chance this isn't being done by someone who only wants Wikipedia's data. As the amount of websites you scrape increases, your desire to use the easy tools loses out to creating the most general tool that can look at most webpages.

[–] prototype_g2@lemmy.ml 19 points 6 days ago* (last edited 6 days ago) (1 children)

Feel like this belongs in !fuck_ai@lemmy.world

Think I should cross-post?

[–] Tea@programming.dev 4 points 6 days ago (1 children)
[–] Abrinoxus@thelemmy.club 18 points 6 days ago

These fucking companies.. downing a torrent of annas archive but crawling wikipedia scourge of mankind

[–] DicJacobus@lemmy.world 12 points 6 days ago

When I imagine a future with AI ruining the world, I always thought it was going to be some Skynet/CABAL/HAL9000 type of thing

Not this sad, boring, depressing type shit

[–] thisbenzingring@lemmy.sdf.org 199 points 1 week ago (4 children)

what assholes .. just fucking download the full package and quit hitting the URL

[–] XTL@sopuli.xyz 1 points 4 days ago

Scraper bots don't read instructions, they just follow links

[–] cm0002@lemmy.world 112 points 1 week ago (2 children)

Right‽ This is ridiculously stupid when you can download the entirety of Wikipedia in a single package and parse it to your hearts desire

[–] TheTechnician27@lemmy.world 87 points 1 week ago* (last edited 1 week ago)

Not only that, but we make it goddamn trivial for not just Wikipedia but for other Wikimedia projects. Doing this is just stealing without attribution and share-alike like the CC BY-SA 4.0 license demands and then on top of that kicking down the ladder for people who actually want to use Wikimedia and not the hallucinatory slop they're trying to supplant it with. LLM companies have caused incalculable damage to critical thinking, the open web, the copyleft movement, and the climate.

[–] ChaoticCookie@sh.itjust.works 19 points 1 week ago

Yay interrobang :D

[–] Glitchvid@lemmy.world 32 points 1 week ago

The amount of stupid AI scraping behavior I see even on my small websites is ridiculous, they'll endlessly pound identical pages as fast as possible over an entire week, apparently not even checking if the contents changed. Probably some vibe coded shit that barely functions.

load more comments (1 replies)
[–] krigo666@lemmy.world 120 points 1 week ago (4 children)

Laws should be passed in all countries that AI crawlers should request permission before crawling whatever target site. I haver no pity to AI "thiefs" that get their models poisoned. F...ing plague, wasn't enough the adware and spyware...

[–] catloaf@lemm.ee 22 points 1 week ago (3 children)

An HTTP request is a request. Servers are free to rate limit or deny access

[–] grysbok@lemmy.sdf.org 2 points 6 days ago (1 children)

Bots lie about who they are, ignore robots.txt, and come from a gazillion different IPs.

[–] catloaf@lemm.ee 2 points 6 days ago

That's what ddos protection is for.

[–] FaceDeer@fedia.io 20 points 1 week ago (2 children)

And Wikimedia, in particular, is all about publishing data under open licenses. They want the data to be downloaded and used by others. That's what it's for.

load more comments (2 replies)
load more comments (1 replies)
load more comments (3 replies)
[–] cupcakezealot@lemmy.blahaj.zone 84 points 1 week ago (2 children)

wikipedia should install ai mazes on their servers

[–] KumaSudosa@feddit.dk 12 points 6 days ago (1 children)

Not in this case, to be fair. The only concern is cost - since Wiki wouldn't be opposed to them getting their actual data - and AI mazes are designed to safeguard more sensitive data, not reducing cost

[–] upandatom@lemmy.dbzer0.com 2 points 6 days ago (1 children)

Nice analysis. Need more smart people like you in the world

[–] KumaSudosa@feddit.dk 3 points 6 days ago

I agree with that assessment!

[–] inbeesee@lemmy.world 1 points 6 days ago (1 children)
[–] skulblaka@sh.itjust.works 11 points 6 days ago (1 children)

Nepenthes does about the same thing but isn't managed by a corp.

[–] brognak@lemm.ee 4 points 6 days ago

There's also Anubis, but it uses proof of work not a maze.

https://anubis.techaro.lol/

[–] cdkg@lemm.ee 52 points 1 week ago (8 children)
[–] theblips@lemm.ee 3 points 6 days ago

I don't know about stopping entirely. I built a pretty cool RAG system for internal use in my company, it very much facilitates navigating very large amounts of text data.

[–] andybytes@programming.dev 20 points 1 week ago (7 children)

I still struggle with a use case for artificial intelligence in my own life. I play around with it all and I'm just like, it doesn't do a good job. Also, I think humanity is missing the plot, you know? Like, we don't need government. If government isn't going to do government. Government serves the people, not corporations. Or at least it should. I don't know, I think we're entering in times. At some point, I think people will pray for nuclear war, because life will be so miserable. That it would be better than just to end it all.

[–] infinitesunrise@slrpnk.net 6 points 6 days ago

AI has niches but they're exactly that: Niches. Small duct tape tasks for fudging over "hard problems" where manual code would result in a worse outcome and take far more time. Little esoteric problem spaces, which notably don't actually require you to use several states worth of electrical power training on a 50PB dataset of anime titties.

An example: I have a name generator in my game that strings together several consonant+vowel phoneme pairs into a name. This means that the names are always pronounceable, but often the spelling looks really unintuitive. Eg Joosiffe, which the player would likely pronounce as Joseph. However, the leap we do in our head between those two spellings is a process of declassifying phonemes and then re-classifying phonemes, and is actually a "hard problem" from a coding perspective due to the unintituive, multifarious complexities of written, spoken, and conceptualized human language. Adding this step to my name generator in code would be a project of it's own, larger than the game itself, and wouldn't ever work nearly as well as it needed to. But relatively small (30MB) AI models that do this with something like 99.8% satisfaction already exist. They didn't require a data center's worth of resources to train, and since they're academic projects they have licenses that allow them to be used for free in a game.

[–] Bytemeister@lemmy.world -1 points 6 days ago

Actual AI?

Imagine your phone knows that you have a business meeting downtown today. It's already reserved a parking space for you, set your car to warm up before you leave and looped your contact in on your ETA, along with automatically notifying you of any delays. Then, your kid wakes up this morning in with a horrible toothache, you ask your phone what to do and it rings up your family dentist, who has a full schedule today, but makes you a referral nearby. You agree to try that other dentist today, and your AI books an appointment, checks your meeting today, coordinates with their AIs and approves a 15 minute delay so you can get to the dentist. It also notifies your kid's school of their absence and has their teachers AI automatically queued up to send transcripts, notes and homework assignmenta from today's classes.

That's the kind of stuff actual AI can do. Overgrown autocorrect? It's basically a multi-billion dollar Magic Eightball.

load more comments (5 replies)
load more comments (6 replies)
[–] collapse_already@lemmy.ml 41 points 1 week ago (8 children)

And the quality of the AI output sucks. I was recently looking for information about positive convention for yaw, pitch, and roll in aircraft. I was looking at az and yaw and got reasonable results from the AI, but when I looked at pitch and el all of the results were about elevator pitches. Even when I spelled out elevation it insisted on elevator pitches. I scroll past the AI results as a matter of principle, but I usually look at them so I have something specific to complain about when people ask why I am so virulently anti-AI.

load more comments (8 replies)
[–] Telorand@reddthat.com 22 points 1 week ago (4 children)

AI: The "pen that can write in zero gravity" when pencils exist.

[–] Alk@sh.itjust.works 44 points 1 week ago (6 children)

Well I get the analogy, but also I think they didn't use pencils because of the graphite and complications with filtering air or something.

[–] catloaf@lemm.ee 28 points 1 week ago (3 children)

Graphite is conductive. A short circuit and fire are Very Bad.

load more comments (3 replies)
load more comments (5 replies)
load more comments (3 replies)
[–] andybytes@programming.dev 21 points 1 week ago (2 children)

This is an example of corporate terrorism sponsored by our own government. Elon Musk loves to see himself as the villain in Ready Player One. And this is not a joke you can look it up. Big tech is waging war against American citizens, and no longer do we have any control of our government, and the Democrats will not save us. The electoral processes will not save us. This is just hard for some people to accept, that's why things have to fall apart before they get a clue. Unfortunately, those that are wiser are going to feel the flames first.

load more comments (2 replies)
load more comments
view more: next ›