This is an automated archive made by the Lemmit Bot.
The original was posted on /r/datahoarder by /u/GreenTeaBD on 2024-10-27 08:56:26+00:00.
Hi everyone, I wanted to bounce an idea off people and see how/if this could work.
I think we're starting to get close to the point where storage is cheap enough for individuals to archive copies of the old Internet as archived on the Wayback machine. Not soon-soon, but 5 to 10 years maybe? At least if we chop it up into a few chunks. I've been seeing those stories around here about people expecting capacity for hdds to really make some jumps soon, so who knows?
The wayback machine is huge, 99+PB. You look at their data and in 2023 they had 735 billion pages archived. Obviously there's no practical way for everyone to have this but you look at earlier years and the number is a lot smaller. In 2003 they had only 11 billion pages archived. This number jumps to 30 billion in 2004. That 2003/2004 point also seems like a good (though somewhat arbitrary) line to draw in the sand for "old internet" vs "new internet" (or at least "can be mirrored by a normal person maybe sometime soon" internet and "cant" internet) I might be wrong here but 2003/2004 feels like about the time everyone started getting broadband and the Internet changed drastically.
That's not the whole picture either, pre-broadband websites were much smaller. Low-res images, a whole lot less javascript and other stuff making the sites much smaller. Maybe 50KB to 100KB a page. They had to be, anything more was brutal over dialup. The Internet itself was a lot smaller, too.
So, we take 2003, 11 billion pages, assume 100KB a page (dangerous assumption but it's all the data I have to work with, this is a rough estimate) we can estimate that the total wayback machine archive for the old Internet is 1.1PB.
So, what do I want to do here? 1.1PB is still a lot, I'm at 120TB right now... But that feels reachable soon enough. I worry about the Internet Archive dying sometime, maybe not soon but in the future. Who knows what could happen. The old Internet is important to me, it's our digital heritage. It needs to be kept safe.
Does anyone think it would be possible to make this a shareable archive, in the future, so that the old internet can be downloaded as one big chunk, shared among everyone who feels like having it, and therefore be more safely preserved?
I think obviously it can, but the big problem is, would archive.org go along with this? I doubt they would be happy with me as just some guy blasting the whole archive and scraping everything from 96 to 2003 but if this is a coordinated project with the goal of further preservation in mind would they go along with it? I've seen some people associated with IA post around here so if they have any input I'd be interested in it, or if they could correct my estimates.
Would people even be interested in this? I am, but I'm an incredibly weird guy so who knows. I'm not thinking of this as a project to start now but we'll see where storage technology goes in the coming years.
I gotta admit, also I thought of this whole thing because I use theoldnet's proxy in my emulated 98se P100 install and thought it would be cool as hell to have a local mirror that's insanely fast, or just to poke through for hours/make more searchable.