Collection of stories about useful scraper robots

38 readers
1 users here now

This community collects stories about robots harvesting data from the net (typically the web) and performing a beneficial service.

Related:

founded 1 week ago
MODERATORS
1
 
 

In Belgium real estate listings mostly omit addresses. This makes it extremely annoying for consumers looking to either buy or rent because they are forced into engagement with the landlord/seller just to find out the address. Very time-wasting. You must register on a site and disclose your email address, then wait for someone to reply with the address (and often they do not, or they want to speak on the phone and hear your voice -- which can go badly if you don’t speak the local language)†.

The published listings tend to only disclose what approximate neighborhood the dwelling is in (useless for my needs because you have no way of knowing if it’s near a tram stop that is relevant). But there are some exceptions. Maybe ~5—10% of listings have an address. I decided to ignore the majority of listings and only consider those with an address. This meant in order to get a decent number of choices I had to scrape every single real estate site that covers my city to harvest just the listings with addresses.

Then I used a geocaching API to convert the addresses to GPS coords. From there, I scraped the public transport websites. For every address in my city the tool would grab all weekday public transport routes from every GPS fix, which includes trams and transfer times. Then it calculated the walk time on both sides to/from the tram stops on every route to derive the shortest door to door time.

I also wanted to be within a certain cycling time from the center of the city, to ensure I don’t get too far from the center. That was calculated using an API.

The tool also accounted for the usual filters, like budget. I ended up selecting the dwelling that was the shortest commute without deviating from the proxity to center constraint.

The only problem with my approach was that one listing used a fake address. So my tool trusted the addresses and some jack ass published bogus info that lead me to a place that was occupied and unavailable. When I called to say “where are you” he said “down the street.. I gave an address that was close but incorrect”.. WTF. It was far enough to screw up the public transport option.

Anyway, this would have been impossible to do without scraping all those websites. I had freedom and power that’s denied to all other consumers who are trapped in the UIs of the real estate sites. But the next time I need a dwelling, the tool is certainly broken due to how rapidly websites change and also how increasingly anti-bot they have become. I think when I built that tool it was during the last moment of time that the web was relatively open access.

Everyone is generally forced to look for a place close to work. But close in terms of straight distance does not translate into a short tram commute because the routes are chaotic. You could be somewhat close but need 2 or 3 transfers. One interesting thing I noticed was a dwelling on the complete opposite side of the city was reasonable because it was close to a train station with no need for transfers. Trains are the fastest with much fewer stops. Also, there are express buses (fewer stops) and normal buses. So intuition is too inaccurate.

† The point of contact is often a real estate agent or property manager who has many listings. So if you call or write to ask for an address of many listings, the same person sees all your requests and ignores all of them because they assume you are not serious. They think: what kind of person looks all over the place.. surely they only want to see one or two neighborhoods. So this bullshit blocks consumers from searching for a place to live in a way that accounts for public transport schedules. They want to force you to choose where to live based on everything other than the address.

2
 
 

I would never use the typical kind of shared bike that you can just leave anywhere because AFAIK those are exclusively for Google pawns. But the kind that have stations do not need an app. So I scraped all the bicycle station locations into a db & used an openstreetmaps API to grab the elevation of each station. If the destination station was a higher elevation than the source station, my lazy ass would take the tram. Hey, gimme a break.. these shared bikes are heavy as fuck because they’re made to take abuse from the general public.

It was fun to just cruise these muscle bikes downhill. I was probably a big contributor of high bicycle availability at low elevations and shortages in high places. The bike org then started a policy to give people a bonus credit if they park in a high station to try to incentivize more people going uphill.

3
 
 

I recall an inspirational story where a woman tried many dating sites and they all lacked the filters and features she needed to find the right guy. So she wrote a scraper bot to harvest profiles and wrote software that narrowed down the selection and propose a candidate. She ended up marrying him.

It’s a great story. I have no link ATM, and search came up dry but I found this story:

https://www.ted.com/talks/amy_webb_how_i_hacked_online_dating/transcript?subtitle=en

I can’t watch videos right now. It could even be the right story but I can’t verify.

I wonder if she made a version 2.0 which would periodically scrape new profiles and check whether her husband re-appears on a dating site, which could then alert her about the anomaly.

Anyway, the point in this new community is to showcase beneficial bots and demonstrate that there is a need to get people off the flawed idea that all bots are malicious. We need more advocacy for beneficial bots.