this post was submitted on 19 Nov 2023

32 points (88.1% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

54539 readers

192 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others

Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

💰 Please help cover server costs.


Ko-fi	Liberapay

founded 1 year ago

MODERATORS

db0@lemmy.dbzer0.com

sunbrothersco@lemmy.dbzer0.com

dataprolet@lemmy.dbzer0.com

Flatworm7591@lemmy.dbzer0.com

RandomLegend@lemmy.dbzer0.com

Kemono/Coomer Scraper Development (lemmy.world)

submitted 11 months ago* (last edited 11 months ago) by Set8@lemmy.world to c/piracy@lemmy.dbzer0.com

NSFW 10 comments fedilink hide all child comments

Around a month ago I posted a poll on this sub asking about feedback relating to a Coomer.su and kemono.su scraper I've been developing, and this post is an update to share where development is going.

For anybody unaware, I have been working on a scraping software that allows you to mass download posts from creators on both kemono and Coomer. This is not a built-in feature of their website, which I found to be somewhat stupid, so I set out to create my own tool.

In my previous post, I talked about the basic features the scraping software would have, and many people pointed out that similar software already exists for this. After taking a look at the software provided to me, I felt it did not meet my expectations and quality standards, so I continued forward with this project.

The major driving factor of this scraping software is the built-in translator I have integrated directly into the codebase, allowing for post titles and descriptions to be seamlessly translated as they are scraped, courtesy of Google translate. This feature has exceeded my expectations, with the only downside being Google's fair rate limit, which can kick in if you translate too many words. This typically only happens with post descriptions and requires upwards of 1k+ words to activate, and thus I feel it is okay in its current state. There is a toggle for translating post descriptions in the code for the time being which defaults to off, and I may add automatic service switching in the future, but for right now, it should work more than well. The translator allows anybody speaking any language to scrape from the PartySites, which is invaluable if your language isn't widely used on the sites.

I've also ported the codebase over to a C# .NET 6 class library for developers, allowing them to create their own scraping software if desired. The project currently has an attached GUI that I am working on refining for the general public.

As I've stated before, the concept of this project is extremely simple, with the codebase itself being compiled to a meager 18kb excluding libraries, and thus it surprises me that nobody has programmed this yet to a capacity deemed acceptable.

I plan to release this scraper in the following weeks, once some bugs are sorted out and discord support is possibly added.

Please let me know what you'd like to see in this, as feedback is always appreciated.

all 11 comments

sorted by: hot top controversial new old

[–] anzo@programming.dev 8 points 11 months ago (1 children)

What about gallery-dl ? It's on GitHub and apparently support both websites

[–] Set8@lemmy.world 4 points 11 months ago

I have taken a look at it in the past and it is extremely bare bones and does not have translation support, which is a major feature of this project

Also I've noticed that with bulk projects like gallery-dl that support a massive number of websites, the websites can often be neglected simply because there's too many to manage

[–] arr@lemmy.dbzer0.com 7 points 11 months ago* (last edited 11 months ago) (1 children)

Have you asked the operator of those sites if they are fine with this?

It would be a shame if they were to be taken down because of people scraping the site causing too much traffic costs or something.

[–] Set8@lemmy.world 1 points 11 months ago (1 children)

Hello,

I had this same thought, and as I've stated in the original post, when this goes public the creators are more then welcome to shoot me a message on GitHub and I'd happily remove it.

This project however keeps HTTP requests to a minimum and isn't very different from a normal user browsing the website. The only real load cost is on their CDN server which is probably designed for high traffic environments.

Out of respect for the developers, I can also modify the user agent of the HTTP requests so they could filter them based specifically on this application if that's an approach they'd be okay with.

[–] arr@lemmy.dbzer0.com 13 points 11 months ago (1 children)

when this goes public the creators are more then welcome to shoot me a message on GitHub and I’d happily remove it.

The only real load cost is on their CDN server which is probably designed for high traffic environments.

I can also modify the user agent of the HTTP requests so they could filter them based specifically on this application if that’s an approach they’d be okay with.

Why not just message them at their contact email address and ask in advance if your assumption about their CDN server is true, you should set a specific user agent etc.? Then they wouldn't have to potentially waste time on figuring out what's happening, writing and deploying filtering/rate limiting logic or finding the repository and contacting you on GitHub.

[–] Set8@lemmy.world 5 points 11 months ago

You do have a point. I'll look into this.

[–] CJOtheReal@ani.social 2 points 11 months ago

I would like to see that.

[–] MigratingtoLemmy@lemmy.world 0 points 11 months ago (1 children)

Could you take a look at deepl.com's API? It's supposed to be better than Google translate for European languages

[–] Set8@lemmy.world 2 points 11 months ago (1 children)

DeepL is a paid API unfortunately.

[–] MigratingtoLemmy@lemmy.world 1 points 11 months ago

Ah, sucks