FOSS infrastructure is under attack by AI companies

simple@lemm.ee · 16 days ago

FOSS infrastructure is under attack by AI companies

𝕸𝖔𝖘𝖘@infosec.pub · 13 days ago

Failtoban should add all those scraper IPs, and we need to just flat out block them. Or send them to those mazes. Or redirect them to themselves lol

MonkderVierte@lemmy.ml · edit-2 15 days ago

Assuming we could build a new internet from the ground up, what would be the solution? IPFS for load-balancing?

Fijxu@programming.dev · 15 days ago

AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN’T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)

Buelldozer@lemmy.today · 16 days ago

I too read Drew DeVault’s article the other day and I’m still wondering how the hell these companies have access to “tens of thousands” of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?

GreenKnight23@lemmy.world · edit-2 8 days ago

deleted by creator

festus@lemmy.ca · 15 days ago

There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users. They route traffic through these IPs via malware, hacked routers, “free” VPN clients, etc. If you block the IP range for one of these addresses you’ll also block real users.

db0@lemmy.dbzer0.com · 16 days ago

Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I’m expecting the next spam wave.

grue@lemmy.world · 15 days ago

ELI5 why the AI companies can’t just clone the git repos and do all the slicing and dicing (running git blame etc.) locally instead of running expensive queries on the projects’ servers?

green@feddit.nl · 15 days ago

Too many people overestimate the actual capabilities of these companies.

I really do not like saying this because it lacks a lot of nuance, but 90% of programmers are not skilled in their profession. This is not to say they are stupid (though they likely are, see cat-v/harmful) but they do not care about efficiency nor gracefulness - as long as the job gets done.

You assume they are using source control (which is unironically unlikely), you assume they know that they can run a server locally (which I pray they do), and you assume their deadlines allow them to think about actual solutions to problems (which they probably don’t)

Yes, they get paid a lot of money. But this does not say much about skill in an age of apathy and lawlessness

Realitaetsverlust@lemmy.zip · 15 days ago

Because that would cost you money, so just “abusing” someone else’s infrastructure is much cheaper.

grysbok@lemmy.sdf.org · 15 days ago

It’s also a huge problem for library/archive/museum websites. We try so hard to make data available to everyone, then some rude bots come along and bring the site down. Adding more resources just uses more resources–the bots expand to fill the container.

melpomenesclevage@lemmy.dbzer0.com · edit-2 8 days ago

Removed by mod

PrivacyDingus@lemmy.world · 15 days ago

nepenthe

It’s a Markov-chain-based text generator which could be difficult for people to implement on repos depending upon how they’re hosting them. Regardless, any sensibly-built crawler will have rate limits. This means that although Nepenthe is an interesting thought exercise, it’s only going to do anything to things knocked together by people who haven’t thought about it, not the Big Big companies with the real resources who are likely having the biggest impact.

melpomenesclevage@lemmy.dbzer0.com · edit-2 8 days ago

Removed by mod

PrivacyDingus@lemmy.world · 11 days ago

any way of slowing things down or wasting resources is a gain I guess