Great for the fediverse! I suspect that these changes to twitter and reddit are mainly a response to the growing hunger of generative artificial intelligence companies who are hoovering up data, basically for free. Change is never easy, but i’m optimistic that this is the break open source and federated communities needed to start taking off. I hope people can see the value in decentralizing and help support these open source projects financially so that they can really start to scale. The reality is, scaling is expensive, and we all need to help where we can. These Ai companies will not hesitate to suck up federated data also. If we want to live in an ad free world its gonna cost us.
Question: Can’t AI companies just as easily Hoover up language content from the fediverse? Or is it something that we just kind of accept but don’t care about since it isn’t eating into fediverse finances?
yeah, fediverse platforms not only have no measures against scraping, they willingly send out content in a computer-readable way. kind of the whole point of federation. and we can’t really stop them, even if we clamp down on federation we’d only hurt ourselves.
besides, up until the latest change twitter was still easy to scrape (and now the problem is that even registered users can’t see that much of it), and reddit is trivial to scrape even without the api. yes, that includes new reddit too. there’s very little you can do against scraping in an open space, especially against someone wielding the full power of chatgpt, and even less so if you want to keep your site accessible to blind people.
lmao, you know you fucked up when a browser pushes an update specifically to circumvent your rate limits
but yeah, if opera can do it, i highly doubt that openai can’t easily do it either. the ai concerns are posturing (and probably a personal grudge, given that elon was a founding member of openai until he got kicked out), the real issue is somewhere between incompetence and attempted monetization.
i’m actually kinda interested how that could work. a regular user using “near infinitely less” resources than a scraping engine sounds like some absolutely stupid design, either on reddit’s or the scraping engine’s side
except most of the weight of the site is in easily cachable assets that don’t get reloaded at all. probably not even loaded to begin with, since even though new reddit is a single-page app, it does have seed data in the html content itself, which a well-written scraper (or one that automatically parses the site with chatgpt) can easily extract. constantly reloading styles and scripts would be a ridiculously stupid design on the scraper’s part, and on reddit’s if they necessitated it.
the html page itself is slightly heavier than just the json data but compared to all the images and videos real clients load and the giant piles of tracking data being sent back every second, a scraper is def going to be lighter. plus the site does reload itself every time you enter a new subreddit, that doesn’t happen through the api for some reason.
I would assume it’s even worse for the fediverse considering the limited resources we have to run the servers. I wonder how the devs/server owners will handle this.
I don’t think (completely wild guess here) AI content crawlers should have any more impact than the dozens and dozens of spiders that make up must of my own site’s traffic.
The impact was magnified for Twitter because it generates so much new content every second. That wasn’t an issue when Twitter had a nice, properly cached API and it shouldn’t be an issue for fediverse instances either because we have RSS and caching and we’re not so stupid as to turn those off. Like, what kind of moron would do that?
The issue comes when those AI bots start commenting and posting here. From what I understand, bots are a large reason why Beehaw keeps defederating from instances with open registration: bots are difficult to moderate without good moderation tools.
to be fair, that latter argument about the magnified effect on twitter operates under the assumption that elon wasn’t just lying to cover up that he didn’t pay his google cloud bill. the amount of users who view and create that content still create a much higher load on the servers than AI scrapers that want to read it once and save it somewhere for training
I don’t think (completely wild guess here) AI content crawlers should have any more impact than the dozens and dozens of search spiders that make up must of my own site’s traffic.
The impact was magnified for Twitter because it generates so much new content every second. That wasn’t an issue when Twitter had a nice, properly cached API and it shouldn’t be an issue for fediverse instances going forward because we have RSS and caching and we’re not so stupid as to turn those off. Like, what kind of moron would do that?
Great for the fediverse! I suspect that these changes to twitter and reddit are mainly a response to the growing hunger of generative artificial intelligence companies who are hoovering up data, basically for free. Change is never easy, but i’m optimistic that this is the break open source and federated communities needed to start taking off. I hope people can see the value in decentralizing and help support these open source projects financially so that they can really start to scale. The reality is, scaling is expensive, and we all need to help where we can. These Ai companies will not hesitate to suck up federated data also. If we want to live in an ad free world its gonna cost us.
Question: Can’t AI companies just as easily Hoover up language content from the fediverse? Or is it something that we just kind of accept but don’t care about since it isn’t eating into fediverse finances?
yeah, fediverse platforms not only have no measures against scraping, they willingly send out content in a computer-readable way. kind of the whole point of federation. and we can’t really stop them, even if we clamp down on federation we’d only hurt ourselves.
besides, up until the latest change twitter was still easy to scrape (and now the problem is that even registered users can’t see that much of it), and reddit is trivial to scrape even without the api. yes, that includes new reddit too. there’s very little you can do against scraping in an open space, especially against someone wielding the full power of chatgpt, and even less so if you want to keep your site accessible to blind people.
People actually already found a way around the rate limit. Opera GX even implemented a fix in their desktop browser.
lmao, you know you fucked up when a browser pushes an update specifically to circumvent your rate limits
but yeah, if opera can do it, i highly doubt that openai can’t easily do it either. the ai concerns are posturing (and probably a personal grudge, given that elon was a founding member of openai until he got kicked out), the real issue is somewhere between incompetence and attempted monetization.
For Reddit API calls are near infinitely less load on the servers than scraping.
i’m actually kinda interested how that could work. a regular user using “near infinitely less” resources than a scraping engine sounds like some absolutely stupid design, either on reddit’s or the scraping engine’s side
When using the API you just request what you’re looking for. With scraping you load everything repeatedly.
except most of the weight of the site is in easily cachable assets that don’t get reloaded at all. probably not even loaded to begin with, since even though new reddit is a single-page app, it does have seed data in the html content itself, which a well-written scraper (or one that automatically parses the site with chatgpt) can easily extract. constantly reloading styles and scripts would be a ridiculously stupid design on the scraper’s part, and on reddit’s if they necessitated it.
the html page itself is slightly heavier than just the json data but compared to all the images and videos real clients load and the giant piles of tracking data being sent back every second, a scraper is def going to be lighter. plus the site does reload itself every time you enter a new subreddit, that doesn’t happen through the api for some reason.
I would assume it’s even worse for the fediverse considering the limited resources we have to run the servers. I wonder how the devs/server owners will handle this.
I don’t think (completely wild guess here) AI content crawlers should have any more impact than the dozens and dozens of spiders that make up must of my own site’s traffic.
The impact was magnified for Twitter because it generates so much new content every second. That wasn’t an issue when Twitter had a nice, properly cached API and it shouldn’t be an issue for fediverse instances either because we have RSS and caching and we’re not so stupid as to turn those off. Like, what kind of moron would do that?
The issue comes when those AI bots start commenting and posting here. From what I understand, bots are a large reason why Beehaw keeps defederating from instances with open registration: bots are difficult to moderate without good moderation tools.
to be fair, that latter argument about the magnified effect on twitter operates under the assumption that elon wasn’t just lying to cover up that he didn’t pay his google cloud bill. the amount of users who view and create that content still create a much higher load on the servers than AI scrapers that want to read it once and save it somewhere for training
I don’t think (completely wild guess here) AI content crawlers should have any more impact than the dozens and dozens of search spiders that make up must of my own site’s traffic.
The impact was magnified for Twitter because it generates so much new content every second. That wasn’t an issue when Twitter had a nice, properly cached API and it shouldn’t be an issue for fediverse instances going forward because we have RSS and caching and we’re not so stupid as to turn those off. Like, what kind of moron would do that?