The Bootstrapped Founder | Transcript: 345: Scrape or Be Scraped

345: Scrape or Be Scraped

September 6, 2024 / 20:03/E345 Download MP3

Arvid: 00:00

This is just proper implementation, you know, from a developer perspective, but from a business perspective I wouldn't need to do this. I could just, you know, barrage these servers. Hey, I'm Arbit and you're listening to the Bootstrap founder. With my podcast data scanning business PodScan, I'm constantly scraping the web for terabytes of audio and metadata, but now I find myself in a wild cat and mouse game trying to protect my own valuable data from these very aggressive AI companies out there doing the exact same thing. It's a bizarre tightrope walk.

Arvid: 00:41

I need data to be freely available for my business, but I'm also setting up defenses against scrapers all while wondering if I could turn those digital intruders into paying customers. Welcome to the weird world of web scraping in the AI age. PodScan, like I said, ingests terabytes of audio data every month, and that's 100 of gigabytes a day. So that's quite a bit of data. I'm constantly checking millions of RSS feeds several times a day, and I'm gathering data from public APIs and websites out there that just have some additional data that I need to enrich all the information on PodScan, both in the web interface and the API.

Arvid: 01:22

And you could say that, I guess, for some intents and purposes, I am scraping the web. I mean, I'm using certain interfaces that are meant to provide this information, but there are other things that someone might consider to be scraping. And up until a couple of months ago, this seemed to me like a perfectly fine thing to do. Then all the big AI platforms started scraping the web in ways that I have never seen before and that changed something in me. These platforms are often disregarding rules put into place to protect the content that is owned by the people who run the servers that are being scraped, and that often creates traffic on websites that causes people to pay 1,000 of dollars in bills just for the traffic before they can stop it, and that generally is considered to be a lot of damage.

Arvid: 02:09

OpenAI and Anthropic and all the big players are doing a lot of scraping, and I kinda get that, right? You need data for training your large language models. And if you're competing against another company that scrapes a lot, that gets a lot of data from all over the Internet because they're trying to build the next big model that knows even more than the one before, trying to compete with your model that is also full of information that you scrape, I get that you would be very aggressive about trying to ingest more and more data. It's kind of a competitive necessity in many ways. And this access to data in some ways, I think at least, has always been part of the Internet.

Arvid: 02:45

Any publicly available piece of data is pretty much fair game for people who download it without impeding the service that it's available through. And recently, there have been several interesting lawsuits decided in favor of people scraping against companies that try to prevent scraping from their publicly available platforms, right? This is not somebody hacking into somebody's system stealing data. This is just somebody running a script over a big platform and a directory and getting all the information on the platform. Obviously, these platforms don't like it for the reasons that I outlined over here, it costs a lot of money to send traffic and it's being used by other people in ways that you may not have attended, but it's out there, that's part of the web.

Arvid: 03:27

And if you zoom out a bit, there's also another thing that has happened very recently and it's the kind of the legal back and forth around the Internet Archive and their intent and previous action to scan books and making them available on the Internet. And that shines yet another light on one of the core features of the web that both scraping and making stuff available are kind of part of. To me, the Internet is a gigantic copy machine. From the early days, Every interaction on the Internet was one of duplicating data and you do this every day when you go to a website. Every browser requests a copy of the site from a server and then transports it over to your computer.

Arvid: 04:06

And that mindset was so prevalent when it came to making things accessible in the beginning so that everything was kinda made accessible. The tools that were built were meant to mirror other websites. If you look into, wget, one of the basic commands that is used in in the command line on many Linux machines and on Mac and wherever, there is a dash m or something, a mirror option built into one of the earliest versions of that tool because mirroring a website, following all the links on the website and trying to just copy them all into a structure that you can locally browse or that you could host again, that is part of the idea of the Internet. At least it was in the beginning. Then the corporate world took over and intellectual property challenges were finally tackled, but both sides kinda tackled them.

Arvid: 04:55

Right? It's not just the corporate, the companies that were building things like DRM systems that did make their way into the mainstream applications and browsers and all of these things, but so did tools like Limewire or Napster or BitTorrent come to the front and allow people to share going kind of around these restrictions. So restriction has always been fighting with distribution ever since. And open standards that exist to this day are still battling protectionism and suffocating data exchanges. I feel torn on this issue because to me, public availability of data and being allowed to download it is central to the business functionality of PodScan.

Arvid: 05:37

If you take a larger view here, the whole podcast ecosystem is built on this. It's all built on RSS feeds that are publicly available and open for downloading. So it has to be open. It has to be scrapable or, you know, downloadable, if you wanna use another term, for it to even work. So it's an interesting field of tension that I've only recently understood as I've been working in this field.

Arvid: 06:02

And I find myself mired in this dilemma because I both need the data to be available out there, and I wanna make sure that I don't pull more data than necessary. And I still believe that the Internet should be a place where people use caution and take an approach that is useful to all involved parties, an approach that is mutually beneficial. But that's kind of where the the whole AI company situation comes in. These businesses, they don't temper their aggression when it comes to collecting data, and that's really what I would call it. It's very aggressive.

Arvid: 06:32

There's been an article that I read that somebody within the couple days had a bot from a single LLM, provider scrape tens of terabytes of data from the website, and that's not okay. Right? That's that's really not the way you should interact with the free and open servers out there. And these companies, they understand that the more data they ingest, the better they can create models to be competitive. So they get that part.

Arvid: 07:00

It's just that they don't care about your bill, your energy bill, your traffic bill, that to them does not matter. And consequentially, all of this has turned into a cat and mouse game where people adjust their, I don't know, robots.txt files to show LLM bots which parts of the websites they're allowed to download and then crawlers either ignore these rules or they kinda follow them or they create a new user agent to bypass the robots file. It's it's just back and forth all the time. There are server operators out there that even try to actively slow down AI crawler with honeypot tools to make sure they don't go on their website or somebody else's website. So it's it's a fight.

Arvid: 07:39

It's a war about data. And this situation made me think about what data I wanna make available on PodScan, my business, right, that is built on data and how I can build it defensively to protect all this valuable and expensive data that I have collected because all the transcripts on my website, on PodScan, in the database and on the API, they have been run through a GPU somewhere, right? Somehow, I paid for every single word in there and I don't want this to be like just used in some random chatbot that is built on GPT 8.0 or whatever. I don't want this to be out there. So I made a couple of choices and I protect my data in some ways.

Arvid: 08:21

I'm just gonna explain this to you because it's kinda two sides of a coin here, like this whole conversation really is. So first thing that I did was to make the choice that there is gonna be no publicly available directory of podcasts on PodScan for the time being. You can't just go to the website and see all podcasts everywhere with all episodes and all transcripts. That does not exist. All data on PodScan is behind the login, account with a trial period.

Arvid: 08:39

So I know that it's limited, right? It prevents anonymous scraping because people need to sign up and it allows me to trace and ban suspicious accounts. I even use a Laravel package that checks for throwaway email providers and does not allow these people to sign up to my service. And I also have verification emails that I actively send so you can't just have a random garbage email and then log in and do all your thing and then, you know, create a new account, does not work. So I'm trying to prevent this at the source at the user level that I at least know that somebody out there, and this might also be a person signing up with Apple or or some kind of anonymizer, but at least not a completely throwaway garbage email.

Arvid: 09:26

And somebody out there I can talk to or I can find out more, I can ban the IP, I can, you know, ban the domain or whatever. I have some kind of measurable potential for action in all of this. And this extends to how people use the platform as well because I also implement rate limiting quite a bit. I've implemented very strict rate limits on almost all pages on all pages by default, but very strict ones on certain pages of PodScan which makes it unfeasible for scrapers to try to download the entire database quickly. So right now I have almost 2,500,000 podcasts on there with 12 to 13,000,000, I don't even know, didn't check today, episodes.

Arvid: 10:06

So that's all in all tens of millions of data points and the rate limit I think is 10 a minute. So it would take a 1000000 minutes to download all of this from one account. So it's really not feasible to try and download the whole thing and it kinda stops people to be able to abuse the system just to even like make the server struggle. That doesn't happen. Rate limiting prevents this from the start.

Arvid: 10:33

And to disincentivize people even more, I'm using encoded IDs all throughout the system particularly on the API where it would be very easy to program something to very quickly iterate over numbers, right? Usually, IDs in databases are numbers from 1, 2, 3, 4, 5 and so on. A new item comes in, it takes the next number. But you don't really wanna expose that to your API because then people could just write a script for from 1 to 0 from 1 to a 1000, try to download all of these as fast as you can. I don't want this.

Arvid: 11:03

So I'm encoding all my IDs. I use a hashed encoded version of IDs that I can synchronously encode and decode for podcasts and episodes which makes it difficult for scrapers to enumerate content very easily. It's not impossible. People could probably figure out how things are encoded and decoded but, you know, for somebody who's building a scraper that tries to scrape everything, trying to spend days or weeks to figure this out is not worth it. So between forcing people to log in, rate limiting, and using encoded IDs, that is how I protect my data in the system as it exists.

Arvid: 11:42

And on the flip side, as someone who needs to collect data myself, I try to be a good netizen, a citizen of the web by minimizing repetition and by deduplicating and figuring out how to prevent overloading other people's servers. And I said, I'm checking millions of podcast feeds every day. So I'm trying to be very mindful of what that means. I spread out my requests throughout the day and I use various techniques to reduce the amount of data transferred and here's kind of what I do. I'm just gonna go through this list to to kinda show you what I think web services should be doing the moment they scrape.

Arvid: 12:20

First off, I minimize data flow. I have a central queue system for downloading feeds and audio files, both of them, and I spread out my requests to avoid overwhelming servers. I have my own servers to take care of but I also try to never download anything more than once. So I have a queue, my back end systems pick 1 and then they do it and then it never happens again. Then I have the access to the data for a while, I cache things a lot, and I try not to overwhelm services.

Arvid: 12:47

Because here's the thing, for any particular podcast, it's fine if I check it once a day. It's downloading a feed and checking if there's a new item, that's alright, but the servers that I'm accessing are not people's home servers. Right? These are businesses that host podcasts like Transistor FM where this podcast is hosted. And the transistor servers, they host hundreds of thousands of podcasts.

Arvid: 13:09

So the moment I try to check all of those, which I do, I really have to make sure that I am not just barraging the service with requests. So I spread them out throughout the day, kind of semi randomized by the ID that these things have in my database, and I try to do it only once and I also try to make sure that if things haven't changed, I don't download the feed. For which I use things like the HTTP features like last modified dates, headers, or etags which allows me to send even the request with give me this feed with the kind of caveat with only send me the full feed if it's been modified since yesterday or if it if its contents look different than this and then you put the etag in there. That is very helpful. I think this has reduced my overall traffic by 90% because most podcasts don't release every day or every couple days even.

Arvid: 14:03

And if you just send those headers in a request, they can just say, nope, nothing new and that's all I get. I just get a what is it? Like a 304 or something back instead of a 200 megabyte file. It's really useful. It also yeah.

Arvid: 14:16

It really reduces the amount of data that I have to both download, which goes into my traffic bill, and run through the CPU to parse, which reduces the amount of, time that it needs to to calculate which allows me to use more of my CPU for other things which is really useful. And talking about server responses, if I get a whiff of the server being overloaded which tends to come as a 4 29 error or a 5 or 3, I back off and I stop requesting data from that server for a day. That's the idea. That's like, oh, okay. For this podcast, I'm not gonna check for another day.

Arvid: 14:49

This seems to be too much. That that to me is proper API handling, but, you know, it's, it is to me. And finally, there are there's something that I just wanna share because it's so weird. So RSS, the the feed, the data feed, which is just XML has a wonderful specification. You you can specify the days and the hours in in which somebody should not check your feed.

Arvid: 15:14

So it's so funny because you specified on the feed. Somebody has to check your feed to read that data and then not check again during those dates. I think this is a consequence of some publishers, some companies in the past thinking, hey, if if we're not gonna be in the office from, you know, 6 PM to 6 in the morning, why should anybody update from our feed because nothing is gonna happen, right? So we're just gonna put the day and the time where we don't want people to access the data in there and well, then hopefully they don't. So I support this too in my crawlers, which is funny because so few podcasts use it, but some do.

Arvid: 15:48

And then, you know, it just saves some bandwidth on all of these servers. And the the problem with all of this is that this is just proper implementation from a developer perspective, but from a business perspective, I wouldn't need to do this. I could just, you know, barrage these servers because they're already under some load from all these other tools trying to get those RSS feeds and downloading these files. It probably wouldn't even register if I didn't, but, you know, in in aggregate, our behavior as developers, as people who build on top of the ecosystem, in aggregate, it really matters. So this is to me not just a clean implementation, it's also an ecosystem friendly implementation.

Arvid: 16:30

And I wish this was the case for these massive AI companies, but the more people start normalizing using AI technology and expecting data from all over the web to be part of this, well, the more of this will be a problem for every founder who wants to create something that people come to for free or as a lead magnet because these companies are trying to vacuum it up. It's gonna be a big challenge to balance providing valuable information with protecting your data from being sucked up by these AI systems. And it's a challenge either way. Right? It's it's balancing this is always gonna be hard.

Arvid: 17:04

Either you protect it too much so people don't really get to see it because everything that is presented to a browser is kind of scrapable and people use browsers to look at stuff. So if the browser can't see it, people can't see it, or you're not protecting it enough and then it it just winds up being part of the next GPT or the next cloud. So it it definitely is an issue. But from my perspective as a founder, and that's something that I haven't really seen many people talk about, I think there is an interesting business potential, an upside there for this situation. When I detect scrapers from various companies, I just recently figured this out.

Arvid: 17:37

Well, I can just reach out to them to sell them the data directly in a way that works for both parties. And this could lead and hopefully will lead to business relationships with AI training companies or companies that are hosting or tuning AI systems which end up being fair relationships where data is exchanged not just scraped and collected. So what I'm right now considering is to implement features to alert me of new scrapers appearing on my website and maybe even setting up a honeypot, just a directory of the top podcasts so to detect when scrapers start collecting this information and then act on it by opening up opportunities to sell data through the PodScan fire hose or through daily exports, maybe that's something they would want that they can ingest into their latest system trainings and stuff, whatever. So the idea is to when they come, I can reach out with an actual business case. So scraping presents interesting avenues both for gathering information if you're running a software business and for facilitating data ingestion in more mutually beneficial business relationships.

Arvid: 18:42

And for me in PodScan, I think this is very important because data is the value in my business. So as we navigate this new landscape, and all of us have to at some point navigate this. If you have a public facing website, people will try to look at it and figure out what's going on. They're probably already scraping it. I think it's critical that we find a balance between protecting our valuable data that keeps our business running and then leveraging it in some way or another for potential business opportunities.

Arvid: 19:10

And even when it feels like we're moving in opposite directions at the same time, both not wanting to scrape and scraping or not wanting to give stuff away, but having to give it away, That's just what it is. That in essence is the balance that every founder has to strike. And that's it for today. Thank you so much for listening to the Boots Up Founder. Really appreciate it.

Arvid: 19:30

You can find me on Twitter at alvidkahl, a r v I d k a h l. Find my books, my Twitter course there, and if you wanna support me and this show, please tell everyone you know about podscan dot f m and it's cool podcast alerting features and leave a rating and a review by going to rate this podcast.com/founder. It makes a massive difference if you show up there because then the podcast will show up in other people's feeds, and that makes me really happy. Any of this really helps the show. Thank you so much for listening.

Arvid: 19:55

Have a wonderful day, and bye bye.

Creators and Guests

Host

Arvid Kahl

Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere