346: When Podcasts Attack: The Unexpected Challenges of External Data

Download MP3
Arvid:

And you have to test at true scale. You cannot let yourself be fooled by tests on subset data. You have to find ways to validate your systems against realistic data volumes and you have to understand that even a small percentage of a large number is itself a large number. Hey, I'm Arvid, and welcome to the Bootstrap founder. As founders, we often focus on scaling our businesses in terms of customers or revenue or team size.

Arvid:

But what happens when the data your business relies on scales faster than anything else? That's the challenge that I very recently faced with PodScan, and it nearly brought the entire system to its knees. Nearly. If you're building a business that deals with large volumes of external data, buckle up. This story might save you from some very sleepless nights.

Arvid:

There are 3 crucial lessons before we dive in. All hard earned learnings from getting things wrong initially. The first one is that observability is king. From day 1, you have to implement robust logging and monitoring in your system. You need to know exactly what it's doing at all times especially when dealing with data at scale.

Arvid:

Number 2, queuing systems are your best friend. Build systems that can handle pressure without crumbling. Message queues and job workers are essential for managing overwhelming workloads. And number 3, very hard learned lesson here, database interactions matter. Be extremely careful with database queries, especially as your data grows.

Arvid:

Simple operations like counting items can become major bottlenecks. Now you can probably imagine that I, a self proclaimed at best one x, probably 0.5x developer, ran into each of these issues headfirst while building PodScan, and you would be right. So let me take you through the roller coaster of the past few weeks where a single overlooked bug cascaded into a full blown crisis and how I clawed my way back to stability. PodScan does 3 main things. It ingests data by scanning all podcasts out there for new episodes.

Arvid:

That's number 1. It then downloads and transcribes these episodes. That's number 2. And finally, it makes the information from these transcripts available to users through alerts and APIs and webhooks. So in essence, we have collection, transcription, and distribution.

Arvid:

And I've been juggling these 3 balls since the beginning of PodScan. And for a while, things were running very smoothly. I just built a system that grew over time. We had a system that could handle 100 of thousands of podcasts, and it was randomly checking throughout the day for new episodes. It kinda worked.

Arvid:

It worked enough. But as we grew, cracks started to appear. The main application server, which is responsible for both podcast checking and for serving web requests, at least at that time, it began to strain under the load even as all of the heavy transcription work, the stuff that uses a lot of AI and GPUs, was already delegated to a standalone back end server fleet. But with millions of podcasts to monitor every single day, the sheer volume was overwhelming our resources on the main server, and I only had one there really because, you know, trying to bootstrap this in some ways. It was clear.

Arvid:

We needed a more scalable solution to get this thing going. And 3 weeks ago, I decided to rebuild a critical part of PodScan dot f m. The idea was pretty simple. Create a dedicated microservice for podcast feed checking and take that out of the main service. And that would take the load off of our main server and allow us to scale the checking process independently.

Arvid:

Right? Instead of having to cut corners and only check certain feeds every day or once a week as I did because I just could not handle more of that, I could then spin up maybe 3 checking servers and check most podcasts several times a day. I was excited for this. It was a good idea. And this new system would be distributed.

Arvid:

It would be running on multiple servers in different locations, and it would be very much optimized for just one task, and that's constantly scouring the web for new podcast episodes by scanning RSS feeds on those particular podcast websites. No more performance impacts on our main application. It seemed perfect. So as I started coding, I started to add some fancy logic to determine the best times and frequencies for checking each feed, and I felt like a true optimization wizard. Little did I know I had just planted a massive time bomb in our system.

Arvid:

Here's where scale becomes a truly wicked problem. When you're testing on your local server with a few 100 or even a few 1000 items, everything can look fine when you test when you try it out. If you see roughly the right number of log entries you assume all is well and your code is working, right? You have this loop that goes over all the podcasts in your system and it just puts a log into the, the log file for each one of them, and they all scroll by. Looks like it's enough, but what if you're missing 20% of your checks or 50% or even 70%.

Arvid:

It's not always obvious when you're dealing with large numbers. And that's exactly what happened. My new scheduling logic had a subtle flaw that caused a substantial number of podcasts to be checked far less frequently than I intended or sometimes not at all. But because we were still ingesting tens of thousands of new episodes every day, I didn't notice the problem when the code was deployed. I was confident in my working system.

Arvid:

Because it looked like it was working, and I moved on to other tasks. I spent a week improving our data extraction system, and I was completely oblivious to that ticking time bomb in the system. And it started to show with a few user reports. Some customers noticed that their favorite shows weren't updating as frequently, and most just use the manual update feature to check as a workaround, and that worked pretty well. You can always go to the UI and schedule a manual check, and it will pick up the feed.

Arvid:

But to one person it was not enough. One user, Nikita, who runs Master Plan and is also using PodScan to build a medical education platform that offers concise summaries of leading medical podcasts. It's really really cool what he's building, but that's just a side. He went above and beyond. He meticulously documented the missing episodes and feeds and presented me with undeniable evidence through our chat system that something was very wrong.

Arvid:

And at first, I thought it was an isolated incident and I kinda brushed it off. I thought it would figure itself out, but Nikita was relentless and provided regular updates every couple days on the state of these problematic feeds, like with the IDs and when the list. I know it was really super helpful. Probably the the most outspoken customer about these issues and extremely helpful in the details. And as that data piled up, I could no longer ignore the truth that our core ingestion system was fundamentally broken in some way.

Arvid:

I didn't know how, but I knew something was amiss. So I dug into the logs, and what I found chilled me to the bone when I saw it. Feeds that should have been checked multiple times a day, because that's the logic I built, they weren't being called at all. I even built code fragments into my system to provide extra logging for just the IDs of the podcast that Nikita told me about. Nothing.

Arvid:

They weren't even attempted to be scanned. My optimized scheduling logic that I was so proud of was failing spectacularly. And one morning, I woke up with a realization. I dreamt of math. It should tell you everything about me.

Arvid:

After so many days of not understanding why things had broken, I had a math dream. So I rushed to my computer and that really means I woke up, I walked the dog, and I had a coffee. But after that, I rushed to my computer and I tested my hypothesis and I felt a mix of both relief and dread wash over me because I had found the bug. It was a simple math error. Well, simple.

Arvid:

It was a math error that was simple to see in hindsight in house these scheduled checks were distributed throughout the day. For nearly 2 weeks, we've been operating at somewhere between 30 or 40% of capacity without realizing it, and that affects the whole system, but I'll get to that in a second. I quickly rewrote the checking logic. I even used Claude AI to help me spot errors that I might have missed this time. I had it actually explained to me and mathematically proved to me that the algorithm that I was gonna be implementing would capture every single item in the list was really useful.

Arvid:

And I implemented extensive logging around it and I deployed it and I helped my breath to see if things would work again. And over the next couple hours days, I watched with growing excitement as our system started behaving correctly again. Feeds were being reliably checked every 4 to 6 hours just as intended. And even Nikita reached out to me to tell me that his alerts were coming in regularly again. But my elation was short lived at that point.

Arvid:

I had solved one problem only to create a much bigger one. Because remember all those missed podcast episodes from the past 2 weeks that the system didn't catch? Well, they were about to hit our system like a tidal wave. In a single day, we experienced an influx of content that would normally be spread over 2 weeks. Fair.

Arvid:

Right? Because we didn't pick them up for 2 weeks and now we did. Our carefully provisioned systems sized for what we thought was normal load were suddenly overwhelmed. The transcription queues overflowed because all of these new episodes that needed to be transcribed so we check them for alerts. Then the data extraction services buckled under the strain because for every new transcription that comes into the system, data needs to be extracted and that needs to happen with a certain scale.

Arvid:

And then the alert system fell behind and that frustrated our users and me because now all of this is kinda queued up and you don't get a message for your podcast, like, minutes after it's released, you get it hours, maybe a day after it's released. And even though that's fine, it's not really critical, it is not what the what I promised. Right? It was like watching a series of dominoes fall. Each part of our pipeline that had been running smoothly at a partial capacity now faced this 400 to 500% increase in workload because things just piled up.

Arvid:

And here's the kicker. Unlike many SaaS businesses that scale with paying customers, PodScan's core ingestion work is relatively static. We aim to process all podcasts out there regardless of our customer count. Doesn't really matter how many people use PodScan. We have 2,500,000 podcasts to ingest every single day.

Arvid:

This means we can't simply throttle or delay processing. The value prop that PodScan offers depends on comprehensive timely coverage. Most software businesses have the luxury to be able to scale their operations along a metric that they can influence. Either it's the number of customers that they allow into the system or the number of projects that they allow their customers to have, maybe even the files that are hosted on a platform, but it's always a number that somewhat increases in tandem with the business itself. But the moment you work with external data, and I would roughly define this as data that is being created by others but desired by your customers, you run into scaling problems very quickly, like the one that I've been telling you about.

Arvid:

My friends over at Fathom Analytics can probably tell a few stories about this as well. They are bootstrapping a Google Analytics alternative, a much better one, and they have customers who have millions of page views per day, maybe even per hour. And that is external data. It comes at a volume that you have no control over, because you get a customer, they might have a 1,000 people coming to the website, or they might have a 1,000,000 people coming to the website. You have to support that either way.

Arvid:

And you have to support any big number to keep that customer or they will go to somebody who does support it. And that was what I had to do stabilize things, keep my customers, show that I can handle the millions of podcasts out there. The next few weeks, if I can even remember them, were a blur of firefighting and optimization. I rebalanced my resources, I shifted servers from transcription to extraction, trying to clear the backlog without completely starving the other queues. It was a balancing act quite significantly.

Arvid:

I scaled AI services because the context aware alerting system that I have in PodScan uses AI to filter relevant mentions, and that's pretty much what it is. It was suddenly processing 5 extra normal volume because of all these things flowing in. If you don't know what this is, it's a it's a specific feature that I've implemented probably one of the most interesting things I've ever built. It's like when a new transcript comes in from a podcast somewhere and a customer has an alert for a keyword, they can also add a question to that alert. So maybe if I were to use it for myself I have 0 to sold my book, right, and I would have 0 to sold as the keyword in there, but I know that some people have been doing webinars or have been doing, like, Twitter spaces with the name 0 TO Soul to do some marketing workshops and I don't want that.

Arvid:

So instead of just having that keyword and getting alerted for a thing that I do wouldn't need, I can add a question there saying, is this podcast episode talking about Abid Kal's book 0 to sold? And that question needs an AI to check. Right? So I had to quickly provision additional AI resources while keeping an eye on cost to be able to handle the needs of the customers that had this context aware alerting system turned on. And mind you, spinning up new servers is not an option for me right now.

Arvid:

I'm kinda semi bootstrapped here, so I have to look at my money. Still, I can't just like spend 1,000 of dollars to deal with the skewing system. I had to use the resources that I had. So I had to take resources from transcription and extraction and put it into inference which further skewed the balance of the whole system. And I implemented more sophisticated priority queuing because I needed to ensure that the most critical podcasts based on user interest and update frequency were processed first.

Arvid:

And this is a tough one for me because I love all shows. It's hard to ignore a podcast just because it's not popular. It might still be relevant to somebody in the system, but for the overall health of PodScan, it was unavoidable for me to reprioritize these kind of things. I get so many podcasts every given day that are just somebody ranting about stuff or like religious podcasts that just go through the bible or the Quran or anything and just figure for half an hour they talk about a certain topic, I needed to deprioritize these things that are just kinda coming in in a certain regularity and are not very popular to be able to get the Huberman Labs podcasts in there or, you know, the Tim Ferriss podcast or whatever, there's very popular ones that a lot of people get results for in their alerts. And I don't like this, I would love to have a system that can scan them all at the same time with the same priority, but I had to build in kinda an 80 20 version that takes most of the most popular ones right now and then delays the other ones until there is space in the priority queue, which tends to be overnight.

Arvid:

As I'm looking into English speaking podcasts right now exclusively, there are not too many being released overnight. So all of these low priority podcasts, I would probably call them, are being scanned and transcribed and extracted and inferred overnight in a couple hours after they're being released. But I had to build this and I still feel like it, but then now you have to deal with this. You you have to also build checks into the system and that's what I did. I built adaptive systems that could automatically adjust these kind of priorities to load fluctuations.

Arvid:

And that would also prevent future backlogs from spiraling out of control, I would think. Right? If I see there are so many items in the queue, I automatically deprioritize more items based on certain numbers like how many ratings on the iTunes store do they have or whatever. Right? I have a kind of adaptive queuing system that looks at the queue as a whole between all the different services and then makes kind of smart choices to keep it balanced.

Arvid:

And finally, I had to deal with the database because, you know, like, we have 14,000,000 items of episodes in the database and 4,000,000 podcasts. And as the data volume swells, even just a count query begins to slow down the system. I had to refactor several database interactions to maintain performance because that's the tricky part. That's kinda what I said in the beginning too. You have to really look into how you interact with your database.

Arvid:

If you have, I don't know, a 100 items in the database, counting them very quickly. If a query just gets a 100 items, counting what that would yield also very quickly. But if it's a 1,000, it might be a bit slower. If it's 10,000, it's definitely slower. And if the queue grows and grows and these kind of calculations are part of it, you're looking at slower and slower queries which make the whole system a little bit slower, which maybe adds a few few more items to the queue, which turns into this kind of self fulfilling prophecy of things becoming slower until at some point they consume all the resources in your system.

Arvid:

It's kind of a memory leak, but it's kind of a resource leak in that sense. So I had to deal with this a couple times. I figured it out fortunately before things crashed, but I had to keep an eye on my database, like the the metrics of the server and the queries in the database itself just to keep that part from exploding and affecting everything else. So throughout this process, I was acutely aware of 2 things. First off, I only have 24 hours in any given day and they're way too few, and then, PodScan is still kinda semi bootstrapped.

Arvid:

Right? I got some funding, but in many ways, I'm still acting like a bootstrap company. I couldn't just throw unlimited resources at the problem, I had to optimize it And it had every optimization had to balance performance with cost effectiveness. And that's quite quite large for a single developer plus marketer plus whatever I am to do in a system like this. But as the dust settles and PodScan regain stability, I'm left with several hard won insights, and I'm gonna share them all with you so that hopefully you might learn from this now.

Arvid:

And when you get into a similar situation, when you face a similar challenge, you know at least what can be done, what should be done, what should maybe not be done, but what can be done. The first one is to expect the unexpected scale because the moment you deal with external data sources, your scale is not determined by anything you control. Customer account, revenue, whatever, you have to plan for the full scope of data you might encounter with your customers, the things that they wanna see, the things that they bring, and provision resources that can handle it even when things are a bit shaky, even when you're deploying, even when there might be a little bug, you have to be able to deal with the data load that your business brings the moment you deal with external data sources. And you have to test at true scale. You cannot let yourself be fooled by tests on subset data.

Arvid:

You have to find ways to validate your systems against realistic data volumes, and you have to understand that even a small percentage of a large number is itself a large number. Scale is hard for humans. We just don't really get big things. They're hard for us to understand, and it's particularly hard when it's all digital information when we can't really see it. Right?

Arvid:

It does not truly exist in a physical form. We can feel the wealth of information. We just can see a number and that is really really tough to grasp for our minds. So we need to deal with these things and what I did was to implement circuit breakers, like building a safeguard that can detect and mitigate unusual spikes in data volume or processing time or whatever. At the very least, a circuit breaker should get your attention so you can intervene, right?

Arvid:

It may not actually do a circuit break in the system, but it has to pull you out of whatever you're doing and break your circuit and allow you to manually deal with the problem. And to be able to do this well, you have to decouple critical systems. Like our problems were compounded because a slowdown in one area, the ingestion, cascaded into user facing alerts. And that is problematic. It's hard to do, but if you can, these systems should work independently.

Arvid:

So if you can design your architecture to isolate the potential failure points, that would be great. Right? It's just it just sucks that if if some part of that chain breaks and the whole chain goes down. That's kinda why I talked about queuing systems in the beginning. If you have these decoupled systems and they all communicate via queues, at least the queue is kind of the backstop and allows these systems to produce however much data or deal with however much data they can, and then the other system will take only as much as it can deal with.

Arvid:

That is usually a very good idea. Look into message queues. Most systems, most programming frameworks have some kind of queuing system or they support message queues such as RabbitMQ or 0MQ or anything like this. The moment you understand as a developer what a message queue is and how it allows you to scale and distribute work, you have a very, very helpful tool in your tool belt that can help you build systems that are more resilient to things like what happened to me. And the other thing you need besides having the right architecture is observability.

Arvid:

The sooner you can detect anomalies, the easier they are to fix. Comprehensive logging, always important, and that might suck because you have to scroll through a lot of logging, but it's very important for you to be able to understand how your system works. And then monitoring, they're not optional at this scale. You have to have them any scale really, but they really really come into play when you have external data. Once you have a system up and running, you should think about how you can compare the current data against the expected data, the data that you have from the prior weeks or so.

Arvid:

And how you can track error rates over time as well and have them reported to you when things misbehave. Usually, I use a lot of these tools, things like Sentry or AppSignal or whatever these things are called are very, very useful as they integrate into your code base and give you access to exceptions and access to even profiling. You can see how certain queries work, how fast they are, what's slow, what's fast. That's really useful. So any observability is really non optional at any level, but particularly when you have external data.

Arvid:

And flexible infrastructure is the same. Right? The ability to quickly reallocate resources between these services saved us from a complete meltdown last week. I designed my systems with this flexibility in mind. The code for the transcription server, the extraction logic, and the context aware question inference, it's all run on the same application.

Arvid:

I deploy one application to a computer like some Ubuntu server with a GPU in it, and all I have to do is to change the configuration value to shift resources. If I switch on transcription, it starts to transcribe. If I switch on inference or extraction, it does these things as well. Right? I usually restrict it to one of these 3, but if I need I can turn on the others and they start to kind of balance out on that computer.

Arvid:

I could even automate that depending on load, but I'm not there yet. Maybe in the future if I if I have another catastrophe, I might be able to do this. And finally, you have to understand your bottlenecks because every system has limits. So you need to know where yours are and have to plan to address them before it becomes critical. So graph your performance data and keep an eye on systems that are at capacity.

Arvid:

That's always interesting because if some RAM is full or if a CPU is a 100%, you will experience problems that will trickle into bottlenecks that then affect the whole system. So take a look at that. This whole experience has been very humbling, but it also kind of rekindled my passion for building resilient scalable systems. PodScan is now more robust than ever, and I have clearer insights into the operational limits and its capabilities. It's really cool.

Arvid:

All it cost me was 2 weeks of my life and a severe chunk of my sanity. So, you know. But for those of you building data intensive businesses, do remember the challenges that you will face aren't always obvious from the start. They happen at a scale that you may not even understand or be able to comprehend, but they will happen. So stay vigilant with that and be prepared to just adapt quickly, measure everything, and never stop learning from your system's behavior.

Arvid:

Like, track stuff over time and look at graphs over time. Usually, just a graph can be a great indicator for a development. Right? Look at over the last 4 weeks of your system. Do you see something slowly creeping up?

Arvid:

That will be a problem 4 weeks in the future. Right? That is how you look at these things and you keep an eye on your system. Building at this scale, I have noticed is not for the faint of heart. But for those willing to embrace the complexity, the rewards in terms of technical knowledge clearly, but also the ability to provide very unique value to users are immeasurable.

Arvid:

So now if you excuse me, I will have a few million podcasts to process. And that's it for today. Thank you so much for listening to the Bootstrap founder. You can find me on Twitter, Abid Kal, EAVIAD KHL. You find my books there, my Twitter course too.

Arvid:

If you want to support me in this show, please tell everyone you know about PodScan dot f m and how well it runs. Leave a rating and a review by going to rate this podcast.com/founder. Makes a massive difference if you show up there because then the podcast will show up in other people's feeds and they will learn about PodScan, which is really appreciated. Any of this will help the show. Thank you so much for listening.

Arvid:

Have a wonderful day, and bye bye.

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
346: When Podcasts Attack: The Unexpected Challenges of External Data
Broadcast by