352: Running Lean at Scale

Download MP3
Arvid:

Hey, it's Arvid, and this is the Bootstrapped founder. This episode is sponsored by Paddle.com. Yesterday, I shrunk the size of my production database from 4 terabytes to just under 1 terabyte. And I know it's still ginormous, but this change has been very impactful for PodScan in several ways. Something interesting happened last weekend that made me realize I needed to change how I think about scale.

Arvid:

While I was checking my monitoring dashboards, I noticed a pattern in PodScan's data ingestion that would significantly change how I approach orchestrating my bootstrap business. So let me back up a bit and give you some context here. PodScan has grown to process millions of podcast episodes. I think we have 15, almost 16,000,000 episodes in the database right now. And that database now is multiple terabytes in size.

Arvid:

It's 4 ish, 4 plus something terabytes in size. That's all just text, all transcripts. It's a massive treasure trove of information, of conversations, of content, and it's very up to date. But just looking at it as a founder, this creates an interesting tension for me. I need to keep the business running lean and have low expenses while handling this ever growing mountain of data.

Arvid:

And with about 6 months of runway left, that's what I'm looking at right now, I'm at the critical stage where every decision about scale really, really matters. The good news is that if the current subscriber growth continues, we will hit profitability within those 6 months. But here's the catch, we'll only get there if I can thread this needle between very competing forces, growing revenue, of course, right, making money, while keeping the expenses in check, or better yet, finding ways to reduce costs without compromising the product. And back to that weekend observation that kinda plays into this, I noticed something fascinating in the ingestion pattern that I have. On weekdays, Potscan processes around 50,000 episodes every day, and roughly 40,42,000 of them are fresh releases that happen on that day.

Arvid:

But on weekends, the number of new episodes drops dramatically, sometimes as low as 5 to 10 ks on a Sunday or Saturday morning. It's really crazy like it's 40 k during the week every single day and then like 5 or 10 on the weekend, nobody releases a podcast on the weekend. People release on Monday because they wanna, you know, get people's first thing in the week or they release on Friday so people can listen to it on Friday on their way home or on the weekend, but people rarely ever release on the weekend. So that means during the week, PodScan is processing 25% more episodes than necessary, right? It's it's 40,000 data and 50,000, processed.

Arvid:

And on weekends, we're running at nearly full capacity, mostly processing historical backlog, because that's what PodScan does if there's nothing fresh. It goes into the history of podcast, grabs 1, transcribes that, goes back in time. I think when I look at my cue that I have for old episodes, I think we're like 2 months in the past. We're looking at, what is this, like, late July episodes right now that that I didn't get to do then. That's how we run backwards through time.

Arvid:

And that's when it hit me. I think we're over indexing the resources that we have right now on content that might never provide value to our users, both in terms of these older episodes that aren't really much interest to someone who needs alerts for mentions on podcasts right now, and even within the freshly released episodes, there's, like, a potential lack of value. And this realization, this thought, just having this kind of moment led me to a question, questioning the core assumption that I had about PodScan. Do we really need to process every single podcast episode immediately? I wanted this from the beginning to be the thing that PodScan does, but is that required?

Arvid:

Because our users, there's some mainly people tracking brand mentions and market trends. They're not really interested in every type of podcast content. Right? I'm scanning everything out there, but people are interested in some kinds that are always interesting, some kind of podcast everybody wants to listen to, but others not. And think about it, there are thousands of daily bible readings or the Quran or the Torah or church sermons or music showcase podcast poetry, and that's all wonderful and valuable, but those things rarely trigger the kind of brand mentions or industry discussions that the users at PodScan has right now actively track.

Arvid:

Yet right now, we're giving them equal priority in our processing pipeline. So over the last couple of months, I've been experimenting with many different ways of optimizing the growing scale of PodScan. You've been with me for this time, like, on this podcast. And my new realizations fit right into those prioritization frameworks. So I had a couple learnings here just in approaching this and dealing with it that might maybe help you if you ever face similar challenges with data.

Arvid:

The first thing that I learned over the last couple months and particularly now is that database management has to be smart. You have to understand where your data lives and how your data is accessed. I discovered that I can just run a smallish CPU instance like 4 or just 2 virtual CPUs with, like, a couple dozen gigabytes of memory and a lot of storage, obviously. That is enough because I carefully optimize every single query that hits that database. And this small database performs just as nice as throwing 1,000 of dollars at bigger hardware.

Arvid:

I could maybe have some more leniency with my queries, but I don't need to. It's not about having the most powerful database, it's about being smart with how you use it. So every time I look up data, I pull out data, I do this in a way where I know that I could never ever exhaust the computer that is actively hosting that database. And it goes as far as something that I call chunk size automation or dynamicism. I chunk every query that touches more than 10 ish database rows at a time.

Arvid:

So if I do an update over a whole lot of things, I do this in chunks. If I read data from the database for thousands of episodes, I do this in chunks as well. Because every single second, there are some operations happening on my database. A lot of writing because I get a lot of things coming from my back end, from my my transcription API, and they write a lot of data. I need to chunk this so that there's space for other things to happen too.

Arvid:

And for large scale operations, I have built something that I call a self aware system that automatically adjusts the chunk size based on processing time. So I start with, I do 10 items or 100 items at a time, depends on what I do. Like for big heavy computation things, this might be 10 items at a time. For just lookups, it's a 100 or a 1000 items at a time. And then I measure how long it takes to read or write that data.

Arvid:

And if an operation takes anywhere between 1 to 5 seconds, I keep the chunk size the same. Is it faster? Is it done in under one second? I increase the size of the chunk by 20%. And the next chunk that goes in is just a bit bigger, so maybe that takes a bit longer and hits that one second window, so I keep it at that or it's still under one second, so I give it 20% more.

Arvid:

Right? And if it's slower, if it takes 5 seconds or more, I decrease the chunk size until it hits that window between 1 5 seconds. And this is all automated. It's dynamic and has been a game changer for managing large scale operations without overwhelming our resources because the performance of the moment determines how heavy the load will be. I think this is called back pressure.

Arvid:

This is a term that exists in the industry, but it works. Like, give your tools the means to understand how they perform and let them kinda just slow down if they are performing too well and the database is overwhelmed. And one thing I really wanna mention this because I think this has been my crowning achievement in in data storage. One thing I did over the last week, I was so excited about this, is that I compressed my data. Something I had on my my backlog forever is to think about how I can compress text information better.

Arvid:

I was just writing it into the database for now just for easy lookup and easy working with it, but it's probably the most impactful change that I did was implementing automatic compression for large text fields. These are two things in the in the PodScan database. It's the actual transcript, like the whole thing with every single timestamp and the person saying something, and then the next person says something, you know, the transcript of a podcast. That was just text and it's now compressed with Gzip, and then Base 64 encoded written into the long text or whatever that field might be in the in the MySQL database. And I do the same thing with the JSON data that holds the word level timestamps, which is a lot of data.

Arvid:

JSON is pretty verbose when it comes to stuff like this because keys are often like long words and if you just use the raw text, there's a lot of symbols, a lot of quotes and a lot of names of keys and brackets and all that. The moment you gzip it, you just compress it, it goes to like 15% of the size. And I reduced the storage of all of my text, which is most of my database by around 85%. And the beauty of this solution is that it's completely transparent to users. They don't see it at all because it happens on the Laravel Eloquent level, on the ORM level, right, where the things that are in the database are mapped to models inside the application.

Arvid:

There are ways for Laravel, to pull out data and manipulate it before it is exposed on the model that then gets involved in all the operations and sent to the front end. So there's always this little tiny, like one millisecond additional work now for every single time that there's a transcript is being accessed, but it's so worth it. Bandwidth has gone down, speed has went up, like this database lookups are faster because the thing is just more compressed, more dense. It's really, really useful. And my users, they don't see anything.

Arvid:

They get the same high quality data because it's, like, losslessly compressed, but I'm using fraction of the storage and bandwidth. It's really, really cool. And this has been an optimization I'm so proud of because it's one of these moments you implement it, you test it and it works, and you deploy it to production and it works as well. One of those rare moments where things don't explode. It's been really, really cool.

Arvid:

And if you ever deal with any kind of text, particularly if there's like a large JSON data that you don't really know the structure of or that might be big, might be small, look into compression. Gzip is part of almost all frameworks out there and you can very easily implement this on a level where people don't even need to see it. I'm so happy. Whenever I'm going to add any more like text data to a PodScan in any way I'm going to automatically compress it for sure. Why wouldn't I?

Arvid:

It's a little bit of additional computation, but, I mean, my main application server has, what, like, 36 CPU cores. I think it can handle a couple of gzip operations. That should be fine. So, yeah, these insights that I had over the last couple of weeks led to a pretty fundamental shift in how I'm approaching PodScan's growth, the ingestion side of things. Instead of trying to process everything immediately, I think we're moving more to an intelligent, decisive system that prioritizes content based on user needs.

Arvid:

That's where I need to be. This is a bootstrap business. This is a SaaS on its way to profitability. The high goal of every single podcast episode, I'll get there. Right now, we cut some off.

Arvid:

And if people want a particular podcast, want to see those things, they can always manually add it to the system. That's already part of what PodScan can do. So if you find a podcast you really want the episodes for, you just click on it. It's pretty much what you need to do. The moment somebody navigates to a podcast on the interface or requests it through the API, a little note is made that maybe the next episode should be transcribed, you know, maybe just to be sure.

Arvid:

So just by using the platform, people will kinda show the platform what they need. And with all of this, I think I can potentially cut server cost right now by up to 50%, Without compromising the core value that we provide to customers because we might transcribe fewer episodes, but the value for our customers remains the same because we still transcribe the right ones. They might even see results faster because now all those non interesting things, well, they get deferred to later. So the interesting things get done quicker. And this experience reminded me of the essential truth about bootstrapping, such as about building features or even acquiring customers, like it's all part of it, but sometimes the most impactful work happens when you step back and just question your assumptions about how things should work, what your business should look like, what your big dream is.

Arvid:

Because maybe there's a step that you can take that gets you there without being the final step. For now, I'm focusing more on sales and outreach. I'll share more about this next week, some cool stuff coming up here, while I'm implementing these database optimizations. Because PodScan already has all the core features that it needs to be a valuable social listening tool. Like, it's first name that I ever had in my mind for PodScan was Google alerts for podcasts, and that is exactly what it is.

Arvid:

The challenge right now is not adding more features. There's always interesting stuff, and I just released sentiment analysis a couple weeks ago, but that is stuff that is not core to the the actual need of the customer, which is figure out when they are mentioned. That helps, but it's not core. And the core things are here, it's now about making sure that I I can sustainably deliver value to a growing user base and keep it growing. So I always have to remember that being a smart bootstrapper is not about having infinite resources.

Arvid:

That's the dream, but it's not real. It's about being thoughtful with the resources that I have and then continuously find ways to do more with less. And sometimes the best way to scale is not to build more, though I would love to, but it's to build smarter. And one smart thing you can do is to go check out Paddle, the sponsor of this show. And I use Paddle, right, I'm I'm not just Peddling it, I'm Peddle is actually what I use to collect the revenue that keeps PodScan running.

Arvid:

As probably one of the most dependable services that I've ever used for my SaaS. They take care of payments and taxes and they collect outstanding invoices, they deal with refunds, all that stuff that would normally keep me up at night, well not anymore because with Paddle making sure that my revenue comes in every month, I can now think about compressing podcast transcripts and developing dynamic chunking algorithms. It's a builder's dream because somebody else has taken care of all the complicated stuff around money. So go to paddle.com to check them out. They have an amazing API, great integrations into most frameworks.

Arvid:

Highly, highly recommended. And now, back to building. That's it for today. Thank you so much for listening to the Bootstrap founder. You can find me on Twitter at avikal, a r b I d k h l, and you will find my books in my Twitter course there too.

Arvid:

If you wanna support me in this show, tell everybody you know about PodScan dot f m, particularly if they are a marketer or somebody who really needs information from podcast, wants to go on podcast, wants to have a podcast, leave a rating and a review for this show by going to rate this podcast.com/founder, and it makes a massive difference if you show up there. Any of this truly helps me and this show. Thank you so much for listening. Have a wonderful day, and bye bye.

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
352: Running Lean at Scale
Broadcast by