374: Indie Hacking Databases at Scale

Download MP3
Arvid:

Hey. It's Arvid, and you're listening to the Bootstrap founder. As indie hackers, we often start our journey with these simple dreams and these manageable datasets. Just a couple things, couple customers, couple projects. Right?

Arvid:

But what happens when success hits at some point and your modest database transforms into this data behemoth? Let me share my experience scaling PodScan from the simple database of a couple items to a system handling millions of podcasts, dozens of millions of episodes, and over a million new tracking rows daily. When you're dealing with a few hundred thousand rows and maybe 50 tables in your database max, life is good. Databases remain manageable, queries run smoothly, and everything feels under control. It's easy to build against that.

Arvid:

But success brings scale, hopefully, and scale brings a lot of challenges. In PodScan's case, it wasn't just about storing podcast data and metadata. We're talking about millions of podcasts and their episodes and transcripts and tracking charts and topic analysis. There's a lot of stuff. What started as a straightforward database with a couple relationships quickly evolved into this intricate web of highly interconnected data that needed to be instantly accessible.

Arvid:

The real challenge here isn't just a technical side. It's juggling both database architecture and maintenance while also wearing all these other hats as a solo founder. I need to take care of marketing. I need to build product. So when you're handling marketing sales, customer support, and product development, the last thing you want is your database becoming a full time job.

Arvid:

It just can't happen. Otherwise, your business is gonna fail. Yet that's exactly what can happen if you don't plan ahead. The first reality check for me building databases for PodScan came when simple queries like give me the latest item in this table started taking minutes, but sometimes hours, and that was bad. That's when I learned the critical importance of database indices.

Arvid:

And I knew about this stuff, and I usually put them in, but there are sometimes surprises in there. So for any database at scale, an index on the relevant data is non negotiable. That's your kind of fast access lifeline to frequently create data like creation dates or podcast IDs in my case. That needs to have an index. But here's the catch.

Arvid:

While setting up indices during table creation is pretty painless and just part of like building a schema, adding them to existing tables with millions of rows, now that's a different beast entirely. Combined indices are even worse when there's like several items, right, when you have your your ID you created it in some other thing and you combine them into an index because you have a query that needs these three things. Wow. To set that up often requires a full table scan and that can freeze your production system for extended periods. Let's just call that.

Arvid:

Could be seconds, could be minutes, could be hours and you certainly don't want that. And adding columns usually is a quick thing but if you have default values, that's a similar challenge because you're essentially rewriting millions of rows in a database if you put a default into every single row and you have millions already. I learned this lesson the hard way when trying to add a new index to track podcast episode publishing patterns. What I thought would be a quick database update and thankfully I tested locally where it was already fairly slow and I just had a couple tens of thousands of items in the database. Well, it turned into a potential, like, twenty so minute downtime situation and that was completely unacceptable for a service providing API access to customers who rely on this.

Arvid:

So I had to find the way out. And initially, I resigned myself to schedule downtime for maintenance. That was before I had API customers. But as PodScan grew and started offering an API, downtime became unacceptable. You can have an API with opening hours.

Arvid:

I mean, the European Union used to have that for their value added text data, but that's not the point. Like, any SaaS business has an API that needs to be there all the time if they wanna offer it as part of the service. So at that point, I discovered AWS RDS's blue green deployments. That's a game changer for database maintenance. So in case you didn't infer this, I run my database on AWS and I use their database hosted managed service RDS to keep this MySQL database running all the time and keep it upgraded, keep it updated, keep it backed up, and I'll get to these things, but that's where I run all my data.

Arvid:

And for each database in the system, AWS offers these blue green deployments which is something I didn't know until a couple months ago when I had to figure it out. And the concept is brilliant. You create a copy of your existing database and add it as a follower to your original one. And you then have two synchronized databases. On the new database, you perform your time intensive operations.

Arvid:

And this is the green version. That's what they call it. The blue is the original that keeps serving requests in the background and update the green one. And then you switch over when you're ready. And AWS RDS handles this elegantly through DNS based switching, which is really, really cool that results in minimal disruptions because in a way, you don't even have to change your configuration to change your database.

Arvid:

They just switch it internally. Right? There's two servers and they flip a switch and then the old server gets a new IP and the new server gets the old IP and all of a sudden your connection just has a little hiccup for a second or two and then things start working again. With PodScan currently being hit by 10 to 20 requests per second from all the internal and external API requests, I typically see fewer than 25 to 50 errors during switch over, which is not that much. Like, it's really, really fast for something that I don't even have to redeploy.

Arvid:

It's really cool. It's a testament to its reliability. I've done this many times so far and has worked every single time very quickly. So there's something else I found during all of these operations. If there's something particularly challenging, you can use a toolkit by the company called Percona.

Arvid:

They have a really, really cool SQL and just database based toolkit. They have a thing called p t dash online dash schema dash change, which is a tool that saved me countless hours a couple months ago, which allows structural changes without blocking reads and writes and without having to redo the whole thing in the existing table. So it creates a copy of a table. That's the idea. It modifies the not the data, just the structure of that table.

Arvid:

And then it modifies that structure to, I don't know, add an index or add a particular new field, right, that you want with the default. Then it synchronously copies data from the old one to the new table while maintaining operations on the old table in the background. It's really the same thing as a blue green deployment, but it's inside an existing database. And this approach has been particularly useful for the creation of a full text index because man, full text indices adding one to an existing table with millions of rows that can take weeks. Like, I tried it.

Arvid:

It was impossible to add it particularly because I have so much text data. Right? Imagine that in my podcast episodes database, it's not just the name of the episode or maybe a description which can be quite lengthy, but I have the full transcript, like hours worth of text in each of these items and now imagine a 2,000,000, 20 million of these items of these rows and trying to create a full text index that indexes every single word. Man, no. That that took way too long.

Arvid:

So PTE, online schema change, that really helped me do this at that point even on the green deployment. Like, that was the thing I created a blueprint deployment and then inside the green deployment I used this tool to make it possible. Yeah, data at this scale is a lot. I think the database is currently roughly over two terabytes in size and that is already a reduced version of what it used to be. It used to be four terabytes four to five and I'm gonna get to that, but it's already shrunk and it's still massive.

Arvid:

So anything on this will take a long time. So why did I shrink it? Well recently I faced an interesting challenge that perfectly illustrates the maintenance complexities at scale that you will undoubtedly encounter once you run into data like this. So I built a system to move old transcripts from shows that are a couple months or years old into Amazon S3 as a JSON document and keeping only frequently accessed content in my database to just keep it lean. Logic was simple here if an episode is older than, I don't know, a couple months and rarely accessed, well maybe once a week or a month, why keep its transcript in our primary database?

Arvid:

Right? That doesn't really make sense. However, I discovered that deleting this data did not reclaim any space inside my database, even though I thought it would. And And then I figured out my AWS RDS storage still showed five terabytes when the actual data was only two terabytes. Now I have a solution of this I'm gonna tell you, but here's the reason why.

Arvid:

By default, AWS RDS, if you use a MySQL database in there, keeps each table in one file. You can do it so that it keeps every row in a file, I think. And that probably would have shrunk the database size. But if it's in a file, it just keeps appending to the file and mark parts as deleted, and those parts will still contribute to the size of the file. The only way to do this is to run an optimized table command on a bluegreen deployment which I did, right?

Arvid:

I created a copy of the blue database making it green and then ran optimized table on my podcast episodes table. And this operation took over three days to complete. It was crazy long. And day two, I was thinking, should I stop this? Is this ever gonna end?

Arvid:

But I was committed to seeing it through, but it successfully compressed the storage needs by three terabytes from five down to two and this significantly reduced monthly AWS will I have to pay. So, of course, this means paying for double the storage and double the compute during this period of optimization, but the long term savings make it absolutely worthwhile. And you couldn't do this on a production system. Just could not work. You could not block a table for a copy operation for three days.

Arvid:

Right? Just you can't do this. You need to have a replica or a blue green deployment for this. I don't know how I would do this, like, all by myself if I were to manage to host my database myself. I have no idea.

Arvid:

I would not be able to run these operations, which is why I highly recommend going with managed solutions from the start. It might not be the cheapest thing, but it turns out to be quite cheap in the end because I did don't have to deal with things like replication and deployments. It's all built. It's all working. Right?

Arvid:

So that that's my personal perspective. Of course, if you're not going to have massive data amounts, you might not need this and you might still be able to transfer your data over into a managed database at some point, but, you know, you never know and I like to make these decisions early so that I don't have to migrate data when it's, like, just becoming complicated. And here's another thing, I talked about full text indices already and that is complicated because full text search is another scaling challenge for any business with data. Because MySQL offers full text indexing, it's really useful and it works but it doesn't work well on millions of text heavy rows particularly post fact, you know, when when it takes weeks to do it. But even if you had build it in from the start, search is very slow if there's so much of an index to go through.

Arvid:

So I opted for MileySearch as an external search solution, and I self host this on a Headstart server for cost efficiency that's a couple hundred bucks a month, but it introduced a new complexity to the whole system and that is synchronization. Because with up to 60,000 new transcripts arriving every day, each potentially containing hours of transcribed text, which is still a lot of text, So keeping the search engine in sync with the main database became a significant engineering challenge. You can send data over and it has to be absorbed into that database. You need to make sure you don't send the wrong data and you need to make sure you only send the data that it can handle. It does just a lot there.

Arvid:

And add multiple indices for different content types to that database, like podcast, topics, descriptions, episodes, and you're looking at a substantial data pipeline that needs careful management. And the synchronization process revealed its own bottleneck. The bottlenecks, I guess. MileySource ingestion can only handle so much data at once. Again, because there's just so much text in podcasts.

Arvid:

And when you're dealing with hours of transcribed text per episode, you need to carefully manage your indexing strategy. So I had to build this sophisticated queuing and waiting system, batching system in there too just to keep everything in sync without overwhelming either system. Right? I don't wanna keep all of those things in memory and now my main server is like memory overload. That's just so much when it comes to data.

Arvid:

Fifty, sixty thousand episodes a day. I don't know why I did this. I don't know why I started this business, man. It's just so much going on, but it works because you just need to make solid choices along the way. And one of the lessons that I've learned repeatedly is that even high performance databases can sometimes make surprisingly poor decisions.

Arvid:

You need to tell your system what it's supposed to use. MySQL's query optimizer is generally quite reliable. You you can send a query and it tries to find the right indices, but it doesn't always pick the optimal index for a query. And I have situations, still have them every couple days, where a simple query would run for minutes until I start hinting at which index to use because the index is there, the system just doesn't use it. And then I hint at it in an actual index hint within the query and after which the whole thing executes in milliseconds using the index.

Arvid:

That is something that you really have to be on top of. You have to have an APM system, like an application performance monitoring system, that shows you which things are getting slower so you can figure out are they optimally using what they could be using. And for me recently this was particularly visible while I was dealing with my podcast chart tracking data. PodScan has a month or so ago introduced charts where you can see the top 200 podcasts in any country in any category, which is, something that I took a long time to build because it's hard to build, but with over a million new rows being added daily because the combination of countries and categories and for each of these is already 10,000 combinations there. For each of these you need to add 200 items to a database has a lot of data.

Arvid:

Even simple queries can become problematic without proper optimization and proper index use. So I've learned to regularly audit query performance by looking at the metrics, the APMs that I have in Sentry, and not shy away from using index hints when the optimizer consistently makes up optimal choices. One thing that's important, like, if you're building an application where you add an index, like, after the fact because you see, this might need it and this might need a better index or a different index, make sure that if you use migrations to change the database schema, you add your index that you manually added to your database into a migration at some point or at the very least you do a dataless export, a dataless dump of your schema. So you know exactly which indices to recreate when you need to redeploy your database or you create a staging environment or a secondary production system, whatever it might be. Right?

Arvid:

Make sure those index choices you make along the way are cataloged and documented somewhere because if you don't put them in, the next time you deploy this, all of a sudden things will get super slow and you will have no idea what the reason is for that. So make sure you stay on top of that. And finally, I guess when it comes to protecting my data from catastrophic disaster, which has yet to happen, but you never know, I use a multi pronged approach. I'm talking about backups here, really. First off, I leverage AWS RDS snapshot feature where they just regularly create backups of the full database, on an hourly basis, I think, ready to be restored at any point.

Arvid:

I've tried this with a couple test databases. It does work, but it's a backup on the platform that it's made for. You know? Like, it's a AWS based backup for AWS data. So if AWS has a problem that I have a problem.

Arvid:

So if relying on this alone is a form of platform risk which I would like to avoid, that's why on a regular basis I do a full data dump. Like, the whole thing I download it and I store it off platform and on platform in s three or whatever. And since the database is now terabytes in size, even when compressed, it's still multiple terabytes or one terabyte plus. This takes a while, and it costs a bit to store, but it's absolutely worth just having the data right with me on a local disk in my house and on several cloud block storages. Like, Hetna recently introduced an s three compatible storage, and AWS clearly has one.

Arvid:

I think DigitalOcean may have one too. It would take me probably a day or so to recover from just the data dump, just to send it into a new database. But, honestly, it beats not having off-site backups at all. Because if AWS ever explodes, which I don't think it will clearly because then everybody else is gonna have problems. But if my data gets corrupted or if the backups don't work or whatever, it's still a way to be able to recover all this really, really valuable data and put it back into a database.

Arvid:

And also, it's just good to have the data to play with if I ever wanted to have a local system to import all this data. It's good to have a local copy. So for fellow indie hackers starting their database journey, here's my advice. Start with a hosted managed database service that's cost effective but reliable. Stick to the basics.

Arvid:

You can really only choose between a local database if you wanna use SQLite, which is just a file based database, really fast, really cool. It just doesn't really scale well into replication, I think. I would just go with a hosted thing where other people make sure your data sticks around. That's my personal perspective. And for that, you either use MySQL or Postgres.

Arvid:

PostgreSQL and while PostgreSQL offers really attractive features like vector storage, which is great for AI applications, like you can actually have your embeddings in there and native location data support, I think that is called PostGIS where you you can create maps and stuff. MySQL has served me and PodScan well over the last couple years, though it means finding alternative solutions for certain features. I think Postgres is also pretty good at full text search. I think I still likely would only use Maille Search for full text even if I had a Postgres database just because of the size of it. But just saying, some databases have stuff like this built in, others do not.

Arvid:

You need to find the solution that includes as much as possible for you, but also the solution that you can work with best. It cost me a couple thousand dollars, like 2 to 3 a month to run this database on AWS and a few hundred to run my own MightySearch search server under three k altogether when optimized for size. It was more expensive when it was five terabyte, now it's two and that's definitely not as expensive. Another tip, keep your implementation database agnostic initially. Avoid special plugins or vendor specific features until you've found the revenue to commit long term, which is why I go with MySQL because it's just very basic.

Arvid:

If I were to use Postgres, I may already have used the vector thing and that means that now I could only ever use Postgres there. Probably okay, but it's also a choice you need to make in the beginning. Just ensure always that you can quickly deploy read replicas should you need to. Like a read replica is pretty much what a green blue deployment is, but it's not meant to be switched over. Like, a read replica is a copy of your database that is read only, that is constantly fed new data from your master or your main database, and gets, just read from.

Arvid:

And it's so much faster because it doesn't have to wait for writes. Right? You can read, read, read, query, query, query. So if you have a lot of traffic spikes or if you have a lot of reading people reading data, you have a, I don't know, you're viral all of a sudden and everybody wants to look at your homepage, whatever. It's good to be able to deploy a read replica and have your system just do queries on that.

Arvid:

I found that being able to quickly spin up a read repertoire during high traffic periods has saved me multiple times in the past when it was still not very optimized and particularly when search engines suddenly start crawling our content more aggressively or even worse AI companies. Like, they don't really stop at the robots. Txt. They just crawl, crawl, crawl. It's good to have a read replica that could just, you know, serve them even though it's a cost for you.

Arvid:

It's better than have this kind of read heavy approach bring your system to its knees. If I were to start PodScan today with all the learnings that I had along the journey, I might choose PostgreSQL for its built in features like vector and location. But the key lesson isn't about which database to choose. It's about understanding that your needs will evolve as you scale and you can't really know precisely what you need from the beginning. The database that serves you well at launch might need to be complemented with some additional services as you grow, like search engine did for me.

Arvid:

Just remember to start with a simple database today and think about the fact that it might need to handle millions of records tomorrow or maybe in a month. Right? Plan for scale. Just don't over engineer. Consider just good practices in building queries, in building an index, in building the schema, and focus on building something that works now while keeping the door open for future optimization.

Arvid:

Use good systems. That's kinda what it is. Like, you could probably write your SQL by hand. You might wanna have, like, a database object wrapper around it, like, something that takes care of this. For Laravel, that's really, really cool.

Arvid:

It's called Eloquent. It is the kind of query builder. It's highly, highly effective and gives me all the required things that I need without me having to write actual SQL. So that is something obviously that most developers do. I just highly recommend it for you if you don't know about it, look into these systems and use them reliably.

Arvid:

You can be pretty sure that things are as optimized as they can be. Because all these things that I talked about, all these challenges of finding these solutions, that's what indie hacking is all about. Right? It's growing and adapting as our success demands it. That's why we're indie hackers.

Arvid:

The most important thing I've learned through this journey is that database management at scale is not just about technical solutions. It's a balance between the performance and maintenance effort and cost. Because I could just buy a much bigger database, and the queries would run faster. Right? You can do a lot of vertical scaling just by getting more cores and more RAM.

Arvid:

Things just happen much, much faster, but it's also super expensive. And as a solo founder, you need solutions that not only work well but don't require constant attention. So sometimes that means paying a bit more, maybe not too much, but a bit more for managed services. And other times, it means investing time in automation. The key is knowing which battles to fight and which ones to defer until they truly matter for your business and you can actually afford to fight them.

Arvid:

And that's it for today. Thank you so much for listening to the Bootstra founder. You can find me on Twitter at AbidKahl, a r v a d k a h l. If you wanna support me in this show, please share podscan.fm with your professional peers and those who you think will benefit from tracking mentions of their brands, businesses, and names on podcasts out there. We have charts now.

Arvid:

Topics are coming. We have collections lists. There's a lot of good stuff in PodScan. Check it out. PodScan is a near real time podcast database with a really, really good API, so my customers tell me.

Arvid:

So please share the word with those who need to stay on top of the podcast ecosystem. Thank you so much for listening.

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
374: Indie Hacking Databases at Scale
Broadcast by