The Bootstrapped Founder | Transcript: 319: My SaaS Server Exploded (& How I Salvaged It)

319: My SaaS Server Exploded (& How I Salvaged It)

May 10, 2024 / 19:38/E319 Download MP3

Arvid: 00:00

Welcome to the Bootstrap founder. Earlier this week, I finally had a day, a full day to work on a massive refactor of PodScan's transcription queuing system. In case this is too technical for you, consider today's episode a nerdy deep dive into the heart of where my business builds its value. I'll explore the safeguards and mental exercises that I needed and developed to deal with sizable setbacks, and there will be a setback. And if you're interested in what a sodopreneur's tech complexity can look like, stick around too.

Arvid: 00:31

I'll spare no details. This episode is sponsored by acquire.com. More on that later. So what happened? On Monday this week, for the very first time in almost 3 weeks, I had a full day to myself to just spend on building software.

Arvid: 00:45

The stuff that I like. I'm a software engineer. I love building. Didn't get to for a while. I was at MicroConf 2 weeks ago.

Arvid: 00:52

That was a full week that I needed just for the conference. And then the following week was full of calls that I had scheduled so I could catch up on the week that I missed at the conference, there were just a lot of interruptions happening throughout the last couple of weeks, which meant that I didn't have any meaningful long block of time to spend on my software business PodScan. Every now and then I could build a little thing or respond to a, you know, a customer message and change a little API field here or there. But I really didn't have time to just really think and really spend couple hours building something, testing it, deploying it. So after 2 weeks, these 2 weeks in the past, I chose a day where I would have a full day of opportunity to work on software.

Arvid: 01:34

But that actually took some work to get there. I had to reconfigure my schedule for this and I'll tell you what I mean. Not just my schedule, my whole scheduling approach. Because for the last couple of months I've been very open with my schedule. People could book calls almost every day of my week, through my Calendly link that I put into every email that I send to my customers.

Arvid: 01:54

Mostly because I really wanted the early PodScan users and early customers to find the perfect time for them to talk to me. Right? Right. Just a 15 minute call. Whenever you find time, talk to me.

Arvid: 02:05

I wanna hear from you. Because the idea here was I could hop on a call with a potential or paying customer whenever it suited them. I'm in a super early stage. I need a lots of feedback and every little piece of feedback really helps. It's really insightful.

Arvid: 02:18

So every early customer interaction that had positive feedback or negative feedback doesn't matter. That is something I can build on later. Because if I put my time in, people will see that and they will build a relationship with me. And I wanted to open my schedule to everyone for as much as I could and I did. But this also meant that I was constantly interrupted because people scheduled their conversations with me whenever it best suited them but never me.

Arvid: 02:45

And in the meantime, I also had a podcast to record, interviews to organize and to research and guests to invite to the show and to be on Twitter all day. All of that still had to happen. I have to manage my editing, my transcribing of my stuff, the podcasts and my business and a lot of extra stuff on certain days just adds to the overhead time that I can spend on building software. So I made a choice that I've had enough of an open schedule and I reduced my available days to get any kind of call to 2 days in a week max. All of my customer service conversation or my customer exploration conversation, discovery, that kind of stuff, now happens on Friday.

Arvid: 03:23

All my podcast recording now happens on Thursday. And I used to have one interview call a day on Tuesdays, Wednesday, Thursday, or whatever. Now I have as many people as I can fit into that day with some time in between. So I'll stack my calls on Thursdays Fridays that frees up the rest of the week to write code, do marketing, sales, all of that without constant interruptions. Obviously, there might still be the on off call every now and then, but I wanna focus my first 3 days of the week to be able to build stuff.

Arvid: 03:52

So after setting up this new schedule, I then set a day to work fully on my software project. And I finally had that day on Monday this week. So I chose the biggest issue that I had out of all the issues that I could work on. It's kind of an eat the frog moment. I chose to completely refactor the heart of my operation.

Arvid: 04:11

Now I will explain to you what that is because you may or may not know PodScan is the business that I'm currently building. It's a media monitoring and data platform for podcast transcripts. And therefore, it ingests thousands of hours of audio data every single hour. It transcribes them into text and then scans them for keywords to send to an API. It's a lot of data coming in, and it's not a very flexible system.

Arvid: 04:38

And whenever any of my now 24 back end servers that do the transcriptions have capacity, they ask my main server through an internal API if there's anything available, then my main server checks the database for an available project and sends it over. That's how it works. And this has grown over time. And it feels like it can be made more extensible and kinda less convoluted. I thought on Monday morning, let's tackle this.

Arvid: 05:03

And I had been thinking about this for a couple weeks, taking couple notes every now and then, thinking about what I could do, what's the structure of the new system would be. And I came up with an internal cue system, a system of many cues that make it easier for the, this kind of individual podcast episode to move through all these different stages of analysis. Because PodScan works like this. I get an audio file from some feed somewhere. Like some feed tells me, hey.

Arvid: 05:27

This new audio file has just been uploaded and this is the new podcast episode. I download it. I transcribe it. That's step 1. Then I run inference on it to extract certain information, maybe build a summary and ask questions of it through like an AI system, a couple local LLMs that I run on all my 24 servers, that's step 2.

Arvid: 05:48

And then I scan for keywords and I send alerts to everybody who is a paying customer or trial customer of PodScan and has alerts set up. That's step 3. So transcription, inference, and scan. I have a couple more steps planned in my document there, but I was gonna start with these 3. Just even without expanding the system, build a better system that can be expanded on later.

Arvid: 06:10

And right now, they all kinda kick off each other. When an audio file is available for transcription, presented on the API, a back end server fetches it, and then responds with the full transcript a couple minutes later, and that kicks off the inference step. And then the server handles inference. It sends it back. And then scanning starts on the main server, and that's how it works.

Arvid: 06:28

It's a back and forth kinda system between 2 fleets of of servers, and it works. But there's always room for improvement. Right? Well, that's what I thought on Monday, Monday morning. I thought about creating 3 queues in which these candidates live, one for transcription, one for inference, one for scanning, according to their stage of completeness.

Arvid: 06:46

Instead of looking in my database, I would just look into the queue, pick 1, and move it from one queue to the other depending on the stage that it's in. Simple enough. And I started building this because I finally had some time to do it. Right? I built the system over, must have been 4 hours and use it locally, and I kind of tried to replicate the production system as much as I could, but, you know, it's still not the exact same thing.

Arvid: 07:08

It worked pretty well. I spent some time on edge cases because I knew that production is a pretty complicated system. What if there's an error while the transcript is being created? What if one of those servers goes down? What if I have to retry it, but there's another step that needs to be done first?

Arvid: 07:22

That kind of stuff. And fortunately, my work has some structure. I'm not completely randomly building these things. I use the git flow model. If you're not aware of this, the idea in the git flow model is whenever you commit code to your repository, you only commit to the main branch or the master branch as it used to be called back when gitflow was invented when it's actually usable.

Arvid: 07:48

Anything that is kind of a longer project sits on its own branch called develop or a feature branch for something specific and you know you can build it there and at the same time you can still fix bugs on the main branch should you need to. It's the idea. Right? And I have a development branch for these more complicated experiments, so I can just, you know, still fix things as I need to. And after about 6 hours of diving into the code and testing intensely, I was ready to pull it from development, that's the name of the branch, to the main server through deployment, and it worked.

Arvid: 08:20

There were a couple of bugs when I started it. There was something about, a particular data point that I didn't think about that was slightly different on the production database, but it was fine, And I could quickly fix that. I tweeted about how well it worked. And I was kinda nervous because it was a pretty central functional part of PodScan, but it did its job. It worked.

Arvid: 08:41

It was still consuming stuff from the API. It was still sending out transcripts, still getting transcript data back, still sending alerts. So that was all working. And it's really critical because if the data ingestion and the transcript engine stuff doesn't work for my business, then nothing works. Right?

Arvid: 08:55

The API and notifications would break down. And that's the stuff people pay for. So I need that to function well. So I let it run for a bit and I started checking my metrics. And gradually, fewer episodes landed in the database.

Arvid: 09:09

Usually, I have around like 2,500 episodes an hour. Now it dropped to 22100, and then 1900. And then, like, half an hour later, it was at, like, 300. So 25100, like, 2,500 to 300, that's a significant drop. Right?

Arvid: 09:26

That's that's, like, 1 in 6. So 1 6th of the the actual amount of episodes that's when I stopped. Something was wrong in the system. Some cue item didn't move the way it should be and that was lost somewhere and I couldn't see what the problem was. And I couldn't just play around and delete jobs from my production system.

Arvid: 09:42

They all needed to work. It was life. It was a life system that interacted with live data. So I rolled back what I'd worked on for 6 hours to the version that I had in the morning before I started working on this. And I tweeted about that too because I think sharing setbacks like this is part of my approach to building in public.

Arvid: 09:58

And I was devastated. I was super frustrated, Not just by the system failing, right, the errors that happened, but also by feeling that I had wasted a day, a day that I had fought so much to get. And I went upstairs, because I live in a basement, like any good software engineer, and I talked to Daniel, my partner. And as I explained my frustration, I realized something. I may have spent a few hours writing code that I won't use.

Arvid: 10:26

Sure. And that's frustrating. But I also had learned a lot about my product and my approach to making improvements for the product. Both the product and my approach had properties that I was not aware of before. Yes.

Arvid: 10:40

The queue system didn't work, but I learned how complicated my existing system is, how delicately it's balanced, and how many interference points existed as well. Like where if you just change something a little bit, a lot of things break. So that was a lot of insight into the technical debt slash technical complexity of my actual product here. And I also learned that my feature specification process was incomplete. Like adding another abstraction layer was not required.

Arvid: 11:10

The queuing system wasn't necessary. My existing state machine that I wasn't even aware of was a state machine until that point was efficient enough if I ensured reliable state transitions between these states. And what I currently have has reliable state transitions. The queue system I built did not. Hence it failed.

Arvid: 11:28

And if I ever add more steps to this state machine, I will have to focus on improving the current state machine instead of replacing it. Like working with what I already have. So despite all the wasted hours that I was so frustrated about in the beginning, I learned so much from this attempt. I think it was absolutely worth it. And next time, I won't dive in with just a few notes and an ocean document.

Arvid: 11:52

I will look into what I currently have more properly and monitor the state machine. And honestly, this is something I realized monitoring this, observability, having observability into my system, that's a feature I need to build, before I can replace systems like this. And one more remark about this whole process. I'm extremely lucky to have systems in place already to deal with a botched deployment that allow me to actually revert it. Like rolling it back was easy and it was possible.

Arvid: 12:22

And that's maybe the most important part. If you build software, build software that you can easily or at least without struggling too much, roll back 1 or 2 deployments. That's sometimes all you need. And I use Laravel as my framework of choice, and my programming language PHP and all that stuff that under is underneath Laravel. And I use Laravel Forge for hosting or for orchestration I guess and Laravel and Voyeur for deployments.

Arvid: 12:47

And these tools in concert allow me to very very easily and quickly roll back to a prior version of the application. For PHP, it's really just files in a directory that you symlink in from one place to another. It's really simple. Right? You just kinda move the the old directory back onto where the server looks for what it should show to the people out there that use the product, and it immediately goes back to working like that.

Arvid: 13:12

And and Voya in particular let me revert to a prior deployment, like, within seconds Because I keep several dozen backup versions, because I'm I'm prepared. I have like 50 steps back should I ever need to. And reversibility is so important. So if you build software, if you build a software business and you have these experiments that you wanna run, you have these features that you wanna build, you absolutely need to think about reversibility. And the often the database changes with these things in particular.

Arvid: 13:42

And in in my case, it was There were a lot of changes in the database. I built new queues, and they needed a table to reflect it in the database. Doesn't matter if it's in a Redis store or if it's an SQL or MySQL or Postgres server. It doesn't really matter. Database changes happen as you build.

Arvid: 13:59

So you need to have non destructive database migrations if you're building. Like in particular, if you start a new feature, if you wanna build something new and your database changes, I highly highly recommend not to switch things inside the tables, but to add more tables or not to change types, but to add new types, and then migrate data from one type to the other. That's my personal experience. You might think of it differently, but don't modify tables and new ones. That allows you to roll back into the old data.

Arvid: 14:33

And there might be conflicts, but it's still better than having to wait for millions of rows. And I'm at that point right now. I have over 3,000,000 rows in the podcast database and over 5 or so 1000000 rows in my episodes database. At this point, it's gonna be over 100 of millions in the end. So any small change that goes through your development system within, like, 2 seconds is gonna take 2 hours on the actual full production system, and that impacts other things.

Arvid: 15:02

That impacts read speed. That impacts your, the the what is it? The sync time between your your replicas, your read replicas if you have them. There are so many little things that can happen if you change a tiny thing in your database. Add on top of it.

Arvid: 15:16

That is always gonna be easier than changing things as you go. So this this is my personal learning over now what 15 years of building software with migrations and databases. Make sure you don't change things as you build. So it and and most migrations have this kind of forward backward mentality built into their systems, right, where you add a new table, there is like a step back where the table gets deleted or you make a change and in the the rollback feature, you you can kinda switch it back, but that often destroys data. Just don't do it.

Arvid: 15:46

Build another add another property. Do something like this. Makes it easier. So I've experienced this very, very clearly over the last couple days, which is why I'm kinda hammering on the point here. But let's just look at this as a lesson in reframing.

Arvid: 16:01

Right? It was a frustrating loss of time, and I thought about it for a minute, removed myself from my office, talked to my partner, and all of a sudden I noticed that it was a massive learning opportunity to understand my product better and to understand my process better, which is, for a solopreneur like myself, very important. If you don't have people that constantly critique your stuff because you're the only person that works with yourself, you need to step out of your own way and look at your process to see if there's room for optimization. So next time something like this happens to you, you build something, doesn't work, you deploy something, everything explodes, look at it through the lens of emergent insights. If I hadn't tried to implement this, I wouldn't have learned about the existing complexity in my business.

Arvid: 16:48

And now I know more about myself and the business, and I have more experience with what I tried, what worked, and what didn't. Do this a few hundred times and you have all the moat you'll ever need in your business. Now you will know so much about how to build a business or a product like yours that every copycat out there is just gonna fail within the first 20 or 30 of these little problems themselves and then they're gonna move on to better things. That's the mode. That you going through these experiments, learning from them and never seeing them as a failure, but always as an opportunity to get more insight, that is what makes all the difference.

Arvid: 17:27

And that's it for today. I wanna briefly thank my sponsor, acquire.com. Imagine you building this perfect software product that never has any bugs. Yeah. It's a dream.

Arvid: 17:39

Right? But, you know, like, you you build a great SaaS, you have customers, and then you're generating consistent MRR. You're living the SaaS stream pretty much, problem is it's just not working for you. You're not growing for whatever reason. And that might be your personal growth, might be a business growth.

Arvid: 17:54

It's just that there's something that's lacking. Focus, skill, interest, you feel stuck and you don't know what to do. The story here unfortunately is that too often people think they should just keep working on it, but what happens they just pay less attention. They stop doing things. Inaction happens.

Arvid: 18:12

And then the business becomes less and less valuable over time or at worse completely worthless. So if you find yourself at this point, or if you think your story is likely headed down this road, where it's just not working for you anymore, I would consider a third option, and that's selling a business on acquire.com to people who would like to build, who would like to take it and take it to the next level. Capitalizing on the value of your time and just the skill set that you have right now is a smart move. And if your business doesn't fit anymore, well, you can exchange it for money because that's that's kinda how acquisitions work. So acquire.com is free to list.

Arvid: 18:48

You can always just check it out, see what you can do to make it more acquirable. The people over there have helped hundreds of founders already. So go to try.acquire.com/arbit and see for yourself if this is the right option for you. Thank you for listening to the Booster founder today. I really appreciate it.

Arvid: 19:05

You can find me on Twitter at avidka, a r v I d k a h l. You find my books, my Twitter core stat too. And if you wanna support me in this show, please subscribe to my YouTube channel, get the podcast that you're currently listening to, and your podcast player of choice, and then leave a rating in the review by going to rate this podcast.com/fernandor maybe. Because that makes a massive difference. If you show up there and you support the show, then the podcast will show up in other people's feeds and support them on their journey.

Arvid: 19:32

Any of this will help the show. Thank you so much for listening. Have a wonderful day, and bye bye.

Creators and Guests

Host

Arvid Kahl

Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.

319: My SaaS Server Exploded (& How I Salvaged It)

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere