The Bootstrapped Founder | Transcript: 321: Unexpected Downtime: Stress as Enhancement vs. Stress as Panic

321: Unexpected Downtime: Stress as Enhancement vs. Stress as Panic

May 17, 2024 / 21:35/E321 Download MP3

Arvid: 00:00

My therapist recently introduced me to the concept of stress as enhancement in exposure therapy. The basic idea is that a little bit of stress can focus you to a point where you make new neural connections and realize new and beneficial insights that help you change for the better. Well, I was able to put this to the test earlier this week, because all of a sudden, without warning or any indication really, PodScan slowed down to a crawl and then went down completely. And that's every SaaS founders absolute nightmare. And yet, as frustrating as that was, I embraced the situation.

Arvid: 00:37

I got through it, and I came out the other side with a better product, happier customers, and a feeling of having learned something new and valuable. And that's what I will share with you today. I'll walk you through the event, what I did, and what came out of it. This episode is sponsored by acquire.com. More on that later.

Arvid: 00:55

Let's jump to Wednesday morning. I recently talked about how I have pushed all my calls and face to face engagements to Thursday Friday. So Wednesday was gonna be one of those days where I could fully focus on product work. And a high profile customer of mine recently had sent me a DM and asked for a specific feature and since serving my pilot customers who also have the ability to tell all their successful founder friends about PodScan, since that is one of my growth channels, I got to work. I really wanted to build this.

Arvid: 01:24

So I started the morning quite relaxed and ready to make the product better. I had nothing else planned, just work on the product. And it was a fairly simple feature, being able to ignore specific podcasts when setting up alert an alert for keyword mentions. The founder in question got too many alerts of their own name when mentioned on their own show. So, you know, it was a clear value add for people like them to allow them to ignore certain podcasts.

Arvid: 01:47

And it took maybe an hour or so to build this, including making these features available on the API, which I want to be able to do whenever I make any changes to my alerting system, because I already know a lot of people are using my API to set up alerts for their own clients. It's really cool. And when I was done with this, after an hour or so, I was doing my usual pre deployment checks, which I tend to do before deployment just to see if everything is okay. And one of my admin endpoints that I use, that I usually load in the browser just to look at a couple metrics, was a bit slower than usual. And that was quite odd because it had been reliably around 2 seconds in just time to query the database and get stuff.

Arvid: 02:25

But now it took 10. And this occasionally happens when the database is a bit busy with some massive import, but I wasn't running anything, and I didn't really know what's going on. And even after a few refreshes, this didn't let up. It only got worse. So I went to the home page of pod scan to check, and it didn't load for 15 seconds this time.

Arvid: 02:45

Or rather, it clearly didn't even connect for 15 seconds, because once it did, the site loaded immediately. And that left me slightly confused. So the server itself responds as it should, but something on the network caused a massive delay. What could that be? I didn't do anything.

Arvid: 03:03

And at this point, I could see where this is going. 10 seconds, 15 seconds, 30 seconds. This was going to end up with a minute long delay plus essentially timing out the page. This was gonna be downtime. And my stress level started to rise.

Arvid: 03:17

I didn't panic. I knew that I had not made any changes to the system, but I felt like this needed my immediate attention, my focus for the next hour or so. So I took a few seconds to reflect on my situation. I just tried to sit calmly and think about it because I didn't need to type away immediately. It would be better to just think about it.

Arvid: 03:37

In the past, I probably would have easily spiraled into this kind of active panic where I would try to do something, just to do something. But I was thinking, what if the server is completely messed up? Or what if this is the end for PodScan? Right? That would be my thinking at the past times when these things happen to other services.

Arvid: 03:55

I did not think about this this time. This time, like my therapist said, I let was leaning into the stress because you have a choice between stress as suffering or stress as leverage, and I chose to use it. So I stepped into a place of calm determination. I said to myself, this is a technical issue with a technical reason and a technical solution. The Internet is a complicated place, a weird network of tubes as they say, and my product is itself a complicated machine.

Arvid: 04:25

Something is wrong. I will figure it out. I will attempt to solve this calmly and without skipping steps. And I started with my emergency reaction step number 1, make sure it's a real problem. There's this old adage that just because things work for you, they might not work for others.

Arvid: 04:41

Right? And I think that's also true in reverse. If something is broken for you, it might still work for others. So that's when I went to webpagetest.org and ran a test on the main homepage from somebody else's server. And when I saw connectivity tanking on their end as well, I knew it was not just a me error.

Arvid: 04:59

Right? My internet here, Canadian countryside is occasionally spotty, but this time it wasn't at fault. So whose error was it? I had a few candidates. It could have been my servers, my programs, right, my PHP application itself.

Arvid: 05:13

It could have been my Cloudflare settings, something with the encryption, something with the caching. Cloudflare itself could have issues. My database could have issues that caused a delay. My server hosting provider, Hetzner, or maybe the internet itself could have issues at this point. So to check which one was responsible, I went down this list in order of how proximate the service was to me.

Arvid: 05:35

So like how close to me being able to do anything about it would that be? So I started with my server, which was hosted on the Hetzner Cloud. I logged into my server. I checked its vitals, like the CPU load, the RAM usage, and the disk space available, and it all looked pretty good. I quickly restarted the Nginx process and the PHP FPM supervisor just to make sure the application is not kind of caught up in something, but that yielded no results, same problem.

Arvid: 06:01

And as huge timeouts like this can be caused by connectivity issues inside an application, I also restarted my local Redis instance as well, no effect. I quickly restarted the whole server, Nothing happened. Same problem. So I then checked what would happen if I accessed my website from within the server that it was hosted on. Locally, the website responded immediately.

Arvid: 06:23

If I just, like, you know, send a local host connect, that worked. But using its public URL, like pod scan dot f m, the connection issue persisted. It still took, like, 20 seconds at this point for me to get a response back. And I knew then that this was probably an issue beyond my reach. Like, that was not a common, you know, software as a service deployment problem.

Arvid: 06:44

That was something else. And funny enough, my stress levels went down a little bit. I knew that there were still a few avenues I could go down to check, but I had the feeling that this was something that was done by somebody else. It was someone else's doing, and it likely wasn't intentional or malicious. There was no sign of an attack, no DDoS or anything, nor were there resource issues.

Arvid: 07:05

There was something else going on and I was, like, beyond my pay grade as they would say. But to make sure it was not on my end, I still locked into my RDS where I keep my database. That's Amazon, I don't know what RDS is short for. It's database services or something. Manage database hosting.

Arvid: 07:21

And I checked the metrics there. And if at all, they had gone down, and it kind of makes sense, right? Fewer rep requests make it to the server, and that means fewer database calls. So it wasn't a database issue either. So with my own software stack worth working for better or worse perfectly, I went up one step the ladder.

Arvid: 07:38

Right? And looked into Cloudflare itself. Or rather, I looked at Cloudflare because it was at that point that I took a breather. I just went and had a coffee. I told my partner that I'm in firefighting mode and then went back into my office so that everybody knew what was happening when they interrupt me.

Arvid: 07:54

I told my dog she didn't care, but, you know, that's what it is. I'm gonna have a little office buddy. And I considered that this might be part of a larger issue. Like I said, probably wasn't me what somebody else is doing. So I checked the status pages for all the services that I use Cloudflare, Hetzner, even AWS, who knows.

Arvid: 08:13

I even looked at Twitter and Hacker News to see if there was a widespread issue. Sometimes if there's, like, a large outage somewhere, you could find it first on Hacker News or Twitter, and then it slowly gets onto the status pages of the services that are actually affected, but no mention of any issues. And my experience with running SaaS businesses for now 10 ish years told me that if something changes with the stack, some configuration was changed somewhere within the stack. That's how this happens. And since I didn't change anything, I looked into the upstream partners that serve the PodScan website, Cloudflare and Hetzner.

Arvid: 08:46

Both have their dedicated networks and connectivity rules. They might have changed something. And in the Cloudflare dashboard, I looked for notifications, warnings, or you are over some kind of limit messages, but there was nothing there. On Hadzler, same story. The server metrics look good.

Arvid: 09:01

There was no warning anywhere. And I was kinda stumped because if something would have happened, if some configuration would have changed, like, as a response to me going over a limit or doing something that didn't want to, they would have at least sent me a message. I checked my email too. Nothing there. I was stumped.

Arvid: 09:15

Yeah. I didn't know what to do. And either someone up the chain had some network issues that they didn't wanna tell anybody about or the traffic to my server, which was one of several heads up servers and the only one of those actually experiencing the issues, I checked the other ones, they were perfectly fine. That server might have been artificially slowed down in its origin network. So I did what I always do when I have no idea what to do.

Arvid: 09:38

I went to Twitter. And I shared that I had this issue and I didn't know what to do. And within minutes, a lot of ideas came in and one caught my eye. Somebody explained that this might be silent connection throttling from the hosting provider. And that was something that they had experienced before using Hetas service for scraping operations.

Arvid: 09:57

And now PodScan isn't technically a scraping tool. Right? Even though kind of it does pull in a lot of RSS feeds for podcast analysis. So you could call it scraping. It does download a lot, but I wouldn't put it past Hetzner to have some kind of automated detection system that silently make it harder for scraping operations to succeed on their platform.

Arvid: 10:20

But, you know, I don't think I am. I mean, here's the thing. I will never know if this was what happened because I haven't found anything out and they haven't told me anything about it. And 10 minutes after I started looking into this particular possibility, things changed a little bit. I ran one more test before things changed using a company called Speed Vitals to check the time to first byte from several locations around the world.

Arvid: 10:42

Like how long does it take for the first response to come back from the server. And the first time I ran it, timeouts happened and 70 plus second delays from all over the world. My website was effectively down for most of it, and I felt surprisingly calm about it at this point. Of course, I was agitated and frustrated. Something that I care a lot about wasn't working, and soon my paying customers would notice.

Arvid: 11:06

They hadn't yet, really, because PodScan is an alerting tool and the alerts were still sent out. Right? It was the incoming rack traffic that had slowed down a lot, but this wouldn't work forever. So, again, try to be calm, try to kind of not go over a manageable stress level. I went upstairs.

Arvid: 11:24

I grabbed another hot beverage, not a coffee, because I'm addicted to that stuff, I guess. And I told Danielle about it. And she, with the experience of having run a SaaS business with me, calmly told me that she knew that I'd figure it out eventually. And I love her for that. Like how lucky am I to have a common measure partner to keep me from spiraling out of control in these situations?

Arvid: 11:47

Because I care about the stuff I do, and I take it very seriously. But she was like, you you can do it. You'll do it. Don't worry. We went through this before, and we did.

Arvid: 11:54

Like, we had issues like this with FeedbackPanda back in the day. And sometimes it took us days to fix it, and customers got super upset. Yet, we still got to sold the business from 1,000,000 of dollars. So, you know, we know that this can work. So back to the office, I went ready to keep working on this.

Arvid: 12:12

I ran the time to first byte test again just to make sure that it was still a problem, and it was back. Everything was fine. Every location reported sub second response times. You you thought I was stumped earlier when I didn't know what was going on. I was equally stumped when everything just fell back into place.

Arvid: 12:28

It was super weird. It was super strange. It was like someone had pulled the plug from the bathtub and the water was just flowing as if nothing had ever happened. I went back up to my browser and it was just like before, like, under a second in response. I went into the logs and my transcription service just had started to grab new candidates again, you know, to just transcribe them while finally being able to just deposit the finished transcripts that they had finished, but couldn't report back.

Arvid: 12:54

So I breathed a very big sigh of relief and immediately started working on moving away from Hetzner. Because mind you, I had no proof of this being some kind of shadow banning or silent throttling. I still don't know to this day. It might just have been a network congestion issue in the part of the data center that my VPS was in. But I knew I was relying on something that was not trustworthy because it was just one server on one cloud provider.

Arvid: 13:19

I knew I had to diversify. I would keep my Hetso server in its current configuration. Always keep it up to date with, you know, if I deploy something, to to the the PodScan repository, it would also go to my Hetson server, but I would keep it as a backup server, just in case I need to switch back to it at some point. But my main server would move into the cloud that I knew and trusted for a while that had my data, AWS. After all, like I've been running my database there for the whole time that PodScan has been around and it has been super reliable.

Arvid: 13:48

And I think there are added benefits and I'll get to that in a second. But with Laravel Forge, which I use for provisioning and orchestrating servers and Laravel and Voyeur from where I deploy new versions, speeding up an instance on AWS was extremely simple. I just needed to adjust a security group or 2 for Forge should be able to connect via SSH, but that was quickly done. Within 20 minutes, I had a fully functional copy of my main server running on AWS. I tested it under a subdomain of podscan.fm just to see, you know, if it can work as an alternative.

Arvid: 14:20

And this morning, the morning after, I guess, after I'd been running idle for half a day, I finally made the switch through Cloudflare, pretty easy remapping IPs from 1 server to the other and just hit save. That was a lot of fun. It was an absolute joy to see traffic slowly shifting from the old server to the new. And through AWS's routing magic also much, much faster. With my database being in the same location as my application, the round trip time dropped significantly.

Arvid: 14:48

Some of my queries were actually cut down to like 20% of their prior duration. It was really cool. You can feed it on the website too. It's extremely snappy. So I'm coming out of this quite horrible incident with renewed confidence in what I have built.

Arvid: 15:03

Because first off, the service never broke. It was unavailable. Sure. But not because it was overloaded or because it had like error state. It was under loaded really, and it was just working.

Arvid: 15:13

Didn't get any requests, but it was working. One massive insight that I got from all this that was that I made a pretty good choice that some people may call premature optimization, but it really paid off this time around. And that was to make every single request between my main server and my 20 ish or so transcription servers queue based. That was a really good idea. So when a transcription gets created, like it it pulls something from an API, like just a URL to an audio file, like, which it then downloads.

Arvid: 15:45

That's what my transcription servers do. They get the URL to the audio file. They download the file. They transcribe it, and they send back the text. That's the main purpose.

Arvid: 15:52

So whenever this transcription is finally created, I just don't send off a HTTP request back to my API, to save it to the database. That HTTP request itself is wrapped in a job, which runs on a queue, and it will be rerun multiple times if it fails. And using Laravel Horizon and the queue retry and backup parameters that come with it, every request will be tried like more than 10 times and the server will wait for up to 10 minutes between these attempts. And the final attempt waits for a full day. That way, things can crash all over the place or slow down or whatever, but the valuable transcription data that I pay a lot of money for to be able to do on these GPU based servers is safe in a queue, ready to eventually be consumed.

Arvid: 16:42

That was a big learning. Queue based communication between microservices, you might call these, or like internal APIs has been really, really helpful. Look into this kind of system if you run PHP or like RabbitMQ if you use anything else or maybe even like like a Kafka queue or some some AWS stuff, they they have their own thing. It's really useful to have a message queue to communicate between things because then data is at least persisted somehow. Right?

Arvid: 17:10

There are other problems that come with it, like you get duplicates sometimes. Like it's either 0 0 to 1 or 1 to infinity. Like you have this kind of choice between making sure that things get delivered at least once or at most once. So that's that's an issue with with message queues. But, hey.

Arvid: 17:28

If you have a system that can then test for internal, you know, complexity and you can test for consistency between these complex states, you might wanna go into the message queue route. It's been really really helpful. I also enjoyed how easy it was to move away from Hetzner as a cloud provider. And I don't move away completely. I still run several parts of the business there, like my search engine that will stay there.

Arvid: 17:53

But it was really fun to just move the main server to AWS within a couple minutes. I made the absolute right choice in trusting the Laravel ecosystem here with Fortune and Voya deploying stuff and making on the flight changes to switch them over was comfortable, reliable, and functional. It was really good. Really enjoyed it. And, ultimately, maybe the most important part here was that I was glad I kept my stress levels under control throughout of this.

Arvid: 18:18

And that allowed me to stay level headed when facing a problem I couldn't solve immediately myself. I grew from this experience. Like I feel more confident in my product and my and my ability to deal with it. And the slightly elevated, but controlled stress of it all, helped me focus on stepping through the steps that I knew I needed to take calmly, and without losing sight of the larger issue. One thing that I recommend is writing a post mortem, kind of like this podcast.

Arvid: 18:48

Right? The the story of it. Just writing it down. Like, just taking all your learnings and persisting them onto paper or into a document. Or just write something into your own business documentation.

Arvid: 19:00

I wrote a few more emergency standard operating procedures right after the problem was resolved. So they will help me or future me, I guess, when dealing with similar issues, hopefully, in an equally calm state of mind. And that's the important part. Right? You cannot control these externalities.

Arvid: 19:18

That's the nature of the term externality. They are things out of your control. Cloudflare, Hetzner, Forge, AWS, they all could do something intentionally or unintentionally that creates an issue for you or more work or challenge. But running around in a panic won't solve that. That's the stress that gets to you.

Arvid: 19:38

What you wanna do is have a cup of tea, tell yourself that you got this, and then tackle it like the professional that you are. And that's it for today. I wanna briefly thank my sponsor, acquire.com. Imagine this. You are a founder who built a solid and reliable SaaS product that is working all the time or most of the time.

Arvid: 19:58

You've acquired customers, and you're generating really consistent monthly recurring revenue. But there's something, there's a problem. If you feel you're not growing personally and the business itself for whatever reason, maybe lack of focus or lack of skill or lack of interest, you just don't know what to do. Well, the story that people would like to hear is that you buckled down and reignited the fire, but realistically, the situation that you might be in might just be stressful and complicated. And too many times, the story here ends up being one of inaction and stagnation until the business itself becomes less and less valuable, over time, completely worthless.

Arvid: 20:32

So if you find yourself here already, or you think your story is likely headed down a similar road, I would consider a third option. And that is selling your business on acquire.com. Because if you capitalize on the value of your time here today, that is a smart move for you as a founder. And somebody else gets to benefit from it too, because they get a business that already does something really cool. So acquire.com is free to list.

Arvid: 20:53

They have helped hundreds of founders already. Go to try. Acquire.com/arvid, and see for yourself if this is the right option for you right now or in the future. Thank you for listening to the Bootstrap founder today. You can find me on Twitter at avitkahl, area d k a h l, and you'll find my books and my Twitter cards there too.

Arvid: 21:11

If you wanna support me and the show, please subscribe to my YouTube channel, get the podcast in your podcast player of choice, and then leave a rating and a review by going to rate this podcast.com/founder. It makes a massive difference if you show up there because then the podcast will show up in other people's feeds. Any of this will truly help the show and me. So that would be great. Thank you so much for listening.

Arvid: 21:31

Have a wonderful day. And bye bye.

Creators and Guests

Host

Arvid Kahl

Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.

321: Unexpected Downtime: Stress as Enhancement vs. Stress as Panic

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere