348: Observability in Software Businesses

Download MP3
Arvid:

Hey, it's Arvid and you're listening to the Bootstrap founder. I didn't see it coming. That is something that I had to admit to myself a few times recently. Over the last couple weeks, I've been experiencing several issues with PodScan that only came to pass because I didn't really have any observability on my system. Well, at least that's what I know now because it always takes a while to see the bottom part of the iceberg.

Arvid:

And with PodScan, most scaling issues only showed themselves in a delayed fashion. That's the problem with this kind of system that I run. The problems that came up were consequences of issues much further down the line that happened much earlier. And this experience has taught me that observability isn't just a nice to have for my data software business, it is crucial. So as I spend more time building this heavily database driven AI based business on top of technologies that may be rather new and untested, I'm realizing the importance of robust observability.

Arvid:

This is especially true given that I'm building an architecture of a size that I've never built before. PodScan is the biggest thing I've ever done. Everything I learn is through doing and running into challenges and then facing them head on and hopefully finding a solution, but most of the time that actually works. So let me share my early stage learnings about system of Superbility in this distributed data centric system of mine. Even if you don't have a software business or might not operate with millions of data feeds every single day, I think there's still something insightful here that I will, not just likely, but guaranteed take into my future business efforts.

Arvid:

So I guess it's also gonna be really, really good for you. Let's talk about being overconfident in your ability to see problems for starters. For the longest time, I thought I'll see when things go wrong. I will notice that it will not escape my sight. But as my systems got more complex with more moving parts and individual components with varying scaling capacities, I very quickly realized that I need to find ways to either automatically detect and mitigate problems or recognize them early on.

Arvid:

Either of these needs to happen. Ideally, I wanna spot trends and patterns or moving thresholds so I can see that if something continues running for a couple more weeks and slowly becomes more problematic, well then I'm gonna have an issue 2 weeks from now. And this foresight ideally allows me to deal with potential problems proactively. But if I knew what problems were to come, I would make sure they would never happen. Right?

Arvid:

So why then do I still have issues? Well, if that level of prediction is not possible, then I need monitoring, observability in place that immediately alerts me of a problem, either a problem right at my doorstep or one in the making. So I'm trying to build all of these systems, but as usual with observability, one of the core problems here is that often you don't precisely know what to observe. For me, it really is a question. Like, I have a really sizable database and I have a search engine with its own database that is kind of structured differently.

Arvid:

And then I have this fleet of back end servers, over 30 servers that all have their own databases, and some of them have no databases but a lot of caching, which tends to be also a database. The question becomes, which of these things just even for thinking about databases do I need to observe to see if there are any issues? Should I look into every single one of them or are 1 or 2 of the main ones enough for me to see if things might go out of whack size wise or if the data is corrupt or inserted correctly or incorrectly? Yeah, that's the question, right? Obviously, I can't observe every single thing everywhere.

Arvid:

That's just impossible because of performance reasons you can't just trace every single thing. So the question, the first question that always comes up for me is, what are the things that could potentially cause trouble? And sometimes these future issues, well, they are extremely clear from the start, or at least there's a very high likelihood that these things might be problematic in the future. For example, when you're taking a long list of items, a whole list of them of all items of a certain kind and then do something with that entire list, That tends to be something that once at scale becomes very problematic. And you'll see why I'm saying this, but just as an example, if you test something on your local computer and you have a 100 items or a 1000 items in your database, well then it's clearly not a problem to do an operation on each of them, right?

Arvid:

Do a map reduce or summarize or do an average or anything like this. There's enough space in RAM and you you have enough resources for that. But what if you're loading a 100,000 items or 2,000,000 items? There might still be enough space in your memory for this because we have machines with like dozens if not hundreds of gigabytes now, but the operation on each of these will consume memory and time as well. So that's gonna multiply.

Arvid:

So is it going to take the one second that it takes on your local computer once you push it to production or is it going to take 10 minutes there and block the database for everything else. This might not be immediately obvious from the code that you're writing and you'll have to learn this by just recognizing this concept of things maybe exploding as things scale. But every single time I find such a glitch I realize that I kind of should have seen it coming or I could have seen it coming. And for me that's a tooling question. I can prepare for this.

Arvid:

The important thing is to understand that there are easy targets right from the get go that I should be monitoring. And if that is true, well then I should set up a kind of reporting system that has some sort of intake where I can reliably push reporting information into so that I can later look at it. There are many options for this like Prometheus or the ELK stack, the ELK stack, all of these are fine if you set them up and push data in there. It's always better to have data in your monitoring system than not to have it in there, even when everything is perfectly fine. For example, you could just push the number of items at any particular database as a JSON object or something into your chosen system whenever you do an interaction with that database whenever you do like a major summarization or you count you push that into your Metrix database as well then you can use visualization tools to show either the number that you currently have or a trending graph of where that number is going to be going what the delta is between the last time you checked like a week ago or 2 months ago, you get historical data.

Arvid:

And depending on something very important, this information might make you sleep soundly or panic slightly. And that is context because the number alone won't really help you. Its current value itself tends to be a somewhat binary thing. Either it's bad or it's good. And that changes over time.

Arvid:

If you have 0 items in your database, that usually is a problem. Maybe not when you just start out, but couple years into the lifetime of your business, 0 items, problem. If you have 200,000,000 items in your database, that might be a problem too. Right? Because it's way too much, even for a lot of substantial services out there.

Arvid:

But it really doesn't matter if it's 5000 or 6000 or maybe 20,000, 30,000, there's a spectrum in between these extremes. And those thresholds, too little, too much or just right, they offer you to determine and constantly adjust. And newly founded SaaS might need a warning when the number of projects that you have eclipses a few hundred because you have like 10 customers. How can you have 200 projects? But a 2 decade old SaaS business, well, they can easily expect thousands of new projects to be created every month.

Arvid:

So revise your thresholds as they are reached recurringly, like the moment you run into these thresholds more and more you might need to adjust it. And some issues aren't as easily found as these things, the things that you learn as you build the business and the product. So I experienced this a couple of months ago when I had a major memory leak with PodScan. It was debilitating. It was really, really bad.

Arvid:

And it caused my servers to actually stop working. So that's how bad of a memory leak it was. And the RAM of my server, which was a sizable machine with like 3 gigabytes of RAM at that time, was just exhausted very quickly over and over again. And every time I restarted the process, like the PHP, the FPM process that kept all of this stuff going, and the server sometimes even just fold the machine fully rebooting it more and more memory would be consumed almost immediately. It was crazy.

Arvid:

Unfortunately, I found a way to dampen the leak so I could spend my time investigating it. It was the PHP FBM configuration. If you ever run into this and you are running on Laravel or have PHP going, PHP FBM can be configured to restart processes after a number of requests. And I set that to something really low so it would constantly restart the processes internally so they could never reach this massive RAM problem. But I eventually traced it to my internal caching logic and my metrics caching was at fault.

Arvid:

I built this ironically to figure out when things go wrong and it itself caused an error. That was quite hilarious. That was implemented in a way that would load a lot of unnecessary data into RAM and stick around for a bit. Right? It would load like the list of, like, the last hours items and then add one item to that and then persist it into Redis, like this whole list.

Arvid:

So if the list was very long, like tens or hundreds of thousands of items, it would always load every single item into RAM, add one to it, and then kinda write it into Redis again. It was fairly, not well implemented. I consider myself a 0.5x developer, at least on on those days. And if enough processes were started with this data, obviously, that was would cascade into a memory leak. So I fixed that by fixing the underlying caching logic using Redis' internal commands instead of loading all the data, which is give the new item to Redis and let it put it into its own data and everything returned to normal throughout the whole system because I used this in a lot of locations.

Arvid:

But now whenever I build something that's doing an averaging or summarization or adding an item to a list operation, I try to figure out immediately how big that list could potentially get. And if I can maybe stop the list from being too big before it gets too big either by chunking or using partials or pagination wherever something is potentially massive in memory, I'm now looking at this. And this is what I mean. I do learn to spot problems by experiencing them along the way, Which might mean you will run into them and you will have to run into them before you can adequately spot and solve them. It's one of these things you have to run into the wall a couple times before you figure out where the door is cause you don't know what the door is.

Arvid:

We didn't even expect it there. So I've trained myself to see this as I could. And before that happened, I was like, oh, it'll be fine. Those processes will eventually remove the memory and be good. And that was true most of the time but it was just that they were not fast enough with it and that cascaded.

Arvid:

The overwhelm of the system came from me just having too many moving parts for the system to self regulate. And I have found one thing, a little little cheat here I guess or just a little tip on how to deal with this. Chat based LLM platforms like Anthropic's Cloud AI are pretty good at spotting performance bottlenecks in code. You don't even have to give the metrics, you can just give them the code and then they can look at it and tell you okay you might problem might be here and you might want to rebuild this maybe refactor this into that. So I sometimes throw the full source code of a core component of my app into Cloud and tell it to investigate the 5 biggest potential performance risk or bugs in that code.

Arvid:

That is actually how I refactored this caching thing that broke when I had this memory leak. I took the whole module, the servers, the caching service that I had written, threw it into Cloud and told it to, hey, rewrite this so it doesn't explode on me. And it did. It found the right commands, commands for Redis that I'd never heard of before and put them in and then it worked. So just sometimes the mere approach to how Cloud argues for its answers allows me to see something that I hadn't thought of before and I'm picky.

Arvid:

I don't take the full answer, I don't take every single thing that Cloud tells me but some of those things are actually quite useful and I certainly hope that IDEs will integrate this as a constant background process soon. I would love to be able to code and have this thing tell me every morning, hey, I ran a couple experiments overnight and these things might be improved. I saw a couple things that may be problematic. I want that, my IDE, to work for me while I'm working. But for now, I guess I have to hunt for bottlenecks myself.

Arvid:

And generally, the problems that you run into and that might cascade into chaos are resource problems. It's rarely ever a logic problem itself that causes massive issues. And even if it is, it then becomes a resource problem, so you always have to deal with resource problems anyway. And resource problems that are code bound where code impacts the performance of the system itself and not data integrity or accuracy, which I would call like data problems, those resource problems tend to be either issues of compute power, like you're causing too much compute, your CPU is overwhelmed and the system gets locked up because you just can't keep up with the calculations needed, or it's a memory availability problem, you have too many things in memory like my memory leak and that causes the system to lock up by being unable to allocate more memory to critical processes or probably one of the most common one is disk issues. So either in terms of operations, IOPS, like reading or writing too much data that your disk just cannot produce or persist fast enough, it can't handle the overwhelming amount of data that needs to be written or read, or, and this is the most likely one, size, you run out of hard drive.

Arvid:

Now if you have 95% of your disk left and everybody's kind of fighting for the rest of it, your temp directory is trying to grab some of that data or it's just full and your processes are trying to persist data to disk that they cannot persist. Those are the 3, right, compute CPU, RAM, or HDD hard drive. And then if you add GPU's graphics cards into the mix, the resource problem becomes even stronger because graphics cards in particular are bottlenecks on most every system that I've encountered that uses GPU, right? My use it a lot for transcription obviously, but there are also vector, databases or retrieval, augmented generation, all these things that use GPU stuff, GPU processing power, I guess, and memory on it become a bottleneck all in itself. It's like a little computer in a computer yet again.

Arvid:

The solution to most of these problems for me has been queue systems, putting queues in place. You can queue almost anything in your back end system and in your web application and every framework available allows you to queue background processes using tried and tested method. PHP, the thing that I'm using has Laravel Horizon and Laravel Q, Ruby has Sidekick and Python has Celery and I think JavaScript has Boolean Q, like there are all these super tested highly performant libraries that can do queuing for you and you should use it because you can queue any regular calculation or operations of your system unless they belong to the request that needs that data as an answer, but most of the time you don't really need this for heavy calculations. You can use placeholders. If you're building for example, AI based image product that generates AI images, You can do this in the background, right?

Arvid:

You put a placeholder loading image into your application and then fetch generated images as they are created. And you can do a lot of things in the background. As long as you restrict your queue to running only a couple of these operations at the same time, which then prevents runaway memory leaks, like mine, or other resource constraints that could affect other parts of your product. And queue systems are also innately measurable. They come with this tooling.

Arvid:

For Laravel, which I'm using for PodScan, the system is called Laravel Horizon and that comes with its own endpoint where you can see how many processes are running and what they're doing and how many are waiting in the queue, they even visualize this and all these queuing tools come out of the box with features like that. So it's definitely very useful to build queuing as a first class citizen into your application. And if your resources are doing all right then the queue will go through immediately anyway, you will barely notice it. It has almost no latency, they're optimized for this, right? And if you have a resource problem at any point then the queue will help you deal with the problem until you find a solution by just queuing up the things that need to be done, right?

Arvid:

They're gonna be persisted somewhere. And most queuing tools also have notification systems built in for when queues are overwhelmed or when there are too many items in any given queue. This observability is built in from day 1 and it doesn't cost you anything I highly recommend you looking into queuing, just learning how to use it. And if you already use it, use it a little bit more. Often just queuing a couple more things that you would have processed right there and then can really really mean a lot of more flexibility for the system and it also means you don't need as big a computer or as big a server or as big instance of whatever you're using because queuing allows you to use resources when they are available instead of having to provision a lot of resources just in case you have to do 10 things at the same time.

Arvid:

So whenever it comes to observability, you wanna be able to see historical data from when things are doing well and from when they're not so you can compare them. It doesn't really help to know only what the last 10 minutes of metrics were when you deal with an avalanche of errors without knowing what the normal state looks like. You need something to actually compare it to. So having access to historical data at any given time will help you investigate problems and solutions to see if they're doing what they're supposed to be doing. And if you implement the fix, well, you need to see if it goes back to normal levels.

Arvid:

Historical data does this. Personally, I try to track all of my relevant system metrics on a per minute basis. And I process them into a database that exists exclusively for those metrics. Every minute, a process runs that pulls all the sizes of my queues that I run, the number of items that I've cached over the last hour, the number of transcriptions I run, the extractions that I've done, the API requests that I got, the number of times my search engine got used, number of customers that sign up, all of this exists in a database where every minute, I get new information about every single one of these numbers. And this has been extremely helpful for me to plot them out and graph them so I can, over time, see developments, just see where things are going.

Arvid:

It's really interesting with PodScan to see the ups and downs of the transcripts per hour. That's something that I did not expect, but I saw in my data. Not just the transcripts that I transcribed, that's a fairly steady number because, you know, I have a fleet of 30 servers and they can only do so much. But even just how many new podcast episodes I detect every hour, this changes throughout the day because new podcast episodes are usually released between 9 am and noon in the US, English speaking podcasts, than they are at night. Nobody releases their podcast app at night or few people do.

Arvid:

So in the morning and around noon, I get this massive avalanche of podcast episodes coming in and that affects my queues. Right? I can see them growing. And then in the afternoon, those queues start to become really small and the system can go into the backlog and go into older episodes and transcribe them. And this creates new situations because there's a different kind of quality that I want to use for transcribing older episodes and therefore I can use more of my energy to transcribe more of them at the same time, which again affects my stats and my metrics.

Arvid:

And I know this because I have the data, because I can track it. And recently I've started adding a lot of things I didn't track before. One of them is external systems that my system interacts with. Over the last week or so, I had a massive problem with my search database, which started when one of my users told me that they had a problem with certain items being missing from search but being present on the API. Like, the API uses search too, so sometimes they were present, sometimes they were not and that is a problem.

Arvid:

So I tracked it and I traced it to a queue on the database, the search engine itself on my Mightysearch instance being overwhelmed. There were a couple million items in that queue, like 6,000,000 items. Obviously, that kind of froze up because even that queue is persisted somewhere and 6,000,000 of database items that contain a lot of transcript data just eats way too much disk and there was a problem. So I really didn't know why that happened and because I wasn't tracking it before, I didn't know what the queue looked like a couple days before or a month before. All I could do was to say, okay, I guess we lost empty the queue and in doing that I had to reimport a significant number of items like 6,000,000 times like 10 or it was a lot of items and had I known that development, had I seen the number and been able to see where it was at any given point along the way, I probably could have dealt with this much earlier before it became a problem.

Arvid:

I probably would not have run into a situation where for more than a day or 2, data wouldn't have been synchronized. And this experience taught me that I needed to track this information, external information, and make it part of the logic of my internal application itself. If you have different kinds of cues and your application can only really understand one kind of cue, its own internal cue, then if the other external cue is not available or overwhelmed, it will still try to send data there because it doesn't know. So I taught the system to see both queues, its own and the external ones. So now I am not sending new items to synchronize with my search database if it's already handling a sizable queue.

Arvid:

Now I'm waiting for that queue to go down and it needs to go below a certain threshold for me to send more items so it will never be over that threshold. And this helps me with backpressure and overflow prevention which I need to build into my system to guarantee this interaction between my internal system, my server, and the external system, the search engine queuing thing. And I think this is very common for any distributed architecture. You run into this problem all the time. There is something external that has its own little mind, and you need to make sure that it's not overwhelmed before you integrate it into your system or before you interact with it.

Arvid:

And since we're talking about externalities, let's talk about the fact that sometimes things are just out of your control. Observability is great, but it's not only about visualization and hoping that you find patterns, it's about literally alerting you and getting you out of whatever you're doing if there's an actual problem with your business or product or one of the vendors that you use. From any single function inside your code base failing to AWS data centers being flooded, you need to know anything along the way. If the ramp of any of my servers that I'm operating is over 80% for 5 minutes, I want to be informed. I wanna be sent an email and I wanna be sent an email too if one of my domain goes down for a minute or when AWS has connectivity issues.

Arvid:

For certain critical issues beyond that email, I also wanna get paged or get a call. And this alerting system that I've set up around these things really recently helped me when I was reimporting all of this data into my search engine queue. I was importing so much data, again overwhelming, but that it was overwhelming a system that I did not expect to be overwhelmed. My observability system for Horizon, right, the the Laravel thing that handles the queuing mechanics checks my local queue, my application server had a problem because it kept succeeded jobs like finished queue items around for too long with too much data in them. The queue was growing obviously because I was posting all of these items into the queue for it to be sent over to my search engine and the RAM of the system there was growing into the 80% range because the queue was configured to keep old successful jobs in the data memory too.

Arvid:

And when you import tens if not hundreds of thousands of items at the same time that's a lot of stuff to keep in memory even if it succeeds. So I needed to deal with this and I got an email around 10 pm I just had gone to bed I wasn't asleep yet I was on my phone. So I saw the email, quickly got out of bed, figured out what was going on. I made a configuration change to get those finished items out of memory immediately after they were dealt with, and then I went back to bed. Had I not done this, I probably could have experienced actual server faults because memory again would have been so high that my application might have been starved from memory and that's when things started to break down.

Arvid:

Avoided because of an alerting email. So it's it's really important to set those, but nobody has the means to observe all the things all the time. So make sure you see critical errors and make sure you don't see non critical ones. That's like my mantra with this. I wanna see critical stuff immediately and anything that is non critical or I can look into tomorrow is not allowed to alert me.

Arvid:

I have to check it out. It's kind of push and pull, push for critical, pull for non critical. And if you lock everything, you will never look at any of it. And if you look at nothing, you won't be able to track things as they are happening. So it has to be somewhere in between.

Arvid:

Always look at how things might scale. That's one of the the biggest things I can tell you. Anything you build, imagine it today with the load that you currently have, 10 times that, a 1000 times that. Is it potentially problematic? Might there be an issue?

Arvid:

If so, you have to deal with this or at least look at it from day 1. Make sure you don't overwhelm your system in the future by loading too much data. It's a pet peeve of mine and be prepared to adjust your threshold for when things are too high or too low. Observability isn't just about seeing what's happening right now, it's about predicting and preventing issues before they become critical problems. For that, you need data.

Arvid:

And building and maintaining a complex distributed system, that's a journey of constant learning and adaptation. A few years from now, few months from now, I'll probably have lots more to say about monitoring and alerting that I just don't understand yet that will come with being exposed to these issues as they happen. But it's always worth looking into today at any scale. By implementing robust observability practices, you are not just gonna solve today's problems, but you're setting yourself up for success as your system grows and evolves. So take a look.

Arvid:

And that's it for today. Thank you for listening to the Bootstrap founder. You can find me on Twitter at Abid Kal, e I v I d k a h l, and you find my books and my tour of course, that too. If you wanna support me and the show, please tell everyone you know about PodScan.fm and leave a rating and a review by going to ratethispodcast.com/founder. Please go there right now and leave a rating and a review.

Arvid:

Ratethispodcast.com/founder. Really appreciate it. It makes a massive difference if you show up there because now the podcast will show up in other people's feeds, and that's where it should be. Any of this helps the show. Thank you so much for listening.

Arvid:

Have a wonderful day, and bye bye.

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
348: Observability in Software Businesses
Broadcast by