337: Doing Things that Don’t Scale …Unintentionally

Download MP3
Arvid:

No matter what I built, no matter what system I run it on, I need to know how much is being consumed and what the trending patterns are. Hey. I'm Arvid, and you're listening to the Bootstrap founder. There's this famous saying by Paul Graham from one of his most well known essays. It's a command to all founders and startup operators to do things that don't scale.

Arvid:

He explains that at the beginning of starting a business, there are things you should do as a founder that you know will never work once you have 100 or thousands of customers, but it is precisely these things that make it happen eventually. The learning from these hands on project then allows you to build a repeatable process that can be automated or handed over to others. You do things that don't scale so you can do things that do scale. And this entrepreneurial wisdom works wonders in our indie hacker community. You see it in small indie hacker projects where the founder helps people integrate the system or service into their stack and you see it on a larger scale too in the VC funded world like with Airbnb which initially relied on the founders going to people's homes and taking photographs and then listing them on their website.

Arvid:

Doing things at scale can only happen if you do things that don't scale first. That seems to be a truism for all kinds of entrepreneurial efforts. Now over the last couple days though, I've been experiencing the technical flip side of that wisdom while building PodScan. I've been talking to a lot of clients and helping them implement the project, that's the business side of things, and I know that I could never do this with dozens and hundreds of these clients simultaneously, and so far, I've been following that advice and it has really worked in that department, but, unfortunately, it also bled into my development work and that was scary. Because here's the problem with the difference between development and operating a business.

Arvid:

It's very easy in development to build something that works perfectly well for a couple of customers or your local testing environment, but has horrendous consequences when it actually runs in production and can create a lot of trouble down the road. Last week, here's the story, I noticed that every now and then my website got really really slow and then it recovered and then it slowed down again. It still worked, but it had slowed down to a point that I didn't think was normal anymore. You know, like when you know how fast something should respond, but it's just twice as slow, that was the situation. I looked at the server and I saw that every couple of minutes, one of the many processes in the pool that I have to serve requests that serve the website would grab more and more memory until the process was either stopped, it ran out, or it started impeding the function of other programs on the server.

Arvid:

It was a memory leak. And for a couple of days, I struggled just trying to keep that leak at bay. I tried everything, I set up a new server with more resources just to see if it was a resource constrained problem, which it wasn't, and I checked my code and I looked for potential reasons, but really struggled. I tried different configs and all of that. I tried restarting, and it it was really hard to figure it out.

Arvid:

It always happened again after a couple of minutes. Thanks for building in public and, I guess, through that having a lot of developers follow my journey, all I needed to do to get to the solution of this was to just talk about it, it was just complain. I talked about it, I posted about it on Twitter, and I got some great feedback in all different kinds of directions, one of which helped me figure out the problem, I had built something that didn't scale. So here's what happened, I had created a caching system to track how many podcasts I transcribe every day and every hour, and that system kept all the IDs of those podcasts in memory so I could tell, like, which podcasts were managed or were transcribed and extracted data from over the last 24 hours. And that's some data.

Arvid:

It's not a lot of data, but it's some. Right? This works perfectly fine on any local development system. For me, that's my Mac studio where I can maybe have a couple dozen podcasts an hour transcribed, that's that's how fast the Mac Studio is, but on production, I have a fleet of over 30 servers that can transcribe up to 5,000 podcasts an hour. So, suddenly, I had thousands of IDs plus some additional data in the cache per hour.

Arvid:

And my daily cache, which is 24 times that, was over a 100000 items. Right? I transcribe over a 100000 podcasts a day, and I kept all of these idea IDs in there. And the issue here was not that there was a lot of data. I think thousands, tens of thousands, hundreds of thousands of IDs.

Arvid:

That is normal for computers to deal with because they are really vast at calculating stuff, but the issue was how I had implemented keeping track of these things. Instead of just sending an update command to my Redis instance where I kept this cache, my system actually loaded every single item from the cache onto the process that was currently serving a request. So I would read a 100,000 items, then I would add one new one, the one that just came in, the one that I just transcribed, and then I would save back 100,001 items to the database. And this is like 30 megabytes of data. Read like one tiny little bit is written and then 30 megabytes plus 1 is sent back.

Arvid:

And if you multiply this by 5 to 10 requests per second because that's how much I get from my 30 back end servers that constantly send and pull stuff from the database, well, that created this massive ongoing memory consumption that then slowed down the process to a point where later processors would have to wait until memory was loaded and then they would load their memory on top and that was the bottleneck. And this was a problem I could never have seen in my local desktop environment with my couple dozen of things that happen every hour. It only exists because of the resource requirements of my production system. I probably could have foreseen it had I spent more time thinking about how this would impact the system at large, but I didn't. I just built it a couple months ago and has slowly crept up to a point where it now started causing issues.

Arvid:

And to solve this, I asked Claude, Anthropics AI, to rewrite my cache module in a memory saving way. That was really what it was. I took the full source code of this module, threw it into a cloud, and told it, hey, this is really memory extensive or something like this, make it use way less memory. And since I was already using Redis as the back end caching system and Claude knew about this, Claude rebuilt it using Redis' internal memory management systems like the commands that Redis can be commanded with to use things more efficiently. I didn't know anything about Redis and its particular commands clotted all of this, but the result is that instead of loading every single item into an array and then add one to it and write it back, the system now uses Redis commands to add an item to an existing array in Redis itself.

Arvid:

So I just sent over 1 item instead of a 100,001, And this operation is much faster, involves almost zero memory usage in my application, and it showed immediately once I implemented this. And now I'm left wondering after having found this and solved this issue, should I have built something better from the start? Should I have seen this? Should I just deal with the fact that this will happen over and over again if I don't? And, honestly, am I building things that don't scale until they break and then build something scalable?

Arvid:

Is that my approach, or should I always try to figure out every single possible contingency before I deploy something in my product? And that is a real question that is quite likely that you ask yourself at some point in your journey. Should I take the shortcut or should I really spend days weeks of figuring this out? And usually the answer is yes to both, right? You should take the shortcut now and then in your background process in your mind keep improving over the thing.

Arvid:

But for real everybody has different answers to this and everybody has different needs that facilitate these answers. Here's my here is my answer. I will keep doing what I'm doing here. I will build things that work for me and then deploy them and see how they impact the system. But I won't shy away from deploying or releasing new features just because they might eventually put some kind of resource constraint on my servers.

Arvid:

I think it's a constant prototyping stage of my business right now anyway on the way to undeniable product market fit. And I would rather have introspection into my system and look at things as they happen and maybe crash a little and then deal with that than to never release a feature for fear that it might impact performance. Because a system that performs well but doesn't give people what they need is not a business, it's a project. So, you know, but as a developer, building larger systems but still with a lot to learn, I know that I can't foresee every contingency. It just doesn't work like scale has side effects that I cannot envision, like this one.

Arvid:

Like, who would thought that loading a couple of 1,000 of IDs would eventually grow to such a big number that that would slow down the system almost grinding it to a halt? I didn't think about this. I thought memory, I have enough memory, I could just add more memory, but it's really not that. Right? It's the cascading slowness of things to allocate and free new memory that would then slow everything else.

Arvid:

Didn't matter how much memory I had, it would eventually catch up with the system. So that's the scale problem and I have to ensure that I can keep an eye on resource usage, That's what I learned, right? No matter what I built, no matter what system I run it on, I need to know how much is being consumed and what the trending patterns are. How does my system keep behaving over time? Is there, like, a sawtooth kind of thing that usually shows that there's a little problem or is it a slow up and down?

Arvid:

Is it kind of a wave? That's alright because resource consumption differs throughout the day as the system is used differently, but I just have to keep an eye on it. And for that I've set up alerts and thresholds and monitoring systems. Ironically though, the problem with this caching thing was because it was meant to be a monitoring system for something else, right? I was trying to track how many podcast episodes I was ingesting every single day so I could see if there was an up or down, if my back end servers are still working.

Arvid:

That was my kind of monitoring approach here, but this system now caused everything else to fall over, which is hilarious. But it's slowed down the system, I have now built a monitoring system for my monitoring system system, and that's the kind of stuff you have to deal with if you're building a SaaS. This is just what it is. So instead of trying to build things that scale forever from the start, I'm building this framework around I'm building this framework around my prototype and my features that tells me if things don't scale, if I run into issues. I will still try to build them as reliably as I can and with this experience in mind, I will think more about the impact that any kind of data collection or data operation might have at a different scale.

Arvid:

What if it's not just 10 items in a list but 10,000? Will this be different? Will this make the system act differently? And then maybe should I approach it slightly differently? Should I chunk it so it's not all done at once?

Arvid:

Should I set limits? That kind of stuff is more prevalent in my mind, but I will still build what I need now and then deal with the technical depth of it later as things develop. And the PodScan back end, in this case, is generally a little bit special because it doesn't really scale with the number of users of the product, It does scale with the number of podcasts that are being released worldwide every day. Right? PodScan ingests all those, transcribes them, and then does keyword search and puts them in a database, so there's full text search around them as well.

Arvid:

And there's a ceiling here, like, only so many new episodes are being released every day, but there also is a pretty high floor and that is at least 30,000 podcast episodes are released every single day. So that's a thousand and then some an hour and that bunch throughout the middle of the day, so there's a lot going on and I need to deal with numbers of that size today, not in some distant future. So doing things that don't scale, that's great for business operations for early early stage business stuff, but building things that don't scale is really not ideal for technical implementations. Both for a system that can handle more load and for a business than can grow reliably, whatever you do that doesn't scale needs to eventually provide enough insight into how you could potentially make it scale. It's the same thing with my memory leak.

Arvid:

The problem here was that my implementation, my very naive and resource, non conforming implementation of this, That didn't think about what would happen at the larger scale, at a larger number of potential IDs in the in the RA. But Redis, they've built a tool that is meant to do millions of operations per second. Right? I should have just used technology that already has this. The big learning from this experience is to keep an eye on significantly larger numbers and use tech that already has built in features to facilitate this for me.

Arvid:

I should have looked into these Redis commands from the start instead of shuffling data around myself. I didn't because I thought I would keep it abstract. Right? I would keep using PHP as it is, Laravel on top of it, and just build everything in there so that I could easily move from back ends like Redis to stuff like Memcached or whatever. But I'm now hitting a certain scale where that is not a good idea.

Arvid:

I have to use native tools that are meant to be high performers. And, yeah, that's what I've learned from all this experience, and I hope that this little insight into my issue gives you a motivation to build now and to build things that work well enough and that will keep working even if there's more load on them, but to also make sure that you measure how they impact your system at large. They're both in terms of building software, obviously, you can measure it there within milliseconds and gigabytes and whatever, but also in your actual business, is the thing that you're doing having a meaningful impact on your sales or your customer retention or your churn or your the visitors to your website or the people that convert there. Can you measure something there with the things that you do manually? That is important because you will have to eventually automate it in some way through people, through software, through process improvements, whatever it might be, you will have to give it to somebody else or something else.

Arvid:

And you can only build things at scale by building things that don't scale first, but you have to put guardrails around them. Keep learning, keep improving, and always be ready to adapt when your non scaling solutions hit their limits. And that's it for today. Thank you for listening to the Bootstrap Founder. You can find me on Twitter at abitkahlearvi k a h l.

Arvid:

Find my books on my Twitter, of course, that too. And if you wanna support me on this show, please tell everyone you know about podscan.fm and leave a rating and a review by going to rate this podcast.com/founder. It makes a massive difference if you show up there because then the podcast will show up in other people's feeds, and any of this will really help the show. Thank you so much for listening. Have a wonderful day, and bye bye.

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
337: Doing Things that Don’t Scale …Unintentionally
Broadcast by