390: When to Choose Local LLMs vs APIs

Download MP3
Arvid:

Hello, everyone. It's Arvid, and welcome to the Bootstrap Founder. Today, I wanna talk about something that's been on my mind lately and probably yours too if you're building anything with AI. It's the age old question and I guess a couple years of age at this point. Should you run language models locally or just use APIs like OpenAI or Claude?

Arvid:

This episode is sponsored by Paddle.com, the merchant of record that has been responsible for allowing me to reach profitability a month ago. Paddle truly is more. That's MOR, the merchant of record, because their product has allowed me focus on building a product people actually want to pay money for. That money is going into a lot of AI tools, and I'll talk about that today. But Paddle is really doing more for me.

Arvid:

They deal with taxes, reach out to customers with failed payments. They charge in people's local currency, all things that I don't need to focus on so I can really be present for my customers and their needs. It's amazing. Check out Paddle.com to learn more. Now I know what you're thinking.

Arvid:

Arvit, local AI versus API, that has been debated to death on Twitter over the last couple of weeks. But here's the thing. Most of these conversations, they seem to happen in a vacuum, a very academic vacuum full of theoretical scenarios, benchmark comparisons, that kind of stuff. What I wanna talk about and share with you today is what I have actually learned from building a real business with AI, making real decisions with real constraints, and sometimes making the wrong choices and learning from them. So I'll share that.

Arvid:

Let me start with a confession here. When I first started building PodScan, I was convinced I had to do everything with local language models. I mean, a bootstrap founder, the idea of keeping costs slow, maintaining control, that was incredibly appealing. So I dove in trying to handle everything locally. And I tweeted a lot about this.

Arvid:

Right? I shared all my benchmarks talking about this. I shared my numbers, how much it costs and all that. It was a lot of research that I needed to do, and I tried to do all of these things in public as much as I could. But here's what happened since.

Arvid:

And this might sound familiar if you've been done this for yourself. The cost savings that platforms like OpenAI and Anthropic have achieved just through their scale very quickly made it pretty clear that for my workload and data volume, it made no sense to rent more and more GPU resources to run local language models. I have 50,000 podcast episodes coming in every day, and I have really no control over how many there are. That's just how much stuff gets released every single day. And I need to deal with this.

Arvid:

So to do analysis on this, I need to have something that works at scale and is efficient. And local models, they don't really work like that. At this point, it became much more effective for me, and that must have been like three months ago when I realized this, to go API first instead of local LLM first and then consider local deployment later if APIs went down or if I needed specific capacity requirements. So don't get me wrong. There absolutely are situations where a local LLM shines for SaaS founders or just if you wanna use it yourself.

Arvid:

But let me break this down based on what I've actually experienced. The first sweet spot is when you have very small tasks that need really quick decisions, things that don't require complex reasoning or extensive context. And I'm talking about little tiny fragments of work that need to be done in the context of a software business. For example, if you need a reasonable choice between options that can't be determined by simple Boolean logic or a basic algorithm, a small language model with fast inference might even run on CPU and can be perfect even if it's running CPU or GPU. Right?

Arvid:

You look at a piece of text and you wanna determine one thing about it that you cannot just find by looking for keywords. Now even for something specific that is kind of in the context of the text but not explicitly written in it, an LLM will be able to figure this out. And a local model, if it's not too much text, can do this extremely fast even if you don't have a GPU in your computer. So instead of making a network call and dealing with API latency and cost, you get your answer instantly on your local server if you run a really, really small LLM. And here's a concrete example from my own business.

Arvid:

The first AI feature that I built was for transcribing audio clips for POTScan. This doesn't necessarily require an inference call to a remote API. It can run on the local GPU or CPU of a server. And that's how I started. Right?

Arvid:

The whole thing was I didn't have too many parallel operations. It didn't exceed what my business at that stage could handle because I was just figuring it out. So I was running all of these transcription things on my Mac computer. And even before I started Podscan, I was running a different business. There was a Podline.

Arvid:

And for that, I did transcription too of really short messages, maybe a minute or so. So that doesn't really require a lot of resources. I was running this on CPU on that server. It's like a $25 a month Hetzner server that runs the whole business. And it might take two to five minutes to transcribe that minute on the CPU, but it doesn't really matter.

Arvid:

Right? The message comes in, you transcribe it, and then you take the transcript and send a notification. A couple minutes, perfectly fine for such an async tool. So I didn't need to build like a complicated pipeline there. I could just have it run locally, and that is perfectly okay.

Arvid:

For these cases, it makes sense to keep the processing on your local server with local models that fit into RAM and avoid costs that would come from someone else's resources and your attempt at using them. And this experience also taught me something interesting about CPU versus GPU inference. There are certain ways of using these models, particularly very modern models where there's barely a difference in speed between CPU and GPU, specifically when you're dealing with low context windows and low token prompts. If you just need a yes or no answer, for example, a CPU and a GPU might be almost equal in terms of speed provided that it's low context, right? It's not a lot of text.

Arvid:

It's not a lot of prompt. It's just give me a yes or give me a no answering this question about this paragraph of text. It is really, really fast on the CPU. You don't need a complicated thing. But the moment you do multimodal work, things like images or audio or some deep analysis with a lot of data, then a GPU based system becomes much more effective and even required.

Arvid:

The key insight here for me is if you have only 10 of these operations a day and you can wait a bit for the results, you don't necessarily need to go to a remote API or use a GPU to begin with. But here's where things get interesting, and this is where I learned some hard lessons. Scale changes everything. My first real challenge came when I started working with a lot of large transcripts. Because inference, like getting some information from a large transcript, really scales with the size of the context that's provided.

Arvid:

If you have a three hour conversation, looking at Joe Rogan, in text form and you wanna run some kind of analytics on it, some kind of data extraction, that turns into a very, very long computation process if you run a local language model. Whether on GPU or CPU, it just takes some time unless you have an H100 graphics card which costs several tens of thousands of dollars to own. Right? So that is not happening. And for any other graphics card that you run this on or God forbid the CPU this is minutes if not hours.

Arvid:

I was using Lama. Cpp with a Lama 3,000,000,000 or 7,000,000,000 parameter model and while this was perfectly fine at the scale of a couple hundred operations a day, it became a bottleneck immediately at a couple thousand. And at tens of thousands a day, it became completely unbearable. I had to scale back the number of things that I pushed in. I had to make a choice, like, I even analyze the script or do I not?

Arvid:

It became a problem. And here's the brutal truth about this. With the unit economics of remote API platforms that are offering AI inference, and there's way more than OpenAI and Anthropix Cloud. Right? There are many, many DeepSeek is one of them, and there are many that can host your models remotely and do that very efficiently, it is often much more reliable and much cheaper to use those services than to run your own infrastructure at scale.

Arvid:

The economies of scale that companies like OpenAI and Anthropic achieve, they are just impossible to replicate when you're running a smaller operation. You cannot scale an inference cluster to the same level they can. Like, I don't even know how to start an inference cluster. I'm just running individual machines. Right?

Arvid:

Like, I don't have the knowledge, and it's not my job. Podscan is not an inference cluster building company. It is a podcast data company. And if I need inference, either I build something, kinda, or I use somebody's system that is so much better at it. And there are some scenarios where you absolutely need local language models regardless of cost or convenience out there too.

Arvid:

And I think it's important to think about those because if you have customers that require SOC two compliance or any kind of privacy based compliance, privacy in the first place, they will very likely not even allow you to send their data to external systems outside of your business. The fine print of API usage terms for the platforms like OpenAI often include the fact that they can use your data that you send to train their future systems. And for customers with strict privacy requirements that is a non starter. It's just not going to happen. And that is their control of their data.

Arvid:

There's another argument for local systems, and that's control of the system itself, both data control and model control. You control what model runs on your own system, what version, and when it runs. Nobody can turn off your model because they need to update their hardware or because they decide to discontinue support or use a new model instead. If you have your own system, can run it for as long as you have access to that computer. You can also tune your AI system to your customer specific data and deploy exactly the right model with exactly the right fine tuning on exactly the right data for them.

Arvid:

You can obviously also kind of do that with systems that have a platform offering, but it becomes more complicated to actually retain full control over it. And sometimes that's what your customers want to pay you for. So how do you actually make this decision between local models and remote APIs? Well, based on everything I've learned, here's the framework that I use. A couple questions.

Arvid:

The first one is scale. How many operations a day am I running? If it's under a few hundred, then local might work. If it's thousands, it's probably API time. The second question is speed requirements.

Arvid:

Can you wait a minute or two for results? If that's the case, use local. If you need instant responses or a lot of parallel responses, APIs will be a friend. The third one, obviously, like I just said, privacy constraints. Do you have customers with strict compliance requirements?

Arvid:

Do you want to offer this? Then that might force your hand towards local. And context size is similar here too. Are you working with large documents, large transcripts, large images, video maybe? The larger the context, the more those APIs make sense unless you wanna invest a lot into hardware to be able to handle it.

Arvid:

And that is kind of the last question, resources. Do you already have it? Like, do you already have a GPU? If not, the upfront investment might not be worth it. Now mind you, Podscan ran on my Mac Studios GPU for a couple of weeks when I started the project.

Arvid:

It was enough to do, what was it, like a hundred and twenty seconds of audio per second. So I could handle a couple thousand or a couple hundred to a couple thousand podcast episodes a day, and that was enough just to get started. So if you have a good GPU in one of your local computers, you might actually use that for a bit. But the APIs are way more reliable. And I go API first for most things now.

Arvid:

And I try to kind of save myself from becoming too dependent on it because there's obviously dependency risk here by having a local fallback option. So I have GPU powered servers that can run these models if they need to. If not, they just keep transcribing along. Right? I'm trying to always use the GPU a %.

Arvid:

But if there is an inference job that needs to be done, they can also spin up a model real quick, do the job, and then kind of remove it from memory. And this gives me the best of both worlds because these servers then also talk to APIs if they need to, to OpenAI, to Cloud, whatever. The speed and cost effectiveness of these APIs happens for most normal operations, and I have the security and the backup, really, of local processing if and when it's needed. There is a benefit to these APIs. It's just an implementation benefit.

Arvid:

It's very underappreciated, but it's standardization. There are OpenAI specific parameter configurations that just work with most APIs. There's a standard on how you prompt a model. These are very easily mapped onto different providers because there is a standard. And competition and companies like AWS Bedrock where you can host AI models and then talk to them as if you were talking to OpenAI, they help you not feeling locked into a single provider.

Arvid:

Right? The vendor lock doesn't really exist here. And this standardization makes it easier to switch between providers or fall back to local models when necessary. Although, in my experience, the UX of writing local prompts for some models is much more finicky than the streamlined and well documented API versions out there. Like, there are some models that need really weird prompt structures because they are not as customer friendly, I would say, but you will figure that out either way.

Arvid:

Like, the potential savings in costs that you might have and the control, in particularly when it comes to privacy, the need for understanding it well will drive you to get how to prompt these models reliably. Look, the choice between local LLMs and APIs is not about being ideologically pure or following trends or whatever. It's just about making practical decisions for your business based on your specific constraints and requirements. So if you're just starting out, my advice is simple. Start with APIs.

Arvid:

You can always run a local model to get tested. And then for any production system, if you don't have much traffic, it's not gonna be very expensive anyway. Get your product working locally. Validate your market. Understand your scale.

Arvid:

You can always move to fully local later if your specific situation demands it, but use APIs just to not have to deal with it, to be able to focus on your business logic. The key is to be honest to yourself about your constraints, both technical and business constraints, and make the choice that serves your customers best. The purism has no hold here. You don't need to build this locally just because you can. Sometimes the best choice is local, sometimes it's APIs, and sometimes it's a hybrid approach.

Arvid:

That depends on your business, not on what you think is cooler in terms of tech. And at the end of the day, the best AI infrastructure is the one that helps you build a sustainable business, like, from the start, and that serves your customers effectively, and you don't have to waste money on. Everything else is really just implementation details. And that's it for today. Thank you so much for listening to the Boots of Founder.

Arvid:

You can find me on Twitter at avid kahl, a v a d k a h l. If you wanna support me and this show, please share Podscan.fm with your professional peers and those who you think will benefit from tracking mentions of their brands, businesses, and names of podcasts out there. Podscan is a near real time database of podcasts with a really stellar API. So please share the word with those who you think need to stay on top of the podcast ecosystem. Thank you so much for listening.

Arvid:

Have a wonderful day, and bye bye.

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
390: When to Choose Local LLMs vs APIs
Broadcast by