404: The Transcription Challenge: Building Infrastructure That Scales With The World
Download MP3Hey, it's Arvid and this is the Bootstrap founder. Today we'll talk about keeping up with an avalanche of audio data and how I build Podscan's transcription infrastructure. This episode is sponsored by paddle.com, my merchant of record payment provider of choice who's been helping me focus on Podscan from day one. They're taking care of all the little things related to money so that founders like you and me can focus on building the things that only we can build like a massive pod cast transcription infrastructure. Paddle handles all the rest, sales tax, credit cards, those kind of things.
Arvid:Don't need to deal with it because they do. I highly recommend checking it out so please go to paddle.com and take a look. Now when I started building the first prototype of PodScan I very quickly realized that this was going to be a different business than any that I've built before. The difference had everything to do with one fundamental challenge in this field. Unlike most software service businesses, the resources that I would need from the start wouldn't scale with the number of customers I had, but would scale with something completely out of my control.
Arvid:The number of new podcast episodes being released worldwide every single day. So no matter if I had one customer or a 100 if they wanted to track every podcast out there for a keyword I needed to deal with this from day one. And that's hard because if you ever investigated the idea of stoicism you will know that there are certain things you can't control that you should care about and certain things that you cannot control that you shouldn't fret about at all that's kind of the idea like a very rough description of stoicism here but you know I guess deal with the things you could deal with and don't whine about the others so that's exactly what I did I focused on what I could do to make transcribing every single podcast out there a reality and I didn't complain about the fact that there are tens of thousands millions of shows being released all the time with tens of thousands of shows being released every day. That's kind of the framework here. I had to deal with it.
Arvid:I think I'm currently tracking 3,800,000 shows and roughly every day there's somewhere between 30 to 70,000 being released. Depends on the day of the week. And I want to talk about this herculean effort of building transcription infrastructure, how I got it from being extremely expensive to manageably cheap comparatively, what the trade offs were along the way, and how much of the development of new technologies has impacted the feasibility of this entire project for me. Now for my first prototype I obviously didn't try to transcribe everything at once and I knew that that didn't make sense to try it all. But I had found my source of podcast feed data, just a couple of good podcast feeds to try it out with through the Podcast Index Project.
Arvid:Very interesting. If you're into podcasting, check it out. It's an open source approach to listing all the podcasts everywhere. It's free, and it's openly available as an API and a database of podcasts that provides where they're hosted, the names, descriptions, and links to episodes as well. I think.
Arvid:Maybe not necessarily all of them, but some. And they even have a full SQLite export, like four gigabytes of just one big file with all this data. Makes it very easy to jump start any system, but they even have a great API and the Podcast Index API has a very beneficial endpoint for trending shows and newly released episodes. So my first prototype used that API and just grabbed the most recently or most popular released episodes and transcribe those with the existing resources that I had. And when it comes to the tech, I'm just gonna share everything here because why not?
Arvid:I already had been experimenting with an open source library called Whisper for a previous project called Podline, a voice messaging tool for podcasts. That was the idea. I was gonna take in voice messages through the browser, transcribe them on the back end, and then send a notification to my customers. And I had found that Whisper, which is supposed to be run on GPUs, could also be run on a CPU, so without a graphics card at all, through a project called Whisper dot cpp, albeit quite slowly. But for Podline where I needed to occasionally transcribe a short one minute clip this worked perfectly.
Arvid:It may have taken five minutes to transcribe it on one CPU core but that's okay. There's many cores in modern CPUs. And if I have five minutes, sure, that people will take the notification a bit after. That's alright. And since PodScan, my current business, was initially a marketing effort for Podline because I wanted to know where people already talk about having voicemail, so I built a tool that would figure out where people talked about it, I had all the tackling around.
Arvid:But obviously, there's a stark difference in transcription scale here. Right? Podline needed to handle occasional short clips, but Podscan needed to reliably transcribe 50,000 shows per day. And those are often shows that go for forty to eighty minutes. Right?
Arvid:That's not just thirty seconds. That is hours of material. And if you look at Joe Rogan, who reliably puts out four plus hour shows, that system needed to be fast and good enough to get the whole conversation and transcribe it. So the first smart choice that I needed to make was treating this as a queuing system not as something that would happen synchronously to when stuff was released. I needed a queue of podcast episodes that would just wait to be transcribed and whenever I had time and resources I would transcribe the next one in descending priority and this required a priority system to determine which episodes should be handled first.
Arvid:That is a whole thing that I could probably do a full episode on. I've come system where I have three queues right now that are high priority, middle priority, and low priority. And the high priority shows would be the Joe Rogan's of this world that would get, like, a preferential treatment because I know that anything set on this show, if it triggers an alert, then that would have the biggest impact on whatever my customers might need to do with it. So I need these episodes to be transcribed early. But then there are maybe mid tier podcasts that can wait half an hour or so before they get transcribed or that could even wait a couple days because it's that's just not that important, and that descends them in priority.
Arvid:There's also an immediate priority queue, which skips all the other queues for, like, custom retranscriptions if I ever have episode that needs to be retranscribed or somebody really needs this episode right now there's a bypass version but effectively that's the priority system that I have. And for that queue my initial setup was really just one consumer and that was the Mac Studio that I was developing the software on and that has right now a microphone that I'm speaking into where I'm recording this podcast Like, I was running my full queue from my production system on a local computer. So running Whisper. Cpp locally on a Mac, that's really cool because it will use the GPU if I can connect to it, and the MacBook's unified memory system, like the MPS system, is capable of running these models really, really quickly. And I was getting about 200 words per second, which is really something, and this meant I could fetch a couple 100 episodes per hour with some parallel processing on my local system.
Arvid:So then I realized that to deploy this as a business properly, I needed a transcription server running on a different cloud system very likely because I couldn't just keep it running locally at home. If my Internet ever goes out, my company wouldn't work. Right? So I started exploring companies that would offer access to computers with graphics cards where I could install whatever stack I had locally and keep it running there twenty four seven. So the first thing I tried was AWS with their g type instances.
Arvid:The g, I guess, stands for graphics cards, I presume. I don't know. But these were quite expensive instances that didn't really have much power for the work that I was doing. The ones I could afford, I think it was around $400 a month, just weren't powerful enough, particularly compared to my local server here. I would have preferred them to be either cheaper or better, so I quickly stepped away from AWS.
Arvid:And even to get them, you have to apply for quota there and they have to verify it. It's quite hard to get there. So I looked for alternative, easier solutions. So I looked into Lambda Labs, which was one of the first reliable options for GPU systems that I found and I used them for quite a while. Lambda was helpful because they offered different servers with different GPUs attached so you could rent an h 100 like one of the most powerful NVIDIA GPUs at the time for about a thousand bucks a month or a bit more which was quite expensive obviously to rent a GPU or you could have an a 100 or an a 10 which were much cheaper and actually perfect for transcription purposes.
Arvid:So I spent a couple months experimenting with my own personal money, testing whether an a 10 would outperform an a 100 or an h 100, not in terms of raw throughput, but in terms of words per dollar. That's kind of the unit that I had. And I think I shared this on Twitter where I did some math there. I deployed my transcription systems to a different host with different graphics cards, and I ran experiments with a varying number of parallel transcriptions just to see how it worked. And I found a working solution eventually.
Arvid:I think I settled on 12 to 16 ish servers with a 10 graphics cards. That was just the best. These became my transcription fleet for a while, but even then got quite expensive, which made me realize that I needed to do something about this price because that was also pre funding for me. I didn't have any funding at that point just yet, but I was paying thousands of dollars a month personally for my personal money, so I needed to figure something out. And the most effective thing that I did was to look for hosted servers outside of services that are focused on renting AI.
Arvid:That I just needed to look for other people that had GPU based servers that were not yet in the AI hype space. And those services tended to offer sizable graphics cards like the ones that do AI for inference, which is great if you need impressive GPU power. But in most cases for transcription, that's actually not what you need. You need some graphics card that can do some transcription and the cheaper, the better because transcription doesn't require a lot of VRAM. It just requires some time on a GPU.
Arvid:And I found this solution in Hetzna, the German company well known for being an affordable hosting company. They had just started offering GPU servers and they also have auction systems where you can get really great hardware quite cheaply, but they offer servers, I think, what is it called, GEX dash 44. That's the one that I use. They have an RTX 4,000 SFF Ada generation GPU and I think they cost €200 a month just to rent and these servers are spectacular they have 64 gigabytes of DDR4 ram four terabytes of disk space, like for 200 ish dollars a month to rent this. That's really cool.
Arvid:Like you can rent how much do I have right now? Like 10 ish of them and have a significant GPU based workload running twenty four seven on many different servers for $2,000 or less. The key insight from all these experiments was that transcription has very different requirements from other AI tasks like inference. You can run transcription quite reliably by using somewhere between four and twenty gigabytes of VRAM, depends on the model that you use, which is something that if you use Whisper, you can choose different models. Right?
Arvid:There's a tiny, a small, a medium, a large model, and they all use different size of gigabytes of this RAM that these GPUs use. And the smaller ones obviously are faster and use less of that RAM, you can run a couple in parallel. And it really doesn't mean that much if it's just a couple of gigabytes, but it's definitely enough to get the highest quality transcription data, particularly if you use a model called Whisper V3 Large Turbo. That's the one I currently use, fastest and best quality. And when I switched all my transcription servers from these a tens at Lambda Labs to the Hetzner systems, I picked up steam dramatically with these.
Arvid:It was so much more effective. So I could get by with half the number of servers and still have a higher throughput than before. So that's where I'm now. Self maintained servers, running transcription scripts twenty four seven on the Hetzner platform, being highly efficient over time. And the solution that I had with WhisperCPP, that was great in the beginning, but as Podscan started gaining customers, they had more requirements than just default transcription.
Arvid:So I needed diarization, which is a fancy term for determining different speakers in an audio file, and word level time stamps for precise interactivity. People wanted to be able to have, like, exact cuts in videos or in audio, I guess, so they can extract stuff. So they needed to know exactly where their sentence starts and where it ends. So for that and knowing who's speaking, I needed something bigger. So I switched from WhisperCPP to another implementation running on top of Faster Whisper, which is a library that uses these models more efficiently, that includes both diarization capabilities and granular timing data.
Arvid:But this revealed a couple of surprising technical challenges. So if you're into transcription, this is going to be very, very useful so you don't have to fall into these traps yourself. Diarization is more resource intensive than transcription. Detecting speakers takes much longer than actually transcribing what they're saying. You would think it would be easier to determine somebody speaking here than somebody else is speaking there, but it's actually harder to figure out if it's person one, two or three than it is to determine the actual words that this person is speaking and from the start I needed a careful prioritization system here because I only could diarize what I really needed if I know that a podcast has only one speaker and has had for the last 200 episodes well, don't need to diarize it and I can save over 50% of the time.
Arvid:But if it's a popular show with different guests all the time, then I guess I need to prioritize it and it's gonna take double the time. At scale, turning off diarization for some shows where it doesn't matter or has not yet proven to matter means that I can transcribe twice as many podcasts in any given day and that's massive if that means that I can do all the shows in a day and still have resources left then I can step back in time and get some of the older episodes that might be still very interesting for search purposes so that's the trade off that I'm dealing with here Finally, there's something that I learned after a couple of weeks of experimenting with this. GPU memory limits affect the quality of the transcription. If you do a lot of parallelized transcription, the GPU reaching its memory limit where you it gets full, that can cause transcription quality to decline. I initially had a thought like this graphics card has 20 gigabytes of reram.
Arvid:Each transcription process uses at most four gigabytes, so it can run five at a time, fill up the whole graphics card. Right? The whole RAM. That tends to be true most of the time that it works. But if one process runs a little bit longer, maybe it's a three hour Joe Rogan podcast yet again, and then another process spawns and five or six processes are fighting for memory or even just a five on there that should have four, one of them is like 4.2.
Arvid:Right? Quality quickly degrades on all of them simultaneously because there's something in there that just breaks down and then it just hallucinates stuff. I have since reduced parallel processes to two or three podcasts at any given time. There's a small chance the GPU isn't fully utilized when, you know, all of them spin up at the same time, but that's okay. Most of the time it's in full use anyway without quality degradation, and I would rather not risk it because I want these transcripts to be reliably good.
Arvid:Biggest learning in all of this has been that bigger GPUs aren't necessarily faster or better and not just from a words to dollar ratio just even from a usage of the GPU. Just because the GPU is bigger doesn't mean it's faster at transcribing, surprisingly. You would think, but it's not. When I ran transcription on my local machine and then on a 10 and a 100 GPUs, I got quite similar results, like always between a 150, 200 words per second. And those things cost $200 a month max but then I rented an h100 gpu and the word count stayed almost the same maybe going up to two twenty five to two fifty words per second but that gpu had five to 10 times the monthly cost and you couldn't really run it in parallel there either because then it would start degrading quality.
Arvid:So for transcription specifically it is way more effective to run on smaller and maybe slightly slower GPUs at scale and this has turned out to be the only feasible way for me to do this. And we're just talking about like self managed transcription here because there's an alternative that puts everything into perspective. If I were to transcribe all 50,000 episodes that come in every day using OpenAI's platform their AI platform their Whisper Endpoint there I would pay a 5 figure dollar amount every single day. After many months of optimizing and experimenting with transcription setups I have obviously not done this. I have turned the whole thing into just a few thousand dollars in expenses a month by having my own infrastructure.
Arvid:The cost savings are significant because when you run your own infrastructure, even though you aren't able to do as many parallel things as you could by using Whisper and OpenAI or other transcription systems like Deepgram, but instead of paying like a $100,000 a month, you pay four or two, right, if you do it well. The daily cost that these commercial models can incur for you is easily in the thousands of dollars. And I've gotten it down to just a 100 and change on a per day basis, which is quite significant. The biggest expense for Podscan at this point is not transcription capacity. You would think, right, that this would be the most impactful expense, but it's a database where all of this information is stored.
Arvid:And that's the next big challenge that I didn't ever think about in the beginning. Because when I initially started tracking a couple 100 podcasts, yeah, it was totally fine to have my SQL database store all of this data without doing anything specific around data storage. Right? Just throw it in and figure it out. But once I turn on the full fire hose of podcast data, all 50 ks a day, it became a massive challenge.
Arvid:If every transcript is like 200 kilobytes to one megabyte in text size, because that's what text is, it gets massive. Right? It can be megabytes of data. Again, Joe Rogan, thank you so much for filling my database with massive transcripts here. Then every day you're adding several gigabytes to your database so if you're trying to do full text search or quick lookups with some filtering this becomes a problem even if you have index there for a full text index or just regular string index, it is so hard to get this right.
Arvid:I had to build infrastructure that prevents my database from overgrowing or slowing down to a halt. Older transcripts are actually transferred to an s three based storage and loaded by the main process when they are requested by a user in the front end or in the API. I don't keep all my transcripts in the database because that would easily be six terabytes right now just in raw size, which is super expensive to maintain and super clunky for database access. Now all transcripts live on s three as JSON files and can be loaded on demand for anything older than a couple months for regular transcripts and anything older than just a couple days for the word level timestamp transcripts that we also save. That is probably the biggest one.
Arvid:JSON data for every second of a show. And this has been very helpful in ensuring that the database stays at least a little nimble in comparison. When it comes to search, I'm using an open search cluster, also in AWS, where I just pipe the full transcript in there and then have its own inverted index, I think that's what it's called, built to be able to search for full text there. We're not doing full text search in the database. We have an additional secondary database that we feed all these transcripts into and facilitate search by just having what is kind of an Elasticsearch fork deal with all of that.
Arvid:It would never never ever work in this MySQL database and probably also not in Postgres if I were to have a full text search there just because data is so massive. I was using MightySearch for a while and that also works like all these search engines that can deal with large text, they are good at it, but transcript data is so big that even those databases struggle a little bit. So you have to build something that works, You have to save them in a way that they can be looked up reliably and you have to save them in a way that they can be searched by too. Now there are other challenges, not just storage. There's also a quality problem.
Arvid:Yet again, podcasting is full of quality problems. There's no normal standard for quality in podcasts. And I mean the audio data for that matter. Some people record into what feels like a potato and others have extremely high end setups like this fine podcast. And you never know reliably which one you will encounter if you listen to one or if you try to transcribe it.
Arvid:So transcription systems expect at least a certain kind of quality, and they struggle with low quality audio or non speech content like music that people throw into this as well. So I had to implement a transcription quality checking system that tries to determine if a transcript is acceptable or if you need to retranscribe it with different settings. Whisper is pretty good by default, but there are edge cases where you need multiple attempts to get it right and then all costs money. Biggest problem, and that's probably also why Podscan is actually so impactful for the people using it, is that transcription systems like Whisper but also others struggle with names and brands. Anything that a human could easily get right from context they don't get right because they don't have context they just have a voice pattern and audio waveform and they don't get it right most of the time and what works really well here but is extremely expensive is taking the full transcript from whisper with all the little mistakes in there and having an AI do a pass over it with context from the podcast name the description and maybe prior episodes data and you get extremely high quality transcripts that way but at scale this costs several dollars per episode because imagine what this would mean to use an AI system let's say you have 500 kilobytes of text that is I don't two hours of a podcast.
Arvid:And you pipe that into even a cheap LLM that is hosted on, I don't know, on Anthropic or in OpenAI's platforms. So you have, like, half a million input tokens and then it does some stuff and then it has half a million output tokens. And that's the expensive part. Output tokens are probably the most expensive stuff for LLMs right now to create. And that does not work at scale because that, again, would cost me $50,000 a day.
Arvid:Like, where am I gonna take that money? Not happening. That might be very limited to a very limited number of shows, but even then, it gets super expensive. So that's an unsolved challenge. Currently, Whisper can take a 120 or so tokens of context, just like things like the title of the show, maybe the episode title, and couple of names for people that will be mentioned.
Arvid:So that's what I throw in. I just give it what I know is true about the episode. And back in the day, I experimented with giving it all the brands names from all the Podscan accounts that were subscribed, like all my customers, to give it as context to maybe find those better. But Whisper actually started finding these words in places where they weren't actually there. It was gaslighting me into believing that it found certain words where they didn't exist.
Arvid:So I quickly stopped piping all of these brand names in there. Since then, I only provide context that I can reliably infer that will be in that particular episode. And the big benefit of the system that I've built so far, like the whole installable on some VPC somewhere system, is that it's pretty easy to set up. It's a Laravel application that I can deploy through Laravel Forge onto any new server. I have an install script that fetches a couple Python libraries, and I can spin up a new server quite easily that then automatically attaches to my API and starts fetching and transcribing new episodes.
Arvid:It's really nice, quite scalable. It's not on a Docker container level scalable, but I'll get to that in the future too. And as PodScan's infrastructure grows, we can quickly add more systems so new episodes are transcribed faster with even more quality because, you know, they can be run-in less parallel with less little outlier errors faster. And as models evolve, they might even become better at transcribing. Eventually, I think I can increase the number to get diarized and get more good data that can then be fed into AI systems for what my customers want.
Arvid:When I first set up the fleet of servers to transcribe all my podcasts that I wanted to transcribe, all podcasts everywhere, it probably would have cost me $30,000 a month even on my own hardware. But I'm now at a point where through proper optimization and balancing my customer needs with the expense requirements, I can reliably capture the majority of podcasts at good quality for just a couple thousand dollars a month in expenses. And I think that's really cool. The fact that that is even possible for a sodopreneur to build, I don't think that would have been a thing a couple years ago, but the tools are all out there. It just takes a year and a half of $24.07 work.
Arvid:So yeah it takes a while but it still is possible. The key insight for all of this is that when you're building a business that scales with these factors outside of your control like the global output of an entire medium you just need to think differently about infrastructure optimization and trade offs. Sometimes the most expensive solution will not be the best one, like OpenAI's hosted whisper just doesn't work. And sometimes the constraints that you think are impossible to work with actually force you into more creative and ultimately better solutions. The kind of challenge that makes building businesses both terrifying and accelerating, that's exactly this.
Arvid:You can't control how many podcasts get published worldwide every day, but you can control how cleverly and effectively you solve the problems that stem from this. And that's it for today. Thank you so much for listening to The Boots of Founder. You can find me on Twitter at avidkal, a r v a d k a h l. If you wanna support me on this show, please share podscan.fm with your peers and those who you think will benefit from tracking brands, competitors, their products, all kinds of things, names on podcasts out there.
Arvid:Podscan is this near real time podcast database with a really solid integration system. We allow a lot of people to build solutions to get leads, to get information on their clients and all of that. So please share the word with those who need to stay on top of the podcast ecosystem. Thank you so much for listening. Have a wonderful day and bye bye.
Creators and Guests

