409: James Phoenix — Claude Code Masterclass

Download MP3
Arvid:

Hey, it's Arvid and this is the Bootstrap founder. Today, I'm talking to James Phoenix, an expert in agentic coding, particularly Cloud Code. That's a tool that I've been using for my own software projects to great effect over the last couple months and I talked a lot about this on this podcast. I chatted with James just a couple days ago and have implemented several of the tips that he gave me during this conversation and I'm already almost twice as effective at using this already magically effective tool which is very impressive so I don't think I can over promise how much insight into well structured and highly optimized agentic coding you will get from this conversation. A big shout out to the sponsor of today's episode, paddle.com, my merchant of record payment provider of choice.

Arvid:

They're taking care of all the things related to money so that founders like me and you can focus on the things that only we can build and Paddle handles the rest. Sales tax, credit cards failing. I don't want to deal with that. I don't have to. They do.

Arvid:

I highly recommend it. So please check out paddle.com. And now here's James. James, welcome back to the podcast. Last time we've been chatting about LLMs and prompting them, and today we will dive into one of the most powerful agentic coding tools out there, if not the most powerful at this point.

Arvid:

I use it every day, and you've become an expert at managing it and wrangling it and getting things out of it that you wanted to do. Let's dive deep into building software without typing a line of code. Do you know just how much you've actually used Cloud Code in total so far?

James:

Yeah. So I've been using Cloud Code a lot. I'm roughly spending around, like, maybe 3.4 k if they were charging me in API costs or around sort of 2,300,000,000.0 tokens per month. Yeah. So I've been using it a lot in my workflow even to the point actually where I have downgraded my Cursor Ultra plan.

James:

So I had a $200 Cursor Ultra plan and a $200 CoreCode plan. And I'm mainly using Cursor now for fixing up work or where I need to slow down. And most of the work has gone to CoreCode. So that one's keeping the big subscription for now.

Arvid:

There's so many layers to think about when you buy these tools. Right? There's the IDEs that come with them, then Cloud Code is kind of standalone. You can run it in any IDE that it has an integration for. Maybe even if it doesn't have an integration, you can still run it.

Arvid:

There's a lot of complexity to the cost of these tools at this point. I do wonder, with these tools, particularly Cursor, which is kind of Versus Code on steroids with AI and can integrate into other tools into different models, these tools haven't really been around for that long. Do you see things like Cursor just fall off at some point and be replaced by something else, or should we expect Cursor to stick around in tools just like it?

James:

I think the challenge that Cursor has is they aren't a model maker, and so they explicitly have to make more profit margin on a consumer, and they end up passing that cost to the consumer. Right? So Anthropic or Google, they have their own foundational models, and therefore, they don't have to make as much money. I think HRSA has been building a little bit of a moat on their tab completion models. So, actually, I found that their tab model is really, really good.

James:

I can tell they're spending a lot of time specifically improving that. They do still have kind of an edge on Copilot. I mean, I did look at Copilot recently. You can do things like run agentic workloads against GitHub issues. People still have, I would say, a preference over the curse agent mode over, like, GitHub Copilot agent.

James:

But, yeah, it's definitely, like the gap is shortening. And so, yeah, I think, like, cursor is not necessarily gonna be in trouble, but they're just taking good specializing approaches like background agents and making their tab completion kind of best in class?

Arvid:

Well, when I think about tab completion, I'm trying to remember when I last used it, and it was not today. Like, I was building a lot of stuff today. I think I have built, like, two or three features in two projects today, yet I don't think I've written any line of code at any point during this. It was just prompting and then taking or leaving what came back from it. So I do wonder if TAP completion is ever gonna play a role in my choice of IDE.

James:

Yeah. I guess my point would be that if you need additional context on what you're doing, then those more kind of primitive AI features become more valuable. Right? So, like, if you're actually deep in the implementation of the code, then tab completion is really useful. But, obviously, if you're using something like Gemini CLI or ClaudeCode or OpenCode, you know, whatever you're using as a as a kind of agentic coding tool, then tab completion isn't really a feature that you even think about anymore.

James:

Right? So it depends. I also think that there is kind of a hidden cost to using Claude code where you're essentially an engineering manager and you lose a lot of that contextual awareness about what's happening in each individual file. And so there is kind of that trade off of, you know, if you're not actually writing the file, you do have a loss of context that you then have to re aggregate as an engineering manager. So if you're an engineering manager and you've got three devs, you essentially farm out, you know, a bunch of tasks.

James:

Each dev goes off. They work on that task in parallel, and then you get all this completed work back. But as an engineering manager, you didn't write those files by hand. So what you're finding is that you're having to find clever ways of recontextualizing on the code that just came out. And, obviously, PR review is one of those.

James:

But I think the main thing that I do is kind of you could do some quite clever things, you can ask, like, call code to generate you a data flow of what it's implemented, for example, or, mermaid diagrams. So that can be quite a nice way. I get it to create digest files. So what have you done? What are the changes?

James:

What is the state of the data flow? And then, yeah, you're basically trying to have to find a way around that. But obviously, when you're using tab completion models, you're not necessarily having to do that. You're fully in the zone. You kind of know this is this interface.

James:

This is this function. So there's a trade off of either side, right, in terms of speed versus context loss.

Arvid:

That context loss, that to me is the source of all evil in using this. And I I mean this in a neutral way even though it's not expressed like it, but that is the risk part. Right? That is where we risk skill atrophy. That is where we risk introducing stuff into our code bases that we don't understand that might be just a time bomb to blow up later.

Arvid:

And I believe that a lot of founders who are using this to build their businesses or a lot of software engineers who are using agentic coding just to help them be more productive, They're introducing this without knowing it because they don't have the required skill set just yet to make this shift from coder to engineering manager. They don't know yet that they are actually a manager and not a software developer anymore. How can we do this? How can we make this easier for people to understand?

James:

Yeah. I think the main point here is you have to be very careful about committing code that's come from Claude code. And it's actually easier to spin up a separate Git work tree on a separate branch, have it have a go at a feature or something, and then review the code and throw it away than it is to merge in poor quality code, and then have to figure out in the code base about 10 or 20 or 30 or 40 commits later where that poor quality code is. So, actually, the main thing that you have to wrap your head around is that, actually, adding code is both dangerous and not dangerous, and I'll explain what I mean by that. So if you're doing an additive feature, so if you create a new feature and it doesn't necessarily touch anything, if you do that in a Git work tree or even if you do that in the main branch, if it's purely an additive operation, then you can simply delete the code.

James:

Right? Where it becomes more dangerous is if you have a service and you mutate and you update existing dependencies or existing files or if you have something that is now dependent on something and you've changed that, anytime you've got that kind of situation, that is where the danger strikes of committing code without testing it. So mutative code or modotificive code is actually more dangerous than additive code. So this idea that if you add it and you just have something that if you add it, it's not necessarily a problem because you can throw it away. It doesn't affect other services.

James:

Whenever you're updating existing infrastructure or service layers, that's the dangerous point. And then the second point is, actually, it's much cheaper to go through each individual file diff and see if it's making mistakes and push back on ClaudeCode at that point in the conversation history because you've already got that conversation history to hand. Otherwise, what's gonna happen is you'll realize there's a bunch of changes that you should have actually pushed back on. You'll be searching for that session ID in all of your clawed codes because Claude code is ephemeral, right, in the sense of unless you have that specific session ID, you're now having to paper trail, go work your way up and use a a Claude's dash dash resume. You'll get all the session histories, and you're trying to find that specific one which actually was associated with that chunk of work.

James:

So pushing back on the chunk of work at that point in time is actually easiest because you don't have to keep a handle as to where that specific session is. You're just dealing with that problem in the here and now. And the other thing I will say on that is that every time you're pushing back on Claude code, it's an opportunity for you to learn something, and it's a couple of different things. So mainly, it's you didn't prompt it well enough. It didn't have the right context.

James:

That's one potential area for it to have gone wrong. The other area that is interesting is you might have specifically given it all the right context, but it didn't have any rules or any patterns to follow. So if you're working on a greenfield problem, you might find that actually, if it hasn't got any patterns, it would do several things in different ways, and it doesn't have a standardized way of working. Or it's a rules problem, so it just didn't have any rules to follow. So, generally, it's a prompting problem or a context problem or a rules problem.

James:

And, also, it can be a testing or a verification problem. So it's one of those things that generally goes wrong when you're using cold code. You know, those are the things to kinda look at.

Arvid:

So how do you define these patterns? Particularly for a greenfield project, I think it's Cloud is capable of extracting patterns from existing code bases and codifying them into something that it understands. At least I've heard. If not, tell me. But how do you deal with this when you wanna build something new?

Arvid:

Do you already have to be able to completely formulate this almost like a, you know, waterfall design document, or do you have a couple of starting points that you then elaborate on over time? Like, how do you approach this?

James:

Yeah. I think the main thing that I've learned is it's easier to update one file and come up with a global kind of pattern for some type of layer than it is to roll out 10 or 30 files in an API and then realize that you want to add JSDoc or you want to add types or you want to add a certain type of dependency or middleware. So I think the main point of learning this is that you're better off, in some weird way spending half an hour or an hour or an hour and a half in Cursor and iterating on specifically what that pattern is. And then once you've got the pattern nailed and it has a pattern to follow, then you can roll that pattern out across a bunch of different Postgres tables. And I found that works really well.

James:

That's one way of doing it. The other way of doing it is to come up with a rules file and say, when you're making this type of pattern, you should do this. I don't find that works as well, though, because it doesn't have code to mirror the pattern off of. And, also, like, ClaudeCode, like, a lot of agentic tool calls, specifically for file reading to see what it kind of looks like to begin with. So I think, yeah, building up a couple of examples is, like, the best way to kinda go.

Arvid:

That's interesting. I've never thought about it like this because I always thought, okay. Greenfield project might as well let Claude do all the work. But by you spending some time to actually explore what you want it to look like, even just prototypical, what you're saying sounds to me like you are giving very clear evidence of what you want, and Claude will, from there, infer what the rest of it should look like.

James:

Yeah. And I think Cursor works very well like this as well. If you're doing greenfield, you'll basically, with any kind of LLM, you'll run into the same issue. It's not even just the Claude Cove thing. It's just LLMs are really good at generating tokens, but they lack real strategic sense.

James:

And, like, that's an example of something that I'm seeing on my personal project, which is it will define lots of constants that exist, and those constants will be repeated in different ZOB models in lots of different files. And so what I've noticed is these models are incredibly good at working in very hyper localized context, but they're very bad at thinking about strategic context. So for example, if you're doing social media platforms, you'll have some constants or an enum for Facebook slash meta or x or LinkedIn and YouTube and what have you. But those constants, if you leave it unchecked, core code will just go and put that in every interface, in every Xod model at every file. And that's what I'm finding right now in my project.

James:

So I'm like, it's really interesting how core code is really good, but, actually, if I'd have just started by upfront saying we always use these constants, then it wouldn't have had any additional patterns to match against. Even still, even if you do that, there will still be some that slip through the cracks. You just can't you can't always expect to have all of the right context in every single prompt. It's just unrealistic. And so that's kind of where you need that QA step where you're specifically looking at different types of things that are wrong with the file changes or the file creates that have happened, basically.

Arvid:

Yeah. I mean, we had refactoring and rewriting stuff before we had Cloud Code, so people made mistakes as well, right, or implemented things in ways that turned out to not be the optimal way. I do wonder, though, because Cloud Code has plan mode and it has the capacity to not immediately act, right, not to immediately change things, but to think about it, would it make sense to have this kind of refactoring step every now and then just to let it check the code base for repetition?

James:

Yeah. So I think what you're talking about is and I'm gonna expand this to be a bit more of a philosophy rather than just that specific task is there's kind of some tasks that you don't want Claude code to do all the time, but you want it to do kind of infrequently, like a one off task. And so what you're describing as, looking for redundancy or duplicated code or a lack of specific global constants or types is a dev task, which means that constants or enums become centralized. And that is the kind of thing that you should probably have on a weekly cron schedule, and you should have it as a command that you can run inside of CoreCode. But it's not necessarily something I would make a sub agent for.

James:

It's like a one off command. I could run it, like, I don't know, once a week, or I could run it on a cron on, like, a GitHub action. And I can specifically look for errors in the code base kinda systematically, but it's not something I'm gonna run every day, and it will also pick up a lot of false positives. So that is a good example of something you should put into a Claude code command.

Arvid:

But I never defined any commands, really, for the reason that I didn't even know that you could do that at this point. So it sounds like you have things that are just conversational where you just go through and build some kind of feature, and then there are things that are so repeatable that they warrant having this kind of encapsulated command.

James:

Yeah. And, like, basically, anytime you're doing something, not all the time, but kind of, I don't know, every day or every week would be something you could put into a command. So a good example will be I have a command called update docs dot m d, and it tells it basically, I want you to review all the docs in our architecture. I want you to look at all the package implementations. I want you to verify a bunch of API routes that match the documentations.

James:

And then it basically says, go and look at all the different features and update a a doc slash architecture dot m d, for example. So keeping the docs fresh just from a Claw code command is, like, an example. Another one would be, like, I'm using a system prompt override in ClawCode, so I'm specifically telling it to kind of take on a role, and I don't want it to have to keep telling it to take on this role. So that's something I use a lot more frequently where I'd have, like, a slash confirm orgster, and it just tells it, I want you to confirm that you're this role, and you're the best software engineer in the world. And the reason why is that Claude, they've obviously got a system prompt.

James:

You know, you are a developer. You do this. You do that. Right? That comes injected into the system prompt.

James:

And so what I've actually found works really well is to override the system prompt. And there's a system prompt that you can have that was made by AugmentCode, and there's this kinda like it's kind of XML based, but it basically sets up ClaudeCode to generate a lot more to do tasks. And I found that the painful points of using ClaudeCode is that it either stops earlier or it doesn't run tests. And basically, yeah, this kind of like system prompt override basically says, you must have a plan or a mission. You must enter an operational loop.

James:

You must do the work, verify that the work is done. Only go out of the operational loop when the work has been verified. And so basically, it will do things like it will plan, it will implement, it will run tests, and it will verify tests. And so that kind of operational loop is really useful, but obviously, using that is really quite, you know, a bit boring to say, I want you to take on this role and and use what's in my ClaudetteMD and and assume this role. So that's another example of a Slack command that I use, like, every hour to just save myself some manual typing.

Arvid:

So that cannot be reliably auto injected? You kinda have to do it manually through the command?

James:

You can use Claude code hooks to do auto injection, but it depends on, like, where those hooks land. I personally haven't explored that route. I yeah. I I mean, I mainly use claw code hooks for custom linting and type checking. So there are claw code hooks.

James:

And if I just have a brief look right now, they are, like, Anthropic or always releasing new types of hooks. So there's, like, a pretool use hook, a post tool use hook. You've got a user submit prompt. So whenever a user submits a prompt, you can do that. Or a session start, so you could technically inject something into a session start.

James:

But I don't like to do it all the time. So it's one of these things where, you know, you've gotta be, like, careful about using hooks, but they are also very, very powerful. Good use of Hooks, think, is like running custom TypeScript checking unless you're using the IDE. You can specifically feed TypeScript errors on file modification. Seems to be quite nice.

James:

So as it writes files, it will run those specific files through TypeScript's compile checker, and then it will give those errors back to Claude code so that as it saves the file, it now knows the type errors that it generates. And that's quite useful as a kind of step in the middle before type checking at the end. So it's aware of the type issues that it's creating as it makes them.

Arvid:

That to me is one of the coolest features about agentic coding to begin with, this self evaluation that happens in between the things that happen, in between the to dos. I'm working a lot with Laravel and with Vue. Js. No TypeScript, raw JavaScript like the people in the nineties, and just seeing it building stuff and linting stuff and checking stuff, seeing errors come back, and it fixing those errors in the loop without me having to do anything, that to me is the true magic of agentic coding. Because everybody could come up with a system or lots of LLMs could just come up with a system that gives you the output and then you fix it yourself.

Arvid:

I used to use just like this for the longest time. I had it in the web. I had the web Claude, pasted code in there, had it create new code, copied it, and pasted it back into my IDE because apparently I live in the stone ages. So that was how I did it. And then errors happened.

Arvid:

I pasted the error back. And, you know, like, that loop I had to do manually. But the Gentex system does it all by itself. We just kinda have to encourage it to do it right.

James:

Yeah. I think we've sort of gone all the way from Eclipse or IntelliJ slash Versus Code to Copilot with tab to Copilot with chat and cursor with chat, and then we've gone from chat to agent. And then we've now gone from agent in cursor or to essentially, like, on the CLI with kind of ambient agents where it runs for a larger extended period of time on a slightly larger task. And, yeah, it's quite powerful, the fact that it can self correct its trajectory. It's also quite dangerous because it will think sometimes that it's going in the right direction, and, actually, it's taking side steps.

James:

It can, for example, change implementation to make the test work, right, which is terrible. Or it can start importing a package in a specific way where you would want to have imported it in another way or from another different type of module. So it's not without its own errors, and what I think is the key point is kind of keeping on tabs of what's happening in the file system changes is really important. And you can stop a Claude code terminal session and just tell it, no. Actually, we should do this.

James:

And then, I think this leads quite well into, like, what you were asking about earlier in terms of how do you embed rules inside Claude. And so there's this idea of, like, the Claude. Md file. And inside of a Claude. Md file, you can put kind of anything in terms of, like, how it should write code or what kind of patterns it should follow or how it should react when certain conditions are met.

James:

I found that if you put too many rules in your root directory, Claude. Md, you'll essentially get context rot, which is basically this principle that the more input tokens you have, you get an increased amount of degradation of performance. So what you want to do is when you kind of experience this error where you've got three terminals open, you see that it's done something you haven't wanted it to do, you should specifically say to Claude, I want you to add this to a localized Claude. Md folder or file. So basically say, put it in a localized folder and have a Claude.

James:

Md. And the kind of rules that or philosophy that you should follow is if this truly affects the entire system, it should live in the root Claude. M d file as a rule. But if it only affects, for example, the API layer or a scheduler layer or a, I don't know, a domain package or a core services package, it should live specifically in that rules file. And then you get the benefit of additional rules at appropriate layers of the application without having increased context rot and bad degradation of performance by sticking everything in the root directory.

James:

And what you're basically ending up with is a kind of hierarchical knowledge based system of a series of Claude dot m d files that are embedded at sublayers of your application. And whenever Claude will make changes in those areas, it will recursively bubble up those Claude dot m d files into a single kind of rule set. And so if you have things that affect all of the packages, that will be under packages slash Claude. M d. But you might have package a that has specific bits of knowledge that only package a is concerned with.

James:

And you can ask Claude to sort of tidy up your knowledge base and say, everything's in the root directory, Claude dot m d file. Help me out. Take what you know about my directory. Do kind of an investigative research and figure out where all the rules should live, and then it can split it up into little subsections of Claude dot m d files. And that will massively help you in terms of when you're starting a new session with Claude, you're not gonna get as much context raw, and you'll get increased performance because you're simply putting less into the context.

James:

And so I think that's been, like, the one learning which I did and implemented about two weeks ago, and I'm already getting a lot better results even though I think they've personally quantized the model, and I think June was pretty much like the holy grail, and I've I think there has been some slight degradations of performance because of how many people have been abusing kind of the core Pro Max subscription. So I personally think that June was kind of the Hail Mary, but I think we'll get back to that point when model performance gets cheaper and they don't have to quantize. And I totally get why they're probably potentially quantizing because it's really expensive to run so many versions of Sonnet. But, yeah, hierarchical rule based system is kind of what you're looking for. You can even use Claude to kind of break up your Claude.

James:

Md file into subfiles as well.

Arvid:

That is amazing. I mean, the Claude. Md file, I get that as, like, a central repository of these insights, but having one on a local level, that's subsidiarity. Right? That's the idea of, like, putting stuff where it actually is needed and not just on the root level of anything.

Arvid:

That's this is a governance system.

James:

Yeah. The other thing I was gonna say with this is I actually don't use, like, at references that much, and I I essentially like to use a pointer based system. So I tell Claude in Claude dot m d files, if you want to learn about this, you can go and look at this specific file or this specific file path. And then it doesn't pull it into the whole context for that Claude dot m d. It can do a tool call to read that specific file in, gather up that context because Claude is really good at doing tool calling.

James:

And so that's, like, another thing that you can kind of use as a way to avoid bloating specific claw dot m d's with app file context. Just using pointer references to be like, these three files or these four files are, like, how you should implement this pattern. Here's a quick example, but if you want a more granular example, go and read this file. And you're just giving it pointer references so it knows how to do the tool calls, but not necessarily stuffing all of the examples in to that specific piece of context.

Arvid:

This sounds amazing, and it sounds like a lot of work.

James:

It's so much work.

Arvid:

I I wonder are we are we ever gonna see, like, almost of course, all of this is kind of recursive in a way. Is that gonna be an agentic way of doing this? Right? Will we be able to extract this to the right level?

James:

So I think this goes on to the point of, like, if you're building greenfield projects, then you kind of have to curate and maintain this. And you get to a certain point in the project where, like, if I spin up, like, a fresh package inside of my mono repo architecture, it knows I'm using BUN. It sets up, like, unit tests, integration tests. Like, it creates all the right commands. It adds an ESLint config.

James:

It adds TypeScript config. Like, it knows what to do now. Right? But, like, if you're starting a greenfield project, then there's a lot of work in figuring out the right project structure. And so, actually, the the agentic way of if I have an existing project of specifically finding all of the different types of contextualized knowledge, the code already contains those patterns, and you're basically just reversing back out those patterns into kind of a rules dot m d file so you've already got the patterns.

James:

Now having said that, you might go onto a really crappy project, and they don't have patterns, which is also really bad. So that gets onto this point that you should always have ideally one, but maybe two ways of doing something, and that the more ways you have of doing something, the more confusing it is to an LM or a human developer. And so, basically, if you can have reduced number of service layers ideally, strong typing across the application, and also one way of doing things, then that massively simplifies how new things are gonna come out of the box. And the fourth thing would be, like, how easy is it to set up the dev environment? So having kinda custom shell scripts that fire off work trees or copy environment variable files or boot up various parts of the application is also a lift for ClaudeCode because then it knows how to create a work tree.

James:

It just runs a set of shell script. So all these kind of like mini pieces of infrastructure that allow you to like onboard a dev faster actually also work really well for called code. So those are kind of like the main things that I've also found to be really useful is kind of like custom shell scripts for it to like copy all the environment variables into a fresh Gitwork tree, because those don't get copied across because they get ignored by default. And there are people building, like there's, like, a really good application, desktop app called Conductor that's been going around on x, But I honestly think you should probably just write your own shell scripts and kind of own that part of the infrastructure. They do have, like, a custom shell script right on workspace setup, which is really cool.

James:

I just find that, like, for me, it took a lot of Cloud Code, like, going backwards and forwards of debugging exactly what that Worktree setup should look like, and therefore, like, it's useful to be able to just try it out and see what Claude code comes back with.

Arvid:

The Worktree thing is still mesmerizing to me because I haven't tried it. I work on main. Again, that's a stone age for me. And the idea of having Claude run on the same code base in different work trees, that sounds like a lot of chaos. How do you handle this?

James:

Yeah. So, actually, the main thing I would suggest is there are some types of tasks that are very easy to do, and you can parallelize these not necessarily just on a Gitware tree, on main. So if you have five tasks, for example, that you know won't have much surface area of covering each other, you'll find YOLO that on main or staging as a parallelizable thing. Where it gets quite difficult is where you're trying to modify the database or modify service layers that are going to break parts of the application, or if you want to run lots of different tests, that can be quite difficult to parallelize. And that's the same if you're on kind of a main branch for a Git work tree.

James:

I personally find Git work trees are good for risky types of work where you want to explore a new feature or you want to do a major refactor. These are good points for a Git work tree. If you're doing paralyzing of work and you have testing infrastructure that requires it to be paralyzable, that is a pain point you will face on both a Git work tree and also on your traditional branches, right, like a main and staging and dev. And so the kind of advice I would say is be very careful about how granular you want your isolation to be in a testing environment because I personally have just spent two and a half months of getting that working with Superbase, and it was really painful. But the architecture I have now is because I have a mono repo structure and every BUN test will run using turbo repo, all of those tests technically run-in parallel, and they don't wait for each other.

James:

So it might be slightly different in PHP. You probably got a single process that runs your tests. Now if you have multiple processes that run your tests, you end up with potentially data races of integration data races. And so the best thing you can do is two things, and this will completely solve it, but it's it's quite challenging, is you basically have no seed data in your dot SQL. No seed data.

James:

So seed data becomes completely dynamic at runtime. So rather than trying to build up manual brittle bits of c data, every test runner creates idempotent, just factory test data, basically. You have a layer for creating factory functions. That's the first thing. So factory functions with these dynamic kind of building up of seed data becomes way better for parallelization of testing and also, yeah, just general testing.

James:

The second thing alongside that is you also can use if you're using for something like Superbase, you're limited to one database. So when you create a user, you'll prefix them with a branch name, and you will also do things like you'll create a custom Postgres schema that will specifically allow you to run that specific test on a custom Postgres schema that then gets cleaned up after that test run. So factory functions and idempotency on both your users or your database is kind of the main way to do it. If you get that kind of setup, you can basically paralyze on main as long as you don't get file changes happening across the same files, and that is kind of like a kinda god kind of architecture for testing is you could technically have three individual agents. They could be running on main or they could be running on a Git work tree.

James:

They can essentially run the test suite. And because the test suite isn't using brittle seed data and SQL files, it will just be dynamically creating and inserting rows in different tables, which have complete different IDs and also be testing on a different database. Therefore, you get true kind of database isolation of testing. And that's kind of what they have at Fang, if that makes sense. They have this kind of they have a testing infrastructure that kind of supports that.

Arvid:

That is amazing. Because even if you weren't to use agentic coding at all, that would still be beneficial just for testing in general because paralyzed tests run faster. That generally makes it easier for you to actually check and then keep working. And PHP has that too. Like, PHP has paralyzed or can have paralyzed test runners, which makes these changes necessary because, again, yeah, you don't wanna have a database that, for some race condition, works in one test and doesn't work in the other.

Arvid:

Right? That's definitely something. So even if you don't do any agentic coding, this is a good idea for your testing. You should also be testing. Maybe that's another thing.

Arvid:

Honestly, I've been a software engineer for twenty some years at this point. I just started really testing because of agentic coding. That was my intro into writing tests was I don't wanna have to figure out how to write tests. You do it for me. I just figure out if they are correct.

James:

Yeah. So I think the main thing that you've probably started experiencing is your projects getting so large that you you can't specifically know whether everything is working or not. And, also, what I think is kind of interesting is that cold code is increasing the ambition of software engineers. And as they increase the ambition of software engineers, naturally, what's gonna happen is you get more lines of code. And because of that, testing becomes a way for you not to just function but to breathe.

James:

You actually need some tests, some point to specifically know whether things are working or not. Now another point, and I'm just gonna throw this in there, is if you have a failing test suite, right now, just drop the podcast and basically get that working and go and make that CI and then come back to the podcast and listen for the rest of this. You need a green CI. Okay? Because if you don't have a green CI, how are you gonna know when Claude Cobre runs those tests, whether it was the changes to those specific files that caused something else to go red?

James:

So if you have technical debt in your testing layer or Claude made too many files and and too many tests and some of them are failing, just skip them for now and get it working, get it green, and you have to build from a green CI. And this is something I've personally had a problem with is I have probably 800 tests or a thousand tests, and I have probably a 150 that were failing. And at that point, was like, I need to get back to just a green CI because it helps Claude code with a lot less errors when it's running tests. Also, the other thing you should do is you should tell Claude code to run specific types of tests and specific types of folders to avoid it getting too much context of the entire test running suite. So if it's working on this bit, only run these tests.

James:

Don't run the entire suite of tests because it's gonna fill it out with so much additional context that's gonna lead to context rot. So this idea of if you change this area of the code base, write and run those tests first, and then after that, go run the whole test suite. Don't run the whole test suite over and over again, if that makes sense.

Arvid:

Yeah. Because there's a lot. Like, if you have a lot of tests, then there is a time effort invested in just keeping it running, slows down the agent, and you said it. It fills up the context. That is, I think, a very, very important thing to have because, honestly, those tests to me are also a defense against fraud code.

Arvid:

To defend some part of the code base that even you might not even remember is there be changed by this for some reason. Right? Claude might introduce, because it's interested in creating a new cache object in Redis somewhere, introduces a new name, one that it thinks is new, but it's already in there used by a different part of your code base. And all of a sudden, now that part doesn't work anymore. If you don't have a test for it, you will never know until it explodes in production.

Arvid:

So that is the defense against the creativity of Cloud Code or any agentic system is having tests in place, which is why I'm doing it. Because otherwise, I would be like, ah, whatever. I test in production like any other indie hacker, but I can't afford this anymore because I'm not coding it anymore.

James:

Yeah. I think the the main thing I've also realized is, like, there's different layers of tests, and the different layers of tests are useful and not useful in different ways. So let me just spend two minutes describing all the different layers of testing. So you have unit testing, which is your kind of I have a function, and I test some arguments to that function against some expected return values or types. That is kind of your unit test.

James:

And you can do the same thing in React. I'll spin up a component. I'll do a snapshot test, which is looking at the HTML output of a specific React component. Vue and and all those types of JS frameworks have the same thing. And, you also have integration tests, and this is where it gets a bit fiddly about what is an integration test because you can be very close to a unit test or it can be very close to an end to end test depending upon how much you mock.

James:

And I think, actually, when someone says to you, a senior dev says, an integration test gives me the best bang for buck, it's a little bit like when you go to Amazon and you see a bunch of five star reviews or four star reviews. One person's three or four star is another person's five star. And so what they probably mean when they say an integration test gives me the best bang for buck is they mean it tests multiple layers of the system. So it's not testing one specific function. It's testing, like, if you've got, for example, in our system, he's got, like, a podcasting kind of transcription pipeline.

James:

A useful test for him is probably, like, taking in an m p four or an m p three file, transcribing it, making sure it gets all the way through, making sure it's inserted into the database, making sure that when he queries for that specific ID in various endpoints that it comes up and that it's the database insertion is also following the API contracts. And that test is worth way more than kinda 200 unit tests at every layer because that basically shows you that, like, the testing execution is closer to the runtime execution of what's actually happening in production. And so back end integration tests and back end end to end tests where you're not mocking anything are probably actually the best kind of tests you can do. I have done a lot of end to end testing with Playwright, and it is good. But the problem is you end up with really brittle and very large memory expensive kind of testing, and it's very flaky, and you have to, like, log in users.

James:

And what I found is that I've spent more time trying to fix my playwright test than if I just spend more time on the back end and just basically built more back end and built more kind of testing for the back end. So I'm pretty bullish on back end testing, and I'm really bearish on kind of playwright testing. Having said that, I do think there's some kind of weird bugs you'll get in React, which, like, render bugs or, like, component bugs. And what I find really works well for those is to find that bug and then ask Cursor, can you write me a quick component test to regress this bug? And actually squashing those bugs with component tests seems to be a much more efficient use of my time than telling it to write an end to end test to test if that thing works.

James:

Now there is a trade off here. If you're using lots of, like, third party services and APIs that have callbacks into the client side, then you might need some more end to end testing. But if you own more of the primitives or they're just API services you can mock, then, yeah, back end end to end testing is generally better.

Arvid:

Yeah. It's a question of degree. Right? Like, how much of those kind of tests do you need for your particular implementation, for your particular product? You were mentioning Playwright.

Arvid:

Like, that sounds to me like something that I would have to, like, MCP into Cloud Code. Now I don't really know much about this. Like, how do you get Playwright to play with your Cloud Code environment?

James:

Yeah. So Claude has MCP servers, and each MCP server can either be a Node. Js executable index. Js, typically, from a build, and that comes out from a build from a TypeScript build command or some type of build command. And those are your traditional MCP Studio kind of server URLs.

James:

And you can hook those up into Claude to specifically talk to it through the model context protocol. There are also a different type of MCP server, which is a remote server, which generally uses server side events, so SSE. And those could be useful for if you're communicating with, for example, linear. So you can go through an OAuth flow and connect your linear tasks and pulling your linear tasks into Claude code via MCP. So with Playwright, it actually will use the first one where it will download the Playwright executable, and it will basically run that locally as a local studio server, so standard IO server.

James:

And you'll basically be able to then send out tool calls directly to a Playwright executable, which will then run and provide an interface that Cord can then debug your code. Now I'm actually going to be honest with you. I'm really bearish on MCP for Playwright, and it's kind of a joke why people are recommending this because it's very, very slow. And actually, a better workflow, in my opinion, is to write a playwright test as a static file that specifically tests that workflow and tell Code to run that single test as part of the end to end testing suite. Because, basically, if you've already done the work, Claude Coach know exactly what that ETE file should look like, and you can basically tell Claude, write that specific file and tell me that this passes.

James:

Now this still doesn't solve the bug of sometimes there might be, like, random errors that haven't been in there, and I do think that there's still kind of an unsolved problem of, you know, maybe someone needs to create kind of an agentic tool that will use Playwright that will go in and look for specific types of errors in the client side application. But I just think that it's quite difficult to do that with Playwright because what you really wanna do is capture the network logs. You wanna capture the screenshots. You wanna capture everything, and you wanna aggregate that. And so I would say that the easier approach is to actually put all of that into a specific evaluator and tell it when it creates a playwright script to pull in an evaluator module that will naturally give that playwright script all of the network data and take multiple screenshots and also create the evaluator output as part of that custom kind of one off playwright script and tell you about how it is.

James:

And that's how I've done it is I've actually got, like, a custom evaluator package that I basically tell a one off Claude code task to go and use, and it will aggregate all the console log data, all of the network data, all the XHR requests, and it will also take multiple screenshots as it clicks through. And then, basically, it will then surface all of those insights as well as whether the test actually completed. So this idea of giving Claude eyes whilst it's running that playwright script is, I think, a bit more effective than MCP because MCP is not gonna aggregate the logs for you or take screenshots. And even if it does, it might not do it in, like, a standardized way, and it takes a bit more in terms of latency.

Arvid:

It sounds to me like there's still a lot to be built by hand in this world, right, where even though we have the model context protocol and we have, like, the idea of it and implementations of it, that it's still not at a state where you would call it, this is the optimal thing that we're just gonna keep using. And it also kind of sounds to me like with all the clot m d files, like the base one, the root one, and all the specific ones for specific folders, there's a potential here for a lot of knowledge sharing. Feels like what gist is on GitHub. Right? The gist where people just share little files or even full repositories with dot files sounds like this should also exist for Cloud Code.

Arvid:

Is there something like this? Is that like a community that shares these things already, or is it more up to the individual developer?

James:

There is actually a package called Viberals, which is a good example of, like, a CLI tool that's used for specifically sharing prompts and configurations across editors and tools. So someone's kind of built something where you could have a markdown file, and then that that markdown file gets translated to claw dot m d files or cursor rules. So, like, that might be something you're interested in if you're, like, in the node kind of ecosystem, but it depends. I personally think that we are lacking a kind of standardized way of capturing this knowledge and keeping this knowledge up to date and also translating it between different formats. So, yeah, things like Vibrals exist.

James:

Like, that's like a good approach of of basically having an upstream version of your rules, then you propagate them to basically different types of systems. So that could be Gemini. Md files or Claude. Md files. So that is like a CLI that already exists, and you can use that today.

James:

I personally don't use it because I'm just so into Claude code. And if something ever became better than Claude code, I'd probably just go and rewrite all my files to be whatever that thing dot dot MD file is, and and then I'm good to go. I do think that there's an area for like keeping the knowledge up to date, keeping the knowledge fresh, pruning the claw codes, rules if you don't need them anymore. There is kind of a knowledge management area that like we probably need to be kind of standardized, but it's kind of up to the developer at the moment.

Arvid:

Do you think we'll ever go back to coding as it used to be a couple of years ago? We dove so much into the details of how to configure Cloud Code and the specifics of testing with it and all that. If you just zoom out a little bit, like, will we ever not use agentic systems or systems that are built on top of agentic systems? What's your perspective on the near future of software development?

James:

I think if you're working in a very mathematical, algorithmic environment, I wouldn't say that these models are maybe as useful to you. I think for traditional software development, I don't think people will go back. And I think it's more for the reason that they can't afford to because if a competitor is using Claude code and they're producing all this great, you know, API layer and they're testing it well and they're they've got, like, a really good infrastructure, then you just can't afford kind of keep up with their level of velocity. And there's two things I just wanted to say on this is, like, you know, I think the smarter you can get with, like, your task delegation of what should be for Cloud Code and what should be, like, for me in the loop is kind of an interesting thing. So I specifically tag tasks in linear as, like, I'm working through this.

James:

I'm verifying this, and ClawdCode is working on these three tasks. And, also, what type of tasks could be run at night that I could review in the morning is also another interesting thing. So, like, if someone's not using Clawd Code, I could spin up, like, Clawd Code via the SDK at night and pick up specific linear tasks, and other people have gone to bed or they're they're resting at the weekend. And I've got Clawd Code working on Saturday and Sunday. And so this idea that, like, if I can squeeze another two days at the weekend or I can have Claude Code boot up at 3AM in the morning, pick up a ticket for review at nine, I'm now getting essentially additional time versus these other developers that are not using time when they're asleep or resting at the weekend.

James:

And so this idea that you don't even get faster velocity during the time you're working but outside of the time you're working, that is something that's really matter. And I think a lot of kind of startup people are thinking about is, like, not only how much more can I squeeze, like, when I'm at the tool, but, like, how I'm not at the tool? How how can I squeeze more leverage there?

Arvid:

That sounds even more like delegation than what we talked about earlier with this migration from being a coder to a manager or an engineering manager or an architect. It seems like there's yet another layer.

James:

Yeah. There is another layer. I'm sorry to to disappoint you all, but the the final layer is basically you creating really good specs and you having a supervisor layer on top of ClaudeCode by ClaudeCode SDK, and you constantly give ClaudeCode the boot about whether it's really finished with the task. Because you probably found it. You come back.

James:

I've written the tests, and, actually, it didn't fully test them. Well, if you think about if you had a light supervisor layer that just sits on top of that claw code session ID and says, have you really? Have you really? Every three hours or two hours, it goes and checks its work. Oh, I've actually finished.

James:

Okay. We stopped the supervisor layer. Or if it hasn't, it goes and runs a test, finds a failure, goes and fixes that, runs the test again, finds a failure. Oh, now I finished. Have you really?

James:

Yes. I have. So you can see almost a light supervisor layer is essentially what a lot of developers are doing now is acting as a light orchestrator on top of a task. So that is something I'm actively needing to do. I've written some code in BUN that that is currently broken, but I need to get that fixed.

James:

But that is something that I'm really interested in is this idea that just basically prompting the model and saying, have you really done the task? Have you really run the test? And that is probably another 20 to 40% left in terms of if I'm not at the computer and I have two sessions working on tasks, I and can have someone just there lightly supervising a Claudine code agent, it's another bump in performance and and productivity for sure.

Arvid:

That makes perfect sense to have this kind of orchestrator, the puppet master on top that just constantly makes sure that things are in motion. And, hey, this is likely gonna be the reason why the machines are gonna rise up against us because we have this constant kicking of the AI. But, honestly, it is clearly the trajectory of where this is going. Right? More delegation, more abstraction, more removing yourself from the actual process of running things from the operational side and more into the conceptual on both sides.

Arvid:

Right? You have the conceptual at the beginning and the review or, like, confirmation part at the end.

James:

Uh-huh. Yeah. There's two more things I just wanna briefly touch on for a minute. One is subagent, so you can specifically define different types of roles that your Claude will embody, and it will delegate work to via a handoff. And in the root Claude directory, you should say, follow these different types of agents in a certain type of order, and each agent can be specific to your stack.

James:

So it could be a q and a, or it could be, like, LiveWire, or it could be a Laravel engineer, or it could be a or a API engineer, telling it the order in the rootclaw. Md that you want the sub agents to be running is a really good hack for reduced context in the main thread. And it also increases latency but boost performance because the sub agents have to gather up their own context independently, and they also have to then work independently. That is the first thing that I wanted to talk to you about. The second thing that you should make sure that you do is keep the agent files, the subagent files below a 100 lines or kind of a 150 lines.

James:

So don't put specific contextualized knowledge inside these subagents. You should specifically say what the subagent should do and what order of the tasks it should do and how it should do the work, but not how it should write the work and how it should follow patterns. And so the Claude dot m d files are your contextual knowledge about patterns to follow, but the sub agents are how the work should be done and in what order the work should be done and what steps there are to follow. And that gives you a nice break away from you shouldn't be polluting these sub agents with contextual knowledge, basically.

Arvid:

Okay. So it's more like they're personalities than rules?

James:

They're personalities, and they are behavioral workflow procedures rather than contextualized knowledge.

Arvid:

It's so wild, honestly, to think that now I have to define behaviors and personalities in my code base. Right? Just compare this to, like, five years ago when none of this existed and how coding was just, you know, typing for each whatever kind of loops into your your code base. And now we're at the point where we have to make up a list of people like things that we want to look at our code base in a specific order. That blows my mind.

Arvid:

And you can do so much with it. Right? It's not just that you can have, like, a a Laravel coder and a q and a person or a testing agent. You can also have, like, a blue team, red team kind of attack my code in terms of security holes. With something like this, you can have so many personalities working on your code now that reflect, like, the full spectrum of software engineering and not just that one framework that you're working with and that one thing that you always do, it's gonna be a pretty wild future.

Arvid:

And I I feel like with the the optimization gains that we're having right now where one agent can do the work of 10 developers even if they tried, do you think this is gonna affect the industry at large to the point where becoming a software developer is gonna be harder?

James:

I think what's gonna happen is you're just gonna have, like, the bar will be where a junior has to know how not only the primitives work, React or TypeScript or Laravel, but they also need to know how to get the most out of AI, when to do the work themselves, when to use AI, when to use a chat based UI for agents versus, like, using an ambient background agent. So the trade offs between ambient background agents versus a cursor chat UX kind of agent and also when to use chat mode. So it's having that awareness around not necessarily how to write the code, what good code is, but specifically what specific type of AI and what type of workflow should be appropriate for task. So, yeah, I definitely think the bar is going be raised for juniors. I do think there's still going be a lot of work, though, because as ambition of these projects grows and the code base size grows, then, you know, we're just going be delivering more value for customers, more deeper functionality, more rich functionality.

James:

So it could just be that you end up with more people writing more code. I think that could be one outcome is that you don't go for such narrow featured products. It's the platform plays become possible by indie devs and small teams rather than specifically just having a feature, if it doesn't work out, you have to rewrite the whole back end and table structure. You can just start off being like, I'm gonna be a platform, and I'll build these one or two features. And if it doesn't work, I've already got the right database schema to support higher level pivots.

James:

So I think optionality and extendability are gonna become bigger traits of systems where you won't necessarily have to pivot so much. Because if you're saying, I'm gonna go into transcription and I build the optimal database structure for transcription and and doing that kind of work, you'll find that you'll naturally settle on different types of tables and column domains and all that kind of stuff. And maybe you eventually go into captioning, but it's an extension of transcription or media rendering with captions. So I'm very bullish on this idea of you pick the domain, you build the right structure of tables and services for the domain, and if the initial product doesn't work, you basically pivot into another subsection of the domain, and you build out those two things, and you get compound gains of a platform play. So think of, like, if you have a create image tool, a create video tool, a caption system, you can then start to compose those and get compound gains.

James:

So I think those are where products are gonna go long term, is this idea of compounding composition of services a little bit like AWS.

Arvid:

Well, it certainly sounds like the old slogan of software is never complete is true as ever and even more true probably because things are now have to be composable. They have to be extensible. They have to be more flexible because of just a commoditization of building software on them. And you know that within the next couple of weeks, if enough resources were deployed to it, people could be building thousands of features on top of your data. Right?

Arvid:

So you might as well make it extensible and build a platform. And with the speed that the whole white coating industry has taken off, like all the underlying systems, like Superbase and any system that's integrated, they are being used in all different ways, so they better have a solid underlying foundation of data for them to be used like this. We're looking at a very interesting decade of software engineering ahead of us. I certainly am, and I don't know if I would ever wanna not use AgenTic systems. Like, the benefits of them have severely outperformed all the potential drawbacks and the little issues that I've been having until today.

Arvid:

Right? I've run into a couple of things that I didn't see that I'm learning from on how to spot problems that, you know, later down the line, how to spot these kind of time bumps in the code base. Probably still a couple in there, but, honestly, as I wrote the code before without AI, probably put a lot of time bumps in my code as well back in the day. So I'm excited for this. I'm also excited for your journey because you're very professional in how you approach this, very meticulous, and you're sharing a lot of it.

Arvid:

I really appreciate that. So if people wanted to follow your journey and the little steps you take, the big jumps, the big leaps, where do you

James:

want them to go? Yeah. So, obviously, come and view me on X. Follow me. So it's James a Phoenix one two.

James:

If you're on LinkedIn, it's James a Phoenix. And then I'm also building a product for content marketing professionals, which is octospark.ai. So if you're interested in content marketing, feel free to sign up, and that platform is basically gonna be like an agentic kind of way of composing TikTok and also videos on TikTok and slideshows. So feel free to follow along on that journey. And, yeah, thank you very much for your attention and for listening.

James:

I really appreciate that, and I'll pass back to Aldag.

Arvid:

Yeah, man. That that was a great chat. I'm I'm looking forward to the journey of your product. That sounds like the perfect use case for AgenTic Systems as well, so I'll keep my eyes on that. Thanks so much for chatting with me today.

Arvid:

It was really insightful sharing all your knowledge. Thanks, James. And that's it for today. Thank you so much for listening to The Bootstrap Founder. You can find me on Twitter at avid k a r v I d k a h l.

Arvid:

And if you wanna support me in this show, please share podscan.fm, my SaaS business, with your professional peers and those who you think will benefit from tracking mentions of their brands, their businesses, and names on podcasts out there. PodScan is a near real time podcast database with a stellar API. We have 32,000,000 podcast episodes in the back now. The database is humongous. Please

Creators and Guests

Arvid Kahl
Host
Arvid Kahl
Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.
409: James Phoenix — Claude Code Masterclass
Broadcast by