AI Agents Need Operations, Not Just Smarter Models

AI Agents Need Operations, Not Just Smarter Models

A lot of AI teams are still acting like model choice is the whole game. I don't buy that. Yes, better models help. I'm happy to use OpenAI. I'm happy to use Anthropic. If a stronger model gives me better reasoning or cleaner output, great. I'll take it. But most of the pain I run into with AI agents has nothing to do with whether the model is slightly smarter. It has to do with whether the work survives real life. Can it recover when a tool call fails? Can it keep going when the job takes 20 minutes instead of 20 seconds? Can I tell the difference between progress, delay, and a dead run? That is the actual product. The demo problem A lot of agent systems look great in a demo because the happy path is easy to script. You give the model a clean prompt. The tool returns exactly what you expected. The output looks impressive. Everyone nods. Then somebody uses the system for a real job. Now the API is slow. One dependency returns garbage. Another step needs approval. A browser session dies halfway through the run. The model is fine, but the system around it is brittle. That is why I keep coming back to the same point I made in AI agent infrastructure matters. Clever output is nice. Dependable execution is better. Long-running work changes the standard The second an AI task becomes long-running, the standard changes. If a job takes thirty seconds, people will tolerate some mystery. If it takes twenty minutes, mystery becomes a problem. Nobody wants to stare at a spinner and wonder whether the system is still working. Nobody wants to re-run a job just because the UI went quiet. Nobody wants to discover an hour later that the agent failed on step three and never surfaced it. At that point, trust becomes an operations problem. I want a system to tell me what it is doing, what already finished, what is blocked, and what it plans to do next. I want resumability. I want retries that are not stupid. I want checkpoints. I want visible state. Without that, an AI agent is just a slot machine with better branding. The boring parts matter most The teams that win here are not just going to have access to strong models. Everyone will. The teams that win will build the boring layer around them. That means things like:tracking state so work does not restart from zero surfacing progress so users do not have to guess handling retries with limits and context escalating gracefully when a human actually needs to step in preserving enough execution history to debug failures after the factNone of that is sexy. That's exactly why it matters. The most useful AI systems will feel less like magic and more like good operations software. Calm. Clear. Recoverable. Why this matters commercially This is not just an engineering opinion. It changes whether people trust the product enough to use it on real work. According to McKinsey's State of AI research, companies are increasing AI adoption, but scaling value still depends on reliability, risk handling, and workflow integration. That tracks with what I see. Nobody cares if your agent looked smart once. They care if it can finish the job on a Tuesday afternoon when six other things are broken. My takeaway I think the AI industry still underrates operations. People talk about reasoning benchmarks because they are easy to screenshot. But if an agent can't stay alive, show its work, and recover from failure, the benchmark win does not matter much. Smarter models are useful. Trustworthy execution is what gets adopted.

Fragmented Analytics Create Fake Confidence

Fragmented Analytics Create Fake Confidence

I keep seeing the same mistake in small portfolios. You split traffic across a few sites, a few products, a few dashboards, and suddenly everything looks more impressive than it really is. Not because the business is lying to you. Because the analytics setup is. If you have one main site, one side project, one tool, and a few landing pages, it's really easy to open Google Analytics, Google Search Console, maybe PostHog, and convince yourself you've got traction everywhere. A little traffic here. A little growth there. One page up 40%. One tool got a few signups. One post pulled in clicks. Individually, every chart tells a hopeful story. Together, they can create fake confidence. The problem isn't the data The problem is fragmentation. When each property is small, every local peak feels meaningful. A blog post gets 120 visits instead of 40, and it looks like momentum. A tool gets 3 signups in a week instead of 1, and it feels like product-market fit is warming up. A page starts ranking for a handful of terms in Google Search, and now you're mentally writing the case study. I've done this myself. The trap is that small numbers become emotionally loud when they're separated into different dashboards. You stop asking, "Is this portfolio actually working?" You start asking, "Which tiny win do I want to believe today?" Why this gets dangerous fast Fragmented analytics don't just hide weakness. They distort prioritization. If you check five dashboards every morning, you can always find one thing that looks alive. That makes it way too easy to avoid harder questions:Which asset is actually creating leverage? Which channel is producing repeatable demand? Which project deserves more time? Which thing should get killed?Without a portfolio-level view, you can confuse motion for signal. That's especially true when you're a small operator. You don't have enough volume to let noise average itself out. One decent day can completely mess with your read on reality. The Google Analytics documentation is useful for understanding the mechanics, but it won't save you from telling yourself a flattering story. That's your job. What I think matters more I trust aggregated truth more than isolated wins. I'd rather know that the whole portfolio generated:1,900 total sessions 42 email captures 11 meaningful conversions 2 assets responsible for most of the pullThan know that one page on one site had a great week. That second view feels better. The first view is more useful. For me, the useful question is usually: if I zoom out, is this whole system compounding or just twitching? Those are very different things. The fix is boring You don't need more dashboards. You need fewer stories. I think every small portfolio needs one simple operator view:total traffic across properties total conversions across properties top entry pages top conversion sources what changed week over week what actually earned more attentionThen underneath that, you can still use tools like Google Analytics, Google Search Console, PostHog, or Stripe for detail. But detail should support the main picture, not replace it. If the rollup says nothing meaningful moved, I don't really care that one page had a cute little spike. The operator lesson Small portfolios are noisy by default. That means the job is not just collecting more data. The job is building a view of reality that is hard to bullshit. A clean dashboard is not the same as a truthful one. If your numbers live in five places, your confidence probably does too. I wrote recently about building a content system that removes excuses. This is the measurement version of the same idea. Fewer moving parts. Fewer places to hide. A shorter path to the truth. That is usually what better operations work looks like.

Building in Public Without Analytics Is Just Vibes

Building in Public Without Analytics Is Just Vibes

Shipping every day feels productive. It also lies to you. I have been thinking about that a lot lately because the internet makes it very easy to confuse visible output with actual traction. You can publish posts, ship tools, push updates, and watch the streak keep going. From the outside, it looks like momentum. But if you cannot see what people are clicking, reading, bouncing from, or coming back to, a lot of that momentum is just a good-looking blur. That is not a content problem. It is a measurement problem. Output is not the same as signal I think a lot of builders quietly do this. We tell ourselves that consistency is the hard part. And to be fair, it is hard. Most people never publish enough to learn anything. But once you are publishing consistently, the bottleneck changes. The question stops being, "Can I ship?" and becomes, "Can I tell what is actually working?" Without analytics, you usually cannot. You are left with the weakest possible proxies. A post "felt" strong. A launch got a couple replies. A page seemed clear when you read it back. None of that is useless. But none of it is enough either. It is just intuition wearing a nicer shirt. Building blind gets expensive fast This matters even more when you are running a small operation. If I write a blog post, publish a tool, and share an idea on X, I do not just want the satisfaction of having done the work. I want to know where attention actually pooled. Did people spend time on the page? Did they click through to the tool? Did one idea pull better than another? Did traffic come from search, direct, or social? Did anything compound? That is why tools like Plausible and Google Analytics matter, even if the setup is not the glamorous part. Measurement is not bureaucracy. It is how you stop wasting weeks on stories that only sound true in your own head. I have learned this the annoying way. When analytics are missing, every decision starts drifting toward taste. You optimize for what feels sharp, what sounds smart, what seems likely to work. Sometimes that overlaps with reality. A lot of the time it does not. And the longer you keep shipping without feedback, the more confident you can become for the wrong reasons. That is a dangerous loop. The real job is closing the loop I think this is where a lot of "build in public" advice falls apart. People talk a lot about courage, speed, and volume. Less people talk about instrumentation. But the boring part is what turns output into a system. You need a loop:publish something measure what happened learn from the result change the next thingWithout that loop, you do not really have a content engine or a product engine. You have a posting habit. And a posting habit is better than silence. I will take that over endless planning every time. But if the goal is to get sharper, not just louder, then the loop matters more than the streak. That is part of why I keep coming back to simple, legible systems. I wrote recently about why boring systems are a feature. This is the same idea in a different form. I do not need a giant dashboard religion. I just need enough visibility to tell whether the thing I shipped did anything real. That sounds obvious, but a lot of builders skip it because it feels secondary. It is not secondary. It decides whether your effort compounds. Vibes are fine for drafts, not decisions I still trust instinct. I still think taste matters. I still think you sometimes have to publish before the data exists. But instinct should help you make the first bet. It should not be the only system you have for deciding what to do next. That is the line I care about more now. Write the post. Ship the page. Launch the tool. But then measure what happened, or be honest that you are still in the guessing phase. Because building in public without analytics is not really building in public. It is just publishing in the dark.

If You Still Have to Double-Check It, It Isn't Automated

If You Still Have to Double-Check It, It Isn't Automated

A lot of people call something automated when what they really mean is faster. Those are not the same thing. If you still have to double-check every output, every recommendation, or every record before you can trust it, you didn't automate the job. You just changed the shape of the work. I keep seeing this with AI tools for operators. The demo looks great. The model fills in the form. It summarizes the notes. It flags the likely issue. Everyone claps because the task that used to take ten minutes now takes two. But then the person using it still has to read the whole thing line by line to make sure it didn't hallucinate, skip a step, or confidently say something dumb. At that point, the tool may be useful. But it is not automation. It's assisted drafting. And to be clear, assisted drafting can still be valuable. I'm not knocking it. Speed matters. Reducing blank-page friction matters. But if a manager still has to babysit every output, the real bottleneck did not disappear. It just moved downstream. That's why I care a lot more about reliability than flair. When I'm building tools for operators, I want the default experience to feel safe. Clear inputs. Narrow scope. Fewer places for the system to go off the rails. The operator should not need to become the QA layer for the machine every single time. This is especially true in messy business workflows. Compliance, payroll, food safety, onboarding, audit prep. These are not areas where "mostly right" feels good. If a record is wrong, or a required step gets skipped, someone ends up eating the cost. That's part of why I think the best AI use cases look boring from the outside. They do one job. They stay inside clear boundaries. They help with judgment only where it actually helps. The more a system depends on a human hovering over it, the less automated it really is. I've written before about how AI makes bad process fail faster. I think this is the same lesson in a different wrapper. A sloppy process plus a fast model just gives you wrong answers at a higher volume. The bar should be higher than speed. The bar should be trust. That doesn't mean every tool needs to run fully unattended. Sometimes human review is exactly the right call. But if human review is mandatory on every single run, then be honest about what you built. It's not automation. It's a co-pilot with a nervous supervisor sitting beside it. I like the way Google's SRE book frames operational reliability. The point is not just to make systems work sometimes. The point is to make them dependable enough that people can build real processes around them. That's the standard I think AI builders should steal. Not "can the model do this once in a demo?" Can someone trust the workflow enough to stop re-checking the whole thing from scratch? If the answer is no, that's fine. It might still be a useful product. But call it what it is. Useful is good. Reliable is better. And actual automation starts when the operator can finally take their hands off the wheel.

Restaurant Owners Don't Care About AI

Restaurant Owners Don't Care About AI

Restaurant Owners Don't Care About AI Restaurant owners do not wake up wanting more AI in their business. They want fewer things to go wrong. They want the fridge temp logged. They want the sanitizer check done. They want to know the opening shift didn't miss something stupid that turns into a failed inspection later. That's why I keep getting pulled toward compliance tools instead of flashy AI demos. The interesting thing is not the model. It's the consequence. If a tool helps someone avoid a health inspection problem, prevent a scramble, or make a manager's day less chaotic, they'll use it. If it just sounds futuristic, they won't. That's also why I think the boring systems angle matters so much. I wrote about that more in Boring Systems Are a Feature. Lately I've been building LogChef, a food safety logging tool for restaurant teams. The point is not to impress anyone with AI. The point is to make the work clearer, faster, and harder to screw up. That framing matters way beyond restaurants. A lot of AI products are still sold like magic tricks. But in real businesses, people usually buy relief. They buy fewer mistakes. They buy fewer dropped handoffs. They buy fewer moments where somebody says, "Wait, who was supposed to do that?" The teams that win with AI are usually not the ones chasing the flashiest demo. They're the ones using it to remove friction from work that already matters. That's a very different bar. It also lines up with how regulators and operators think. The FDA Food Code is not asking whether your tooling is exciting. It cares whether the process is followed, documented, and repeatable. Same in a lot of B2B software. People talk about AI like the product. Most of the time it's just the engine inside the product. What the customer actually buys is confidence. They want to feel less exposed. So when I'm thinking about what to build, I've started using a simple filter:Does this help someone avoid a real problem? Does it make a recurring job easier to complete correctly? Would someone still want this if I removed the word AI from the homepage?If the answer to that last question is no, I get suspicious fast. I'm more interested in tools that quietly make a workday better than tools that generate a lot of hype for a week. That's usually where the real value hides.