AI Agents in Production Need Ops, Not Better Demos

Most AI agents in production fail after the happy path. Here’s why governance, state, permissions, and observability matter more than prompts.

AI Agents in Production Need Ops, Not Better Demos

“The demo worked perfectly” is still the fastest way to tell me nobody has actually shipped AI agents in production.

Sure. My espresso machine also works perfectly until I ask it for six cappuccinos back-to-back while the Wi-Fi dies and someone “helpfully” replaces the oat milk with almond. Same vibe. The distance between “look, it booked a meeting” and “this thing survives real users, weird permissions, stale data, retries, approvals, and one sleep-deprived ops person at 2 a.m.” is where most agent projects get dragged into the street and beaten.

I’m not impressed by agents that can reason in a sandbox. I’m impressed by agents that can survive chaos without turning into a very confident intern with root access.

That’s the whole take, really. Most teams aren’t failing to build agents. They’re failing to admit they’re building a tiny distributed system wearing a cool leather jacket. Sexy prompt in the front. Messy orchestration problem in the back. Like one of those Milan aperitivo spots with perfect lighting and a kitchen one ticket away from mutiny.

I’ve seen this movie too many times. A team gets the first 80% working, everyone claps, screenshots go into the pitch deck, somebody says “we should roll this out company-wide,” and then the last 20% walks in with a baseball bat. State. Permissions. Human approvals. Error handling. Ownership. Audit trails. Suddenly the “AI thing” is not an AI thing at all. It’s software engineering, product, security, and operations all yelling across the same dinner table.

And yes, I sound a little bitter because I’ve been that founder staring at logs at stupid o’clock thinking, madonna, the model was the easy part.

The Demo Is the Lie

The demo is theater.

Not fake, exactly. Just curated within an inch of its life. One session. One user. One clean prompt. One happy path. Nobody revokes a permission halfway through. Nobody uploads a cursed CSV from 2019. Nobody asks the agent to resume a workflow after step four exploded because Salesforce timed out and the approval webhook disappeared into the void.

That’s not production. That’s sleight of hand.

Real AI agents in production don’t just call tools and vibe their way through a workflow. They deal with multi-step tasks, persistent context, long-running jobs, approvals, retries, and failure recovery. They need to remember what happened three steps ago, know which state matters, and recover without acting like they woke up in a different timeline.

OpenAI’s work around stateful runtimes for Amazon Bedrock makes this pretty obvious. The whole point is closing the gap between a nice demo and a real system by supporting persistent context, tool state, approvals, and error handling for long-running workflows. That’s not some boring implementation detail. That is the product. If the runtime has to exist because stateless demos collapse the moment reality shows up, then maybe the real question was never “can the model use a tool?” Maybe it was always “can the system survive contact with the world?”

Exactly.

A few weeks ago in New York, I had coffee with a founder who showed me an agent that could qualify leads, enrich them, draft outreach, and schedule follow-ups. In the demo, it looked like a junior SDR crossed with a caffeinated McKinsey analyst. Then I asked the one annoying question nobody wants asked: what happens if the enrichment API returns partial data, the CRM record is locked by another process, and legal requires approval before outbound messaging in Germany?

Silence.

Beautiful, expensive silence.

That’s the part nobody tweets. The first 80% of an agent demo is the appetizer. The last 20% is Saturday dinner service when two people called in sick and table 12 is gluten-free by religion. If you’ve ever shipped software, you already know where this goes. The hard part is not making the model do a clever trick once. The hard part is making the workflow predictable when the world gets rude.

And the world always gets rude.

AI Agents in Production Are Mostly an Operations Problem

Here’s the less sexy truth: production agents are mostly about control surfaces.

Not vibes. Not screenshots of “reasoning.” Control. Who can the agent act as? What systems can it touch? Which actions need approval? What gets logged? Who gets paged when it does something stupid? Who owns the blast radius when the wrong email goes to the wrong customer with the wrong attachment and the wrong level of confidence?

If nobody can answer that last one, the agent doesn’t have an owner. It has future legal discovery.

OpenAI says the same thing, just with cleaner diagrams. In its writing on AI value models and agent deployments, the pattern is pretty clear: the teams that succeed have identity and access controls, scoped permissions, observability, exception handling, and actual ownership. Notice what’s not on that list: “a really spicy prompt.” Because even OpenAI frames agents as end-to-end workflow orchestration across systems, not just text generation with ambition.

That’s the game.

I think a lot of teams are still emotionally attached to the fantasy that an agent is a smart coworker. It’s not. It’s software acting across systems with partial context, variable confidence, and a dangerous talent for sounding more sure than it should. Which, to be fair, is also true of many startup founders. I include myself here, unfortunately.

The mistake is assigning human expectations to what is fundamentally an operational system. If your agent touches Zendesk, HubSpot, NetSuite, Slack, Gmail, and some haunted internal database only one engineer understands, congratulations: you did not build a chatbot. You built a distributed system with side effects.

I learned this the annoying way. A couple of years ago, I assumed that once the intelligence layer got good enough, the rest would mostly be plumbing. Cute. Very optimistic. The plumbing is the whole thing. The intelligence just decides where the water wants to go. If the pipes are bad, you still flood the building.

And nobody claps for permissions architecture. Nobody quote-tweets your exception handling strategy. But the boring stuff is exactly what makes a system trustworthy enough for an actual business to use on a random Tuesday, not just in a boardroom demo with good lighting.

That’s why so many “production-ready” agents still feel fragile. They’re not underpowered. They’re under-governed.

Memory Without Guardrails Is Just a More Expensive Way to Be Wrong

Everybody wants stateful AI workflows until they realize state can preserve mistakes too.

Memory sounds magical in demos. “It remembers the user.” “It knows the account history.” “It can continue where it left off.” Bello. Amazing. Love that for us. But if retrieval is weak, grounding is sloppy, or permissions are too broad, memory just becomes a premium subscription to being wrong at scale.

AWS Prescriptive Guidance gets this right. Enterprise-grade agents combine grounded retrieval, reasoning, traceability, memory, IAM-based access control, and safety constraints. That list matters because it kills the fantasy that memory alone makes an agent useful. It doesn’t. Memory without guardrails is just a more expensive chaos goblin.

Make it concrete. Say you have an internal finance agent handling vendor approvals and invoice status. If it pulls the wrong policy doc, grabs stale payment terms from the wrong source, or surfaces a record the user shouldn’t even be allowed to see, that’s not a funny hallucination screenshot for LinkedIn. That’s a real business problem. Maybe a compliance problem. Maybe a “why is legal suddenly joining this call?” problem.

Same for support. Same for sales ops. Same for HR, where one permission leak can turn your cute internal assistant into a career-ending incident.

AWS’s guidance around Bedrock Knowledge Bases and guardrails is useful because it treats grounded retrieval like infrastructure, not decoration. The agent should pull from approved sources, operate with scoped access, and give you traceability into why it said what it said. That’s the grown-up version of memory. Not “it remembers stuff.” More like “it remembers the right stuff, from the right place, under the right permissions, with a paper trail.”

Huge difference.

I’ll say it even more bluntly: if your agent can’t explain where a critical answer came from, I do not care how fluent it sounds. Fluency is cheap now. Traceability is expensive. Trust lives in the expensive part.

My nonna would probably say I can ruin even a beautiful meal by turning it into a lecture about process. Fair enough. But if the agent is touching enterprise systems, romance is overrated. I want receipts.

A team of engineers collaborating on AI agent deployment, surrounded by screens displaying data and analytics.

The Real Product Is the Dashboard, Not the Prompt

This is the part founders hate because it looks terrible in a demo day clip: once agents hit production, the most important interface is not the chat window.

It’s the dashboard.

The traces. The logs. The evals. The version history. The approval queue. The incident view. The rollback controls. The latency spikes. The tool failure rates. The tiny red badge telling you your “autonomous workflow” has been quietly failing for 37 minutes while everyone was busy posting screenshots of the prompt.

That’s the actual product. The prompt is one ingredient.

OpenAI’s AgentKit push makes this pretty clear too. The serious stuff is workflow versioning, guardrails, connectors, evals, and observability. In other words, if your strategy for AI agents in production is “we keep tweaking the system prompt until the vibes improve,” I have bad news. That’s not a reliability plan. That’s a superstition.

This is where agent observability and AI agent evals stop sounding like enterprise jargon and start sounding like oxygen. What failed? Where? How often? Under what conditions? After which model change? Against which user segment? With which tool chain? If you can’t answer those questions, you don’t have a product. You have a demo everyone is afraid to touch because nobody knows what breaks when.

I’ve watched teams spend three weeks debating prompt wording and about 45 minutes thinking about failure taxonomy. That ratio is insane. Prompt tweaks matter, sure. But if you’re not collecting the right traces and building eval datasets around real workflows, you’re basically seasoning pasta without tasting the sauce. Very American behavior. Con affetto, but still.

Versioning matters more than people think, too. Once your agent handles live workflows, every prompt update, policy change, connector tweak, and retrieval adjustment is a production change. Rollback should be normal, not some shameful emergency move. The best teams I know treat agent releases like any other risky software deployment:

  • Staged rollout
  • Monitoring
  • Eval gates
  • Rollback ready

Boring? A little. Effective? Molto.

Same old startup lesson. New outfit.

If you can’t measure it, inspect it, and reverse it, you don’t own it.

My Unpopular Prediction: The Winners Will Be the Boring Teams

My bet is that the companies who win with AI agents in production will not be the ones with the flashiest demos.

They’ll be the boring teams.

I mean that as a compliment, by the way.

Boring means clear ownership. Scoped permissions. Human approvals where they matter. Versioning. Rollback. Auditability. Grounded retrieval. Evals tied to real workflows instead of benchmark cosplay. It means someone knows exactly what the system can do, what it cannot do, and what happens when it gets confused.

And the funny part is everyone serious is converging on the same answer. OpenAI talks about stateful runtime, governance, and exception handling for long-running workflows. Their business writing keeps hammering identity, permissions, observability, and ownership. AWS focuses on retrieval, traceability, IAM, memory, and guardrails. AgentKit leans into versioning, connectors, and eval infrastructure. Different packaging, same message: production readiness is governance plus observability plus controlled execution. Not model cleverness alone.

Autonomy is wildly overrated if reliability is trash.

Nobody cares that your agent is “fully autonomous” if Karen from finance has to clean up after it every Friday. Nobody cares that it can use 12 tools if nobody trusts it with one. Nobody cares that it sounds smart if the audit trail reads like a crime novel.

My guess? Within a year, “AI agents in production” stops meaning “it can use tools” and starts meaning “it can be audited, governed, and trusted under pressure.” That shift is good. Less glamorous, yes. Also a lot more real.

So before you tell me your company has AI agents in production, answer the ugly questions.

  1. What happens when the workflow stalls?
  2. Who approves risky actions?
  3. What can it access?
  4. How do you trace a bad decision?
  5. How do you roll it back?

If the answer is “we’re figuring that out,” congrats.

You don’t have an agent in production.

You have a very charismatic prototype.

And honestly, that might be the real dividing line over the next year: not whether your agent is smart enough, but whether your company is mature enough to deserve one.