AGENTOPS MONITORING

AgentOps: Monitoring AI Agents in Production

By Jason Osajima — former VP of AI at a $250M manufacturer · LinkedIn ·
Quick answer

AgentOps monitoring explained for manufacturers: what to track once AI agents go live, the metrics that matter, and why pilots die without it.

AgentOps is the discipline of running AI agents reliably after they ship: tracing every run, scoring output quality on a schedule, watching cost and latency, and alerting when any of those cross a threshold. It exists because agents fail silently — they return a confident wrong answer instead of crashing — so a "200 OK" tells you nothing about whether the work was right. Monitor four things — quality, behavior, cost, reliability — set numeric thresholds before launch, and you catch problems before the plant manager does.

I ran AI at a $250M furniture manufacturer. The agents that survived all had monitoring from day one. The ones that died were the ones we shipped blind, and they died the same way every time: something went wrong in production, nobody could see it or explain it, and trust collapsed.

This is not a niche worry. Gartner predicts more than 40% of agentic AI projects will be scrapped by the end of 2027, citing escalating costs, unclear value, and inadequate risk controls — findings drawn from a survey of over 3,400 organizations (Gartner, 2025). Inadequate risk controls is monitoring by another name.

Why agents need their own ops discipline

Traditional software is deterministic. Same input, same output, every time. You watch uptime and error codes and you're mostly covered. Agents break that model in three ways.

The silent-failure problem is not theoretical. A Stanford RegLab study found leading models hallucinated on specific, verifiable legal questions between 69% and 88% of the time (Dahl et al., Journal of Legal Analysis, 2024). Drop an unmonitored agent into a domain it doesn't know cold and that's the failure rate you inherit — invisibly.

Drift is the slow killer. McKinsey's 2025 survey found measurement is still immature across most companies, and where tracking does exist, value realization rises and risk incidents fall (McKinsey, 2025). Monitoring isn't overhead. It's the thing that converts a pilot into value.

The four things to monitor

AgentOps covers four categories. Skip any one and you've got a blind spot that eventually bites. This maps cleanly to the Measure and Manage functions of the NIST AI Risk Management Framework, 2024, which calls for tracking performance, drift, and incidents after deployment — not just before.

1. Quality — is it right?

The category that matters most and the one most teams skip, because it's the hardest. You can't eyeball thousands of outputs. So you sample and you score.

One caution on numbers. Top models score 1–3% hallucination on clean, grounded benchmarks but commonly 3–20% or higher across mixed real-world tasks. Your held-out set has to look like your real inputs, not a vendor demo, or your eval lies to you.

2. Behavior — what is it doing?

You need to see what the agent actually did, step by step, not just what it returned.

Tracing has a standard now, which means you're not locked to one vendor. The OpenTelemetry GenAI semantic conventions, 2025 define a common schema for spans, token usage, model metadata, and tool calls — so a trace from a LangChain agent looks the same as one from a raw model call. Instrument to that spec and you can swap observability tools without re-instrumenting.

3. Cost — what is it spending?

Agents cost money per run, and the costs surprise people.

4. Reliability — is it up?

The traditional ops stuff still applies, because none of the above matters if the agent is down or too slow to use.

What good looks like, in numbers

Vague monitoring is no monitoring. Set actual thresholds before launch and alert when they're crossed.

Signal Healthy Investigate Alert
Eval accuracy At or above launch baseline 5+ pts below baseline 10+ pts below baseline
Human override rate Stable or falling Rising trend over a week Doubles from baseline
Business metric Moving toward target Flat Reversing
Cost per run Within budget 25% over 50%+ over
Latency (p95) Under user tolerance Creeping up Exceeds tolerance

The exact numbers differ by use case. The point is to write them down before launch, so a problem becomes a crossed threshold you can act on — not a vague feeling that the agent "seems worse lately."

Pick p95 latency, not the average. The average hides the slow tail, and the slow tail is what makes a plant operator give up on the tool.

Monitoring is what earns write access

Here's the link to everything else. The reason you can safely let an agent move from read-only to writing back into your ERP or MES is that you can see what it's doing. Tracing, eval accuracy, and override rates are the evidence that an agent is reliable enough to trust with a real transaction. No monitoring, no write access — the two are tied together, and that boundary is the heart of integrating AI agents with your ERP and MES.

Monitoring also keeps human-in-the-loop honest. The approval step on high-stakes actions only works if you're tracking how often the human disagrees with the agent — a discipline I cover in human-in-the-loop AI for operations. A near-zero override rate means you can widen the agent's autonomy. A rising one means pull it back.

And the trace logs are your security and audit trail. When you need to prove what an agent did with sensitive data, or investigate a prompt-injection attempt, the trace is the record — which is why monitoring is load-bearing for the AI agent security risks manufacturers must manage.

How this fits formal governance

If your company is heading toward certification, monitoring is not a separate workstream. It's the operational layer that satisfies the standards.

The NIST AI RMF GenAI Profile, 2024 names confabulation (hallucination), data leakage, and information-integrity failures as risks to monitor continuously after deployment. The international management standard ISO/IEC 42001, 2023 goes further, requiring organizations to define what they monitor — accuracy, fairness, drift, resource use — and how often, as a documented control.

Read those requirements and they describe exactly the four categories above. Build the monitoring and you've built most of the evidence an auditor will ask for. Skip it and certification becomes a paperwork exercise with nothing real underneath. Either way, monitoring slots into the broader picture in AI governance for manufacturers.

Start simple, then build

You don't need a full observability platform on day one. You need these, in order.

  1. Logging and tracing — capture every run from the first deployment. Non-negotiable and cheap. Instrument to the OpenTelemetry GenAI conventions so you're not locked in.
  2. A standing eval set — real cases with known answers, run on a schedule. This is what catches drift.
  3. A human review sample — a daily handful of outputs, graded by someone who knows the work.
  4. The business-metric dashboard — the single number that justifies the agent's existence, visible to the sponsor.

Add cost and latency alerting as volume grows. Tooling exists — there are AgentOps platforms built for exactly this — but the discipline matters more than the tool. A spreadsheet of eval results that someone reads every morning beats a fancy dashboard nobody opens.

This is the same gap that strands most pilots, which I break down in the AI pilot-to-production gap. The agents that get out of pilot and stay out are the monitored ones. AgentOps is how you keep trust after launch, and trust is the entire game.

Frequently asked questions

What is AgentOps?

AgentOps is the operational discipline of monitoring and maintaining AI agents in production. It covers tracing every run, scoring output quality against a held-out eval set, tracking cost and latency, and alerting when any signal crosses a defined threshold. Think of it as DevOps for non-deterministic systems: the work of keeping an agent reliable after it ships, not just building it.

How is monitoring AI agents different from monitoring normal software?

Normal software is deterministic, so uptime and error codes tell you most of what you need. Agents are non-deterministic and fail silently — they return confident wrong answers instead of crashing — so a successful HTTP response says nothing about correctness. You have to monitor output quality directly through evals and human review, plus watch for drift as data and models change over time.

What metrics should I track for an AI agent in production?

Track four categories: quality (eval accuracy, human override rate, the business metric it moves), behavior (full run traces, tool and data calls), cost (token spend per use case, budget alerts), and reliability (p95 latency, uptime, error rates). Set numeric thresholds for each before launch so a problem is a crossed line you can act on rather than a hunch. The exact targets depend on your use case, but they must be written down in advance.

Do I need a dedicated AgentOps platform to get started?

No. Start with four cheap things in order: logging and tracing on every run, a standing eval set run on a schedule, a small daily human-review sample, and one business-metric dashboard for your sponsor. Instrument tracing to the OpenTelemetry GenAI semantic conventions so you can adopt a platform later without re-instrumenting. The discipline matters more than the tool — a spreadsheet someone reads beats a dashboard nobody opens.

How does monitoring relate to letting an agent write to my ERP?

Monitoring is the evidence that earns write access. You can only safely let an agent move from read-only to writing back into your ERP or MES because tracing, eval accuracy, and override rates prove it's reliable enough to trust with a real transaction. No monitoring, no write access — and the trace logs double as the audit trail you'll need when something goes wrong or an auditor asks what the agent did.

Let's see what's worth building first.

A 15-minute call: tell me where your AI or planning is stuck, and I'll tell you the one thing worth building first — and whether it's worth doing at all.

More field notes

AI Governance for Manufacturers: A Starter FrameworkAI Agent Security Risks Manufacturers Must ManageHuman-in-the-Loop AI for Operations: When to Use ItAI Compliance Checklist for Manufacturing Leaders