AgentOps: Monitoring AI Agents in Production
AgentOps monitoring explained for manufacturers: what to track once AI agents go live, the metrics that matter, and why pilots die without it.
AgentOps is the discipline of running AI agents reliably after they ship: tracing every run, scoring output quality on a schedule, watching cost and latency, and alerting when any of those cross a threshold. It exists because agents fail silently — they return a confident wrong answer instead of crashing — so a "200 OK" tells you nothing about whether the work was right. Monitor four things — quality, behavior, cost, reliability — set numeric thresholds before launch, and you catch problems before the plant manager does.
I ran AI at a $250M furniture manufacturer. The agents that survived all had monitoring from day one. The ones that died were the ones we shipped blind, and they died the same way every time: something went wrong in production, nobody could see it or explain it, and trust collapsed.
This is not a niche worry. Gartner predicts more than 40% of agentic AI projects will be scrapped by the end of 2027, citing escalating costs, unclear value, and inadequate risk controls — findings drawn from a survey of over 3,400 organizations (Gartner, 2025). Inadequate risk controls is monitoring by another name.
Why agents need their own ops discipline
Traditional software is deterministic. Same input, same output, every time. You watch uptime and error codes and you're mostly covered. Agents break that model in three ways.
- Non-deterministic output. The same question can return a different answer on each run. "It returned 200 OK" tells you the pipe is open, not whether the answer was correct.
- Silent failure. An agent doesn't throw an exception when it's wrong. It returns a plausible, confident, incorrect answer. There's nothing to catch.
- Drift. Behavior changes over time as your data shifts, your inputs shift, or the underlying model gets updated. What worked in March quietly degrades by June.
The silent-failure problem is not theoretical. A Stanford RegLab study found leading models hallucinated on specific, verifiable legal questions between 69% and 88% of the time (Dahl et al., Journal of Legal Analysis, 2024). Drop an unmonitored agent into a domain it doesn't know cold and that's the failure rate you inherit — invisibly.
Drift is the slow killer. McKinsey's 2025 survey found measurement is still immature across most companies, and where tracking does exist, value realization rises and risk incidents fall (McKinsey, 2025). Monitoring isn't overhead. It's the thing that converts a pilot into value.
The four things to monitor
AgentOps covers four categories. Skip any one and you've got a blind spot that eventually bites. This maps cleanly to the Measure and Manage functions of the NIST AI Risk Management Framework, 2024, which calls for tracking performance, drift, and incidents after deployment — not just before.
1. Quality — is it right?
The category that matters most and the one most teams skip, because it's the hardest. You can't eyeball thousands of outputs. So you sample and you score.
- Run evals on a held-out set continuously, not just before launch. Fix a set of real cases with known-correct answers and score them automatically on every model change and on a regular cadence. When accuracy drops, you see it before users do.
- Sample live outputs for human review. Pull a daily sample and have someone qualified grade it. Even 20 cases a day surfaces problems fast.
- Track the business metric the agent exists to move — hours saved, errors caught, tickets deflected. If that number isn't moving, quality is a footnote.
One caution on numbers. Top models score 1–3% hallucination on clean, grounded benchmarks but commonly 3–20% or higher across mixed real-world tasks. Your held-out set has to look like your real inputs, not a vendor demo, or your eval lies to you.
2. Behavior — what is it doing?
You need to see what the agent actually did, step by step, not just what it returned.
- Trace every run. Capture inputs, the agent's reasoning steps, every tool or data source it touched, and the final output. When something breaks, you replay the trace instead of guessing.
- Watch tool and data calls. An agent calling the wrong endpoint or reading stale data shows up here before it shows up as a wrong answer to a user.
- Flag human overrides. Every time a person rejects or corrects an output, log it. A rising override rate is your earliest warning that quality is slipping.
Tracing has a standard now, which means you're not locked to one vendor. The OpenTelemetry GenAI semantic conventions, 2025 define a common schema for spans, token usage, model metadata, and tool calls — so a trace from a LangChain agent looks the same as one from a raw model call. Instrument to that spec and you can swap observability tools without re-instrumenting.
3. Cost — what is it spending?
Agents cost money per run, and the costs surprise people.
- Track token spend per agent and per use case. A chatty agent or a runaway loop can 10x your bill quietly.
- Set budget alerts. A cap that warns you before a bad prompt or an infinite loop runs up a bill overnight.
4. Reliability — is it up?
The traditional ops stuff still applies, because none of the above matters if the agent is down or too slow to use.
- Latency and uptime. An agent that takes 40 seconds to answer won't get used, no matter how accurate it is.
- Error and timeout rates on the integration layer and the model calls themselves.
What good looks like, in numbers
Vague monitoring is no monitoring. Set actual thresholds before launch and alert when they're crossed.
| Signal | Healthy | Investigate | Alert |
|---|---|---|---|
| Eval accuracy | At or above launch baseline | 5+ pts below baseline | 10+ pts below baseline |
| Human override rate | Stable or falling | Rising trend over a week | Doubles from baseline |
| Business metric | Moving toward target | Flat | Reversing |
| Cost per run | Within budget | 25% over | 50%+ over |
| Latency (p95) | Under user tolerance | Creeping up | Exceeds tolerance |
The exact numbers differ by use case. The point is to write them down before launch, so a problem becomes a crossed threshold you can act on — not a vague feeling that the agent "seems worse lately."
Pick p95 latency, not the average. The average hides the slow tail, and the slow tail is what makes a plant operator give up on the tool.
Monitoring is what earns write access
Here's the link to everything else. The reason you can safely let an agent move from read-only to writing back into your ERP or MES is that you can see what it's doing. Tracing, eval accuracy, and override rates are the evidence that an agent is reliable enough to trust with a real transaction. No monitoring, no write access — the two are tied together, and that boundary is the heart of integrating AI agents with your ERP and MES.
Monitoring also keeps human-in-the-loop honest. The approval step on high-stakes actions only works if you're tracking how often the human disagrees with the agent — a discipline I cover in human-in-the-loop AI for operations. A near-zero override rate means you can widen the agent's autonomy. A rising one means pull it back.
And the trace logs are your security and audit trail. When you need to prove what an agent did with sensitive data, or investigate a prompt-injection attempt, the trace is the record — which is why monitoring is load-bearing for the AI agent security risks manufacturers must manage.
How this fits formal governance
If your company is heading toward certification, monitoring is not a separate workstream. It's the operational layer that satisfies the standards.
The NIST AI RMF GenAI Profile, 2024 names confabulation (hallucination), data leakage, and information-integrity failures as risks to monitor continuously after deployment. The international management standard ISO/IEC 42001, 2023 goes further, requiring organizations to define what they monitor — accuracy, fairness, drift, resource use — and how often, as a documented control.
Read those requirements and they describe exactly the four categories above. Build the monitoring and you've built most of the evidence an auditor will ask for. Skip it and certification becomes a paperwork exercise with nothing real underneath. Either way, monitoring slots into the broader picture in AI governance for manufacturers.
Start simple, then build
You don't need a full observability platform on day one. You need these, in order.
- Logging and tracing — capture every run from the first deployment. Non-negotiable and cheap. Instrument to the OpenTelemetry GenAI conventions so you're not locked in.
- A standing eval set — real cases with known answers, run on a schedule. This is what catches drift.
- A human review sample — a daily handful of outputs, graded by someone who knows the work.
- The business-metric dashboard — the single number that justifies the agent's existence, visible to the sponsor.
Add cost and latency alerting as volume grows. Tooling exists — there are AgentOps platforms built for exactly this — but the discipline matters more than the tool. A spreadsheet of eval results that someone reads every morning beats a fancy dashboard nobody opens.
This is the same gap that strands most pilots, which I break down in the AI pilot-to-production gap. The agents that get out of pilot and stay out are the monitored ones. AgentOps is how you keep trust after launch, and trust is the entire game.
Frequently asked questions
What is AgentOps?
AgentOps is the operational discipline of monitoring and maintaining AI agents in production. It covers tracing every run, scoring output quality against a held-out eval set, tracking cost and latency, and alerting when any signal crosses a defined threshold. Think of it as DevOps for non-deterministic systems: the work of keeping an agent reliable after it ships, not just building it.
How is monitoring AI agents different from monitoring normal software?
Normal software is deterministic, so uptime and error codes tell you most of what you need. Agents are non-deterministic and fail silently — they return confident wrong answers instead of crashing — so a successful HTTP response says nothing about correctness. You have to monitor output quality directly through evals and human review, plus watch for drift as data and models change over time.
What metrics should I track for an AI agent in production?
Track four categories: quality (eval accuracy, human override rate, the business metric it moves), behavior (full run traces, tool and data calls), cost (token spend per use case, budget alerts), and reliability (p95 latency, uptime, error rates). Set numeric thresholds for each before launch so a problem is a crossed line you can act on rather than a hunch. The exact targets depend on your use case, but they must be written down in advance.
Do I need a dedicated AgentOps platform to get started?
No. Start with four cheap things in order: logging and tracing on every run, a standing eval set run on a schedule, a small daily human-review sample, and one business-metric dashboard for your sponsor. Instrument tracing to the OpenTelemetry GenAI semantic conventions so you can adopt a platform later without re-instrumenting. The discipline matters more than the tool — a spreadsheet someone reads beats a dashboard nobody opens.
How does monitoring relate to letting an agent write to my ERP?
Monitoring is the evidence that earns write access. You can only safely let an agent move from read-only to writing back into your ERP or MES because tracing, eval accuracy, and override rates prove it's reliable enough to trust with a real transaction. No monitoring, no write access — and the trace logs double as the audit trail you'll need when something goes wrong or an auditor asks what the agent did.
Let's see what's worth building first.
A 15-minute call: tell me where your AI or planning is stuck, and I'll tell you the one thing worth building first — and whether it's worth doing at all.