AI Production Readiness Checklist for Plant Leaders
An AI production readiness checklist built for plant leaders: 7 gates covering data, accuracy, ownership, failure modes, and ROI before you go live.
An AI agent is production-ready when it has a baselined metric, has been tested on live messy data, has defined failure behavior, has a named human owner, and has live monitoring running before launch. Pass those gates and the agent keeps working after the demo ends. Skip them and your pilot becomes a 2am incident on second shift.
This is the checklist I wish I'd had before I put the first agent in front of a production line at a $250M manufacturer. We had a working pilot, a happy demo, and an executive who wanted it live by month-end. What we didn't have was an honest answer to "what happens when it's wrong at 2am on second shift?"
That gap is where pilots turn into incidents. And it's common. Gartner predicted that at least 30% of generative AI projects would be abandoned after proof of concept by end of 2025 (Gartner, 2024). McKinsey found more than 80% of organizations see no tangible EBIT impact from generative AI (McKinsey, 2025). The problem isn't the model. It's everything around it.
Why production is a different animal than a pilot
A pilot proves the agent can work. Production proves it will keep working when the integrator's gone, the data shifts, and nobody's watching the demo screen. Those are different jobs.
Plant leaders already know this rhythm. You don't commission a new line off a vendor demo. You run readiness checks: safety, capability, maintenance plan, operator training. An AI agent earns the same rigor or it doesn't go on the floor.
The software world figured this out too. Google's research team built a 28-point rubric for ML production readiness (Google Research, 2017) covering data, model, infrastructure, and monitoring. The checklist below translates that thinking into plant language. Seven gates, in the order I run them.
If you're earlier in the journey, start with the AI pilot-to-production gap to understand why so many stall before they get here.
Gate 1: The number is defined and baselined
Before anything technical, you need the metric and the before-state. If the agent is supposed to cut quote turnaround, you measured current turnaround for at least two weeks. If it's flagging scrap, you have the current scrap rate by line and shift.
- Target metric named and tied to OEE, yield, OTD, or labor hours
- Baseline captured for 2+ weeks under normal conditions
- Success threshold written down (e.g., "reduce manual touches from 6 to 2")
- Break-even math done: cost of the agent vs. dollars recovered
No baseline, no go. You can't manage what you didn't measure, and finance will defund what you can't prove. For the full method, see how to calculate AI agent ROI in manufacturing.
Gate 2: Data is production-grade, not demo-grade
The pilot probably ran on a clean export. Production runs on the real feed. This is the gate that quietly kills the most projects. Gartner's analysts pointed straight at it: a lack of AI-ready data puts AI projects at risk (Gartner, 2025).
Before go-live, confirm the agent has been tested against the actual mess, not the sample.
- Tested on live production data, including nulls, dupes, and free-text fields
- Source systems documented (ERP, MES, SCADA, spreadsheets, email)
- Data refresh frequency matches the decision speed (real-time vs. nightly)
- Schema-change alerting in place, because someone will change a field
Google's production ML guidance makes the same point in plainer terms: training and serving should mimic each other as closely as possible (Google, 2024). A model trained on the clean export and served on the live feed is two different systems. Dig deeper in data readiness for AI in manufacturing.
Gate 3: Accuracy is measured the way operators experience it
A 92% accuracy number is meaningless until you know what the 8% costs. Misclassifying a non-critical defect is a shrug. Missing a critical one ships bad product. Split your accuracy by consequence, not by average.
| Error type | Frequency | Cost per miss | Acceptable? |
|---|---|---|---|
| False positive (over-flag) | 6% | 4 min operator review | Yes |
| False negative (miss minor) | 1.5% | minor rework | Yes |
| False negative (miss critical) | 0.2% | escaped defect, recall risk | No — needs human gate |
If the expensive errors aren't rare enough, the agent runs in suggest mode with a human approving, not act mode, until it earns autonomy.
When to keep a human in the loop
The rule I use: the higher the cost of a single wrong action, the more a human stays on the gate. A scheduling suggestion can run autonomous. A scrap-or-ship decision on a safety-critical part does not, not at first. The NIST AI Risk Management Framework (NIST, 2023) frames this as a measure problem before it's a manage problem: quantify the risk, then decide how much autonomy it earns. More on the trade-offs in human-in-the-loop AI for operations.
Gate 4: Failure modes are designed, not discovered
This is the gate plant leaders get and software teams forget. Every machine on your floor has a defined failure behavior. A breaker trips. A valve fails closed. Your agent needs the same.
- What happens when the model is unsure? Define a confidence threshold that routes to a human.
- What happens when a source system goes down? The agent should fail safe and alert, not guess.
- What happens when output is obviously wrong? Operators need a one-click override and a way to flag it.
- What's the manual fallback? If the agent is offline, can the line still run? It must.
An agent with no defined failure mode isn't production-ready. It's an outage waiting for a trigger. This maps to the NIST manage function: you allocate a planned response to each mapped risk before it happens, not after.
Gate 5: A named human owns it
Every production agent needs an owner with allocated hours, the same way every line has an owner. Not the integrator. Not "IT." A named person.
- Owner named, with 2-4 hours/week allocated for monitoring
- Weekly review of accuracy, exception rate, and override rate
- Escalation path defined for when metrics drift
- Retraining trigger defined (e.g., override rate above 20%)
Ownership is also a governance requirement. ISO/IEC 42001:2023, the first international AI management system standard (ISO, 2023), is built around clear accountability and continual improvement — which only happens when a specific person is responsible for the system's behavior over time.
Gate 6: Operators are trained and bought in
The best agent on the floor fails if the people next to it don't trust it. I've watched a perfectly good quality agent get ignored because nobody explained what it did or how to override it. The model worked. The rollout didn't.
- Operators trained on what the agent does and doesn't do
- Override mechanism is one click and well understood
- Operators know how to flag bad output and see it gets acted on
- A skeptic on the floor has been walked through it (convert your loudest critic)
McKinsey's research is blunt on this: the workflow redesign and human adoption around the tool drives more EBIT impact than the model itself. Treat training as part of the system, not an afterthought. See AI change management for plant and ops teams for the rollout playbook.
Gate 7: Monitoring and ROI tracking are live before launch
You don't commission a line and check on it next quarter. Same here. The dashboard goes live before the agent does, not after the first surprise.
- Live dashboard: accuracy, throughput, exception rate, uptime
- ROI tracked against the Gate 1 baseline, updated weekly
- Alerting on drift, downtime, and threshold breaches
- A 30-day review scheduled with finance to confirm the dollars
Watch for drift, because the model degrades quietly
A model that was accurate at launch will degrade as the world changes around it. New product mix, a retooled line, a different supplier's material — the data shifts and accuracy slides. Google calls for validation against sudden and slow degradations versus prior versions and fixed thresholds (Google, 2024). On the floor, "slow degradation" means an agent that's quietly getting worse while the dashboard still looks green if you're only watching uptime. Track accuracy over time, not just availability.
How to use this AI production readiness checklist
Run it as a gate, not a survey. Every item is pass/fail.
| Gate | Hard stop? | Can launch in suggest mode? |
|---|---|---|
| 1. Baselined metric | Yes | No |
| 2. Production-grade data | Yes | No |
| 3. Accuracy by consequence | No | Yes, with human gate |
| 4. Designed failure modes | Yes | No |
| 5. Named owner | No | Yes, with a date |
| 6. Operator training | No | Yes, with a date |
| 7. Live monitoring | No | Yes, with a date |
Any fail at Gate 1, 2, or 4 is a hard stop — those are the ones that cause incidents and defunding. Gates 3, 5, 6, and 7 can sometimes launch in suggest mode while you close them. Put a date on each open item.
The whole point: the pilot proves the agent can work. This checklist proves it will keep working when the demo is over, the integrator's gone, and it's second shift on a Tuesday.
Frequently asked questions
What is an AI production readiness checklist?
It's a pass/fail gate you run before putting an AI agent into live operations. It confirms the agent has a baselined success metric, has been tested on real production data, has defined failure behavior, has a named human owner, and has monitoring running before launch. The goal is to prove the agent will keep working after the demo, not just that it worked once.
How is production readiness different from a successful pilot?
A pilot runs on clean data, a narrow scope, and constant attention, and it only has to prove the agent can work. Production runs on messy live feeds, full volume, and no one watching, and it has to keep working for months. Gartner found at least 30% of generative AI projects are abandoned after the proof-of-concept stage precisely because that gap is underestimated.
What are the most common reasons AI agents fail in production?
The big three are bad data, undefined failure modes, and no owner. Models trained on clean exports break on live feeds; Gartner ties many failures directly to a lack of AI-ready data. Agents without a designed fail-safe behavior become outages, and agents with no named owner drift unnoticed until the numbers break.
When should an AI agent run in suggest mode versus act mode?
Run in suggest mode — where a human approves each action — whenever the cost of a single wrong action is high, such as a scrap-or-ship decision on a safety-critical part. Let the agent act autonomously only after it has proven its expensive-error rate is rare enough on live data. The NIST AI Risk Management Framework frames this as measuring the risk before deciding how much autonomy the system earns.
What should I monitor after an AI agent goes live?
Track accuracy, throughput, exception rate, and uptime on a live dashboard, plus ROI against your pre-launch baseline. Most importantly, watch accuracy over time, because models degrade as your product mix, materials, or processes change. Google's production ML guidance recommends validating against both sudden and slow performance degradations rather than only checking that the system is up.
Let's see what's worth building first.
A 15-minute call: tell me where your AI or planning is stuck, and I'll tell you the one thing worth building first — and whether it's worth doing at all.