AI incidents jumped 56% in 2024. The Stanford AI Index counted 233 reported AI failures last year, up from 149 in 2023, mostly in systems already in use rather than lab experiments.
Air Canada learned that the hard way. Its chatbot made up a non-existent bereavement fare policy, and a tribunal forced the airline to honor that fake policy while their monitoring still showed a healthy, fast 200 OK service.
Systems like this now sit in front of customers, inside credit decisions, in medical support tools, and on search results pages. When they fail, they usually do not crash. They respond with wrong or unsafe answers that look fine at the protocol level, while your dashboards stay green. Traditional observability catches timeouts and error codes. It does not catch a model that is confidently wrong.
Most monitoring still focuses on technical health: latency, error rates, CPU and memory. That works for databases and APIs because those systems tend to fail loudly. Queries time out, endpoints return 500s, containers crash, and your graphs show clear spikes.
A language model that answers customer questions fails in a much quieter way. When it hallucinates a refund policy or invents a medical instruction, the HTTP response still looks healthy. You see a 200, normal response time, and no spike in standard error metrics. From the system’s point of view, everything succeeded. From the user’s point of view, something went very wrong.
Teams often hear about AI failures from customers, support tickets, or regulators rather than from alerts. The observability stack is watching infrastructure behavior, while the failure lives in the content and decisions the model produces.
For AI systems, you need to see what the model did, not just whether it responded. When a large language model misbehaves, the important details are the prompt you sent, the retrieved context, the tools it called, and the answer it produced. If you are not logging those, you are guessing whenever you investigate an incident.
Tokens are also the real cost unit. Without token counts per request and per feature, cost regressions only show up when you see a bill that is much higher than you expected.
Quality needs dedicated checks. Hallucinations, off-topic answers, unsafe content, and bias do not appear in Prometheus counters. You catch them with evaluations: simple rules, an LLM that scores outputs against your criteria, or periodic human review on sampled traffic. LLM observability is about monitoring what the model actually says, not just the transport around it.
Model behavior can also change underneath you. Cloud providers ship new versions, adjust safety filters, or retrain on new data. A prompt that was stable last week may start giving different answers because something upstream changed. If you are not tracking model versions and running basic regression checks, you only notice when users complain.
Most production AI applications are pipelines, not single calls. A request may go through retrieval, a model call, one or two tools, and post-processing. Problems can appear at any step: irrelevant retrieval, a tool that fails silently, or a bug in output handling. Monitoring has to see that chain, not just the outer HTTP request.
For some AI systems, this level of monitoring is now a legal expectation. The EU AI Act treats uses such as credit scoring, hiring, medical diagnosis, and critical infrastructure as “high risk” and requires providers to run post-market monitoring and to report serious incidents within strict timelines. The law assumes you can collect, analyze, and explain how these systems behave in the field.
The United States has taken a softer route, but the message is similar. The NIST AI Risk Management Framework and its generative AI profile highlight continuous monitoring of performance, safety, and bias as part of normal operations. Financial and healthcare regulators are extending existing model risk rules to AI tools, which usually means keeping records, watching outcomes in production, and being ready to justify how decisions are made.
Data protection regulators also expect you to own what your AI does with personal data. GDPR enforcement has already produced multi-billion euro fines, and regulators will treat your AI stack as your responsibility, regardless of which vendor trained the underlying model.
To close the gap, you need monitoring that understands AI specifics but still fits into your existing operations. Start with logging that captures prompts, model versions, retrieved documents, tools used, and outputs, tied to the user or system that triggered each request. That turns “we saw something strange” into “this prompt, and this retrieval step produced this answer”.
On top of this, add a simple evaluation loop. Take a sample of real traffic and score it for relevance, hallucinations, tone, and safety, using rules or an LLM as a judge. Turn those scores into metrics and alerts so quality issues show up on the same screens as latency and error rate.
Treat cost in the same structured way. Track token usage per request, per feature, and per model rather than just watching total spend. Set budgets and alerts for sudden jumps, and route simpler traffic to cheaper or smaller models when it does not hurt quality.
Finally, write runbooks for AI-specific incidents. When hallucination rates spike, when a new model version shifts behavior, or when an evaluator flags bias for a particular workflow, the team should know who investigates, what gets rolled back, and what data to capture for audits.
Getting production monitoring right for AI systems means rethinking your observability stack. Here are the practical steps.
1. Set up comprehensive logging
Save prompts, model versions, search results, tool calls, and outputs for every request
Link everything to user or session IDs so you can reconstruct incidents
2. Run continuous quality checks
Sample 1-2% of live requests and grade them for accuracy, tone, and safety
Use simple rules for obvious problems, a lightweight model for nuanced cases
Track scores over time and alert when they drop
3. Monitor costs and build runbooks
Count tokens per request and feature, alert on spikes, and test routing strategies
Document who handles hallucination versus retrieval failures, how to roll back models, and what tests to run before redeploying
4. Start simple and close the loop
Use OpenTelemetry if you lack MLOps resources, as most LLM libraries already emit the right signals
Let users flag bad responses directly in your app and feed those into your dashboard alongside latency and cost
AI systems fail differently from traditional software. Watching HTTP codes will not catch a model that invents policy details or ignores safety rules. Production AI monitoring requires logging prompts, running quality checks on live traffic, tracking token costs, and building incident processes for how LLMs actually break. Start with structured logging, add automated checks, and catch AI failures before your users do.