Aiinfox logo
All articles
Generative AI June 2, 2026 12 min read

AI Agent Observability in Production: What to Instrument Before Launch

A production-grade observability checklist for AI agents — per-call traces, token cost, latency p50/p95/p99, refusal rates, tool-call success, eval drift, and the right tooling.

MS

Manjeet Singh

Senior engineering team · Aiinfox

Agent observability is the single biggest difference between teams that ship AI and teams that keep promising to ship AI. The agents that survive production have telemetry on every call, every tool invocation, every refusal, and every cost-bearing decision. The agents that fail in production are the ones where the team learns about the regression from a customer complaint instead of an alert. The gap between those two states is one or two weeks of instrumentation work — and almost every engagement that does not do that work pays for it later in incident response.

We have taken over a dozen agent engagements after the original vendor walked away, and the pattern is consistent: no per-call traces, no token-cost tracking, no latency histogram, no refusal-rate dashboard, no tool-call success-rate breakdown. The system is a black box, and every prompt change is a leap of faith. Below is the observability checklist we instrument on every production agent — what to log, what to dashboard, what to alert on, and which tools we recommend for the stack.

1. Per-call audit traces — the structural foundation

Every agent invocation should produce a structured trace that captures the full call lifecycle: input prompt (with PII redacted), retrieval queries and results, tool calls and their responses, LLM input/output pairs at each step, refusal flags, and final output. Each span in the trace gets a timestamp, a duration, a cost, and a parent-child relationship so the full conversation tree is reconstructable.

  • Trace ID propagated end-to-end (request-in to response-out) so the full call is queryable by a single ID.
  • PII redaction at the trace-storage boundary, not the trace-emission boundary — the trace needs to capture what the model saw, not what the user typed before redaction.
  • Tool-call payloads logged in full with parameters and responses — the most common debugging question is "what did the tool actually return on this call?"
  • Retrieval context logged with chunk IDs, similarity scores, and source document metadata — debugging RAG failures without this is guesswork.
  • Final output flagged for content-safety category, refusal type, citation completeness, format conformance.

2. Token cost tracking — per call, per category, per tenant

Token cost is the line item that surprises every CFO at month three of an AI deployment. Without per-call cost tracking, the team finds out about a 4x cost blow-out when the bill arrives, and tracing back which deployment, tenant, or prompt change caused it is forensic work. With per-call tracking, the cost regression is visible the moment it starts.

  • Input tokens, output tokens, prompt-cache hit/miss, and per-1k-token rate logged on every LLM call.
  • Dollar cost computed per call, aggregated daily by deployment, tenant, query category, and model.
  • Cost-per-conversation rolled up across multi-turn sessions — the per-call number can hide expensive conversations.
  • Alerts on day-over-day cost delta above a configured threshold (e.g., 25%) so cost regressions surface within 24 hours, not 30.

Concrete example: on our [voice agent stack](/case-studies/voice-agent), per-call cost tracking surfaced a prompt-cache miss regression within four hours of a deployment that added a per-request user-context block to the prompt. Without the dashboard, the cache miss would have cost roughly $1,800/day before the next manual audit. With the dashboard, the regression was reverted before lunch.

3. Latency histograms — p50, p95, p99 at every span

Aggregate "average latency" is the dashboard metric that hides production problems. The honest measurement is a histogram with p50, p95, and p99 — and the same histogram broken down per span (STT, retrieval, LLM, tool call, TTS) so the slow span is visible. A 1.2-second average latency with a 4-second p99 means 1% of users wait four seconds, and that 1% drives the highest-leverage complaints.

  • Per-span latency histograms (retrieval, LLM first-token, LLM full-response, tool call, downstream API).
  • Per-tenant or per-deployment slicing — one customer's slow Postgres can mask itself as a global regression.
  • Cold-start latency tracked separately from steady-state — a 600ms cold start hides in averages but is the first thing a new user sees.
  • Latency-vs-cost scatter plot for prompt iteration — the cheapest prompt is sometimes the slowest, the fastest is sometimes the most expensive, and the curve has a knee that is worth finding.

4. Refusal-rate dashboards — by category and severity

Refusals are not failures — refusals are the system saying "I do not have grounded evidence for an answer, so I am declining." A healthy agent has a measured refusal rate. An agent with 0% refusals is hallucinating; an agent with 50% refusals is over-suppressing. The right rate depends on the workload, but it should be measured and tracked.

Refusals should be categorized: out-of-scope (the user asked something the agent is not designed for), insufficient retrieval (the retrieval returned no high-quality matches), safety-critical (the query triggered a safety-policy refusal), and ambiguous (the user's query is ambiguous and the agent asked for clarification). Each category has different operational implications.

5. Tool-call success rates — the silent failure mode

Agentic systems that call tools (CRM lookups, payment APIs, calendar bookings, ticket creation) have a class of failure most prompt-only systems do not: the tool call succeeded at the protocol level but did not accomplish what the user asked for. The booking landed in the wrong calendar, the ticket was created with the wrong priority, the lookup returned stale data. Without per-tool-call success metrics, these failures are invisible until users complain.

  • Per-tool invocation logged with parameters, response, success/failure flag, and downstream system reference (booking ID, ticket ID, payment intent ID).
  • Tool-call retry count tracked — repeated retries on the same call indicate a downstream system degradation that the agent is masking from the user.
  • Tool-call success rate broken down by tool — a 96% success rate on the calendar tool and a 73% success rate on the CRM tool is two different problems.
  • Sampling of tool-call traces for human review — a 2% sample, reviewed weekly, catches semantic failures that protocol-level success metrics miss.

6. Eval-vs-production drift detection

The eval set passes in CI. The system goes to production. Three weeks later, the production accuracy is 6 percentage points lower than the eval set predicted. The eval set is now lying to the team, and nobody knows yet. This is eval-vs-production drift, and it is one of the most insidious failure modes in shipped AI.

The fix is automated production sampling. Take a 1-5% sample of production traffic, run it through the eval-set rubric (LLM-as-judge or human review for a sub-sample), and compare the production-sample accuracy against the golden-set accuracy. When they diverge by more than the noise threshold, the eval set needs expanding to cover the categories production traffic is hitting. See our [LLM eval harness 101 post](/blog/llm-eval-harness-101) for the eval-set design pattern.

7. Cost vs accuracy curves — for prompt and model iteration

Every prompt iteration and model swap shifts the agent on a cost-vs-accuracy curve. Without instrumented observability, teams iterate blind: they swap from a 7B to a 70B model and notice the accuracy went up, without noticing the cost tripled. Or they tune a prompt down for cost and notice the cost dropped without realizing accuracy regressed on a category they were not testing.

A production observability stack lets the team plot every iteration on the curve: x-axis is per-call cost, y-axis is eval-set accuracy, each point is a deployment. The dominated points (lower accuracy and higher cost than another point) are easy to discard. The Pareto-frontier points are the candidates worth running production A/B on.

8. Per-tenant or per-customer slicing

Multi-tenant agents — most enterprise deployments — fail differently per tenant. One customer's data corpus is well-structured and yields 96% retrieval accuracy; another's is messy and yields 72%. Without per-tenant slicing on the dashboards, the global average looks healthy while one customer's experience is degraded. With per-tenant slicing, the tenant-specific problem surfaces immediately.

This is especially important for any agent shipped to a regulated environment — see our [HIPAA AI deployment checklist](/blog/hipaa-ai-deployment-checklist) and [UK GDPR AI development guide](/blog/uk-gdpr-ai-development-practical-guide-2026) for the audit-trail requirements that per-tenant logging needs to satisfy.

9. Alerting — what to page on, what to dashboard

Not every metric warrants a page. The discipline is to alert on metrics that require human intervention within an hour, and to dashboard everything else. Pages that fire on metrics that resolve themselves train the on-call to ignore the page, which kills the entire alerting system.

  • Page: p95 latency above the contract threshold sustained for 5 minutes (user-visible degradation).
  • Page: refusal rate above 2x the baseline sustained for 15 minutes (retrieval or model regression).
  • Page: tool-call success rate below 90% on any single tool sustained for 15 minutes (downstream failure).
  • Page: cost-per-call above 2x the baseline sustained for 30 minutes (cache miss or prompt regression).
  • Dashboard only: accuracy drift, category-level eval scores, conversation-length distribution, cold-start latency.

10. The handoff requirement — observability the customer can operate

Observability that only the vendor can read is observability the customer does not own. On every engagement, the dashboards, alerts, and traces should be in a system the customer's team can log into, query, and modify after the vendor leaves. This is part of the takeover audit checklist — see our [vendor takeover audit signs](/blog/ai-vendor-takeover-audit-signs) post for the full delivery contract.

Practically: the observability stack runs in the customer's cloud account (or a shared account with explicit access), the dashboards are exported as code (Grafana JSON, Braintrust project export), the alerting routes to the customer's on-call rotation, and the runbook explains what each alert means and how to triage it.

Wrapping up — observability is the engagement

An agent without observability is a science experiment that happens to run in production. Observability is what turns the experiment into a system the team can operate, debug, and improve over time. The instrumentation cost is one to two weeks of engineering at the start of the engagement. The cost of not instrumenting is paid in months of incident response and customer trust on the back end.

If you are shipping an agentic system and want a 30-minute call to review your observability instrumentation against the 10-point checklist above — [book a discovery call](/contact-us). We will return a written assessment inside 72 hours: what is missing, what to instrument first, and which platform fits the workload. The same artifact we deliver on every production engagement.

TaggedAI agent observabilityLLM observabilityagent tracingBraintrust Langfuse Phoenixtoken cost trackingagent latency monitoring
Production AI, not slideware

Ready to ship the system this post describes?

30-minute scoping call. Senior engineers. Fixed-price scope in 72 hours.