AI Agent Observability: How to Instrument and Monitor Agents in Production
AI agent observability closes the gap between what your agents do and what you can see. A practical guide to distributed tracing, metrics, and tooling.
AI agent observability is the practice of capturing the full decision trail of an autonomous agent as it runs in production: every LLM call, every tool invocation, every retrieval step, every branch in its reasoning path — recorded as structured telemetry you can query, alert on, and replay. Standard infrastructure monitoring tells you a request succeeded or failed. Agent observability tells you why — which tool the agent chose, what the model inferred at each step, how much that run cost in tokens, and whether the reasoning chain would hold up to scrutiny. Without it, you are operating AI systems blind: you see outputs but cannot diagnose wrong answers, cost spikes, latency regressions, or safety violations. Our guide to LLMOps in enterprise production covers the full deployment lifecycle; this post drills into the observability layer specifically.
The adoption gap makes this urgent. A 2026 survey found that 73% of enterprise engineering teams now require AI agent monitoring in production as a compliance or operational risk requirement — yet only about 15% of GenAI deployments are actually instrumented with meaningful observability today. The gap exists because agent observability is more complex than infrastructure monitoring: agents are stateful, non-deterministic, multi-step, and expensive. Failures are semantic as often as they are operational. A failed HTTP call appears in your error budget; an agent that confidently fabricates an answer while completing the request does not.
This guide covers the full observability stack for production AI agents: the three core pillars, how to instrument using OpenTelemetry GenAI conventions, how to choose between Langfuse, Arize Phoenix, and Datadog LLM Observability, how to design SLOs for agents, and how multi-agent and MCP workflows change the tracing model. It builds on our guide to OpenTelemetry for enterprise infrastructure observability, which covers the infrastructure layer — this post covers the agent-specific telemetry that sits on top.
What Is AI Agent Observability and Why Is It Different from Infrastructure Monitoring?
Traditional application observability focuses on infrastructure-level signals: latency, error rate, throughput, CPU and memory usage. These metrics tell you whether a system is healthy at the operational level. AI agent observability adds three dimensions that infrastructure monitoring fundamentally cannot capture.
- →Reasoning path visibility: a multi-step agent completes a user request through a chain of inferences, tool calls, and retrieval operations. Each step can succeed operationally while failing semantically — an agent that calls the right API but misinterprets the response will return a plausible-looking wrong answer. Tracing the full reasoning chain lets you identify exactly where the semantic failure occurred.
- →Cost and token accounting: each LLM call has a token cost that compounds across multi-step agentic workflows. Without per-span token tracking, cost spikes are invisible until they appear on your cloud bill. Agent observability instruments token usage as a first-class metric, making cost optimization data-driven rather than guesswork.
- →Non-determinism and drift: the same agent prompt with the same input can produce different outputs across runs, model versions, and as model behavior evolves over time. Infrastructure monitoring has no concept of this. Agent observability captures quality signals — relevance scores, factuality evals, latency distributions — and alerts when they shift.
The Three Pillars of Production Agent Observability
Enterprise agent observability stacks organize around three functional pillars. Most teams implement the first pillar within weeks of deploying agents; the second and third follow as the team moves from confirming basic function to operating at scale with quality and cost accountability.
- →Distributed tracing: every LLM call, tool invocation, retrieval, and reasoning step is recorded as a span with structured attributes — model name, token counts, latency, input and output, tool parameters, success or failure. Spans link into trace hierarchies that reconstruct the complete execution path of any agent run from start to finish.
- →Evaluation and scoring: automated quality scoring runs either inline during execution or offline against sampled production traces. This covers LLM-as-judge relevance scoring, factuality checks against ground truth, toxicity filters, and latency and cost SLO compliance. Evaluation closes the loop from confirming the agent ran to confirming the agent ran correctly.
- →Alerting, dashboards, and incident response: observability is only useful if it triggers action. Agent observability feeds structured telemetry into your alerting layer — Datadog monitors, PagerDuty policies, Slack notifications — so that quality regressions, cost spikes, and latency violations open incidents the same way infrastructure failures do.
How to Instrument AI Agents with OpenTelemetry GenAI Conventions
OpenTelemetry's GenAI Semantic Conventions are the vendor-neutral standard for AI agent telemetry. Published in stable form in 2026, they define span types for LLM calls (gen_ai.client.chat), agent invocations (gen_ai.agent.invoke), and tool executions (gen_ai.tool.execute), along with a standardized attribute vocabulary covering token usage, model identity, finish reason, and agent metadata. Major vendor backends — Datadog, Honeycomb, New Relic — and all major agent frameworks — LangChain, CrewAI, LlamaIndex, and the Anthropic SDK — support these conventions natively or through auto-instrumentation packages.
- →Install OpenLLMetry alongside your existing OpenTelemetry SDK. OpenLLMetry provides auto-instrumentation for OpenAI, Anthropic, Bedrock, LangChain, LlamaIndex, and CrewAI with near-zero code changes. It emits spans conforming to the GenAI semantic conventions and routes them to whatever OTel backend you already run.
- →Wrap agentic workflows in parent spans. For each user-initiated agent run, open a root span at the entry point and let all LLM calls, tool invocations, and retrieval steps emit as child spans. Without a root span, traces fragment into disconnected unit-level spans with no way to reconstruct the full workflow.
- →Propagate W3C TraceContext across every service boundary. Multi-agent architectures and MCP-enabled workflows cross process, service, and network boundaries. Pass trace context through every hop so that sub-agent invocations, tool calls, and retrieval service calls all appear as children of the originating agent run.
- →Record token usage at every LLM span. The gen_ai.usage.prompt_tokens and gen_ai.usage.completion_tokens attributes are defined in the GenAI conventions. Capture them at the individual span level — not aggregated at the workflow level — so you can correlate token cost to specific steps in the reasoning chain.
Choosing an Observability Backend: Langfuse, Arize Phoenix, and Datadog
Most production teams run a two-layer observability stack: a specialized LLM and agent observability platform for trace visualization, evaluation, and prompt management, paired with the broader infrastructure observability layer for cross-system alerting and SLO dashboards. The right primary platform depends on whether your team's priority is self-hosted control, evaluation depth, or enterprise feature parity.
- →Langfuse: open-source, self-hostable on Postgres and ClickHouse, framework-agnostic via OpenTelemetry or its native SDK, strong on prompt versioning, dataset management, and cost tracking. Best for teams that require full data residency control, operate under strict compliance requirements, or want LLM observability without sending production data to a SaaS vendor.
- →Arize Phoenix: open-source, OpenTelemetry-native with OpenInference semantic conventions, strong on LLM-as-judge evaluations, behavioral drift detection, and bias analysis. Best for teams that prioritize deep evaluation capabilities or are already invested in the Arize ML observability ecosystem.
- →Datadog LLM Observability: enterprise SaaS, native integration with the existing Datadog APM and infrastructure stack, now fully supports OpenTelemetry GenAI semantic conventions. Best for teams already on Datadog that want a unified view across agent behavior and the underlying infrastructure.
- →OpenLLMetry (instrumentation-only): if you own your full observability stack and route telemetry to Grafana, Jaeger, or another self-hosted backend, OpenLLMetry provides the instrumentation layer without coupling you to any SaaS platform. Use this when your organization has an existing OTel infrastructure investment and wants to minimize vendor lock-in.
Designing SLOs and Alerts for AI Agents
Service-level objectives for AI agents require a different model than for deterministic APIs. A traditional SLO measures latency and error rate — both binary and infrastructure-bound. Agent SLOs must account for quality, cost, and non-deterministic variance alongside operational signals. Defining these SLOs before your agents reach production scale is what separates teams that operate AI responsibly from those that discover problems through user complaints.
- →Define per-workflow latency budgets: agent workflows that span multiple LLM calls, tool invocations, and retrieval steps have compound latency. Set p95 and p99 latency targets at the workflow level — not just the individual LLM call level — and alert when multi-step traces breach the budget.
- →Track quality regression with automated evaluation: define 2-3 LLM-as-judge checks relevant to your use case — response relevance, factuality, safety filter pass rate — and run them offline against a daily sample of production traces. Alert when quality scores drop more than a defined threshold from the trailing 7-day average.
- →Cost SLOs prevent budget blowouts: define a maximum acceptable token spend per workflow invocation and per 24-hour period. Alert at 80% of budget. Uncontrolled agentic loops in production can exhaust API credit limits in hours. Our post on controlling agentic AI costs at enterprise scale covers the cost governance framework that token-level observability feeds into.
- →Track tool failure rates per tool: in multi-tool agents, each tool invocation has its own reliability profile. A tool that fails silently — returning a partial or empty response — can cause the agent to hallucinate rather than report the failure. Per-tool failure rate metrics let you identify unreliable tools before they corrupt downstream agent behavior.
Observability in Multi-Agent and MCP Workflows
Single-agent architectures present a tractable observability problem: one entry point, one reasoning chain, one trace. Multi-agent architectures compound this: an orchestrator agent spawns sub-agents, each of which invokes its own tools, calls LLMs independently, and produces results that feed back into the orchestrator's reasoning. Without deliberate trace propagation, each sub-agent's work appears as a disconnected trace, and reconstructing the full workflow from production data becomes intractable.
MCP-enabled workflows add another dimension: tool calls cross process and service boundaries to MCP servers that may themselves call LLMs or spawn additional operations. Maintaining trace continuity through MCP requires W3C TraceContext propagation in every MCP client and server hop — which is not yet default behavior in all MCP client libraries and must be implemented explicitly. Our guide to multi-agent AI orchestration patterns in production covers the architectural patterns; this layer adds the observability requirement that every inter-agent communication carries trace context.
Common Observability Mistakes That Leave Agent Behavior Invisible
Teams moving AI agents to production repeat the same instrumentation omissions. Each one creates a specific blind spot that surfaces as a mystery incident.
- →Instrumenting only the LLM call, not the workflow: teams add a trace SDK call around the model invocation but do not wrap the full agentic workflow in a parent span. The result is disconnected LLM call records with no way to reconstruct which step in a multi-step workflow produced a given output.
- →Not recording tool inputs and outputs: tool parameters and return values are the most diagnostic attributes in an agent trace. An agent that retrieves the wrong data and then reasons correctly from it looks like a correct run at the LLM span level. Recording tool inputs and outputs is the only way to surface this class of failure.
- →Aggressive sampling from day one: observability teams default to tail-based sampling to control telemetry volume. For AI agents, aggressive sampling in early production removes precisely the long-tail runs that contain the most interesting failures. Start at 100% sampling for the first 30 days, then calibrate rates after you understand your trace volume and failure distribution.
- →Treating observability as a post-deployment activity: instrumentation added after an agent reaches production requires a code change, a redeployment, and a maintenance window — exactly when you most need visibility. Instrument during development, validate traces in staging, and enter production with observability already operational.
Frequently Asked Questions
What is the difference between AI agent observability and LLM evaluation?
Observability captures what your agent actually did in production — traces, spans, token counts, latency, tool invocations. Evaluation scores whether those actions were correct, relevant, or safe. Observability is the data collection layer; evaluation is the analysis layer that runs on top of it. You cannot do useful evaluation without observability infrastructure, but observability alone does not tell you whether the agent's behavior was acceptable.
Does OpenTelemetry work for AI agent observability?
Yes. OpenTelemetry's GenAI Semantic Conventions, published in stable form in 2026, define standardized span types and attributes for LLM calls, agent invocations, and tool executions. Major vendor backends including Datadog, Honeycomb, and New Relic support these conventions. OpenLLMetry provides auto-instrumentation that emits GenAI-convention-compliant spans for OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI with minimal code changes.
How do I maintain trace continuity across multi-agent and MCP workflows?
Use W3C TraceContext header propagation at every service and process boundary. Each sub-agent invocation and every MCP tool call should receive the parent trace context as an HTTP header, allowing their spans to attach as children of the originating agent run. Without explicit context propagation, multi-agent workflows produce disconnected trace fragments that cannot be correlated after the fact.
What SLOs should I define for AI agents in production?
Define at minimum: a p95 and p99 latency target at the workflow level, a daily token spend budget per workflow type, a quality regression alert threshold based on automated LLM-as-judge scoring, and a per-tool failure rate threshold for each tool the agent invokes. These four SLOs cover the operational, cost, quality, and reliability dimensions of agent behavior in production.
Which observability platform should I choose for AI agents?
Choose Langfuse if you require self-hosted data residency or have strict compliance requirements. Choose Arize Phoenix if your team prioritizes deep behavioral evaluation and drift detection. Choose Datadog LLM Observability if you are already on Datadog and want unified infrastructure and agent observability on one platform. All three support OpenTelemetry, which means migrating between backends without re-instrumenting is feasible if your requirements change.
How Belsoft Approaches AI Agent Observability
Belsoft designs and implements production-grade AI agent systems where observability is instrumented from day one, not retrofitted after an incident. We wire OpenTelemetry GenAI tracing, evaluation pipelines, and SLO alerting into every agent deployment we deliver — so engineering teams can operate AI agents with the same operational confidence as any other production system. If your team is deploying AI agents without a clear observability strategy, talk to us before your first unexplained incident.
We work with engineering teams across SaaS, enterprise software, and AI product development to build agent systems that are observable, cost-controlled, and production-safe. Explore our AI & Automation services to see how we approach agent implementation end to end.
“You cannot fix what you cannot see. Instrument your agents before they reach production, not after your first unexplained incident.”
Written by
Belsoft Team
More from the blog
Ready to build?
Let's talk about your project.
30 minutes. No pitch. We map your requirements and tell you honestly what it will take.
Book a Strategy Call