AI Agent Observability
Your agent failed in production. Here's the black box.
Applied AI Engineer / AI Solutions Architect / Forward Deployed Engineer — agents fail in production at 3 AM, and current tooling (Datadog, LangSmith, n8n run history) can't tell you which tool call consumed the budget, what the prompt looked like at the exact failing step, or how it differed from the last run that worked.
Nobody renders an agent run as a clickable, replayable, diff-able timeline the way Sentry renders a stacktrace. Infrastructure monitoring knows the Lambda timed out; it doesn't know the prompt drifted 4,400 tokens between Monday and Wednesday. That's the white space.
A wall of JSON. The failure message says context_length_exceeded. Nobody knows why tonight's run is 4,000 tokens heavier than Monday's. The postmortem is a meeting.
One click opens the run inspector. The tool-call tree draws itself. The failing node glows amber. "Compare to last success" splits the prompt diff — a single 4,390-token compliance_policy_v3 → v4 injection added Tuesday. Postmortem done in three clicks.
context_overflow, guardrail_reject, tool_timeout: you name these only after you've been paged for them.run_8f2a1c · contract-review-agent · 03:14 · FAILED. Click the red node. Click "Compare to last run that worked." The diff shows a single contiguous 4,390-token block added at inject_context. That's the regression. Three clicks, full postmortem.Static Astro site. All data is synthetic and pre-recorded — no backend, no API keys, no live model calls. The "edit & replay" affordance routes to a bundled fixture, not a live model. Diff rendering via diff-match-patch (4 kb gzipped). Deploys as a Cloudflare Pages subpath.