AI Agent Observability

Tracewell

Your agent failed in production. Here's the black box.

Buyer & pain

Applied AI Engineer / AI Solutions Architect / Forward Deployed Engineer — agents fail in production at 3 AM, and current tooling (Datadog, LangSmith, n8n run history) can't tell you which tool call consumed the budget, what the prompt looked like at the exact failing step, or how it differed from the last run that worked.

Problem

Nobody renders an agent run as a clickable, replayable, diff-able timeline the way Sentry renders a stacktrace. Infrastructure monitoring knows the Lambda timed out; it doesn't know the prompt drifted 4,400 tokens between Monday and Wednesday. That's the white space.

Before

A wall of JSON. The failure message says context_length_exceeded. Nobody knows why tonight's run is 4,000 tokens heavier than Monday's. The postmortem is a meeting.

After

One click opens the run inspector. The tool-call tree draws itself. The failing node glows amber. "Compare to last success" splits the prompt diff — a single 4,390-token compliance_policy_v3 → v4 injection added Tuesday. Postmortem done in three clicks.

What the 30-second demo proves

  1. Failure-mode vocabularycontext_overflow, guardrail_reject, tool_timeout: you name these only after you've been paged for them.
  2. Hierarchical data as legible UI — a nested tool-call tree with per-node token cost rendered in one screen; the exact take-home FDE and Solutions Architect roles assign.
  3. Root-cause thinking, not log-tailing — failure-mode clustering surfaces one root cause behind six incidents across 50 synthetic runs.
  4. Observability designed in — token spend, latency, and replay affordance per step; matches the n8n/StockX "instrument for auditability" ask verbatim.

How it works

  1. Timeline view — 50 synthetic agent runs (3 agents, 7 days) with status, failure-mode badges, and token totals. Failure-mode sidebar clusters runs by signature: FP-01 · 6 runs · contract-review-agent · context_overflow.
  2. Run inspector — click any row to open a split-panel. Left: tool-call tree with per-node token cost and latency. Right: prompt snapshot, model output, and a character-level diff against the last successful run of the same step.
  3. Signature momentrun_8f2a1c · contract-review-agent · 03:14 · FAILED. Click the red node. Click "Compare to last run that worked." The diff shows a single contiguous 4,390-token block added at inject_context. That's the regression. Three clicks, full postmortem.
  4. Synthetic replay — the replay tab accepts edits and re-runs against a recorded response fixture. Labeled clearly: "Replay is synthetic — edits route to a recorded response fixture."

Integration surface

Static Astro site. All data is synthetic and pre-recorded — no backend, no API keys, no live model calls. The "edit & replay" affordance routes to a bundled fixture, not a live model. Diff rendering via diff-match-patch (4 kb gzipped). Deploys as a Cloudflare Pages subpath.

Skills proven

Roles this maps to

Try the live demo → dallascrilley.com