GenAI Observability Platform

Overview

A purpose-built observability platform for production GenAI workloads. The system captures every LLM call across any provider (including direct API integrations and multi-provider routing services) and surfaces token usage, response times, full inputs/outputs, and cost metrics through an intuitive interface designed for navigating millions of calls per day.

Challenge

The client operated complex multi-agent workflows across multiple LLM providers and routing services, generating millions of GenAI calls daily. Existing observability tools weren't built for LLM workloads. They couldn't correlate calls across providers, trace agent decision chains, or surface token-level cost breakdowns. The client also needed the instrumentation to have near-zero performance impact on their production agents, ruling out synchronous logging approaches.

Solution

We built Workflows: a fully async observability pipeline designed for minimal instrumentation overhead. A lightweight SDK captures call metadata, full inputs, and full outputs, then pushes events to SQS, decoupling telemetry collection from agent execution. An ingestion service processes the queue and indexes structured call data into Elasticsearch, optimized for high-cardinality queries across provider, model, agent, workflow, and time dimensions. The frontend provides intuitive drill-down from high-level workflow traces to individual call payloads, with full input/output inspection at every level. Built-in reporting lets the client generate output-based analytics, compare model performance across versions and releases, and track quality and cost trends over time. The entire system was designed around the client's domain: data models, navigation patterns, terminology, and reporting workflows all reflect how their team actually operates, not a generic dashboard bolted on after the fact.

Results

1M+

Daily GenAI calls tracked

Provider-agnostic

Any LLM, any routing service

<1ms

Instrumentation overhead

Sub-second

Query latency at scale

Tech Stack

PythonDjangoElasticsearchAWS SQSReactTypeScriptRedisDockerAWS

Start a project like this