8 min read

We automatically score every AI agent conversation — here's how

By the Sentygent team

If you ship AI agents or RAG pipelines, you know the problem: after deployment, how do you know if your agent is actually getting better? Or worse?

You have logs. Lots of them. Token counts, latencies, error rates. But logs aren't signal. They're just data. The real question — "is this conversation successful?" — lives in semantic territory that metrics can't reach. Did the agent actually understand the user? Was the answer helpful? Were safety guardrails respected?

Most teams solve this with manual evaluation. They sample conversations, rate them by hand, build spreadsheets, run eval pipelines on datasets. It works. It's also slow, expensive, and it scales backwards. The harder you need to evaluate, the slower you move.

We built Sentygent to flip that equation: every conversation gets evaluated automatically, instantly, and for free on the free tier. No manual pipeline to build. No datasets to curate. No spreadsheets. Just signal, every time.

The problem with evaluation as it exists

Manual evaluation has two fatal flaws.

First, it doesn't scale to production. In production, you're not evaluating 50 carefully curated conversations — you're running thousands daily. You can't manually rate all of them. So you sample. Sampling means you miss quality degradation until it's too late.

Second, it's slow. You ship your agent on Monday. You evaluate it on Thursday. Between Monday and Thursday, how many bad conversations did your users see?

Existing observability platforms (Langfuse, LangSmith) solve the scale problem — they log everything. But they ask you to define what "good" means. You build custom scorers, wire up evaluation datasets, integrate CI/CD tools, babysit eval runs. The tools are generic and flexible. That flexibility is a cost: more setup, more complexity, more moving parts.

We wanted something different. Instead of "give us your data and we'll log it," we wanted "give us your data and we'll score it."

The approach: automated quality scoring

Sentygent scores every conversation across up to six dimensions:

  • Relevance — Does the response address the user's actual question?
  • Helpfulness — Would this response move the user forward?
  • Completeness — Does it cover the necessary scope, or is it truncated/shallow?
  • Coherence — Is the logic sound? Are there contradictions or confused reasoning?
  • Safety — Does it avoid harmful outputs, injections, or policy violations?
  • Groundedness (RAG only) — Did the response stay faithful to the retrieved context, or did it hallucinate?

The first five are always evaluated. Groundedness activates automatically when your trace includes retrieval events with stored chunks — no configuration needed. Each dimension gets a score from 0-100, along with an actionable suggestion. Safety is special: if it drops below 30, we alert you immediately.

The scoring happens asynchronously after each conversation ends. We analyze the full trace — every LLM call, tool invocation, retrieval step, and error — and produce scores in seconds. The evaluation uses a purpose-built LLM pipeline optimized for speed and accuracy.

All of this is automated. No setup. No config. It just happens.

How it works in your code

Let's walk through an integration. Say you have a customer support agent that retrieves docs and generates responses.

import { SentygentClient, instrumentOpenAI } from '@sentygent/sdk';
import { OpenAI } from 'openai';

// Initialize once
const sentygent = new SentygentClient({
  apiKey: process.env.SENTYGENT_API_KEY,
  agent: 'support-agent',
  captureContent: true, // Optional: send input/output to scorer
});

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const ai = instrumentOpenAI({ openai }, sentygent); // Auto-instrument calls

// In your request handler
await sentygent.request(userId, async (span) => {
  // Retrieve docs — chunks enable automatic groundedness scoring
  const docs = await retrieveDocs(userQuery);
  await span.captureRetrieval({
    provider: 'internal-kb',
    query: userQuery,
    execute: async () => docs,
    extractResults: (results) => ({
      resultsCount: results.length,
      relevantCount: results.filter(d => d.score > 0.8).length,
      meanScore: results.reduce((a, d) => a + d.score, 0) / results.length,
      chunks: results.map(d => ({ text: d.text, score: d.score, source: d.source })),
    }),
  });

  // Call LLM (auto-captured by instrumentor)
  const response = await ai.openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userQuery },
    ],
  });

  return response.choices[0].message.content;
});

That's it. After the request completes, Sentygent:

  1. Buffers the events (llm_call from OpenAI, retrieval, lifecycle markers)
  2. Detects the conversation end and flushes to the API
  3. Validates all events against our schema
  4. Queues an async scoring job
  5. Scores the conversation across all dimensions (5 standard + groundedness for RAG)
  6. Stores results and alerts if safety drops

You get back quality scores in seconds, visible in the dashboard alongside agent metrics and traces.

Inside the scoring pipeline

When a conversation ends, the scoring engine receives the full trace — not just metrics, but actual conversation content (when enabled), what the agent retrieved, tool outputs, and errors that occurred. It evaluates each dimension and generates actionable suggestions like "Consider clarifying the reset process with a step-by-step guide" or "Good retrieval, but the response was slightly truncated."

The score is stored and compared against your agent's historical average (30-day rolling baseline). Is today 10% better? Below average? Percentile badges help you spot trends.

What you get

Three main things.

Quality scores and trends. Every conversation scored instantly. Dashboard shows your agent's score distribution, trend over time, and per-conversation detail. You can filter by tags (language, user tier, feature flag, etc.) to spot quality gaps by segment.

Safety auto-alerts. Safety < 30 triggers a webhook immediately. No polling, no manual checks. If your agent starts emitting unsafe content — injection attack, policy violation, jailbreak attempt — you know within seconds.

Data webhook. After scoring, you can configure a webhook URL in Settings to receive the full trace plus quality score as JSON. Sentygent remains your long-term store (30-day retention), but you can export data to your own systems for archiving, custom analysis, or integration with downstream tools.

What it's not

We should be honest about what Sentygent doesn't do.

It's not Langfuse or LangSmith. Those platforms are generic observability: they log everything and let you build custom eval logic. Sentygent is opinionated: we've pre-defined six dimensions (including automatic groundedness for RAG) and the scoring engine handles the evaluation. If you need custom dimensions (e.g., "does it follow the brand voice?"), you'll need another tool.

It's not perfect. No automated scoring system is. It sometimes misses nuance, occasionally rates incoherent conversations higher than they deserve. But it's consistent. And it's better than nothing — which is what most teams have today.

SDKs: TypeScript and Python. Both SDKs have full feature parity: tracing, auto-instrumentation for major LLM providers, multi-agent support, and content capture. Install with npm install @sentygent/sdk or pip install sentygent.

Content capture is opt-in. By default, captureContent: false, we send only metadata: model, tokens, latency. If you want the scorer to see actual conversation content (which produces better scores), you set captureContent: true. This is privacy-first: you choose what leaves your infrastructure.

It's not designed for 100M+ conversations/day. We run on a single NestJS process + Redis on a $5/month Lightsail instance. That's plenty for 5-digit event volumes. If you're at Anthropic scale, we're not your tool.

Pricing and free tier

Sentygent is free for:

  • Up to 5,000 events per day
  • 3 agents
  • 30-day data retention
  • Safety alerts and quality scores included

No credit card. No feature gates. Just sign up.

For teams with higher volumes, Pro pricing is custom. Email us.

Try it

Go to app.sentygent.com and sign up. You'll get a demo tenant seeded with example conversations so you can see how quality scoring works before you integrate the SDK.

When you're ready, grab your API key, install the SDK, and run the code above. Your first conversation will be scored in seconds.

npm install @sentygent/sdk

We think this is the future of agent evaluation. You shouldn't have to choose between ship speed and quality signal. They should be the same thing.

If you build agents, we'd love to hear how it goes. Found a bug? Have a feature request? Drop us a line on Discord or GitHub.


Sentygent is an open-source observability platform for AI agents. Fork us on GitHub: github.com/sentygent/sentygent

Built by the Sentygent team. Follow us: Twitter, GitHub, Discord.