Evals

Score every answer. Catch the regressions.

Run code checks and LLM judges against live production traffic on a 0-1 scale, so quality is a number you watch, not a vibe you hope for.

Start free See pricing

0.94avg score

90%pass rate

tone · groundedness · valid-json6.1k scored

Two kinds of checks

Code checks and LLM judges, side by side.

Deterministic checks for the things that must be exactly right (valid JSON, no PII, schema conformance) and model-graded judges for the fuzzy stuff like tone and groundedness.

Code evals run inline, no model cost
LLM judges grade tone, helpfulness, and groundedness
Every eval scored 0–1 with a pass threshold you set

Start free

Score distribution

0.0

0.2

0.4

0.6

0.8

1.0

On real traffic

Evals where your users actually are.

Don't grade a static test set and hope. Foglamp scores sampled production traces, so your pass rate reflects what real users are getting today.

Sample a fixed rate or score everything
Drill from a failing score to the exact trace
Trend pass rate per agent over time

Explore traces

Pass rate · answer-groundedness

80%pass rate

tone · groundedness · valid-json6.1k scored

Support Agent Claude Fable 5Passed

plan

search_docs

generateText

Cost$0.0041Latency2.3sTokens1,842Evals94%

Your agents are running in the fog.

What they cost, when they break, what they say. You can't see any of it. One prompt turns the light on.

Start free