Evals
Score every answer. Catch the regressions.
Run code checks and LLM judges against live production traffic on a 0–1 scale — so quality is a number you watch, not a vibe you hope for.
0.94avg score
90%pass rate
tone · groundedness · valid-json6.1k scored
Two kinds of checks
Code checks and LLM judges, side by side.
Deterministic checks for the things that must be exactly right — valid JSON, no PII, schema conformance — and model-graded judges for the fuzzy stuff like tone and groundedness.
- Code evals run inline, no model cost
- LLM judges grade tone, helpfulness, and groundedness
- Every eval scored 0–1 with a pass threshold you set
Score distribution
0.0
0.2
0.4
0.6
0.8
1.0
On real traffic
Evals where your users actually are.
Don't grade a static test set and hope. Foglamp scores sampled production traces, so your pass rate reflects what real users are getting today.
- Sample a fixed rate or score everything
- Drill from a failing score to the exact trace
- Trend pass rate per agent over time
Pass rate · answer-groundedness
80%pass rate
tone · groundedness · valid-json6.1k scored
Your agents are running in the fog.
Cost, latency, errors, eval scores — all there, all invisible. Wrap your model and turn the light on.