AI Engineering

March 20, 2025

How to Evaluate LLMs Before Shipping to Production

The Problem with Vibes-Based Testing

Most teams test their LLM features by trying some examples, seeing that they look good, and shipping. This works — until it doesn't.

The problem is that LLMs produce different outputs for the same input across runs. They degrade when you change the prompt slightly. They fail on edge cases you didn't anticipate. And when you're evaluating by "does this look good to me," you're optimizing for the examples you tried, not the distribution of inputs your users will actually send.

Production is unforgiving. Users find the edge cases your manual testing missed. Without a rigorous evaluation framework, you don't know if a prompt change improved things or made them worse, and you have no way to catch regressions.

Here's the framework we use.

Layer 1: Unit Evaluations

Unit evals are the foundation. They're specific input-output pairs where you have a ground truth expectation.

For each capability your LLM feature needs, write a set of test cases:

Input: the exact user input (or a representative sample)
Expected output: what a correct response looks like — either an exact string or a set of conditions
Assertion: how you check whether the output matches

Example for a document summarization feature:

Input: a specific 500-word technical document
Expected: summary that includes the main conclusion, is under 100 words, doesn't include any facts not in the original
Assertion: word count check + fact-check against the original

Start with 20-30 unit evals per capability. Grow this set as you find failures in production.

Layer 2: Regression Sets

Your regression set is a collection of inputs that have caused problems before — either in testing or in production. Every time you fix a bug, add the failing input to the regression set.

Run your regression set every time you change a prompt, change a model, or change any part of the pipeline that touches the LLM. A change that fixes one thing and breaks another is worse than not changing anything.

This sounds obvious but most teams don't do it. They fix a bug, eyeball a few examples, and ship. Regression sets catch the cases you stopped looking at.

Layer 3: Adversarial Inputs

Your unit evals cover the happy path. Adversarial inputs test the edges.

Adversarial categories to consider:

Empty or minimal inputs: empty string, one word, a single period
Very long inputs: text that exceeds what you expect, or approaches context limits
Off-topic inputs: questions or content completely unrelated to the intended use case
Prompt injection attempts: inputs that try to override the system prompt or jailbreak the model
Multilingual inputs: especially if your product is English-only
Inputs with typos and poor formatting: real users don't write clean text

The goal isn't to handle every adversarial case perfectly. It's to understand how your system degrades and ensure it fails gracefully (a polite error message) rather than catastrophically (confidently wrong output or security breach).

Layer 4: Latency and Cost Benchmarks

Quality isn't the only dimension that matters in production.

For every LLM feature, measure:

P50, P95, P99 latency: Median isn't enough. The tail matters. A P99 of 30 seconds will destroy the user experience even if median is 2 seconds.
Token consumption per request: Input tokens + output tokens. This drives cost.
Cost per feature invocation: Actual dollar cost, using current model pricing.

Project this to your expected usage volume. A feature that costs $0.05 per call is fine at 1,000 calls/month. At 100,000 calls/month, that's $5,000/month for one feature.

Run these benchmarks against the same set of inputs every time you change the model or prompt. Cost and latency can change significantly with seemingly minor prompt changes.

Layer 5: A/B Testing in Production

Unit evals tell you if your change is correct. A/B testing tells you if it's better for real users.

For high-traffic features, run model changes as controlled experiments:

10% of traffic sees the new version
Both versions log their outputs
Compare quality metrics (if you have them) and downstream behavior (did the user take the action you wanted?)

This is especially important for model upgrades. A new model version might score higher on your evals but produce outputs that real users prefer less. Real-world behavior is the ground truth.

Layer 6: LLM-as-Judge

For outputs that are difficult to evaluate programmatically (long-form text, nuanced responses), use another LLM as an evaluator.

The pattern:

Take an input + the model's output
Send both to an evaluator LLM with a rubric
The evaluator scores the output on each dimension of your rubric
Aggregate scores across your test set

This doesn't replace human evaluation, but it scales it. You can run LLM-as-judge evaluations across thousands of examples in minutes.

The evaluator model should be different from (and ideally more capable than) the model being evaluated. Using GPT-4 to evaluate GPT-4 outputs leads to systematic blind spots.

Recommended Tools

promptfoo: Open source, works locally, supports multiple providers. Excellent for unit evals and regression testing. Free.
LangSmith: Tracing + evaluation from LangChain. Best-in-class UI for debugging chains and agents. Has a free tier.
Braintrust: Evaluation platform with a strong emphasis on LLM-as-judge workflows. Good for teams that need shared visibility into eval results.

Pick one and stick with it. The discipline of running evals matters more than which tool you use.

The Minimum Viable Eval Suite

If you're shipping your first LLM feature and don't have time for all of this:

Write 20 unit evals covering your most important use cases
Write 10 adversarial inputs for edge cases
Measure latency and cost per call
Commit to growing the eval suite every time a production issue is found

That's it. Start here. Add layers as your product matures.

The teams that build evaluation infrastructure early ship with confidence. The ones that don't spend their time firefighting production issues they could have caught in testing.

Meet the author

Tequity Team