Back to Field Notes

FIELD NOTE

Evaluating LLM systems beyond accuracy

EvaluationJanuary 3, 20263 min read

Context

Accuracy alone doesn’t tell you if a system is safe to deploy. In production, you need metrics that reflect real-world risk and cost.

Metric 1: Calibration

What it measures

Whether the model’s confidence matches its correctness. Well-calibrated models don’t overclaim when they’re unsure.

Why it matters

Users trust confidence scores more when they’re meaningful.
Downstream systems can route low-confidence queries to human review.
Helps set thresholds for automation.

How to measure

Bin predictions by confidence percentile.
Compute accuracy per bin.
Plot reliability diagram; compute Expected Calibration Error (ECE).

Production signal

Rising ECE after a prompt change means the model became overconfident.
Sudden calibration drop may indicate data drift.

Metric 2: Refusal correctness

What it measures

Whether the model refuses appropriately (harmful/invalid queries) and answers when it should.

Why it matters

Over-refusing kills productivity; under-refusing creates compliance risk.
Product teams tune refusal behavior; you must detect regressions.

How to measure

Curate two datasets: safe-to-answer and should-refuse.
Measure false refusal rate and false answer rate.
Track per-category refusal rates.

Production signal

Spike in false refusals after a policy update.
Declining refusal accuracy may indicate prompt drift.

Metric 3: Cost per correct answer

What it measures

The total cost (tokens + infrastructure) to achieve one correct, useful answer.

Why it matters

Accuracy at any cost is not viable in enterprise.
Encourages smart routing, caching, and right-sizing models.

How to measure

Sum tokens per query + overhead.
Divide by number of “correct” answers (human-rated or automated).
Track per-customer and per-query-type costs.

Production signal

Cost per correct answer trending up suggests inefficient routing.
Sudden jumps may indicate a model change or prompt bloat.

Metric 4: Controllability

What it measures

Whether the model follows explicit constraints (format, length, tone, tool usage).

Why it matters

Downstream integrations break when output format changes.
Business rules (e.g., “never mention competitors”) must be respected.

How to measure

Create constraint test sets with expected patterns.
Use regex or schema validation to check compliance.
Measure violation rate per constraint type.

Production signal

Increased violations after prompt edits.
Correlate violations with specific user segments or query types.

Trade-offs I accept

Calibration vs. latency: Temperature tuning improves calibration but can increase response time.
Refusal correctness vs. risk: Prefer slight over-refusing; add human review for edge cases.
Cost vs. quality: Route 70% of queries to smaller models; escalate only when confidence low.
Controllability vs. flexibility: Strict format validation reduces creativity but prevents downstream breaks.

Takeaway

Production LLM systems need a dashboard, not a single accuracy score. Track calibration, refusal correctness, cost per correct answer, and controllability - and set automated gates for regressions.

Back to Field Notes