Back to Field Notes
FIELD NOTE

Evaluating LLM systems beyond accuracy

3 min read

Context

Accuracy alone doesn’t tell you if a system is safe to deploy. In production, you need metrics that reflect real-world risk and cost.

Metric 1: Calibration

What it measures

Whether the model’s confidence matches its correctness. Well-calibrated models don’t overclaim when they’re unsure.

Why it matters

  • Users trust confidence scores more when they’re meaningful.
  • Downstream systems can route low-confidence queries to human review.
  • Helps set thresholds for automation.

How to measure

  • Bin predictions by confidence percentile.
  • Compute accuracy per bin.
  • Plot reliability diagram; compute Expected Calibration Error (ECE).

Production signal

  • Rising ECE after a prompt change means the model became overconfident.
  • Sudden calibration drop may indicate data drift.

Metric 2: Refusal correctness

What it measures

Whether the model refuses appropriately (harmful/invalid queries) and answers when it should.

Why it matters

  • Over-refusing kills productivity; under-refusing creates compliance risk.
  • Product teams tune refusal behavior; you must detect regressions.

How to measure

  • Curate two datasets: safe-to-answer and should-refuse.
  • Measure false refusal rate and false answer rate.
  • Track per-category refusal rates.

Production signal

  • Spike in false refusals after a policy update.
  • Declining refusal accuracy may indicate prompt drift.

Metric 3: Cost per correct answer

What it measures

The total cost (tokens + infrastructure) to achieve one correct, useful answer.

Why it matters

  • Accuracy at any cost is not viable in enterprise.
  • Encourages smart routing, caching, and right-sizing models.

How to measure

  • Sum tokens per query + overhead.
  • Divide by number of “correct” answers (human-rated or automated).
  • Track per-customer and per-query-type costs.

Production signal

  • Cost per correct answer trending up suggests inefficient routing.
  • Sudden jumps may indicate a model change or prompt bloat.

Metric 4: Controllability

What it measures

Whether the model follows explicit constraints (format, length, tone, tool usage).

Why it matters

  • Downstream integrations break when output format changes.
  • Business rules (e.g., “never mention competitors”) must be respected.

How to measure

  • Create constraint test sets with expected patterns.
  • Use regex or schema validation to check compliance.
  • Measure violation rate per constraint type.

Production signal

  • Increased violations after prompt edits.
  • Correlate violations with specific user segments or query types.

Trade-offs I accept

  • Calibration vs. latency: Temperature tuning improves calibration but can increase response time.
  • Refusal correctness vs. risk: Prefer slight over-refusing; add human review for edge cases.
  • Cost vs. quality: Route 70% of queries to smaller models; escalate only when confidence low.
  • Controllability vs. flexibility: Strict format validation reduces creativity but prevents downstream breaks.

Takeaway

Production LLM systems need a dashboard, not a single accuracy score. Track calibration, refusal correctness, cost per correct answer, and controllability - and set automated gates for regressions.