Back to Field Notes
FIELD NOTE

Why most GenAI demos fail in production

3 min read

Context

Demos impress. Production systems endure. The gap isn't tech - it's constraints, data, and operational discipline.

Demo fallacy 1: Single-turn optimism

What you see in demos

  • One perfect-looking answer to a curated question.
  • No history, no state, no session context.
  • No adversarial inputs.

What production looks like

  • Multi-turn conversations with references to prior turns.
  • Users rephrase, correct, and challenge answers.
  • Edge cases: typos, slang, code snippets, tables.

Why demos fail

  • Prompts optimized for a single canonical query.
  • No session handling or context window management.
  • No guardrails for ambiguous or harmful inputs.

Mitigation

  • Session-aware design: Store and summarize prior turns.
  • Prompt templates for rephrase/correction: Detect user intent shifts.
  • Guardrail layers: Input sanitization, refusal handling, fallback paths.

Demo fallacy 2: Data abundance assumption

What you see in demos

  • 10–20 clean documents in a vector store.
  • Perfect OCR, clean metadata, no versioning.
  • No PII, no access controls.

What production looks like

  • Thousands to millions of noisy documents.
  • Versioned, with conflicting updates.
  • PII/PHI, retention policies, tenant isolation.

Why demos fail

  • No chunking strategy for long docs.
  • No deduplication or version conflict resolution.
  • No privacy pipeline before indexing.

Mitigation

  • Robust preprocessing: OCR cleanup, version conflict detection.
  • Privacy pipeline: PII detection, redaction, audit logging.
  • Incremental indexing: Support updates without full recompute.

Demo fallacy 3: Ignoring latency and cost

What you see in demos

  • 5–10 seconds per query is acceptable.
  • No cost tracking; model choice is “best available.”
  • No rate limiting or quota enforcement.

What production looks like

  • p95 < 2.5s, often < 1s for internal tools.
  • Cost per query must be predictable and budgeted.
  • Rate limits per user/tenant.

Why demos fail

  • No caching, no model tiering.
  • Synchronous chains; no parallelism.
  • No observability to detect regressions.

Mitigation

  • Tiered model routing: Fast model for simple queries, strong model for complex.
  • Semantic caching: Cache frequent queries and their results.
  • Cost quotas: Enforce per-tenant budgets with alerts.

Demo fallacy 4: No governance or change management

What you see in demos

  • One prompt, one model, deployed manually.
  • No evaluation, no regression testing.
  • No audit trail.

What production looks like

  • Prompt registry, versioned configs.
  • Automated evals before deploy.
  • Immutable logs for compliance.

Why demos fail

  • Manual changes cause regressions.
  • No way to roll back quickly.
  • No way to trace which change caused a failure.

Mitigation

  • Prompt/model registry: All changes tracked and reviewed.
  • Automated eval gates: Block deploys that regress metrics.
  • Observability dashboards: Real-time metrics and alerting.

Trade-offs I accept

  • Latency vs. quality: Use smaller models for 60% of queries; route to larger models only when confidence low.
  • Cost vs. coverage: Cache aggressively; accept slight staleness for high-traffic queries.
  • Speed vs. governance: Add automated checks, but keep manual override paths with audit trails.

Takeaway

Production AI isn't about the best single answer - it's about consistent, safe, and cost-aware answers at scale. Demos skip the hard parts.