Back to Field Notes

FIELD NOTE

Why most GenAI demos fail in production

ProductionJanuary 5, 20263 min read

Context

Demos impress. Production systems endure. The gap isn't tech - it's constraints, data, and operational discipline.

Demo fallacy 1: Single-turn optimism

What you see in demos

One perfect-looking answer to a curated question.
No history, no state, no session context.
No adversarial inputs.

What production looks like

Multi-turn conversations with references to prior turns.
Users rephrase, correct, and challenge answers.
Edge cases: typos, slang, code snippets, tables.

Why demos fail

Prompts optimized for a single canonical query.
No session handling or context window management.
No guardrails for ambiguous or harmful inputs.

Mitigation

Session-aware design: Store and summarize prior turns.
Prompt templates for rephrase/correction: Detect user intent shifts.
Guardrail layers: Input sanitization, refusal handling, fallback paths.

Demo fallacy 2: Data abundance assumption

What you see in demos

10–20 clean documents in a vector store.
Perfect OCR, clean metadata, no versioning.
No PII, no access controls.

What production looks like

Thousands to millions of noisy documents.
Versioned, with conflicting updates.
PII/PHI, retention policies, tenant isolation.

Why demos fail

No chunking strategy for long docs.
No deduplication or version conflict resolution.
No privacy pipeline before indexing.

Mitigation

Robust preprocessing: OCR cleanup, version conflict detection.
Privacy pipeline: PII detection, redaction, audit logging.
Incremental indexing: Support updates without full recompute.

Demo fallacy 3: Ignoring latency and cost

What you see in demos

5–10 seconds per query is acceptable.
No cost tracking; model choice is “best available.”
No rate limiting or quota enforcement.

What production looks like

p95 < 2.5s, often < 1s for internal tools.
Cost per query must be predictable and budgeted.
Rate limits per user/tenant.

Why demos fail

No caching, no model tiering.
Synchronous chains; no parallelism.
No observability to detect regressions.

Mitigation

Tiered model routing: Fast model for simple queries, strong model for complex.
Semantic caching: Cache frequent queries and their results.
Cost quotas: Enforce per-tenant budgets with alerts.

Demo fallacy 4: No governance or change management

What you see in demos

One prompt, one model, deployed manually.
No evaluation, no regression testing.
No audit trail.

What production looks like

Prompt registry, versioned configs.
Automated evals before deploy.
Immutable logs for compliance.

Why demos fail

Manual changes cause regressions.
No way to roll back quickly.
No way to trace which change caused a failure.

Mitigation

Prompt/model registry: All changes tracked and reviewed.
Automated eval gates: Block deploys that regress metrics.
Observability dashboards: Real-time metrics and alerting.

Trade-offs I accept

Latency vs. quality: Use smaller models for 60% of queries; route to larger models only when confidence low.
Cost vs. coverage: Cache aggressively; accept slight staleness for high-traffic queries.
Speed vs. governance: Add automated checks, but keep manual override paths with audit trails.

Takeaway

Production AI isn't about the best single answer - it's about consistent, safe, and cost-aware answers at scale. Demos skip the hard parts.

Back to Field Notes