FIELD NOTE
Why most GenAI demos fail in production
3 min read
Context
Demos impress. Production systems endure. The gap isn't tech - it's constraints, data, and operational discipline.
Demo fallacy 1: Single-turn optimism
What you see in demos
- One perfect-looking answer to a curated question.
- No history, no state, no session context.
- No adversarial inputs.
What production looks like
- Multi-turn conversations with references to prior turns.
- Users rephrase, correct, and challenge answers.
- Edge cases: typos, slang, code snippets, tables.
Why demos fail
- Prompts optimized for a single canonical query.
- No session handling or context window management.
- No guardrails for ambiguous or harmful inputs.
Mitigation
- Session-aware design: Store and summarize prior turns.
- Prompt templates for rephrase/correction: Detect user intent shifts.
- Guardrail layers: Input sanitization, refusal handling, fallback paths.
Demo fallacy 2: Data abundance assumption
What you see in demos
- 10–20 clean documents in a vector store.
- Perfect OCR, clean metadata, no versioning.
- No PII, no access controls.
What production looks like
- Thousands to millions of noisy documents.
- Versioned, with conflicting updates.
- PII/PHI, retention policies, tenant isolation.
Why demos fail
- No chunking strategy for long docs.
- No deduplication or version conflict resolution.
- No privacy pipeline before indexing.
Mitigation
- Robust preprocessing: OCR cleanup, version conflict detection.
- Privacy pipeline: PII detection, redaction, audit logging.
- Incremental indexing: Support updates without full recompute.
Demo fallacy 3: Ignoring latency and cost
What you see in demos
- 5–10 seconds per query is acceptable.
- No cost tracking; model choice is “best available.”
- No rate limiting or quota enforcement.
What production looks like
- p95 < 2.5s, often < 1s for internal tools.
- Cost per query must be predictable and budgeted.
- Rate limits per user/tenant.
Why demos fail
- No caching, no model tiering.
- Synchronous chains; no parallelism.
- No observability to detect regressions.
Mitigation
- Tiered model routing: Fast model for simple queries, strong model for complex.
- Semantic caching: Cache frequent queries and their results.
- Cost quotas: Enforce per-tenant budgets with alerts.
Demo fallacy 4: No governance or change management
What you see in demos
- One prompt, one model, deployed manually.
- No evaluation, no regression testing.
- No audit trail.
What production looks like
- Prompt registry, versioned configs.
- Automated evals before deploy.
- Immutable logs for compliance.
Why demos fail
- Manual changes cause regressions.
- No way to roll back quickly.
- No way to trace which change caused a failure.
Mitigation
- Prompt/model registry: All changes tracked and reviewed.
- Automated eval gates: Block deploys that regress metrics.
- Observability dashboards: Real-time metrics and alerting.
Trade-offs I accept
- Latency vs. quality: Use smaller models for 60% of queries; route to larger models only when confidence low.
- Cost vs. coverage: Cache aggressively; accept slight staleness for high-traffic queries.
- Speed vs. governance: Add automated checks, but keep manual override paths with audit trails.
Takeaway
Production AI isn't about the best single answer - it's about consistent, safe, and cost-aware answers at scale. Demos skip the hard parts.