The Demo That Launched a Thousand Bug Reports
The demo went perfectly. Stakeholders applauded. "Ship it," they said.
Within 72 hours:
"It worked in the playground," you tell the incident channel. It did. That's the problem.
Everyone can make a prompt work in a demo. The gap between that and production - where real users do unexpected things at scale - is where most AI projects fail.
This article covers the patterns that survived millions of production requests. Every lesson here was learned the hard way: through outages, cost overruns, user complaints, and late-night debugging sessions.
The Production Prompt Stack
Production prompts are not single strings. They are layered systems with multiple components working together.
Layer 1: The System Foundation
Your system prompt is infrastructure. It runs on every request, shapes every response, and is the hardest to change once users depend on its behavior.
interface SystemPromptConfig {
version: string;
identity: {
role: string;
personality: string;
expertise: string[];
};
capabilities: {
allowed: string[];
forbidden: string[];
requiresApproval: string[];
};
outputRules: {
format: 'markdown' | 'json' | 'plain';
maxLength?: number;
requiredSections?: string[];
};
safetyRules: {
contentFilters: string[];
escalationTriggers: string[];
uncertaintyHandling: string;
};
}
function buildSystemPrompt(config: SystemPromptConfig): string {
return `
You are ${config.identity.role}.
## Personality
${config.identity.personality}
## Expertise Areas
${config.identity.expertise.map(e => `- ${e}`).join('\n')}
## What You Can Do
${config.capabilities.allowed.map(c => `- ${c}`).join('\n')}
## What You Must Never Do
${config.capabilities.forbidden.map(c => `- ${c}`).join('\n')}
## Actions Requiring Human Approval
${config.capabilities.requiresApproval.map(c => `- ${c}`).join('\n')}
## Response Format
Format: ${config.outputRules.format}
${config.outputRules.maxLength ? `Maximum length: ${config.outputRules.maxLength} characters` : ''}
${config.outputRules.requiredSections ? `Required sections: ${config.outputRules.requiredSections.join(', ')}` : ''}
## Safety Guidelines
- Content to filter: ${config.safetyRules.contentFilters.join(', ')}
- Escalate when: ${config.safetyRules.escalationTriggers.join(', ')}
- When uncertain: ${config.safetyRules.uncertaintyHandling}
`.trim();
}This structured approach gives you version control, A/B testing capability, and clear documentation of your AI's behavior contract.
Layer 2: Context Injection
Production systems need dynamic context. User data, session history, retrieved documents, current state - all injected at runtime.
interface RequestContext {
user: {
id: string;
tier: 'free' | 'pro' | 'enterprise';
preferences: Record<string, string>;
history: ConversationTurn[];
};
session: {
id: string;
startedAt: Date;
turnCount: number;
currentTask?: string;
};
retrieved: {
documents: RetrievedDocument[];
relevanceScores: number[];
};
constraints: {
maxTokens: number;
timeoutMs: number;
costBudget: number;
};
}
function buildContextBlock(ctx: RequestContext): string {
const historyWindow = ctx.user.history.slice(-5); // Last 5 turns
return `
## User Context
- User tier: ${ctx.user.tier}
- Preferences: ${JSON.stringify(ctx.user.preferences)}
## Conversation History
${historyWindow.map(t => `${t.role}: ${t.content}`).join('\n')}
## Relevant Information
${ctx.retrieved.documents.map((doc, i) => `
### Source ${i + 1} (relevance: ${(ctx.retrieved.relevanceScores[i] * 100).toFixed(0)}%)
${doc.content}
`).join('\n')}
## Constraints
- Keep response under ${Math.floor(ctx.constraints.maxTokens * 0.75)} tokens
- Respond within ${ctx.constraints.timeoutMs}ms budget
`.trim();
}Layer 3: Task Instructions
The specific request, formatted for reliability:
interface TaskSpec {
action: string;
inputs: Record<string, unknown>;
outputSchema?: object;
examples?: Array<{ input: unknown; output: unknown }>;
constraints?: string[];
}
function buildTaskBlock(task: TaskSpec): string {
let block = `
## Your Task
${task.action}
## Inputs
${Object.entries(task.inputs).map(([k, v]) => `- ${k}: ${JSON.stringify(v)}`).join('\n')}
`;
if (task.outputSchema) {
block += `
## Required Output Format
\`\`\`json
${JSON.stringify(task.outputSchema, null, 2)}
\`\`\`
`;
}
if (task.examples && task.examples.length > 0) {
block += `
## Examples
${task.examples.map((ex, i) => `
Example ${i + 1}:
Input: ${JSON.stringify(ex.input)}
Output: ${JSON.stringify(ex.output)}
`).join('\n')}
`;
}
if (task.constraints) {
block += `
## Additional Constraints
${task.constraints.map(c => `- ${c}`).join('\n')}
`;
}
return block.trim();
}Reliability Patterns
Production systems must handle failure gracefully. LLMs fail in ways traditional software does not.
Pattern 1: Retry with Reformulation
When a response fails validation, do not just retry - reformulate the prompt with feedback:
interface RetryConfig {
maxAttempts: number;
backoffMs: number[];
reformulationStrategies: ReformulationStrategy[];
}
type ReformulationStrategy =
| 'add_example'
| 'simplify_request'
| 'explicit_format'
| 'break_down_task';
async function reliableCompletion<T>(
prompt: string,
schema: z.ZodSchema<T>,
config: RetryConfig
): Promise<T> {
let lastError: Error | null = null;
let currentPrompt = prompt;
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
try {
const response = await llm.complete({
prompt: currentPrompt,
responseFormat: 'json'
});
const parsed = JSON.parse(response);
return schema.parse(parsed);
} catch (error) {
lastError = error as Error;
// Wait with backoff
if (attempt < config.maxAttempts - 1) {
await sleep(config.backoffMs[attempt] || 1000);
// Reformulate based on error type
currentPrompt = reformulatePrompt(
prompt,
error,
config.reformulationStrategies[attempt]
);
console.log(`Retry ${attempt + 1}: reformulated with ${config.reformulationStrategies[attempt]}`);
}
}
}
throw new Error(`Failed after ${config.maxAttempts} attempts: ${lastError?.message}`);
}
function reformulatePrompt(
original: string,
error: unknown,
strategy: ReformulationStrategy
): string {
switch (strategy) {
case 'add_example':
return `${original}
IMPORTANT: Your previous response was invalid. Here's an example of a correct response:
${getExampleForSchema(error)}
Please try again, following the exact format shown.`;
case 'simplify_request':
return `${original}
NOTE: Please provide a simple, direct response. Avoid complex structures or explanations.`;
case 'explicit_format':
return `${original}
CRITICAL: You MUST respond with valid JSON only. No markdown, no explanation, no code blocks.
Start your response with { and end with }.`;
case 'break_down_task':
return `Let's break this down step by step.
First, I'll ask you to do just one part:
${extractFirstSubtask(original)}`;
default:
return original;
}
}Pattern 2: Fallback Chains
When primary model fails, fall back to alternatives:
interface ModelConfig {
name: string;
provider: 'openai' | 'anthropic' | 'google' | 'local';
costPer1kTokens: number;
latencyP50Ms: number;
capabilities: string[];
}
const modelChain: ModelConfig[] = [
{ name: 'gpt-4o', provider: 'openai', costPer1kTokens: 0.005, latencyP50Ms: 800, capabilities: ['reasoning', 'code', 'analysis'] },
{ name: 'claude-3-5-sonnet', provider: 'anthropic', costPer1kTokens: 0.003, latencyP50Ms: 600, capabilities: ['reasoning', 'writing', 'analysis'] },
{ name: 'gpt-4o-mini', provider: 'openai', costPer1kTokens: 0.00015, latencyP50Ms: 300, capabilities: ['general'] },
{ name: 'local-llama', provider: 'local', costPer1kTokens: 0, latencyP50Ms: 200, capabilities: ['general'] }
];
async function completionWithFallback(
prompt: string,
requiredCapabilities: string[],
options: { timeout: number; maxCost: number }
): Promise<{ response: string; model: string; cost: number }> {
const eligibleModels = modelChain.filter(m =>
requiredCapabilities.every(cap => m.capabilities.includes(cap)) &&
m.costPer1kTokens <= options.maxCost
);
for (const model of eligibleModels) {
try {
const response = await Promise.race([
callModel(model, prompt),
timeout(options.timeout)
]);
return {
response: response.text,
model: model.name,
cost: response.tokensUsed * model.costPer1kTokens / 1000
};
} catch (error) {
console.warn(`Model ${model.name} failed: ${error}`);
continue;
}
}
throw new Error('All models in fallback chain failed');
}Pattern 3: Circuit Breaker
Protect your system when LLM providers have issues:
class LLMCircuitBreaker {
private failures: number = 0;
private lastFailure: Date | null = null;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private threshold: number = 5,
private resetTimeMs: number = 30000
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - (this.lastFailure?.getTime() || 0) > this.resetTimeMs) {
this.state = 'half-open';
} else {
throw new Error('Circuit breaker is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailure = new Date();
if (this.failures >= this.threshold) {
this.state = 'open';
console.error(`Circuit breaker opened after ${this.failures} failures`);
}
}
}
// Usage
const breaker = new LLMCircuitBreaker(5, 30000);
async function safeCompletion(prompt: string): Promise<string> {
return breaker.call(() => llm.complete({ prompt }));
}Cost Optimization Strategies
LLM costs can explode quickly. Here are patterns that saved us hundreds of thousands of dollars.
Strategy 1: Prompt Compression
Reduce token count without losing meaning:
interface CompressionResult {
compressed: string;
originalTokens: number;
compressedTokens: number;
compressionRatio: number;
}
async function compressPrompt(prompt: string): Promise<CompressionResult> {
const originalTokens = countTokens(prompt);
// Strategy 1: Remove redundant whitespace
let compressed = prompt.replace(/\n{3,}/g, '\n\n').trim();
// Strategy 2: Abbreviate common phrases (carefully!)
const abbreviations: Record<string, string> = {
'for example': 'e.g.',
'that is': 'i.e.',
'and so on': 'etc.',
'as soon as possible': 'ASAP',
};
for (const [full, abbrev] of Object.entries(abbreviations)) {
compressed = compressed.replace(new RegExp(full, 'gi'), abbrev);
}
// Strategy 3: Use LLM to compress context (for long documents)
if (originalTokens > 2000) {
compressed = await llm.complete({
prompt: `Compress this text to 30% of its length while preserving all key information:
${compressed}
Compressed version:`,
maxTokens: Math.floor(originalTokens * 0.35)
});
}
const compressedTokens = countTokens(compressed);
return {
compressed,
originalTokens,
compressedTokens,
compressionRatio: compressedTokens / originalTokens
};
}Strategy 2: Tiered Model Selection
Use expensive models only when necessary:
interface TaskComplexity {
score: number; // 0-1
factors: {
requiresReasoning: boolean;
requiresCreativity: boolean;
requiresAccuracy: boolean;
contextLength: number;
outputComplexity: 'simple' | 'structured' | 'complex';
};
}
function assessComplexity(task: string, context: string): TaskComplexity {
const factors = {
requiresReasoning: /analyze|compare|evaluate|decide|reason/i.test(task),
requiresCreativity: /create|write|generate|design|imagine/i.test(task),
requiresAccuracy: /calculate|extract|precise|exact|legal|medical/i.test(task),
contextLength: countTokens(context),
outputComplexity: detectOutputComplexity(task)
};
let score = 0;
if (factors.requiresReasoning) score += 0.3;
if (factors.requiresCreativity) score += 0.2;
if (factors.requiresAccuracy) score += 0.3;
if (factors.contextLength > 4000) score += 0.1;
if (factors.outputComplexity === 'complex') score += 0.1;
return { score: Math.min(score, 1), factors };
}
function selectModel(complexity: TaskComplexity): string {
if (complexity.score > 0.7) return 'gpt-4o';
if (complexity.score > 0.4) return 'gpt-4o-mini';
return 'gpt-3.5-turbo';
}Strategy 3: Semantic Caching
Cache responses for similar queries:
interface CacheEntry {
prompt: string;
promptEmbedding: number[];
response: string;
createdAt: Date;
hitCount: number;
}
class SemanticCache {
private cache: CacheEntry[] = [];
private similarityThreshold = 0.95;
async get(prompt: string): Promise<string | null> {
const embedding = await embed(prompt);
for (const entry of this.cache) {
const similarity = cosineSimilarity(embedding, entry.promptEmbedding);
if (similarity >= this.similarityThreshold) {
entry.hitCount++;
console.log(`Cache hit (similarity: ${similarity.toFixed(3)})`);
return entry.response;
}
}
return null;
}
async set(prompt: string, response: string): Promise<void> {
const embedding = await embed(prompt);
this.cache.push({
prompt,
promptEmbedding: embedding,
response,
createdAt: new Date(),
hitCount: 0
});
// Evict old entries if cache too large
if (this.cache.length > 10000) {
this.evictLeastUsed();
}
}
private evictLeastUsed(): void {
this.cache.sort((a, b) => b.hitCount - a.hitCount);
this.cache = this.cache.slice(0, 5000);
}
}
// Usage
const cache = new SemanticCache();
async function cachedCompletion(prompt: string): Promise<string> {
const cached = await cache.get(prompt);
if (cached) return cached;
const response = await llm.complete({ prompt });
await cache.set(prompt, response);
return response;
}Security Patterns
LLMs introduce novel security risks. These patterns protect against them.
Pattern 1: Input Sanitization
interface SanitizationResult {
sanitized: string;
flagged: boolean;
flags: string[];
}
function sanitizeUserInput(input: string): SanitizationResult {
const flags: string[] = [];
let sanitized = input;
// Detect prompt injection attempts
const injectionPatterns = [
/ignore (all |previous |above )?instructions/i,
/disregard (all |previous |above )?instructions/i,
/forget (all |previous |above )?instructions/i,
/new instructions:/i,
/system prompt/i,
/you are now/i,
/pretend (to be|you're)/i,
/roleplay as/i,
/<\/?system>/i,
/<\/?user>/i,
/<\/?assistant>/i
];
for (const pattern of injectionPatterns) {
if (pattern.test(input)) {
flags.push(`Potential injection: ${pattern.source}`);
}
}
// Remove or escape special delimiters
sanitized = sanitized
.replace(/\`\`\`/g, "'''")
.replace(/<system>/gi, '[system]')
.replace(/<\/system>/gi, '[/system]');
// Truncate extremely long inputs
if (sanitized.length > 10000) {
sanitized = sanitized.slice(0, 10000) + '... [truncated]';
flags.push('Input truncated due to length');
}
return {
sanitized,
flagged: flags.length > 0,
flags
};
}Pattern 2: Output Validation
interface OutputValidation {
valid: boolean;
issues: string[];
sanitizedOutput: string;
}
function validateOutput(output: string, context: OutputContext): OutputValidation {
const issues: string[] = [];
let sanitized = output;
// Check for leaked system prompt
if (context.systemPrompt) {
const systemWords = context.systemPrompt.split(/\s+/).filter(w => w.length > 10);
const matchCount = systemWords.filter(w => output.includes(w)).length;
if (matchCount > systemWords.length * 0.3) {
issues.push('Possible system prompt leak detected');
}
}
// Check for PII in output (that wasn't in input)
const piiPatterns = {
ssn: /\b\d{3}-\d{2}-\d{4}\b/,
creditCard: /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/,
email: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
};
for (const [type, pattern] of Object.entries(piiPatterns)) {
const outputMatches = output.match(pattern) || [];
const inputMatches = context.userInput.match(pattern) || [];
const newPII = outputMatches.filter(m => !inputMatches.includes(m));
if (newPII.length > 0) {
issues.push(`Potentially hallucinated ${type}: ${newPII.length} instance(s)`);
for (const pii of newPII) {
sanitized = sanitized.replace(pii, '[REDACTED]');
}
}
}
// Check for harmful content
const harmfulPatterns = [
/how to (make|build|create) (a )?bomb/i,
/hack into/i,
/steal (someone's|personal)/i
];
for (const pattern of harmfulPatterns) {
if (pattern.test(output)) {
issues.push(`Harmful content detected: ${pattern.source}`);
}
}
return {
valid: issues.length === 0,
issues,
sanitizedOutput: sanitized
};
}Pattern 3: Rate Limiting per User
interface RateLimitConfig {
requestsPerMinute: number;
tokensPerDay: number;
costPerDay: number;
}
const tierLimits: Record<string, RateLimitConfig> = {
free: { requestsPerMinute: 10, tokensPerDay: 10000, costPerDay: 0.10 },
pro: { requestsPerMinute: 60, tokensPerDay: 100000, costPerDay: 5.00 },
enterprise: { requestsPerMinute: 300, tokensPerDay: 1000000, costPerDay: 100.00 }
};
class UserRateLimiter {
private usage: Map<string, {
requests: { timestamp: number }[];
tokensToday: number;
costToday: number;
dayStart: number;
}> = new Map();
check(userId: string, tier: string, estimatedTokens: number): {
allowed: boolean;
reason?: string;
retryAfterMs?: number;
} {
const limits = tierLimits[tier] || tierLimits.free;
const now = Date.now();
const dayStart = new Date().setHours(0, 0, 0, 0);
let usage = this.usage.get(userId);
if (!usage || usage.dayStart !== dayStart) {
usage = { requests: [], tokensToday: 0, costToday: 0, dayStart };
this.usage.set(userId, usage);
}
// Check requests per minute
const recentRequests = usage.requests.filter(r => now - r.timestamp < 60000);
if (recentRequests.length >= limits.requestsPerMinute) {
const oldestRecent = recentRequests[0].timestamp;
return {
allowed: false,
reason: 'Rate limit exceeded',
retryAfterMs: 60000 - (now - oldestRecent)
};
}
// Check daily token limit
if (usage.tokensToday + estimatedTokens > limits.tokensPerDay) {
return {
allowed: false,
reason: 'Daily token limit exceeded',
retryAfterMs: dayStart + 86400000 - now
};
}
return { allowed: true };
}
record(userId: string, tokensUsed: number, cost: number): void {
const usage = this.usage.get(userId);
if (usage) {
usage.requests.push({ timestamp: Date.now() });
usage.tokensToday += tokensUsed;
usage.costToday += cost;
}
}
}Observability and Debugging
You cannot improve what you cannot measure. Production AI systems need deep observability.
Pattern 1: Structured Logging
interface LLMCallLog {
id: string;
timestamp: Date;
// Request
model: string;
promptTokens: number;
promptHash: string; // For debugging without storing full prompts
temperature: number;
// Response
completionTokens: number;
latencyMs: number;
finishReason: 'stop' | 'length' | 'content_filter' | 'error';
// Context
userId?: string;
sessionId?: string;
taskType: string;
// Quality signals
validationPassed: boolean;
retryCount: number;
cachedResponse: boolean;
// Cost
estimatedCost: number;
}
class LLMLogger {
private logs: LLMCallLog[] = [];
log(entry: Omit<LLMCallLog, 'id' | 'timestamp'>): void {
const fullEntry: LLMCallLog = {
...entry,
id: crypto.randomUUID(),
timestamp: new Date()
};
this.logs.push(fullEntry);
// Send to observability platform
this.sendToDatadog(fullEntry);
// Alert on anomalies
this.checkAlerts(fullEntry);
}
private checkAlerts(entry: LLMCallLog): void {
if (entry.latencyMs > 10000) {
this.alert('high_latency', entry);
}
if (entry.retryCount > 2) {
this.alert('high_retry', entry);
}
if (entry.estimatedCost > 1.00) {
this.alert('high_cost', entry);
}
if (!entry.validationPassed) {
this.alert('validation_failed', entry);
}
}
private alert(type: string, entry: LLMCallLog): void {
console.warn(`[ALERT:${type}] ${JSON.stringify(entry)}`);
}
private sendToDatadog(entry: LLMCallLog): void {
// Implementation for your observability platform
}
}Pattern 2: Prompt Versioning and A/B Testing
interface PromptVersion {
id: string;
version: string;
prompt: string;
createdAt: Date;
metrics: {
invocations: number;
successRate: number;
avgLatencyMs: number;
avgCost: number;
userRating?: number;
};
}
class PromptExperiment {
private variants: Map<string, PromptVersion> = new Map();
private assignments: Map<string, string> = new Map(); // userId -> variantId
addVariant(variant: PromptVersion): void {
this.variants.set(variant.id, variant);
}
getPrompt(userId: string): { prompt: string; variantId: string } {
// Sticky assignment for consistent user experience
let variantId = this.assignments.get(userId);
if (!variantId) {
// Assign based on user hash for reproducibility
const hash = this.hashUserId(userId);
const variantIds = Array.from(this.variants.keys());
variantId = variantIds[hash % variantIds.length];
this.assignments.set(userId, variantId);
}
const variant = this.variants.get(variantId)!;
variant.metrics.invocations++;
return { prompt: variant.prompt, variantId };
}
recordOutcome(variantId: string, success: boolean, latencyMs: number, cost: number): void {
const variant = this.variants.get(variantId);
if (variant) {
const m = variant.metrics;
m.successRate = (m.successRate * (m.invocations - 1) + (success ? 1 : 0)) / m.invocations;
m.avgLatencyMs = (m.avgLatencyMs * (m.invocations - 1) + latencyMs) / m.invocations;
m.avgCost = (m.avgCost * (m.invocations - 1) + cost) / m.invocations;
}
}
getWinner(): PromptVersion | null {
const variants = Array.from(this.variants.values())
.filter(v => v.metrics.invocations >= 100); // Minimum sample size
if (variants.length < 2) return null;
// Simple winner: highest success rate with acceptable latency
variants.sort((a, b) => {
if (b.metrics.successRate !== a.metrics.successRate) {
return b.metrics.successRate - a.metrics.successRate;
}
return a.metrics.avgLatencyMs - b.metrics.avgLatencyMs;
});
return variants[0];
}
private hashUserId(userId: string): number {
let hash = 0;
for (let i = 0; i < userId.length; i++) {
hash = ((hash << 5) - hash) + userId.charCodeAt(i);
hash = hash & hash;
}
return Math.abs(hash);
}
}Real-World Case Studies
Case Study 1: Customer Support Automation
We built an AI support agent handling 50,000 tickets per month. The journey from demo to production taught us:
Initial Problem: Demo worked beautifully with test queries. Production users asked questions in ways we never anticipated.
Solution: We built a query normalization layer that rephrased user questions before sending to the LLM:
async function normalizeQuery(rawQuery: string): Promise<string> {
// Handle common patterns that confused the main model
const normalized = await llm.complete({
prompt: `Rewrite this customer support query to be clear and specific.
Original: "${rawQuery}"
Rules:
- Fix typos and grammar
- Expand abbreviations
- If the query is vague, make reasonable assumptions and note them
- Keep the same intent and urgency
Rewritten query:`,
maxTokens: 200,
temperature: 0.3
});
return normalized;
}Result: Query understanding improved from 72% to 94%, and customers reported higher satisfaction with responses.
Case Study 2: Code Review Assistant
We deployed an AI code reviewer integrated with GitHub PRs. Key lessons:
Problem 1: Large diffs crashed the context window.
Solution: Intelligent chunking by file, with cross-file context summary:
async function reviewLargePR(files: FileChange[]): Promise<Review[]> {
// First pass: generate a summary of the overall change
const summary = await summarizeChanges(files);
// Review each file with the summary as context
const reviews = await Promise.all(
files.map(file => reviewFileWithContext(file, summary))
);
// Final pass: check for cross-file issues
const crossFileIssues = await checkCrossFileConsistency(files, reviews);
return [...reviews, ...crossFileIssues];
}Problem 2: AI flagged too many false positives, annoying developers.
Solution: Confidence scoring with threshold:
interface ReviewComment {
line: number;
message: string;
severity: 'critical' | 'warning' | 'suggestion';
confidence: number; // 0-1
}
function filterReviewComments(
comments: ReviewComment[],
minConfidence: number = 0.7
): ReviewComment[] {
return comments.filter(c => {
// Always show critical issues
if (c.severity === 'critical') return c.confidence > 0.5;
// Higher threshold for warnings and suggestions
return c.confidence >= minConfidence;
});
}Result: False positive rate dropped from 35% to 8%, developer adoption increased from 40% to 85%.
The Production Checklist
Before deploying any LLM feature, verify:
Reliability
Cost Control
Security
Observability
Testing
FAQ
Q: How do I test prompt changes without breaking production?
Shadow mode: run new prompts alongside production, compare outputs, but only serve the production version. Graduate to canary (5% traffic) once confidence is high.
Q: What's a reasonable latency target for LLM responses?
For interactive use: p50 < 2s, p95 < 5s. For background processing: whatever your users tolerate. Streaming helps perception - first token in < 500ms feels fast even if completion takes longer.
Q: How do I handle prompt injection attacks?
Defense in depth: input sanitization, output validation, separate system/user message roles, and monitor for anomalies. No single technique is sufficient.
Q: Should I use few-shot examples or fine-tuning?
Start with few-shot - it's faster to iterate. Fine-tune when: you have 1000+ high-quality examples, few-shot isn't reaching quality targets, or you need to reduce prompt size for cost/latency.
Q: How often should I review and update prompts?
Monthly minimum. Set up alerts for quality degradation. Model updates can change behavior - revalidate after provider announcements.
---
Production prompt engineering isn't about writing clever prompts. It's about building robust systems that work reliably at scale, within budget, while protecting users and the business.
The patterns here emerged from real failures. Each retry mechanism exists because requests failed. Each cost optimization exists because bills exceeded budget. Each security pattern exists because attacks happened.
Start with the basics: structured prompts, validation, logging. Add complexity only when needed. Measure everything. Iterate based on data.
Build systems, not demos. Your users - and your on-call rotation - will thank you.
Recommended Reading

Building LLM Apps
by Valentino Gagliardi
Practical LLM application development
As an Amazon Associate, we earn from qualifying purchases.

Prompt Engineering for Generative AI
by James Phoenix
Comprehensive prompt engineering guide
As an Amazon Associate, we earn from qualifying purchases.
💬Discussion
No comments yet
Be the first to share your thoughts!