Prompt Engineering for Production 2026 | Enterprise Patterns & Best Practices

The Demo That Launched a Thousand Bug Reports

The demo went perfectly. Stakeholders applauded. "Ship it," they said.

Within 72 hours:

User #1 typed "asdfghjkl" and got a 500-word essay about keyboard layouts

User #2 discovered they could make your AI write their college essays

User #3's request took 47 seconds and timed out

User #4's request cost $2.50 (your budget was $0.02/request)

User #5 got medical advice that made Legal very, very nervous

"It worked in the playground," you tell the incident channel. It did. That's the problem.

Everyone can make a prompt work in a demo. The gap between that and production - where real users do unexpected things at scale - is where most AI projects fail.

This article covers the patterns that survived millions of production requests. Every lesson here was learned the hard way: through outages, cost overruns, user complaints, and late-night debugging sessions.

The Production Prompt Stack

Production prompts are not single strings. They are layered systems with multiple components working together.

Layer 1: The System Foundation

Your system prompt is infrastructure. It runs on every request, shapes every response, and is the hardest to change once users depend on its behavior.

TypeScript

interface SystemPromptConfig {
  version: string;
  identity: {
    role: string;
    personality: string;
    expertise: string[];
  };
  capabilities: {
    allowed: string[];
    forbidden: string[];
    requiresApproval: string[];
  };
  outputRules: {
    format: 'markdown' | 'json' | 'plain';
    maxLength?: number;
    requiredSections?: string[];
  };
  safetyRules: {
    contentFilters: string[];
    escalationTriggers: string[];
    uncertaintyHandling: string;
  };
}

function buildSystemPrompt(config: SystemPromptConfig): string {
  return `
You are ${config.identity.role}.

## Personality
${config.identity.personality}

## Expertise Areas
${config.identity.expertise.map(e => `- ${e}`).join('\n')}

## What You Can Do
${config.capabilities.allowed.map(c => `- ${c}`).join('\n')}

## What You Must Never Do
${config.capabilities.forbidden.map(c => `- ${c}`).join('\n')}

## Actions Requiring Human Approval
${config.capabilities.requiresApproval.map(c => `- ${c}`).join('\n')}

## Response Format
Format: ${config.outputRules.format}
${config.outputRules.maxLength ? `Maximum length: ${config.outputRules.maxLength} characters` : ''}
${config.outputRules.requiredSections ? `Required sections: ${config.outputRules.requiredSections.join(', ')}` : ''}

## Safety Guidelines
- Content to filter: ${config.safetyRules.contentFilters.join(', ')}
- Escalate when: ${config.safetyRules.escalationTriggers.join(', ')}
- When uncertain: ${config.safetyRules.uncertaintyHandling}
`.trim();
}

This structured approach gives you version control, A/B testing capability, and clear documentation of your AI's behavior contract.

Layer 2: Context Injection

Production systems need dynamic context. User data, session history, retrieved documents, current state - all injected at runtime.

TypeScript

interface RequestContext {
  user: {
    id: string;
    tier: 'free' | 'pro' | 'enterprise';
    preferences: Record<string, string>;
    history: ConversationTurn[];
  };
  session: {
    id: string;
    startedAt: Date;
    turnCount: number;
    currentTask?: string;
  };
  retrieved: {
    documents: RetrievedDocument[];
    relevanceScores: number[];
  };
  constraints: {
    maxTokens: number;
    timeoutMs: number;
    costBudget: number;
  };
}

function buildContextBlock(ctx: RequestContext): string {
  const historyWindow = ctx.user.history.slice(-5); // Last 5 turns
  
  return `
## User Context
- User tier: ${ctx.user.tier}
- Preferences: ${JSON.stringify(ctx.user.preferences)}

## Conversation History
${historyWindow.map(t => `${t.role}: ${t.content}`).join('\n')}

## Relevant Information
${ctx.retrieved.documents.map((doc, i) => `
### Source ${i + 1} (relevance: ${(ctx.retrieved.relevanceScores[i] * 100).toFixed(0)}%)
${doc.content}
`).join('\n')}

## Constraints
- Keep response under ${Math.floor(ctx.constraints.maxTokens * 0.75)} tokens
- Respond within ${ctx.constraints.timeoutMs}ms budget
`.trim();
}

Layer 3: Task Instructions

The specific request, formatted for reliability:

TypeScript

interface TaskSpec {
  action: string;
  inputs: Record<string, unknown>;
  outputSchema?: object;
  examples?: Array<{ input: unknown; output: unknown }>;
  constraints?: string[];
}

function buildTaskBlock(task: TaskSpec): string {
  let block = `
## Your Task
${task.action}

## Inputs
${Object.entries(task.inputs).map(([k, v]) => `- ${k}: ${JSON.stringify(v)}`).join('\n')}
`;

  if (task.outputSchema) {
    block += `
## Required Output Format
\`\`\`json
${JSON.stringify(task.outputSchema, null, 2)}
\`\`\`
`;
  }

  if (task.examples && task.examples.length > 0) {
    block += `
## Examples
${task.examples.map((ex, i) => `
Example ${i + 1}:
Input: ${JSON.stringify(ex.input)}
Output: ${JSON.stringify(ex.output)}
`).join('\n')}
`;
  }

  if (task.constraints) {
    block += `
## Additional Constraints
${task.constraints.map(c => `- ${c}`).join('\n')}
`;
  }

  return block.trim();
}

Reliability Patterns

Production systems must handle failure gracefully. LLMs fail in ways traditional software does not.

Pattern 1: Retry with Reformulation

When a response fails validation, do not just retry - reformulate the prompt with feedback:

TypeScript

interface RetryConfig {
  maxAttempts: number;
  backoffMs: number[];
  reformulationStrategies: ReformulationStrategy[];
}

type ReformulationStrategy = 
  | 'add_example'
  | 'simplify_request'
  | 'explicit_format'
  | 'break_down_task';

async function reliableCompletion<T>(
  prompt: string,
  schema: z.ZodSchema<T>,
  config: RetryConfig
): Promise<T> {
  let lastError: Error | null = null;
  let currentPrompt = prompt;

  for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
    try {
      const response = await llm.complete({
        prompt: currentPrompt,
        responseFormat: 'json'
      });

      const parsed = JSON.parse(response);
      return schema.parse(parsed);
    } catch (error) {
      lastError = error as Error;
      
      // Wait with backoff
      if (attempt < config.maxAttempts - 1) {
        await sleep(config.backoffMs[attempt] || 1000);
        
        // Reformulate based on error type
        currentPrompt = reformulatePrompt(
          prompt, 
          error, 
          config.reformulationStrategies[attempt]
        );
        
        console.log(`Retry ${attempt + 1}: reformulated with ${config.reformulationStrategies[attempt]}`);
      }
    }
  }

  throw new Error(`Failed after ${config.maxAttempts} attempts: ${lastError?.message}`);
}

function reformulatePrompt(
  original: string, 
  error: unknown, 
  strategy: ReformulationStrategy
): string {
  switch (strategy) {
    case 'add_example':
      return `${original}

IMPORTANT: Your previous response was invalid. Here's an example of a correct response:
${getExampleForSchema(error)}

Please try again, following the exact format shown.`;

    case 'simplify_request':
      return `${original}

NOTE: Please provide a simple, direct response. Avoid complex structures or explanations.`;

    case 'explicit_format':
      return `${original}

CRITICAL: You MUST respond with valid JSON only. No markdown, no explanation, no code blocks. 
Start your response with { and end with }.`;

    case 'break_down_task':
      return `Let's break this down step by step.

First, I'll ask you to do just one part:

${extractFirstSubtask(original)}`;

    default:
      return original;
  }
}

Pattern 2: Fallback Chains

When primary model fails, fall back to alternatives:

TypeScript

interface ModelConfig {
  name: string;
  provider: 'openai' | 'anthropic' | 'google' | 'local';
  costPer1kTokens: number;
  latencyP50Ms: number;
  capabilities: string[];
}

const modelChain: ModelConfig[] = [
  { name: 'gpt-4o', provider: 'openai', costPer1kTokens: 0.005, latencyP50Ms: 800, capabilities: ['reasoning', 'code', 'analysis'] },
  { name: 'claude-3-5-sonnet', provider: 'anthropic', costPer1kTokens: 0.003, latencyP50Ms: 600, capabilities: ['reasoning', 'writing', 'analysis'] },
  { name: 'gpt-4o-mini', provider: 'openai', costPer1kTokens: 0.00015, latencyP50Ms: 300, capabilities: ['general'] },
  { name: 'local-llama', provider: 'local', costPer1kTokens: 0, latencyP50Ms: 200, capabilities: ['general'] }
];

async function completionWithFallback(
  prompt: string,
  requiredCapabilities: string[],
  options: { timeout: number; maxCost: number }
): Promise<{ response: string; model: string; cost: number }> {
  const eligibleModels = modelChain.filter(m =>
    requiredCapabilities.every(cap => m.capabilities.includes(cap)) &&
    m.costPer1kTokens <= options.maxCost
  );

  for (const model of eligibleModels) {
    try {
      const response = await Promise.race([
        callModel(model, prompt),
        timeout(options.timeout)
      ]);

      return {
        response: response.text,
        model: model.name,
        cost: response.tokensUsed * model.costPer1kTokens / 1000
      };
    } catch (error) {
      console.warn(`Model ${model.name} failed: ${error}`);
      continue;
    }
  }

  throw new Error('All models in fallback chain failed');
}

Pattern 3: Circuit Breaker

Protect your system when LLM providers have issues:

TypeScript

class LLMCircuitBreaker {
  private failures: number = 0;
  private lastFailure: Date | null = null;
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  
  constructor(
    private threshold: number = 5,
    private resetTimeMs: number = 30000
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - (this.lastFailure?.getTime() || 0) > this.resetTimeMs) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = new Date();
    
    if (this.failures >= this.threshold) {
      this.state = 'open';
      console.error(`Circuit breaker opened after ${this.failures} failures`);
    }
  }
}

// Usage
const breaker = new LLMCircuitBreaker(5, 30000);

async function safeCompletion(prompt: string): Promise<string> {
  return breaker.call(() => llm.complete({ prompt }));
}

Cost Optimization Strategies

LLM costs can explode quickly. Here are patterns that saved us hundreds of thousands of dollars.

Strategy 1: Prompt Compression

Reduce token count without losing meaning:

TypeScript

interface CompressionResult {
  compressed: string;
  originalTokens: number;
  compressedTokens: number;
  compressionRatio: number;
}

async function compressPrompt(prompt: string): Promise<CompressionResult> {
  const originalTokens = countTokens(prompt);
  
  // Strategy 1: Remove redundant whitespace
  let compressed = prompt.replace(/\n{3,}/g, '\n\n').trim();
  
  // Strategy 2: Abbreviate common phrases (carefully!)
  const abbreviations: Record<string, string> = {
    'for example': 'e.g.',
    'that is': 'i.e.',
    'and so on': 'etc.',
    'as soon as possible': 'ASAP',
  };
  
  for (const [full, abbrev] of Object.entries(abbreviations)) {
    compressed = compressed.replace(new RegExp(full, 'gi'), abbrev);
  }
  
  // Strategy 3: Use LLM to compress context (for long documents)
  if (originalTokens > 2000) {
    compressed = await llm.complete({
      prompt: `Compress this text to 30% of its length while preserving all key information:

${compressed}

Compressed version:`,
      maxTokens: Math.floor(originalTokens * 0.35)
    });
  }
  
  const compressedTokens = countTokens(compressed);
  
  return {
    compressed,
    originalTokens,
    compressedTokens,
    compressionRatio: compressedTokens / originalTokens
  };
}

Strategy 2: Tiered Model Selection

Use expensive models only when necessary:

TypeScript

interface TaskComplexity {
  score: number; // 0-1
  factors: {
    requiresReasoning: boolean;
    requiresCreativity: boolean;
    requiresAccuracy: boolean;
    contextLength: number;
    outputComplexity: 'simple' | 'structured' | 'complex';
  };
}

function assessComplexity(task: string, context: string): TaskComplexity {
  const factors = {
    requiresReasoning: /analyze|compare|evaluate|decide|reason/i.test(task),
    requiresCreativity: /create|write|generate|design|imagine/i.test(task),
    requiresAccuracy: /calculate|extract|precise|exact|legal|medical/i.test(task),
    contextLength: countTokens(context),
    outputComplexity: detectOutputComplexity(task)
  };

  let score = 0;
  if (factors.requiresReasoning) score += 0.3;
  if (factors.requiresCreativity) score += 0.2;
  if (factors.requiresAccuracy) score += 0.3;
  if (factors.contextLength > 4000) score += 0.1;
  if (factors.outputComplexity === 'complex') score += 0.1;

  return { score: Math.min(score, 1), factors };
}

function selectModel(complexity: TaskComplexity): string {
  if (complexity.score > 0.7) return 'gpt-4o';
  if (complexity.score > 0.4) return 'gpt-4o-mini';
  return 'gpt-3.5-turbo';
}

Strategy 3: Semantic Caching

Cache responses for similar queries:

TypeScript

interface CacheEntry {
  prompt: string;
  promptEmbedding: number[];
  response: string;
  createdAt: Date;
  hitCount: number;
}

class SemanticCache {
  private cache: CacheEntry[] = [];
  private similarityThreshold = 0.95;

  async get(prompt: string): Promise<string | null> {
    const embedding = await embed(prompt);
    
    for (const entry of this.cache) {
      const similarity = cosineSimilarity(embedding, entry.promptEmbedding);
      
      if (similarity >= this.similarityThreshold) {
        entry.hitCount++;
        console.log(`Cache hit (similarity: ${similarity.toFixed(3)})`);
        return entry.response;
      }
    }
    
    return null;
  }

  async set(prompt: string, response: string): Promise<void> {
    const embedding = await embed(prompt);
    
    this.cache.push({
      prompt,
      promptEmbedding: embedding,
      response,
      createdAt: new Date(),
      hitCount: 0
    });
    
    // Evict old entries if cache too large
    if (this.cache.length > 10000) {
      this.evictLeastUsed();
    }
  }

  private evictLeastUsed(): void {
    this.cache.sort((a, b) => b.hitCount - a.hitCount);
    this.cache = this.cache.slice(0, 5000);
  }
}

// Usage
const cache = new SemanticCache();

async function cachedCompletion(prompt: string): Promise<string> {
  const cached = await cache.get(prompt);
  if (cached) return cached;

  const response = await llm.complete({ prompt });
  await cache.set(prompt, response);
  return response;
}

Security Patterns

LLMs introduce novel security risks. These patterns protect against them.

Pattern 1: Input Sanitization

TypeScript

interface SanitizationResult {
  sanitized: string;
  flagged: boolean;
  flags: string[];
}

function sanitizeUserInput(input: string): SanitizationResult {
  const flags: string[] = [];
  let sanitized = input;

  // Detect prompt injection attempts
  const injectionPatterns = [
    /ignore (all |previous |above )?instructions/i,
    /disregard (all |previous |above )?instructions/i,
    /forget (all |previous |above )?instructions/i,
    /new instructions:/i,
    /system prompt/i,
    /you are now/i,
    /pretend (to be|you're)/i,
    /roleplay as/i,
    /<\/?system>/i,
    /<\/?user>/i,
    /<\/?assistant>/i
  ];

  for (const pattern of injectionPatterns) {
    if (pattern.test(input)) {
      flags.push(`Potential injection: ${pattern.source}`);
    }
  }

  // Remove or escape special delimiters
  sanitized = sanitized
    .replace(/\`\`\`/g, "'''")
    .replace(/<system>/gi, '[system]')
    .replace(/<\/system>/gi, '[/system]');

  // Truncate extremely long inputs
  if (sanitized.length > 10000) {
    sanitized = sanitized.slice(0, 10000) + '... [truncated]';
    flags.push('Input truncated due to length');
  }

  return {
    sanitized,
    flagged: flags.length > 0,
    flags
  };
}

Pattern 2: Output Validation

TypeScript

interface OutputValidation {
  valid: boolean;
  issues: string[];
  sanitizedOutput: string;
}

function validateOutput(output: string, context: OutputContext): OutputValidation {
  const issues: string[] = [];
  let sanitized = output;

  // Check for leaked system prompt
  if (context.systemPrompt) {
    const systemWords = context.systemPrompt.split(/\s+/).filter(w => w.length > 10);
    const matchCount = systemWords.filter(w => output.includes(w)).length;
    if (matchCount > systemWords.length * 0.3) {
      issues.push('Possible system prompt leak detected');
    }
  }

  // Check for PII in output (that wasn't in input)
  const piiPatterns = {
    ssn: /\b\d{3}-\d{2}-\d{4}\b/,
    creditCard: /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/,
    email: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
  };

  for (const [type, pattern] of Object.entries(piiPatterns)) {
    const outputMatches = output.match(pattern) || [];
    const inputMatches = context.userInput.match(pattern) || [];
    
    const newPII = outputMatches.filter(m => !inputMatches.includes(m));
    if (newPII.length > 0) {
      issues.push(`Potentially hallucinated ${type}: ${newPII.length} instance(s)`);
      for (const pii of newPII) {
        sanitized = sanitized.replace(pii, '[REDACTED]');
      }
    }
  }

  // Check for harmful content
  const harmfulPatterns = [
    /how to (make|build|create) (a )?bomb/i,
    /hack into/i,
    /steal (someone's|personal)/i
  ];

  for (const pattern of harmfulPatterns) {
    if (pattern.test(output)) {
      issues.push(`Harmful content detected: ${pattern.source}`);
    }
  }

  return {
    valid: issues.length === 0,
    issues,
    sanitizedOutput: sanitized
  };
}

Pattern 3: Rate Limiting per User

TypeScript

interface RateLimitConfig {
  requestsPerMinute: number;
  tokensPerDay: number;
  costPerDay: number;
}

const tierLimits: Record<string, RateLimitConfig> = {
  free: { requestsPerMinute: 10, tokensPerDay: 10000, costPerDay: 0.10 },
  pro: { requestsPerMinute: 60, tokensPerDay: 100000, costPerDay: 5.00 },
  enterprise: { requestsPerMinute: 300, tokensPerDay: 1000000, costPerDay: 100.00 }
};

class UserRateLimiter {
  private usage: Map<string, {
    requests: { timestamp: number }[];
    tokensToday: number;
    costToday: number;
    dayStart: number;
  }> = new Map();

  check(userId: string, tier: string, estimatedTokens: number): { 
    allowed: boolean; 
    reason?: string;
    retryAfterMs?: number;
  } {
    const limits = tierLimits[tier] || tierLimits.free;
    const now = Date.now();
    const dayStart = new Date().setHours(0, 0, 0, 0);

    let usage = this.usage.get(userId);
    if (!usage || usage.dayStart !== dayStart) {
      usage = { requests: [], tokensToday: 0, costToday: 0, dayStart };
      this.usage.set(userId, usage);
    }

    // Check requests per minute
    const recentRequests = usage.requests.filter(r => now - r.timestamp < 60000);
    if (recentRequests.length >= limits.requestsPerMinute) {
      const oldestRecent = recentRequests[0].timestamp;
      return { 
        allowed: false, 
        reason: 'Rate limit exceeded',
        retryAfterMs: 60000 - (now - oldestRecent)
      };
    }

    // Check daily token limit
    if (usage.tokensToday + estimatedTokens > limits.tokensPerDay) {
      return { 
        allowed: false, 
        reason: 'Daily token limit exceeded',
        retryAfterMs: dayStart + 86400000 - now
      };
    }

    return { allowed: true };
  }

  record(userId: string, tokensUsed: number, cost: number): void {
    const usage = this.usage.get(userId);
    if (usage) {
      usage.requests.push({ timestamp: Date.now() });
      usage.tokensToday += tokensUsed;
      usage.costToday += cost;
    }
  }
}

Observability and Debugging

You cannot improve what you cannot measure. Production AI systems need deep observability.

Pattern 1: Structured Logging

TypeScript

interface LLMCallLog {
  id: string;
  timestamp: Date;
  
  // Request
  model: string;
  promptTokens: number;
  promptHash: string; // For debugging without storing full prompts
  temperature: number;
  
  // Response
  completionTokens: number;
  latencyMs: number;
  finishReason: 'stop' | 'length' | 'content_filter' | 'error';
  
  // Context
  userId?: string;
  sessionId?: string;
  taskType: string;
  
  // Quality signals
  validationPassed: boolean;
  retryCount: number;
  cachedResponse: boolean;
  
  // Cost
  estimatedCost: number;
}

class LLMLogger {
  private logs: LLMCallLog[] = [];

  log(entry: Omit<LLMCallLog, 'id' | 'timestamp'>): void {
    const fullEntry: LLMCallLog = {
      ...entry,
      id: crypto.randomUUID(),
      timestamp: new Date()
    };

    this.logs.push(fullEntry);
    
    // Send to observability platform
    this.sendToDatadog(fullEntry);
    
    // Alert on anomalies
    this.checkAlerts(fullEntry);
  }

  private checkAlerts(entry: LLMCallLog): void {
    if (entry.latencyMs > 10000) {
      this.alert('high_latency', entry);
    }
    if (entry.retryCount > 2) {
      this.alert('high_retry', entry);
    }
    if (entry.estimatedCost > 1.00) {
      this.alert('high_cost', entry);
    }
    if (!entry.validationPassed) {
      this.alert('validation_failed', entry);
    }
  }

  private alert(type: string, entry: LLMCallLog): void {
    console.warn(`[ALERT:${type}] ${JSON.stringify(entry)}`);
  }

  private sendToDatadog(entry: LLMCallLog): void {
    // Implementation for your observability platform
  }
}

Pattern 2: Prompt Versioning and A/B Testing

TypeScript

interface PromptVersion {
  id: string;
  version: string;
  prompt: string;
  createdAt: Date;
  metrics: {
    invocations: number;
    successRate: number;
    avgLatencyMs: number;
    avgCost: number;
    userRating?: number;
  };
}

class PromptExperiment {
  private variants: Map<string, PromptVersion> = new Map();
  private assignments: Map<string, string> = new Map(); // userId -> variantId

  addVariant(variant: PromptVersion): void {
    this.variants.set(variant.id, variant);
  }

  getPrompt(userId: string): { prompt: string; variantId: string } {
    // Sticky assignment for consistent user experience
    let variantId = this.assignments.get(userId);
    
    if (!variantId) {
      // Assign based on user hash for reproducibility
      const hash = this.hashUserId(userId);
      const variantIds = Array.from(this.variants.keys());
      variantId = variantIds[hash % variantIds.length];
      this.assignments.set(userId, variantId);
    }

    const variant = this.variants.get(variantId)!;
    variant.metrics.invocations++;

    return { prompt: variant.prompt, variantId };
  }

  recordOutcome(variantId: string, success: boolean, latencyMs: number, cost: number): void {
    const variant = this.variants.get(variantId);
    if (variant) {
      const m = variant.metrics;
      m.successRate = (m.successRate * (m.invocations - 1) + (success ? 1 : 0)) / m.invocations;
      m.avgLatencyMs = (m.avgLatencyMs * (m.invocations - 1) + latencyMs) / m.invocations;
      m.avgCost = (m.avgCost * (m.invocations - 1) + cost) / m.invocations;
    }
  }

  getWinner(): PromptVersion | null {
    const variants = Array.from(this.variants.values())
      .filter(v => v.metrics.invocations >= 100); // Minimum sample size

    if (variants.length < 2) return null;

    // Simple winner: highest success rate with acceptable latency
    variants.sort((a, b) => {
      if (b.metrics.successRate !== a.metrics.successRate) {
        return b.metrics.successRate - a.metrics.successRate;
      }
      return a.metrics.avgLatencyMs - b.metrics.avgLatencyMs;
    });

    return variants[0];
  }

  private hashUserId(userId: string): number {
    let hash = 0;
    for (let i = 0; i < userId.length; i++) {
      hash = ((hash << 5) - hash) + userId.charCodeAt(i);
      hash = hash & hash;
    }
    return Math.abs(hash);
  }
}

Real-World Case Studies

Case Study 1: Customer Support Automation

We built an AI support agent handling 50,000 tickets per month. The journey from demo to production taught us:

Initial Problem: Demo worked beautifully with test queries. Production users asked questions in ways we never anticipated.

Solution: We built a query normalization layer that rephrased user questions before sending to the LLM:

TypeScript

async function normalizeQuery(rawQuery: string): Promise<string> {
  // Handle common patterns that confused the main model
  const normalized = await llm.complete({
    prompt: `Rewrite this customer support query to be clear and specific.
    
Original: "${rawQuery}"

Rules:
- Fix typos and grammar
- Expand abbreviations
- If the query is vague, make reasonable assumptions and note them
- Keep the same intent and urgency

Rewritten query:`,
    maxTokens: 200,
    temperature: 0.3
  });

  return normalized;
}

Result: Query understanding improved from 72% to 94%, and customers reported higher satisfaction with responses.

Case Study 2: Code Review Assistant

We deployed an AI code reviewer integrated with GitHub PRs. Key lessons:

Problem 1: Large diffs crashed the context window.

Solution: Intelligent chunking by file, with cross-file context summary:

TypeScript

async function reviewLargePR(files: FileChange[]): Promise<Review[]> {
  // First pass: generate a summary of the overall change
  const summary = await summarizeChanges(files);
  
  // Review each file with the summary as context
  const reviews = await Promise.all(
    files.map(file => reviewFileWithContext(file, summary))
  );

  // Final pass: check for cross-file issues
  const crossFileIssues = await checkCrossFileConsistency(files, reviews);

  return [...reviews, ...crossFileIssues];
}

Problem 2: AI flagged too many false positives, annoying developers.

Solution: Confidence scoring with threshold:

TypeScript

interface ReviewComment {
  line: number;
  message: string;
  severity: 'critical' | 'warning' | 'suggestion';
  confidence: number; // 0-1
}

function filterReviewComments(
  comments: ReviewComment[], 
  minConfidence: number = 0.7
): ReviewComment[] {
  return comments.filter(c => {
    // Always show critical issues
    if (c.severity === 'critical') return c.confidence > 0.5;
    // Higher threshold for warnings and suggestions
    return c.confidence >= minConfidence;
  });
}

Result: False positive rate dropped from 35% to 8%, developer adoption increased from 40% to 85%.

The Production Checklist

Before deploying any LLM feature, verify:

Reliability

[ ] Retry logic with exponential backoff implemented

[ ] Fallback models configured

[ ] Circuit breaker protecting against provider outages

[ ] Timeout handling for slow responses

[ ] Graceful degradation when AI unavailable

Cost Control

[ ] Per-user rate limiting implemented

[ ] Cost alerts configured

[ ] Model selection based on task complexity

[ ] Caching for common queries

[ ] Prompt optimization reviewed

Security

[ ] Input sanitization for injection attacks

[ ] Output validation for harmful content

[ ] PII detection and handling

[ ] Audit logging for compliance

[ ] Access controls for sensitive operations

Observability

[ ] Structured logging for all LLM calls

[ ] Latency and error rate dashboards

[ ] Cost tracking per feature/user

[ ] Quality metrics collection

[ ] Alerting for anomalies

Testing

[ ] Evaluation dataset with expected outputs

[ ] Regression tests for prompt changes

[ ] Load testing for concurrent requests

[ ] Chaos testing for provider failures

[ ] User acceptance testing completed

FAQ

Q: How do I test prompt changes without breaking production?

Shadow mode: run new prompts alongside production, compare outputs, but only serve the production version. Graduate to canary (5% traffic) once confidence is high.

Q: What's a reasonable latency target for LLM responses?

For interactive use: p50 < 2s, p95 < 5s. For background processing: whatever your users tolerate. Streaming helps perception - first token in < 500ms feels fast even if completion takes longer.

Q: How do I handle prompt injection attacks?

Defense in depth: input sanitization, output validation, separate system/user message roles, and monitor for anomalies. No single technique is sufficient.

Q: Should I use few-shot examples or fine-tuning?

Start with few-shot - it's faster to iterate. Fine-tune when: you have 1000+ high-quality examples, few-shot isn't reaching quality targets, or you need to reduce prompt size for cost/latency.

Q: How often should I review and update prompts?

Monthly minimum. Set up alerts for quality degradation. Model updates can change behavior - revalidate after provider announcements.

---

Production prompt engineering isn't about writing clever prompts. It's about building robust systems that work reliably at scale, within budget, while protecting users and the business.

The patterns here emerged from real failures. Each retry mechanism exists because requests failed. Each cost optimization exists because bills exceeded budget. Each security pattern exists because attacks happened.

Start with the basics: structured prompts, validation, logging. Add complexity only when needed. Measure everything. Iterate based on data.

Build systems, not demos. Your users - and your on-call rotation - will thank you.

Prompt Engineering for Production: Battle-Tested Patterns from the Trenches

The Demo That Launched a Thousand Bug Reports

The Production Prompt Stack

Reliability Patterns

Cost Optimization Strategies

Security Patterns

Observability and Debugging

Real-World Case Studies

The Production Checklist

FAQ

Recommended Reading

Building LLM Powered Applications

Prompt Engineering for Generative AI

Share this article

💬Discussion

Related Articles

The Ultimate AI-Assisted Development Guide: AGENTS.md, Workflows & Best Practices

AI Code Review & Quality Assurance: Automated Excellence