AI/MLJanuary 10, 2026

Prompt Engineering for Production: Battle-Tested Patterns from the Trenches

Real-world prompt engineering strategies that survived millions of API calls. Learn the patterns, anti-patterns, and hard-won lessons from building production AI systems.

DT

Dev Team

35 min read

#prompt-engineering#production#llm#enterprise#best-practices#reliability
Prompt Engineering for Production: Battle-Tested Patterns from the Trenches

The Demo That Launched a Thousand Bug Reports

The demo went perfectly. Stakeholders applauded. "Ship it," they said.

Within 72 hours:

  • User #1 typed "asdfghjkl" and got a 500-word essay about keyboard layouts
  • User #2 discovered they could make your AI write their college essays
  • User #3's request took 47 seconds and timed out
  • User #4's request cost $2.50 (your budget was $0.02/request)
  • User #5 got medical advice that made Legal very, very nervous
  • "It worked in the playground," you tell the incident channel. It did. That's the problem.

    Everyone can make a prompt work in a demo. The gap between that and production - where real users do unexpected things at scale - is where most AI projects fail.

    This article covers the patterns that survived millions of production requests. Every lesson here was learned the hard way: through outages, cost overruns, user complaints, and late-night debugging sessions.

    The Production Prompt Stack

    Production prompts are not single strings. They are layered systems with multiple components working together.

    Layer 1: The System Foundation

    Your system prompt is infrastructure. It runs on every request, shapes every response, and is the hardest to change once users depend on its behavior.

    TypeScript
    interface SystemPromptConfig {
      version: string;
      identity: {
        role: string;
        personality: string;
        expertise: string[];
      };
      capabilities: {
        allowed: string[];
        forbidden: string[];
        requiresApproval: string[];
      };
      outputRules: {
        format: 'markdown' | 'json' | 'plain';
        maxLength?: number;
        requiredSections?: string[];
      };
      safetyRules: {
        contentFilters: string[];
        escalationTriggers: string[];
        uncertaintyHandling: string;
      };
    }
    
    function buildSystemPrompt(config: SystemPromptConfig): string {
      return `
    You are ${config.identity.role}.
    
    ## Personality
    ${config.identity.personality}
    
    ## Expertise Areas
    ${config.identity.expertise.map(e => `- ${e}`).join('\n')}
    
    ## What You Can Do
    ${config.capabilities.allowed.map(c => `- ${c}`).join('\n')}
    
    ## What You Must Never Do
    ${config.capabilities.forbidden.map(c => `- ${c}`).join('\n')}
    
    ## Actions Requiring Human Approval
    ${config.capabilities.requiresApproval.map(c => `- ${c}`).join('\n')}
    
    ## Response Format
    Format: ${config.outputRules.format}
    ${config.outputRules.maxLength ? `Maximum length: ${config.outputRules.maxLength} characters` : ''}
    ${config.outputRules.requiredSections ? `Required sections: ${config.outputRules.requiredSections.join(', ')}` : ''}
    
    ## Safety Guidelines
    - Content to filter: ${config.safetyRules.contentFilters.join(', ')}
    - Escalate when: ${config.safetyRules.escalationTriggers.join(', ')}
    - When uncertain: ${config.safetyRules.uncertaintyHandling}
    `.trim();
    }

    This structured approach gives you version control, A/B testing capability, and clear documentation of your AI's behavior contract.

    Layer 2: Context Injection

    Production systems need dynamic context. User data, session history, retrieved documents, current state - all injected at runtime.

    TypeScript
    interface RequestContext {
      user: {
        id: string;
        tier: 'free' | 'pro' | 'enterprise';
        preferences: Record<string, string>;
        history: ConversationTurn[];
      };
      session: {
        id: string;
        startedAt: Date;
        turnCount: number;
        currentTask?: string;
      };
      retrieved: {
        documents: RetrievedDocument[];
        relevanceScores: number[];
      };
      constraints: {
        maxTokens: number;
        timeoutMs: number;
        costBudget: number;
      };
    }
    
    function buildContextBlock(ctx: RequestContext): string {
      const historyWindow = ctx.user.history.slice(-5); // Last 5 turns
      
      return `
    ## User Context
    - User tier: ${ctx.user.tier}
    - Preferences: ${JSON.stringify(ctx.user.preferences)}
    
    ## Conversation History
    ${historyWindow.map(t => `${t.role}: ${t.content}`).join('\n')}
    
    ## Relevant Information
    ${ctx.retrieved.documents.map((doc, i) => `
    ### Source ${i + 1} (relevance: ${(ctx.retrieved.relevanceScores[i] * 100).toFixed(0)}%)
    ${doc.content}
    `).join('\n')}
    
    ## Constraints
    - Keep response under ${Math.floor(ctx.constraints.maxTokens * 0.75)} tokens
    - Respond within ${ctx.constraints.timeoutMs}ms budget
    `.trim();
    }

    Layer 3: Task Instructions

    The specific request, formatted for reliability:

    TypeScript
    interface TaskSpec {
      action: string;
      inputs: Record<string, unknown>;
      outputSchema?: object;
      examples?: Array<{ input: unknown; output: unknown }>;
      constraints?: string[];
    }
    
    function buildTaskBlock(task: TaskSpec): string {
      let block = `
    ## Your Task
    ${task.action}
    
    ## Inputs
    ${Object.entries(task.inputs).map(([k, v]) => `- ${k}: ${JSON.stringify(v)}`).join('\n')}
    `;
    
      if (task.outputSchema) {
        block += `
    ## Required Output Format
    \`\`\`json
    ${JSON.stringify(task.outputSchema, null, 2)}
    \`\`\`
    `;
      }
    
      if (task.examples && task.examples.length > 0) {
        block += `
    ## Examples
    ${task.examples.map((ex, i) => `
    Example ${i + 1}:
    Input: ${JSON.stringify(ex.input)}
    Output: ${JSON.stringify(ex.output)}
    `).join('\n')}
    `;
      }
    
      if (task.constraints) {
        block += `
    ## Additional Constraints
    ${task.constraints.map(c => `- ${c}`).join('\n')}
    `;
      }
    
      return block.trim();
    }

    Reliability Patterns

    Production systems must handle failure gracefully. LLMs fail in ways traditional software does not.

    Pattern 1: Retry with Reformulation

    When a response fails validation, do not just retry - reformulate the prompt with feedback:

    TypeScript
    interface RetryConfig {
      maxAttempts: number;
      backoffMs: number[];
      reformulationStrategies: ReformulationStrategy[];
    }
    
    type ReformulationStrategy = 
      | 'add_example'
      | 'simplify_request'
      | 'explicit_format'
      | 'break_down_task';
    
    async function reliableCompletion<T>(
      prompt: string,
      schema: z.ZodSchema<T>,
      config: RetryConfig
    ): Promise<T> {
      let lastError: Error | null = null;
      let currentPrompt = prompt;
    
      for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
        try {
          const response = await llm.complete({
            prompt: currentPrompt,
            responseFormat: 'json'
          });
    
          const parsed = JSON.parse(response);
          return schema.parse(parsed);
        } catch (error) {
          lastError = error as Error;
          
          // Wait with backoff
          if (attempt < config.maxAttempts - 1) {
            await sleep(config.backoffMs[attempt] || 1000);
            
            // Reformulate based on error type
            currentPrompt = reformulatePrompt(
              prompt, 
              error, 
              config.reformulationStrategies[attempt]
            );
            
            console.log(`Retry ${attempt + 1}: reformulated with ${config.reformulationStrategies[attempt]}`);
          }
        }
      }
    
      throw new Error(`Failed after ${config.maxAttempts} attempts: ${lastError?.message}`);
    }
    
    function reformulatePrompt(
      original: string, 
      error: unknown, 
      strategy: ReformulationStrategy
    ): string {
      switch (strategy) {
        case 'add_example':
          return `${original}
    
    IMPORTANT: Your previous response was invalid. Here's an example of a correct response:
    ${getExampleForSchema(error)}
    
    Please try again, following the exact format shown.`;
    
        case 'simplify_request':
          return `${original}
    
    NOTE: Please provide a simple, direct response. Avoid complex structures or explanations.`;
    
        case 'explicit_format':
          return `${original}
    
    CRITICAL: You MUST respond with valid JSON only. No markdown, no explanation, no code blocks. 
    Start your response with { and end with }.`;
    
        case 'break_down_task':
          return `Let's break this down step by step.
    
    First, I'll ask you to do just one part:
    
    ${extractFirstSubtask(original)}`;
    
        default:
          return original;
      }
    }

    Pattern 2: Fallback Chains

    When primary model fails, fall back to alternatives:

    TypeScript
    interface ModelConfig {
      name: string;
      provider: 'openai' | 'anthropic' | 'google' | 'local';
      costPer1kTokens: number;
      latencyP50Ms: number;
      capabilities: string[];
    }
    
    const modelChain: ModelConfig[] = [
      { name: 'gpt-4o', provider: 'openai', costPer1kTokens: 0.005, latencyP50Ms: 800, capabilities: ['reasoning', 'code', 'analysis'] },
      { name: 'claude-3-5-sonnet', provider: 'anthropic', costPer1kTokens: 0.003, latencyP50Ms: 600, capabilities: ['reasoning', 'writing', 'analysis'] },
      { name: 'gpt-4o-mini', provider: 'openai', costPer1kTokens: 0.00015, latencyP50Ms: 300, capabilities: ['general'] },
      { name: 'local-llama', provider: 'local', costPer1kTokens: 0, latencyP50Ms: 200, capabilities: ['general'] }
    ];
    
    async function completionWithFallback(
      prompt: string,
      requiredCapabilities: string[],
      options: { timeout: number; maxCost: number }
    ): Promise<{ response: string; model: string; cost: number }> {
      const eligibleModels = modelChain.filter(m =>
        requiredCapabilities.every(cap => m.capabilities.includes(cap)) &&
        m.costPer1kTokens <= options.maxCost
      );
    
      for (const model of eligibleModels) {
        try {
          const response = await Promise.race([
            callModel(model, prompt),
            timeout(options.timeout)
          ]);
    
          return {
            response: response.text,
            model: model.name,
            cost: response.tokensUsed * model.costPer1kTokens / 1000
          };
        } catch (error) {
          console.warn(`Model ${model.name} failed: ${error}`);
          continue;
        }
      }
    
      throw new Error('All models in fallback chain failed');
    }

    Pattern 3: Circuit Breaker

    Protect your system when LLM providers have issues:

    TypeScript
    class LLMCircuitBreaker {
      private failures: number = 0;
      private lastFailure: Date | null = null;
      private state: 'closed' | 'open' | 'half-open' = 'closed';
      
      constructor(
        private threshold: number = 5,
        private resetTimeMs: number = 30000
      ) {}
    
      async call<T>(fn: () => Promise<T>): Promise<T> {
        if (this.state === 'open') {
          if (Date.now() - (this.lastFailure?.getTime() || 0) > this.resetTimeMs) {
            this.state = 'half-open';
          } else {
            throw new Error('Circuit breaker is open');
          }
        }
    
        try {
          const result = await fn();
          this.onSuccess();
          return result;
        } catch (error) {
          this.onFailure();
          throw error;
        }
      }
    
      private onSuccess() {
        this.failures = 0;
        this.state = 'closed';
      }
    
      private onFailure() {
        this.failures++;
        this.lastFailure = new Date();
        
        if (this.failures >= this.threshold) {
          this.state = 'open';
          console.error(`Circuit breaker opened after ${this.failures} failures`);
        }
      }
    }
    
    // Usage
    const breaker = new LLMCircuitBreaker(5, 30000);
    
    async function safeCompletion(prompt: string): Promise<string> {
      return breaker.call(() => llm.complete({ prompt }));
    }

    Cost Optimization Strategies

    LLM costs can explode quickly. Here are patterns that saved us hundreds of thousands of dollars.

    Strategy 1: Prompt Compression

    Reduce token count without losing meaning:

    TypeScript
    interface CompressionResult {
      compressed: string;
      originalTokens: number;
      compressedTokens: number;
      compressionRatio: number;
    }
    
    async function compressPrompt(prompt: string): Promise<CompressionResult> {
      const originalTokens = countTokens(prompt);
      
      // Strategy 1: Remove redundant whitespace
      let compressed = prompt.replace(/\n{3,}/g, '\n\n').trim();
      
      // Strategy 2: Abbreviate common phrases (carefully!)
      const abbreviations: Record<string, string> = {
        'for example': 'e.g.',
        'that is': 'i.e.',
        'and so on': 'etc.',
        'as soon as possible': 'ASAP',
      };
      
      for (const [full, abbrev] of Object.entries(abbreviations)) {
        compressed = compressed.replace(new RegExp(full, 'gi'), abbrev);
      }
      
      // Strategy 3: Use LLM to compress context (for long documents)
      if (originalTokens > 2000) {
        compressed = await llm.complete({
          prompt: `Compress this text to 30% of its length while preserving all key information:
    
    ${compressed}
    
    Compressed version:`,
          maxTokens: Math.floor(originalTokens * 0.35)
        });
      }
      
      const compressedTokens = countTokens(compressed);
      
      return {
        compressed,
        originalTokens,
        compressedTokens,
        compressionRatio: compressedTokens / originalTokens
      };
    }

    Strategy 2: Tiered Model Selection

    Use expensive models only when necessary:

    TypeScript
    interface TaskComplexity {
      score: number; // 0-1
      factors: {
        requiresReasoning: boolean;
        requiresCreativity: boolean;
        requiresAccuracy: boolean;
        contextLength: number;
        outputComplexity: 'simple' | 'structured' | 'complex';
      };
    }
    
    function assessComplexity(task: string, context: string): TaskComplexity {
      const factors = {
        requiresReasoning: /analyze|compare|evaluate|decide|reason/i.test(task),
        requiresCreativity: /create|write|generate|design|imagine/i.test(task),
        requiresAccuracy: /calculate|extract|precise|exact|legal|medical/i.test(task),
        contextLength: countTokens(context),
        outputComplexity: detectOutputComplexity(task)
      };
    
      let score = 0;
      if (factors.requiresReasoning) score += 0.3;
      if (factors.requiresCreativity) score += 0.2;
      if (factors.requiresAccuracy) score += 0.3;
      if (factors.contextLength > 4000) score += 0.1;
      if (factors.outputComplexity === 'complex') score += 0.1;
    
      return { score: Math.min(score, 1), factors };
    }
    
    function selectModel(complexity: TaskComplexity): string {
      if (complexity.score > 0.7) return 'gpt-4o';
      if (complexity.score > 0.4) return 'gpt-4o-mini';
      return 'gpt-3.5-turbo';
    }

    Strategy 3: Semantic Caching

    Cache responses for similar queries:

    TypeScript
    interface CacheEntry {
      prompt: string;
      promptEmbedding: number[];
      response: string;
      createdAt: Date;
      hitCount: number;
    }
    
    class SemanticCache {
      private cache: CacheEntry[] = [];
      private similarityThreshold = 0.95;
    
      async get(prompt: string): Promise<string | null> {
        const embedding = await embed(prompt);
        
        for (const entry of this.cache) {
          const similarity = cosineSimilarity(embedding, entry.promptEmbedding);
          
          if (similarity >= this.similarityThreshold) {
            entry.hitCount++;
            console.log(`Cache hit (similarity: ${similarity.toFixed(3)})`);
            return entry.response;
          }
        }
        
        return null;
      }
    
      async set(prompt: string, response: string): Promise<void> {
        const embedding = await embed(prompt);
        
        this.cache.push({
          prompt,
          promptEmbedding: embedding,
          response,
          createdAt: new Date(),
          hitCount: 0
        });
        
        // Evict old entries if cache too large
        if (this.cache.length > 10000) {
          this.evictLeastUsed();
        }
      }
    
      private evictLeastUsed(): void {
        this.cache.sort((a, b) => b.hitCount - a.hitCount);
        this.cache = this.cache.slice(0, 5000);
      }
    }
    
    // Usage
    const cache = new SemanticCache();
    
    async function cachedCompletion(prompt: string): Promise<string> {
      const cached = await cache.get(prompt);
      if (cached) return cached;
    
      const response = await llm.complete({ prompt });
      await cache.set(prompt, response);
      return response;
    }

    Security Patterns

    LLMs introduce novel security risks. These patterns protect against them.

    Pattern 1: Input Sanitization

    TypeScript
    interface SanitizationResult {
      sanitized: string;
      flagged: boolean;
      flags: string[];
    }
    
    function sanitizeUserInput(input: string): SanitizationResult {
      const flags: string[] = [];
      let sanitized = input;
    
      // Detect prompt injection attempts
      const injectionPatterns = [
        /ignore (all |previous |above )?instructions/i,
        /disregard (all |previous |above )?instructions/i,
        /forget (all |previous |above )?instructions/i,
        /new instructions:/i,
        /system prompt/i,
        /you are now/i,
        /pretend (to be|you're)/i,
        /roleplay as/i,
        /<\/?system>/i,
        /<\/?user>/i,
        /<\/?assistant>/i
      ];
    
      for (const pattern of injectionPatterns) {
        if (pattern.test(input)) {
          flags.push(`Potential injection: ${pattern.source}`);
        }
      }
    
      // Remove or escape special delimiters
      sanitized = sanitized
        .replace(/\`\`\`/g, "'''")
        .replace(/<system>/gi, '[system]')
        .replace(/<\/system>/gi, '[/system]');
    
      // Truncate extremely long inputs
      if (sanitized.length > 10000) {
        sanitized = sanitized.slice(0, 10000) + '... [truncated]';
        flags.push('Input truncated due to length');
      }
    
      return {
        sanitized,
        flagged: flags.length > 0,
        flags
      };
    }

    Pattern 2: Output Validation

    TypeScript
    interface OutputValidation {
      valid: boolean;
      issues: string[];
      sanitizedOutput: string;
    }
    
    function validateOutput(output: string, context: OutputContext): OutputValidation {
      const issues: string[] = [];
      let sanitized = output;
    
      // Check for leaked system prompt
      if (context.systemPrompt) {
        const systemWords = context.systemPrompt.split(/\s+/).filter(w => w.length > 10);
        const matchCount = systemWords.filter(w => output.includes(w)).length;
        if (matchCount > systemWords.length * 0.3) {
          issues.push('Possible system prompt leak detected');
        }
      }
    
      // Check for PII in output (that wasn't in input)
      const piiPatterns = {
        ssn: /\b\d{3}-\d{2}-\d{4}\b/,
        creditCard: /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/,
        email: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
      };
    
      for (const [type, pattern] of Object.entries(piiPatterns)) {
        const outputMatches = output.match(pattern) || [];
        const inputMatches = context.userInput.match(pattern) || [];
        
        const newPII = outputMatches.filter(m => !inputMatches.includes(m));
        if (newPII.length > 0) {
          issues.push(`Potentially hallucinated ${type}: ${newPII.length} instance(s)`);
          for (const pii of newPII) {
            sanitized = sanitized.replace(pii, '[REDACTED]');
          }
        }
      }
    
      // Check for harmful content
      const harmfulPatterns = [
        /how to (make|build|create) (a )?bomb/i,
        /hack into/i,
        /steal (someone's|personal)/i
      ];
    
      for (const pattern of harmfulPatterns) {
        if (pattern.test(output)) {
          issues.push(`Harmful content detected: ${pattern.source}`);
        }
      }
    
      return {
        valid: issues.length === 0,
        issues,
        sanitizedOutput: sanitized
      };
    }

    Pattern 3: Rate Limiting per User

    TypeScript
    interface RateLimitConfig {
      requestsPerMinute: number;
      tokensPerDay: number;
      costPerDay: number;
    }
    
    const tierLimits: Record<string, RateLimitConfig> = {
      free: { requestsPerMinute: 10, tokensPerDay: 10000, costPerDay: 0.10 },
      pro: { requestsPerMinute: 60, tokensPerDay: 100000, costPerDay: 5.00 },
      enterprise: { requestsPerMinute: 300, tokensPerDay: 1000000, costPerDay: 100.00 }
    };
    
    class UserRateLimiter {
      private usage: Map<string, {
        requests: { timestamp: number }[];
        tokensToday: number;
        costToday: number;
        dayStart: number;
      }> = new Map();
    
      check(userId: string, tier: string, estimatedTokens: number): { 
        allowed: boolean; 
        reason?: string;
        retryAfterMs?: number;
      } {
        const limits = tierLimits[tier] || tierLimits.free;
        const now = Date.now();
        const dayStart = new Date().setHours(0, 0, 0, 0);
    
        let usage = this.usage.get(userId);
        if (!usage || usage.dayStart !== dayStart) {
          usage = { requests: [], tokensToday: 0, costToday: 0, dayStart };
          this.usage.set(userId, usage);
        }
    
        // Check requests per minute
        const recentRequests = usage.requests.filter(r => now - r.timestamp < 60000);
        if (recentRequests.length >= limits.requestsPerMinute) {
          const oldestRecent = recentRequests[0].timestamp;
          return { 
            allowed: false, 
            reason: 'Rate limit exceeded',
            retryAfterMs: 60000 - (now - oldestRecent)
          };
        }
    
        // Check daily token limit
        if (usage.tokensToday + estimatedTokens > limits.tokensPerDay) {
          return { 
            allowed: false, 
            reason: 'Daily token limit exceeded',
            retryAfterMs: dayStart + 86400000 - now
          };
        }
    
        return { allowed: true };
      }
    
      record(userId: string, tokensUsed: number, cost: number): void {
        const usage = this.usage.get(userId);
        if (usage) {
          usage.requests.push({ timestamp: Date.now() });
          usage.tokensToday += tokensUsed;
          usage.costToday += cost;
        }
      }
    }

    Observability and Debugging

    You cannot improve what you cannot measure. Production AI systems need deep observability.

    Pattern 1: Structured Logging

    TypeScript
    interface LLMCallLog {
      id: string;
      timestamp: Date;
      
      // Request
      model: string;
      promptTokens: number;
      promptHash: string; // For debugging without storing full prompts
      temperature: number;
      
      // Response
      completionTokens: number;
      latencyMs: number;
      finishReason: 'stop' | 'length' | 'content_filter' | 'error';
      
      // Context
      userId?: string;
      sessionId?: string;
      taskType: string;
      
      // Quality signals
      validationPassed: boolean;
      retryCount: number;
      cachedResponse: boolean;
      
      // Cost
      estimatedCost: number;
    }
    
    class LLMLogger {
      private logs: LLMCallLog[] = [];
    
      log(entry: Omit<LLMCallLog, 'id' | 'timestamp'>): void {
        const fullEntry: LLMCallLog = {
          ...entry,
          id: crypto.randomUUID(),
          timestamp: new Date()
        };
    
        this.logs.push(fullEntry);
        
        // Send to observability platform
        this.sendToDatadog(fullEntry);
        
        // Alert on anomalies
        this.checkAlerts(fullEntry);
      }
    
      private checkAlerts(entry: LLMCallLog): void {
        if (entry.latencyMs > 10000) {
          this.alert('high_latency', entry);
        }
        if (entry.retryCount > 2) {
          this.alert('high_retry', entry);
        }
        if (entry.estimatedCost > 1.00) {
          this.alert('high_cost', entry);
        }
        if (!entry.validationPassed) {
          this.alert('validation_failed', entry);
        }
      }
    
      private alert(type: string, entry: LLMCallLog): void {
        console.warn(`[ALERT:${type}] ${JSON.stringify(entry)}`);
      }
    
      private sendToDatadog(entry: LLMCallLog): void {
        // Implementation for your observability platform
      }
    }

    Pattern 2: Prompt Versioning and A/B Testing

    TypeScript
    interface PromptVersion {
      id: string;
      version: string;
      prompt: string;
      createdAt: Date;
      metrics: {
        invocations: number;
        successRate: number;
        avgLatencyMs: number;
        avgCost: number;
        userRating?: number;
      };
    }
    
    class PromptExperiment {
      private variants: Map<string, PromptVersion> = new Map();
      private assignments: Map<string, string> = new Map(); // userId -> variantId
    
      addVariant(variant: PromptVersion): void {
        this.variants.set(variant.id, variant);
      }
    
      getPrompt(userId: string): { prompt: string; variantId: string } {
        // Sticky assignment for consistent user experience
        let variantId = this.assignments.get(userId);
        
        if (!variantId) {
          // Assign based on user hash for reproducibility
          const hash = this.hashUserId(userId);
          const variantIds = Array.from(this.variants.keys());
          variantId = variantIds[hash % variantIds.length];
          this.assignments.set(userId, variantId);
        }
    
        const variant = this.variants.get(variantId)!;
        variant.metrics.invocations++;
    
        return { prompt: variant.prompt, variantId };
      }
    
      recordOutcome(variantId: string, success: boolean, latencyMs: number, cost: number): void {
        const variant = this.variants.get(variantId);
        if (variant) {
          const m = variant.metrics;
          m.successRate = (m.successRate * (m.invocations - 1) + (success ? 1 : 0)) / m.invocations;
          m.avgLatencyMs = (m.avgLatencyMs * (m.invocations - 1) + latencyMs) / m.invocations;
          m.avgCost = (m.avgCost * (m.invocations - 1) + cost) / m.invocations;
        }
      }
    
      getWinner(): PromptVersion | null {
        const variants = Array.from(this.variants.values())
          .filter(v => v.metrics.invocations >= 100); // Minimum sample size
    
        if (variants.length < 2) return null;
    
        // Simple winner: highest success rate with acceptable latency
        variants.sort((a, b) => {
          if (b.metrics.successRate !== a.metrics.successRate) {
            return b.metrics.successRate - a.metrics.successRate;
          }
          return a.metrics.avgLatencyMs - b.metrics.avgLatencyMs;
        });
    
        return variants[0];
      }
    
      private hashUserId(userId: string): number {
        let hash = 0;
        for (let i = 0; i < userId.length; i++) {
          hash = ((hash << 5) - hash) + userId.charCodeAt(i);
          hash = hash & hash;
        }
        return Math.abs(hash);
      }
    }

    Real-World Case Studies

    Case Study 1: Customer Support Automation

    We built an AI support agent handling 50,000 tickets per month. The journey from demo to production taught us:

    Initial Problem: Demo worked beautifully with test queries. Production users asked questions in ways we never anticipated.

    Solution: We built a query normalization layer that rephrased user questions before sending to the LLM:

    TypeScript
    async function normalizeQuery(rawQuery: string): Promise<string> {
      // Handle common patterns that confused the main model
      const normalized = await llm.complete({
        prompt: `Rewrite this customer support query to be clear and specific.
        
    Original: "${rawQuery}"
    
    Rules:
    - Fix typos and grammar
    - Expand abbreviations
    - If the query is vague, make reasonable assumptions and note them
    - Keep the same intent and urgency
    
    Rewritten query:`,
        maxTokens: 200,
        temperature: 0.3
      });
    
      return normalized;
    }

    Result: Query understanding improved from 72% to 94%, and customers reported higher satisfaction with responses.

    Case Study 2: Code Review Assistant

    We deployed an AI code reviewer integrated with GitHub PRs. Key lessons:

    Problem 1: Large diffs crashed the context window.

    Solution: Intelligent chunking by file, with cross-file context summary:

    TypeScript
    async function reviewLargePR(files: FileChange[]): Promise<Review[]> {
      // First pass: generate a summary of the overall change
      const summary = await summarizeChanges(files);
      
      // Review each file with the summary as context
      const reviews = await Promise.all(
        files.map(file => reviewFileWithContext(file, summary))
      );
    
      // Final pass: check for cross-file issues
      const crossFileIssues = await checkCrossFileConsistency(files, reviews);
    
      return [...reviews, ...crossFileIssues];
    }

    Problem 2: AI flagged too many false positives, annoying developers.

    Solution: Confidence scoring with threshold:

    TypeScript
    interface ReviewComment {
      line: number;
      message: string;
      severity: 'critical' | 'warning' | 'suggestion';
      confidence: number; // 0-1
    }
    
    function filterReviewComments(
      comments: ReviewComment[], 
      minConfidence: number = 0.7
    ): ReviewComment[] {
      return comments.filter(c => {
        // Always show critical issues
        if (c.severity === 'critical') return c.confidence > 0.5;
        // Higher threshold for warnings and suggestions
        return c.confidence >= minConfidence;
      });
    }

    Result: False positive rate dropped from 35% to 8%, developer adoption increased from 40% to 85%.

    The Production Checklist

    Before deploying any LLM feature, verify:

    Reliability

  • [ ] Retry logic with exponential backoff implemented
  • [ ] Fallback models configured
  • [ ] Circuit breaker protecting against provider outages
  • [ ] Timeout handling for slow responses
  • [ ] Graceful degradation when AI unavailable
  • Cost Control

  • [ ] Per-user rate limiting implemented
  • [ ] Cost alerts configured
  • [ ] Model selection based on task complexity
  • [ ] Caching for common queries
  • [ ] Prompt optimization reviewed
  • Security

  • [ ] Input sanitization for injection attacks
  • [ ] Output validation for harmful content
  • [ ] PII detection and handling
  • [ ] Audit logging for compliance
  • [ ] Access controls for sensitive operations
  • Observability

  • [ ] Structured logging for all LLM calls
  • [ ] Latency and error rate dashboards
  • [ ] Cost tracking per feature/user
  • [ ] Quality metrics collection
  • [ ] Alerting for anomalies
  • Testing

  • [ ] Evaluation dataset with expected outputs
  • [ ] Regression tests for prompt changes
  • [ ] Load testing for concurrent requests
  • [ ] Chaos testing for provider failures
  • [ ] User acceptance testing completed
  • FAQ

    Q: How do I test prompt changes without breaking production?

    Shadow mode: run new prompts alongside production, compare outputs, but only serve the production version. Graduate to canary (5% traffic) once confidence is high.

    Q: What's a reasonable latency target for LLM responses?

    For interactive use: p50 < 2s, p95 < 5s. For background processing: whatever your users tolerate. Streaming helps perception - first token in < 500ms feels fast even if completion takes longer.

    Q: How do I handle prompt injection attacks?

    Defense in depth: input sanitization, output validation, separate system/user message roles, and monitor for anomalies. No single technique is sufficient.

    Q: Should I use few-shot examples or fine-tuning?

    Start with few-shot - it's faster to iterate. Fine-tune when: you have 1000+ high-quality examples, few-shot isn't reaching quality targets, or you need to reduce prompt size for cost/latency.

    Q: How often should I review and update prompts?

    Monthly minimum. Set up alerts for quality degradation. Model updates can change behavior - revalidate after provider announcements.

    ---

    Production prompt engineering isn't about writing clever prompts. It's about building robust systems that work reliably at scale, within budget, while protecting users and the business.

    The patterns here emerged from real failures. Each retry mechanism exists because requests failed. Each cost optimization exists because bills exceeded budget. Each security pattern exists because attacks happened.

    Start with the basics: structured prompts, validation, logging. Add complexity only when needed. Measure everything. Iterate based on data.

    Build systems, not demos. Your users - and your on-call rotation - will thank you.

    Share this article

    💬Discussion

    🗨️

    No comments yet

    Be the first to share your thoughts!

    Related Articles