BackendJanuary 6, 2026

Event-Driven Architecture: Building Reactive Systems at Scale

Master event-driven architecture with Kafka, event sourcing, CQRS, and saga patterns for building resilient, scalable distributed systems.

DT

Dev Team

28 min read

#event-driven#kafka#cqrs#event-sourcing#distributed-systems
Event-Driven Architecture: Building Reactive Systems at Scale

Black Friday, 2:47 PM

The dashboard turns red. Your order service is timing out. Customers are abandoning carts. The payment provider is responding in 8 seconds instead of 200ms - and because your checkout flow calls payment synchronously, every order request is now queued behind that 8-second wall.

"Can we just... skip the email confirmation step?" someone asks. You can't. The notification service is hardcoded into the order flow. Skipping it means redeploying the order service - during peak traffic.

By 3:15 PM, you've lost $340,000 in abandoned carts. The postmortem will use phrases like "tight coupling" and "single point of failure." Your architecture diagram looks like spaghetti.

Here's what went wrong: every service depended on every other service being available and fast. Failures cascaded. The system was fragile precisely because it was "simple."

Event-driven architecture offers an alternative. Instead of services calling each other synchronously, services emit events when interesting things happen. Other services subscribe to events they care about. The order service publishes "OrderPlaced." Inventory, payment, shipping, and notifications all react independently.

If the notification service is slow? Orders still complete. If you need to add a loyalty points service? Subscribe to events - no changes to the order service required.

This simple inversion has profound implications for scalability, resilience, and evolvability.

The Event-Driven Mental Model

Understanding event-driven architecture requires shifting your mental model from commands to facts.

Commands are requests to do something: "Create this order," "Process this payment." They might fail. They expect a response.

Events are facts about what happened: "Order was created," "Payment was processed." They already happened. They are immutable.

In an event-driven system, services communicate through events. A service does its work, then publishes events describing what happened. Other services react to those events as they see fit. The publisher does not know or care who is listening.

TypeScript
// Command (request-response thinking)
interface CreateOrderCommand {
  customerId: string;
  items: OrderItem[];
}

// Response
interface CreateOrderResult {
  success: boolean;
  orderId?: string;
  error?: string;
}

// Event (event-driven thinking)
interface OrderCreated {
  eventId: string;
  timestamp: Date;
  orderId: string;
  customerId: string;
  items: OrderItem[];
  totalAmount: number;
}

This distinction matters. Commands couple the sender to the receiver. Events decouple them completely.

Event Streaming with Apache Kafka

Kafka has become the backbone of event-driven systems at scale. It is not a traditional message queue - it is a distributed commit log that enables both message passing and event streaming.

Key Kafka Concepts

Topics: Named streams of events. You might have topics for orders, payments, shipments, user-activities.

Partitions: Topics are split into partitions for parallelism. Events with the same key go to the same partition, ensuring ordering for related events.

Consumer Groups: Multiple consumers in a group share the work of processing a topic. Each partition is consumed by exactly one consumer in the group.

Retention: Unlike queues, Kafka retains messages for a configurable period. Consumers can replay history.

TypeScript
import { Kafka, Producer, Consumer } from 'kafkajs';

// Producer: Publishing events
class OrderEventPublisher {
  private producer: Producer;

  async publishOrderCreated(order: Order): Promise<void> {
    const event: OrderCreatedEvent = {
      eventId: generateUUID(),
      eventType: 'OrderCreated',
      timestamp: new Date(),
      aggregateId: order.id,
      payload: {
        customerId: order.customerId,
        items: order.items,
        totalAmount: order.totalAmount
      }
    };

    await this.producer.send({
      topic: 'orders',
      messages: [{
        key: order.id,  // Ensures ordering for this order
        value: JSON.stringify(event),
        headers: {
          'event-type': 'OrderCreated',
          'correlation-id': order.correlationId
        }
      }]
    });
  }
}

// Consumer: Reacting to events
class InventoryEventHandler {
  private consumer: Consumer;

  async start(): Promise<void> {
    await this.consumer.subscribe({ topic: 'orders', fromBeginning: false });

    await this.consumer.run({
      eachMessage: async ({ topic, partition, message }) => {
        const event = JSON.parse(message.value.toString());
        
        switch (event.eventType) {
          case 'OrderCreated':
            await this.reserveInventory(event.payload);
            break;
          case 'OrderCancelled':
            await this.releaseInventory(event.payload);
            break;
        }
      }
    });
  }

  private async reserveInventory(order: OrderPayload): Promise<void> {
    for (const item of order.items) {
      await this.inventoryService.reserve(item.productId, item.quantity);
    }
    
    // Publish our own event
    await this.publishInventoryReserved(order.orderId);
  }
}

Event Sourcing: Events as the Source of Truth

Traditional systems store current state: "This order has 3 items and is in 'shipped' status." If you want to know how it got there, you check logs - if you even have them.

Event sourcing flips this: you store events, not state. Current state is derived by replaying events. "Order was created," "Item was added," "Item was removed," "Order was paid," "Order was shipped."

> Pro tip: Think of event sourcing like Git for your data. You don't store "the current file" - you store every commit. You can always rebuild any state from the history.

TypeScript
// Events describe what happened
type OrderEvent =
  | { type: 'OrderCreated'; orderId: string; customerId: string; timestamp: Date }
  | { type: 'ItemAdded'; orderId: string; productId: string; quantity: number; price: number }
  | { type: 'ItemRemoved'; orderId: string; productId: string }
  | { type: 'OrderPaid'; orderId: string; paymentId: string; amount: number }
  | { type: 'OrderShipped'; orderId: string; trackingNumber: string };

// Aggregate rebuilds state from events
class OrderAggregate {
  private id: string;
  private customerId: string;
  private items: Map<string, OrderItem> = new Map();
  private status: OrderStatus = 'draft';
  private totalPaid: number = 0;

  // Apply events to rebuild state
  apply(event: OrderEvent): void {
    switch (event.type) {
      case 'OrderCreated':
        this.id = event.orderId;
        this.customerId = event.customerId;
        this.status = 'created';
        break;
        
      case 'ItemAdded':
        this.items.set(event.productId, {
          productId: event.productId,
          quantity: event.quantity,
          price: event.price
        });
        break;
        
      case 'ItemRemoved':
        this.items.delete(event.productId);
        break;
        
      case 'OrderPaid':
        this.totalPaid += event.amount;
        if (this.totalPaid >= this.calculateTotal()) {
          this.status = 'paid';
        }
        break;
        
      case 'OrderShipped':
        this.status = 'shipped';
        break;
    }
  }

  // Rebuild from event stream
  static fromEvents(events: OrderEvent[]): OrderAggregate {
    const order = new OrderAggregate();
    for (const event of events) {
      order.apply(event);
    }
    return order;
  }
}

Benefits of Event Sourcing

Complete audit trail: Every change is recorded. Compliance and debugging become trivial.

Time travel: Rebuild state at any point in time. "What did this order look like yesterday?"

Event replay: Fix a bug in your projection logic? Replay events with the fixed code.

Multiple views: Derive different read models from the same events. A dashboard view, a search index, an analytics pipeline - all from the same event stream.

The Cost of Event Sourcing

Event sourcing is not free:

Complexity: Simple CRUD becomes event modeling. Teams need new skills.

Eventual consistency: Read models lag behind writes. Not every application tolerates this.

Event schema evolution: Events are forever. Changing them requires versioning strategies.

Storage growth: Events accumulate. Snapshots mitigate but add complexity.

Use event sourcing when the benefits justify the costs - typically for complex domains where audit trails, temporal queries, or multiple views matter.

CQRS: Separating Reads from Writes

Command Query Responsibility Segregation splits your model into two parts: commands (writes) and queries (reads).

The write model is optimized for business logic and consistency. It validates commands, enforces invariants, and generates events.

The read model is optimized for queries. It is denormalized, indexed, and shaped exactly how consumers need it.

TypeScript
// Write side: Command handling
class OrderCommandHandler {
  async handle(command: CreateOrderCommand): Promise<Result<OrderId>> {
    // Validate business rules
    const customer = await this.customerRepository.find(command.customerId);
    if (!customer.isActive) {
      return Result.fail('Customer is not active');
    }

    // Check inventory
    for (const item of command.items) {
      const available = await this.inventoryService.checkAvailability(item.productId);
      if (available < item.quantity) {
        return Result.fail(`Insufficient inventory for ${item.productId}`);
      }
    }

    // Create aggregate and apply command
    const order = new OrderAggregate();
    order.create(command.customerId, command.items);

    // Persist events
    await this.eventStore.save(order.id, order.uncommittedEvents);

    // Publish events
    for (const event of order.uncommittedEvents) {
      await this.eventBus.publish(event);
    }

    return Result.ok(order.id);
  }
}

// Read side: Projection
class OrderReadModelProjection {
  async handle(event: OrderEvent): Promise<void> {
    switch (event.type) {
      case 'OrderCreated':
        await this.db.orders.insert({
          id: event.orderId,
          customerId: event.customerId,
          status: 'created',
          itemCount: 0,
          totalAmount: 0,
          createdAt: event.timestamp
        });
        break;

      case 'ItemAdded':
        await this.db.orders.update(event.orderId, {
          $inc: { 
            itemCount: 1, 
            totalAmount: event.price * event.quantity 
          }
        });
        await this.db.orderItems.insert({
          orderId: event.orderId,
          productId: event.productId,
          quantity: event.quantity,
          price: event.price
        });
        break;

      case 'OrderShipped':
        await this.db.orders.update(event.orderId, {
          status: 'shipped',
          shippedAt: event.timestamp
        });
        break;
    }
  }
}

// Query side: Optimized reads
class OrderQueryService {
  async getOrderSummary(orderId: string): Promise<OrderSummary> {
    // Direct read from optimized read model
    return this.db.orders.findOne({ id: orderId });
  }

  async getCustomerOrders(customerId: string): Promise<OrderSummary[]> {
    // Pre-computed, indexed, ready to serve
    return this.db.orders
      .find({ customerId })
      .sort({ createdAt: -1 })
      .limit(50);
  }

  async getOrdersReadyToShip(): Promise<OrderSummary[]> {
    // Purpose-built for this specific query
    return this.db.orders.find({ 
      status: 'paid',
      inventoryReserved: true 
    });
  }
}

CQRS shines when read and write patterns diverge significantly - many reads, complex queries, different scaling needs.

The Saga Pattern: Distributed Transactions

In a monolith, a transaction can span multiple operations. In distributed systems, we cannot have distributed transactions (they do not scale). The Saga pattern provides an alternative.

A saga is a sequence of local transactions. Each step publishes events that trigger the next step. If a step fails, compensating transactions undo previous steps.

TypeScript
// Saga definition
interface OrderSaga {
  steps: SagaStep[];
  compensations: Map<string, CompensationStep>;
}

const createOrderSaga: OrderSaga = {
  steps: [
    {
      name: 'ReserveInventory',
      action: async (ctx) => {
        await inventoryService.reserve(ctx.orderId, ctx.items);
      },
      compensation: async (ctx) => {
        await inventoryService.release(ctx.orderId, ctx.items);
      }
    },
    {
      name: 'ProcessPayment',
      action: async (ctx) => {
        const paymentId = await paymentService.charge(ctx.customerId, ctx.amount);
        ctx.paymentId = paymentId;
      },
      compensation: async (ctx) => {
        await paymentService.refund(ctx.paymentId);
      }
    },
    {
      name: 'CreateShipment',
      action: async (ctx) => {
        await shippingService.createShipment(ctx.orderId, ctx.address);
      },
      compensation: async (ctx) => {
        await shippingService.cancelShipment(ctx.orderId);
      }
    }
  ]
};

// Saga orchestrator
class SagaOrchestrator {
  async execute(saga: OrderSaga, context: SagaContext): Promise<SagaResult> {
    const completedSteps: string[] = [];

    try {
      for (const step of saga.steps) {
        await step.action(context);
        completedSteps.push(step.name);
        
        // Emit progress event
        await this.eventBus.publish({
          type: 'SagaStepCompleted',
          sagaId: context.sagaId,
          step: step.name
        });
      }

      return { success: true };

    } catch (error) {
      // Compensate in reverse order
      for (const stepName of completedSteps.reverse()) {
        const step = saga.steps.find(s => s.name === stepName);
        try {
          await step.compensation(context);
        } catch (compensationError) {
          // Log and alert - compensation failures need manual intervention
          await this.alertService.critical(`Compensation failed for ${stepName}`);
        }
      }

      return { success: false, error: error.message };
    }
  }
}

Choreography vs. Orchestration

Choreography: Services react to events independently. No central coordinator. More decoupled but harder to understand the overall flow.

Orchestration: A saga orchestrator directs the flow. Easier to understand and monitor but introduces a coordinator dependency.

Most production systems use orchestration for complex workflows and choreography for simpler event reactions.

Idempotency: The Non-Negotiable Requirement

> If you only remember one thing: In event-driven systems, messages can - and will - be delivered multiple times. Your handlers must be idempotent.

Network issues, consumer restarts, and rebalancing all cause redelivery. Processing the same event twice should have the same effect as processing it once. ("Why did we charge the customer three times?" is not a question you want to answer.)

TypeScript
class IdempotentEventHandler {
  private processedEvents: Set<string>; // In production, use a database

  async handle(event: Event): Promise<void> {
    // Check if already processed
    if (await this.isProcessed(event.eventId)) {
      console.log(`Event ${event.eventId} already processed, skipping`);
      return;
    }

    // Process the event
    await this.processEvent(event);

    // Mark as processed
    await this.markProcessed(event.eventId);
  }

  private async isProcessed(eventId: string): Promise<boolean> {
    return this.db.processedEvents.exists({ eventId });
  }

  private async markProcessed(eventId: string): Promise<void> {
    await this.db.processedEvents.insert({
      eventId,
      processedAt: new Date()
    });
  }
}

For database operations, use natural idempotency when possible:

TypeScript
// Idempotent: Uses upsert with the event's intended state
async function handleOrderShipped(event: OrderShippedEvent): Promise<void> {
  await db.orders.updateOne(
    { id: event.orderId },
    { 
      $set: { 
        status: 'shipped', 
        trackingNumber: event.trackingNumber,
        shippedAt: event.timestamp 
      }
    },
    { upsert: false }  // Don't create if doesn't exist
  );
}

Monitoring Event-Driven Systems

Event-driven systems require different monitoring approaches.

Consumer Lag

The gap between latest published event and latest consumed event. Growing lag indicates consumers cannot keep up.

TypeScript
// Kafka consumer lag monitoring
async function checkConsumerLag(groupId: string): Promise<LagReport> {
  const admin = kafka.admin();
  const offsets = await admin.fetchOffsets({ groupId, topics: ['orders'] });
  const topicOffsets = await admin.fetchTopicOffsets('orders');

  return offsets.map(partition => ({
    partition: partition.partition,
    currentOffset: partition.offset,
    latestOffset: topicOffsets.find(t => t.partition === partition.partition).offset,
    lag: topicOffsets.find(t => t.partition === partition.partition).offset - partition.offset
  }));
}

Event Processing Metrics

  • Events processed per second
  • Processing latency (p50, p95, p99)
  • Error rate
  • Retry rate
  • Dead letter queue size
  • Distributed Tracing

    Trace events across services using correlation IDs:

    TypeScript
    interface TracedEvent {
      eventId: string;
      correlationId: string;  // Ties related events together
      causationId: string;    // ID of the event that caused this one
      timestamp: Date;
      // ... event payload
    }

    Best Practices Checklist

  • [ ] Design events as facts - Events describe what happened, not what should happen. They're immutable.
  • [ ] Include all necessary context - Events should be self-contained. Consumers shouldn't need to call back to the producer.
  • [ ] Version your events - Use schema registries. Plan for evolution from day one.
  • [ ] Handle failures gracefully - Dead letter queues, retry policies, and alerting for stuck events.
  • [ ] Monitor consumer lag - It's your primary health indicator.
  • [ ] Test event handlers in isolation - Mock the event bus for unit tests.
  • [ ] Document event schemas and flows - Event-driven systems are harder to understand than request-response. Documentation is essential.
  • When NOT to Use Event-Driven Architecture

    > Watch out: Event-driven isn't always the answer. Avoid it when:

  • Simple CRUD apps - The overhead isn't worth it for basic data entry applications.
  • Strong consistency required - If you can't tolerate eventual consistency, stick with synchronous calls.
  • Small team, tight deadline - EDA has a learning curve. Don't adopt it under time pressure.
  • Debugging skills lacking - Distributed tracing and async debugging are prerequisites.
  • FAQ

    Q: Kafka vs. RabbitMQ vs. SQS - which should I use?

    Kafka for high-throughput event streaming and replay. RabbitMQ for complex routing and lower latency. SQS for simple queuing with AWS integration. Most teams should start with managed services (SQS, Cloud Pub/Sub) and graduate to Kafka when they hit scale.

    Q: How do I handle events that arrive out of order?

    Design handlers to be order-independent when possible. When order matters, use partition keys to ensure related events go to the same partition (Kafka guarantees ordering within a partition).

    Q: What's a reasonable retention period for events?

    Start with 7 days for operational events, 30+ days for events you might need to replay. Event sourcing stores events forever by design.

    Q: How do I migrate from request-response to event-driven?

    Start with one non-critical flow. Keep the synchronous path as fallback. Gradually shift traffic to events as you build confidence. Don't big-bang migrate.

    Q: My consumer lag keeps growing. What do I do?

    Add more consumers (scale horizontally), optimize handler performance, or batch process events. Check if one slow downstream dependency is blocking everything.

    ---

    Event-driven architecture isn't universally better than request-response. It's better for specific problems: high scalability needs, loose coupling requirements, complex workflows, and systems that benefit from event replay. Choose the right tool for the job - and when events are that tool, build with patterns that make them reliable.

    Share this article

    💬Discussion

    🗨️

    No comments yet

    Be the first to share your thoughts!

    Related Articles