AI/ML•January 9, 2026

RAG Architecture Deep Dive: Building Knowledge-Augmented AI

Master Retrieval-Augmented Generation with vector databases, embedding strategies, chunking techniques, and production optimization patterns.

DT

Dev Team

28 min read

#rag#vector-database#embeddings#llm#ai#pinecone
RAG Architecture Deep Dive: Building Knowledge-Augmented AI

"Where Did You Get That Information?"

Your AI chatbot just told a customer that your premium plan includes unlimited API calls. It doesn't. Never has.

"Where did you get that information?" your support lead asks, staring at the chat transcript.

The answer: nowhere. The model hallucinated it. Confidently. In a way that sounds completely plausible. Now you have a customer expecting features that don't exist and a legal team asking hard questions.

This is the knowledge problem. LLMs are remarkable at reasoning, but their knowledge is frozen at training time. Ask about last week's events - hallucination. Ask about your internal docs - hallucination. Ask for citations - invented references that don't exist.

RAG (Retrieval-Augmented Generation) fixes this by changing how we use language models. Instead of asking the model to recall information, we retrieve relevant documents at query time and include them in the prompt. The model becomes a reasoning engine over provided context, not a knowledge oracle.

Suddenly your AI can access yesterday's information. It can work with proprietary documents. It can cite real sources. And you can update its knowledge without retraining.

Here's how to build RAG systems that actually work in production.

How RAG Works: The Mental Model

Understanding RAG requires understanding two distinct phases: indexing and retrieval.

The Indexing Phase happens offline, before any user queries. You take your knowledge base - documents, articles, database records, whatever - and process it into a format optimized for retrieval:

  • Document Loading: Ingest documents from various sources (files, databases, APIs, web scraping)
  • Chunking: Split documents into smaller pieces that fit in context windows
  • Embedding: Convert each chunk into a high-dimensional vector that captures its semantic meaning
  • Indexing: Store these vectors in a database optimized for similarity search
  • The Retrieval Phase happens at query time:

  • Query Embedding: Convert the user's question into the same vector space as your documents
  • Similarity Search: Find the chunks whose vectors are closest to the query vector
  • Context Assembly: Combine retrieved chunks into a coherent context
  • Generation: Pass the context and query to the LLM to generate a response
  • The magic is in step 2. Semantic similarity search finds relevant content even when the exact words differ. A query about "authentication" will find documents discussing "login," "credentials," and "identity verification" - because in the embedding space, these concepts are neighbors.

    Chunking: The Underrated Art

    Chunking seems straightforward - just split documents into smaller pieces, right? In practice, chunking strategy has enormous impact on retrieval quality, and getting it right requires thought.

    The Chunking Dilemma

    Small chunks are more precise. A 200-token chunk about authentication will be highly relevant to authentication queries and will not dilute the context with unrelated information.

    But small chunks lose context. A chunk that says "this approach significantly improves performance" is useless without knowing what "this approach" refers to. Context that spans multiple chunks gets fragmented.

    Large chunks preserve context but reduce precision. A 2000-token chunk might contain a paragraph about authentication buried in discussion of other topics. It will be retrieved for authentication queries, but most of the chunk will be noise.

    Chunking Strategies

    Fixed-Size Chunking is the simplest approach. Split every N tokens with some overlap to preserve context at boundaries:

    TypeScript
    function fixedSizeChunk(text: string, chunkSize: number, overlap: number): string[] {
      const tokens = tokenize(text);
      const chunks: string[] = [];
      
      for (let i = 0; i < tokens.length; i += chunkSize - overlap) {
        const chunk = tokens.slice(i, i + chunkSize);
        chunks.push(detokenize(chunk));
      }
      
      return chunks;
    }

    Simple, but it ignores document structure. You might split mid-sentence or separate a heading from its content.

    Semantic Chunking respects document structure. Split on paragraph boundaries, section headers, or other natural breakpoints:

    TypeScript
    function semanticChunk(document: Document): Chunk[] {
      const chunks: Chunk[] = [];
      
      for (const section of document.sections) {
        // Keep section headers with their content
        let currentChunk = `## ${section.title}\n\n`;
        
        for (const paragraph of section.paragraphs) {
          if (tokenCount(currentChunk + paragraph) > MAX_CHUNK_SIZE) {
            chunks.push({ content: currentChunk, metadata: section.metadata });
            currentChunk = `## ${section.title} (continued)\n\n`;
          }
          currentChunk += paragraph + '\n\n';
        }
        
        if (currentChunk.trim()) {
          chunks.push({ content: currentChunk, metadata: section.metadata });
        }
      }
      
      return chunks;
    }

    Recursive Chunking handles nested structures by splitting at progressively finer levels until chunks fit the size limit:

    TypeScript
    const splitters = [
      '\n\n\n',  // Major sections
      '\n\n',     // Paragraphs
      '\n',        // Lines
      '. ',         // Sentences
      ' ',          // Words
    ];
    
    function recursiveChunk(text: string, maxSize: number, level: number = 0): string[] {
      if (tokenCount(text) <= maxSize) {
        return [text];
      }
      
      if (level >= splitters.length) {
        // Hard split if nothing else works
        return fixedSizeChunk(text, maxSize, 0);
      }
      
      const parts = text.split(splitters[level]);
      const chunks: string[] = [];
      let current = '';
      
      for (const part of parts) {
        if (tokenCount(current + splitters[level] + part) <= maxSize) {
          current += (current ? splitters[level] : '') + part;
        } else {
          if (current) chunks.push(...recursiveChunk(current, maxSize, level + 1));
          current = part;
        }
      }
      
      if (current) chunks.push(...recursiveChunk(current, maxSize, level + 1));
      return chunks;
    }

    Practical Recommendations

    After much experimentation, here is what works:

  • Chunk size 500-1000 tokens for most use cases
  • Overlap 50-100 tokens to preserve boundary context
  • Include metadata (source, title, date) in chunk storage
  • Consider parent-child chunks: Store larger context chunks and retrieve them when their child chunks match
  • Test with real queries: The best chunking strategy depends on how users actually query your content
  • Embeddings: Meaning in Numbers

    Embeddings are the secret sauce that makes semantic search possible. They transform text into dense vectors where similar meanings cluster together.

    How Embeddings Work

    An embedding model takes text (a word, sentence, or paragraph) and outputs a vector - typically 384 to 1536 dimensions. These models are trained so that semantically similar texts produce similar vectors.

    The magic is that this similarity captures meaning, not just word overlap. The embedding for "The CEO resigned yesterday" will be similar to "The company's leader stepped down recently" despite sharing almost no words.

    Choosing an Embedding Model

    Several factors matter:

    Dimension Count: Higher dimensions can capture more nuance but require more storage and computation. 768-1536 dimensions work well for most applications.

    Training Data: Models trained on specific domains (code, legal, medical) often outperform general-purpose models in those domains.

    Sequence Length: Some models truncate at 512 tokens; others handle 8192+. Match to your chunk sizes.

    Performance: Embedding latency matters for real-time applications. Some models are 10x faster than others.

    Popular choices include:

    TypeScript
    // OpenAI embeddings - excellent quality, API cost
    const openaiEmbedding = {
      model: 'text-embedding-3-large',
      dimensions: 3072,
      maxTokens: 8191
    };
    
    // Cohere - good multilingual support
    const cohereEmbedding = {
      model: 'embed-english-v3.0',
      dimensions: 1024,
      maxTokens: 512
    };
    
    // Open source - no API costs, self-hosted
    const sentenceTransformer = {
      model: 'all-MiniLM-L6-v2',
      dimensions: 384,
      maxTokens: 256
    };

    Embedding Best Practices

  • Use the same model for indexing and querying. Different models produce incompatible vector spaces.
  • Normalize vectors for cosine similarity. Most vector databases do this automatically.
  • Consider instruction-tuned embeddings. Some models let you specify whether text is a query or a document, improving retrieval quality.
  • Batch embedding requests to reduce API calls and latency.
  • TypeScript
    async function embedBatch(texts: string[], batchSize: number = 100): Promise<number[][]> {
      const embeddings: number[][] = [];
      
      for (let i = 0; i < texts.length; i += batchSize) {
        const batch = texts.slice(i, i + batchSize);
        const response = await openai.embeddings.create({
          model: 'text-embedding-3-small',
          input: batch
        });
        embeddings.push(...response.data.map(d => d.embedding));
      }
      
      return embeddings;
    }

    Vector Databases: Where Knowledge Lives

    Vector databases are specialized systems for storing and searching embeddings. They make the "find similar vectors" operation fast - typically milliseconds even with millions of vectors.

    Vector Database Options

    Pinecone: Fully managed, easy to start, scales well. Good default choice for production.

    TypeScript
    import { Pinecone } from '@pinecone-database/pinecone';
    
    const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
    const index = pinecone.index('knowledge-base');
    
    // Upsert vectors
    await index.upsert([{
      id: 'doc-1-chunk-3',
      values: embedding,
      metadata: { source: 'handbook.pdf', page: 42 }
    }]);
    
    // Query
    const results = await index.query({
      vector: queryEmbedding,
      topK: 10,
      includeMetadata: true,
      filter: { source: { $eq: 'handbook.pdf' } }
    });

    Weaviate: Open source, supports hybrid search natively, has built-in vectorization.

    Qdrant: Open source, excellent filtering capabilities, efficient memory usage.

    pgvector: PostgreSQL extension. Good if you want to keep everything in Postgres.

    SQL
    -- pgvector setup
    CREATE EXTENSION vector;
    
    CREATE TABLE documents (
      id SERIAL PRIMARY KEY,
      content TEXT,
      embedding vector(1536),
      metadata JSONB
    );
    
    CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
      WITH (lists = 100);
    
    -- Query
    SELECT id, content, 1 - (embedding <=> $1) AS similarity
    FROM documents
    ORDER BY embedding <=> $1
    LIMIT 10;

    Indexing Strategies

    Vector search speed depends on the index algorithm:

    Flat/Brute Force: Compares query against every vector. Exact results but slow at scale.

    IVF (Inverted File): Clusters vectors and only searches relevant clusters. Fast but may miss some results.

    HNSW (Hierarchical Navigable Small World): Graph-based navigation. Excellent speed-accuracy tradeoff.

    For most production systems, HNSW provides the best balance. Pinecone and most managed services use it automatically.

    Hybrid Search: Best of Both Worlds

    Pure semantic search has a weakness: it can miss exact keyword matches. If a user searches for "error code 4532" they need documents containing that exact code, not documents semantically similar to error concepts.

    Hybrid search combines semantic search with traditional keyword search:

    TypeScript
    async function hybridSearch(query: string, options: SearchOptions): Promise<SearchResult[]> {
      // Parallel semantic and keyword search
      const [semanticResults, keywordResults] = await Promise.all([
        semanticSearch(query, options.topK),
        keywordSearch(query, options.topK)
      ]);
      
      // Reciprocal Rank Fusion to combine results
      const scores = new Map<string, number>();
      const k = 60; // RRF constant
      
      semanticResults.forEach((result, rank) => {
        const current = scores.get(result.id) || 0;
        scores.set(result.id, current + 1 / (k + rank));
      });
      
      keywordResults.forEach((result, rank) => {
        const current = scores.get(result.id) || 0;
        scores.set(result.id, current + 1 / (k + rank));
      });
      
      // Sort by combined score
      return Array.from(scores.entries())
        .sort((a, b) => b[1] - a[1])
        .slice(0, options.topK)
        .map(([id, score]) => ({ id, score }));
    }

    Many vector databases now support hybrid search natively. Weaviate, for example, lets you specify the balance:

    TypeScript
    const results = await client.graphql.get()
      .withClassName('Document')
      .withHybrid({
        query: 'authentication best practices',
        alpha: 0.75  // 75% semantic, 25% keyword
      })
      .withLimit(10)
      .do();

    Retrieval Quality: The Make-or-Break Factor

    Retrieval quality determines RAG quality. If you retrieve irrelevant chunks, the best LLM in the world cannot produce good answers.

    Measuring Retrieval Quality

    Precision: What percentage of retrieved chunks are actually relevant?

    Recall: What percentage of relevant chunks were retrieved?

    Mean Reciprocal Rank (MRR): How high do relevant results rank?

    Build an evaluation set of queries with known relevant documents, and track these metrics:

    TypeScript
    interface RetrievalEval {
      query: string;
      relevantDocIds: string[];
    }
    
    function evaluateRetrieval(
      results: SearchResult[],
      eval: RetrievalEval
    ): { precision: number; recall: number; mrr: number } {
      const retrieved = new Set(results.map(r => r.id));
      const relevant = new Set(eval.relevantDocIds);
      
      const truePositives = [...retrieved].filter(id => relevant.has(id)).length;
      
      const precision = truePositives / retrieved.size;
      const recall = truePositives / relevant.size;
      
      // Find rank of first relevant result
      const firstRelevantRank = results.findIndex(r => relevant.has(r.id));
      const mrr = firstRelevantRank >= 0 ? 1 / (firstRelevantRank + 1) : 0;
      
      return { precision, recall, mrr };
    }

    Improving Retrieval

  • Query Expansion: Rephrase the query multiple ways and combine results
  • TypeScript
    async function expandedSearch(query: string): Promise<SearchResult[]> {
      const expansions = await llm.complete({
        prompt: `Generate 3 alternative phrasings of this search query: "${query}"
        Output as JSON array of strings.`
      });
      
      const allQueries = [query, ...JSON.parse(expansions)];
      const allResults = await Promise.all(allQueries.map(q => search(q)));
      
      return fuseResults(allResults);
    }
  • Reranking: Use a cross-encoder to rerank initial results
  • TypeScript
    async function searchWithReranking(query: string, topK: number): Promise<SearchResult[]> {
      // Get more candidates than needed
      const candidates = await semanticSearch(query, topK * 3);
      
      // Rerank with cross-encoder
      const reranked = await reranker.rank(
        query,
        candidates.map(c => c.content)
      );
      
      return reranked.slice(0, topK);
    }
  • Metadata Filtering: Let users constrain search to specific sources, dates, or categories
  • Query Classification: Route different query types to different retrieval strategies
  • Generation: Turning Context into Answers

    Once you have retrieved relevant chunks, you need to generate a response that uses them effectively.

    Prompt Structure

    A well-structured RAG prompt has clear sections:

    TypeScript
    function buildRAGPrompt(query: string, chunks: Chunk[]): string {
      const context = chunks
        .map((c, i) => `[Source ${i + 1}: ${c.metadata.title}]\n${c.content}`)
        .join('\n\n---\n\n');
      
      return `You are a helpful assistant that answers questions based on the provided context.
    
    CONTEXT:
    ${context}
    
    INSTRUCTIONS:
    - Answer the question based ONLY on the context provided
    - If the context doesn't contain enough information, say so
    - Cite sources using [Source N] notation
    - Be concise but complete
    
    QUESTION: ${query}
    
    ANSWER:`;
    }

    Handling Insufficient Context

    Sometimes retrieved chunks do not contain the answer. The model should acknowledge this rather than hallucinate:

    TypeScript
    const prompt = `...
    If the context does not contain information to answer the question, respond with:
    "I don't have enough information in my knowledge base to answer this question. 
    The available documents discuss [brief summary of what IS available], 
    but don't cover [what the user asked about]."
    ...`;

    Citation and Attribution

    For trustworthiness, responses should cite sources:

    TypeScript
    interface RAGResponse {
      answer: string;
      citations: Citation[];
      confidence: number;
    }
    
    interface Citation {
      sourceId: string;
      title: string;
      excerpt: string;
      relevance: number;
    }

    Production Considerations

    Building RAG for production requires attention to several additional concerns.

    Latency Budget

    RAG adds latency: embedding the query, searching the vector DB, and including context in the prompt. A typical breakdown:

  • Query embedding: 50-100ms
  • Vector search: 10-50ms
  • LLM generation: 500-2000ms
  • To meet user expectations:

  • Cache common query embeddings
  • Use faster embedding models for real-time
  • Stream LLM responses
  • Pre-compute popular queries
  • Cost Management

    RAG costs come from embeddings, vector database operations, and longer prompts (more context = more tokens):

    TypeScript
    interface RAGCostEstimate {
      embeddingCost: number;    // Per query
      vectorSearchCost: number; // Per query
      llmCost: number;          // Depends on context size
      storageCost: number;      // Monthly, based on corpus size
    }
    
    function estimateCost(config: RAGConfig): RAGCostEstimate {
      const embeddingCost = 0.0001; // $0.0001 per 1K tokens
      const vectorSearchCost = config.vectorDB.costPerQuery;
      const llmCost = (config.contextTokens + config.avgResponseTokens) * 0.00003;
      const storageCost = config.corpusSize * config.vectorDB.costPerGB;
      
      return { embeddingCost, vectorSearchCost, llmCost, storageCost };
    }

    Keeping Knowledge Fresh

    Your knowledge base is only valuable if it stays current:

    TypeScript
    async function incrementalIndexUpdate(changes: DocumentChange[]): Promise<void> {
      for (const change of changes) {
        switch (change.type) {
          case 'create':
          case 'update':
            const chunks = await chunkDocument(change.document);
            const embeddings = await embedBatch(chunks.map(c => c.content));
            await vectorDB.upsert(chunks.map((c, i) => ({
              id: `${change.documentId}-${i}`,
              values: embeddings[i],
              metadata: c.metadata
            })));
            break;
          
          case 'delete':
            await vectorDB.deleteByFilter({
              documentId: change.documentId
            });
            break;
        }
      }
    }

    Measuring Success

    Track these metrics to know if your RAG system is working:

    Retrieval Metrics

  • Retrieval latency (p50, p95)
  • Chunks retrieved per query
  • Cache hit rate
  • Quality Metrics

  • User satisfaction (thumbs up/down)
  • Answer groundedness (is the answer supported by retrieved context?)
  • Citation accuracy (do citations match claims?)
  • Business Metrics

  • Queries per user
  • Time to resolution (if support use case)
  • Escalation rate (if automation use case)
  • Common Pitfalls and Solutions

    Pitfall: Retrieving irrelevant chunks

    Solution: Improve chunking, tune similarity thresholds, add reranking

    Pitfall: Missing information that exists in the corpus

    Solution: Better chunking to avoid fragmentation, query expansion, hybrid search

    Pitfall: Model ignoring context and hallucinating

    Solution: Stronger prompting, explicit instruction to cite sources, lower temperature

    Pitfall: Context window overflow

    Solution: Smarter chunk selection, summarization of retrieved content, longer context models

    Pitfall: Stale information

    Solution: Automated index updates, metadata filtering by date, source freshness signals

    The Future of RAG

    RAG is evolving rapidly. Trends to watch:

  • Long-context models that can ingest entire documents directly
  • Learned retrieval where the retrieval model is fine-tuned end-to-end
  • Multi-modal RAG that retrieves images, tables, and code alongside text
  • Agentic RAG where the system decides what to retrieve and when
  • But the fundamentals will not change: understand your documents, chunk them thoughtfully, embed them accurately, retrieve them efficiently, and generate responses that users can trust.

    RAG is not just a technique - it is the bridge between the vast knowledge locked in your documents and the reasoning power of large language models. Build that bridge well, and you unlock tremendous value.

    Share this article

    šŸ’¬Discussion

    šŸ—Øļø

    No comments yet

    Be the first to share your thoughts!

    Related Articles