RAG

Jul 20, 2024 · 9 min read

Retrieval Augmented Generation (RAG) - Unveiling the Simplicity and Magic of an LLM Innovation

In today’s landscape, large language models (LLMs) are catalyzing transformative applications across various industries. Among these, Retrieval Augmented Generation (RAG) stands out for its straightforward yet powerful approach, delivering impressive outcomes across numerous scenarios. This blog will explore RAG from its foundational principles to its practical applications.

Introduction

Retrieval Augmented Generation (RAG) is a hybrid model that enhances traditional language model generation by dynamically retrieving and integrating external information during the generation process. Building a RAG model involves selecting a robust knowledge base, integrating it with a language model, and fine-tuning the system to dynamically incorporate relevant information during the generation process.

What does a RAG do

The RAG Pipeline

The RAG pipeline incorporates a sequence of intricate steps designed to blend retrieval mechanisms with generative capabilities, and here's a detailed breakdown of each stage in the RAG pipeline:

Knowledge Base Creation

The key foundation of RAG is a well-organized knowledge base. The process involves collecting, curating, and structuring data that can effectively support the retrieval needs of the model. Data is typically sourced from databases, or any other textual sources, then cleaned, standardized, and indexed.

Here are some major classes of knowledge bases, along with details on their characteristics and use cases:

Vector Databases

Vector databases store data in the form of vectors, which are representations of data points in a high-dimensional space. The data is partitioned into chunks, then the chunks are encoded into vectors with LLM or techniques such as deep learning models, then they're indexed into a vector database by their vector representation.

Vector databases are most used for RAG models that use semantic search where the similarity between vectors directly correlates to the relevance of information in response to a query. For instance, consider a database containing addresses, and the query is a list of addresses with typographical errors. The objective is to retrieve the most accurately matched correct address. In this scenario, a vector database would be an ideal choice. The reason being, the similarity between vectors—generated from the addresses—will efficiently highlight the most relevant information in response to the query.

Popular implementations include Faiss, BERT, NLTK, sparse embeddings such as TF-IDF, and so on.

Graph Databases

Graph databases use graph structures to represent and store data. The relationships among data points are extracted by LLM and directly stored as edges, which makes them highly suitable for complex queries that involve relations.

Graph databases are commonly used in scenarios where relationships between data points are crucial, such as linking concepts across documents or understanding hierarchical structures within the data. For example, it's instrumental for implementing a personalized recommendation system. Nodes in the graph represent users, products, and categories, while edges denote actions such as purchases or views. This setup allows the system to track user interactions and dynamically update connections between products and users.

The built-in pacakge graphrag in Python is a commonly used implementation.

Information Retrieval

After a user provides a query in natural language, the retrieval process is executed to fetch the most relevant information from the knowledge base in response to a query.

The retrieval mechanism leverages the vector database to pinpoint segments of indexed data that share semantic similarities with the user's embedded query. For instance, in a vector database, information retrieval is facilitated by identifying the nearest neighbors to the query, which then serve as context for the large language model (LLM). Conversely, in a graph database, retrieval is executed by extracting a sub-knowledge base—specifically, a sub-graph—centered around the entity specified in the query. This method ensures that the retrieved information is highly relevant and contextually appropriate for the user's needs.

Retrieval Approaches

Basic Retrieval : The system scans an index to find items that match or are related to the user's embedded query. Techniques such as keyword search, Boolean retrieval, or retrieving based on cosine similarity are common and straightforward.
Small-to-Large Chunking : This technique involves breaking down the retrieval process into progressively larger data chunks. Initially, the system retrieves small segments of data closely matching the query specifics. It then gradually widens the scope, incorporating larger chunks of related information until the desired context or accuracy is achieved. ParentDocumentRetriever in langchain is a widely used technique for this.
Ensemble Retrieval : This technique takes multiple retrievers as input, ensemble the results of their methods according to a weight, and rerank the results based on the Reciprocal Rank Fusion algorithm . The most common pattern is to combine a sparse retriever (like BM25) which is good at finding relevant documents based on keywords, with a dense retriever (like embedding similarity) which is good at finding relevant documents based on semantic similarity. Because their strengths are complementary, hybriding them would balance the strengths and weaknesses of individual methods. A commonly used tool is EnsembleRetriever in langchain.
Re-Ranking : By analyzing and scoring the retrieved items to reorder them based on their perceived relevance to the query, re-ranking enhances the initial retrieval results by adjusting the rank of the retrieved documents/information to better align with the user's needs.
- Lexical Re-Ranking : Lexical re-ranking adjusts the ranks based on the lexical similarity between the query and the retrieved documents, it may utilize advanced token matching algorithms, incorporating synonym recognition, phrase matching, and context awareness. Commonly used techniques include BM25 or cosine similarity with TF-IDF vectors.
- Semantic Re-Ranking : This method adopts natural language understanding capabilities to gauge the semantic correspondence between the query and the retrieved content. Techniques like embedding vectors (from models like BERT or GPT), which represent semantic meanings of texts, are compared to determine the match.
- Machine Learning-Based Re-Ranking : This involves training data consisting of features extracted from both the query and the documents and manually ranked documents, to teach the model what characteristics of a document make it relevant to a query. Common methods are pairwise ranking [where loss is based on pairs of documents with difference in relevance], pointwise ranking [where labels are predicted by using classification or regression loss], and listwise ranking [optimise ranking metrics directly].

Contextual Integration

After getting the retrieved information from last step, the relevant information is combined with the initial query, forming a new, enriched input context to enhance the response's relevance and accuracy. This process involves aligning the retrieved information with the query contextually and ensuring that the combination is meaningful.

Response Generation

At this point, the model combines the information it has retrieved with its pre-existing knowledge to produce responses that are not only coherent but also contextually relevant. This involves weaving together insights from various sources to ensure both accuracy and relevance. The response is crafted to be informative while also matching the user’s original query.

Here's a naive code-level example of the contextual integration and response generation steps :

# Let's assume we have a user query and some retrieved text snippets that might contain the answer.

user_query = "What are the health benefits of drinking green tea?"
retrieved_snippets = [
    "Green tea is high in antioxidants that can improve the function of your body and brain.",
    "Some studies show that green tea leads to increased weight loss and helps in preventing cardiovascular diseases."
]

# Preprocess and Combine the Information
import nltk
from nltk.tokenize import sent_tokenize

# Download the required model for sentence tokenization
nltk.download('punkt')  

# Function to concatenate snippets into a single context
def create_context(query, snippets):
    context = query + " " + " ".join(snippets)
    return context

# Generate the combined context
combined_context = create_context(user_query, retrieved_snippets)
print("Combined Context for the Model:", combined_context)

# Use a Language Model to Generate an Answer
from transformers import pipeline

# Initialize the question answering pipeline
qa_pipeline = pipeline('question-answering')

# Use the model to generate an answer
answer = qa_pipeline({
    'question': user_query,
    'context': combined_context
})

print("Answer:", answer['answer'])

Note that this is just a very simple example of the last two steps for better understanding. The performance of this piece of code will be limited due to the selection of language model and naive context generation. Personally in real implementation I would recommand using ChatPromptTemplate in langchain for a more effective result.

Evaluation

Evaluating the performance of Retrieval Augmented Generation (RAG) models is crucial to determine their effectiveness and identify areas for improvement. Personally I highly recommend RAGAS package, as it allows you to generate a synthetic evaluation dataset for assessing your RAG pipeline, and evaluate the answers of RAG from the following metrics :

Faithfulness - Measures the factual consistency of the answer to the context based on the question.
Context_precision - Measures how relevant the retrieved context is to the question, conveying the quality of the retrieval pipeline.
Answer_relevancy - Measures how relevant the answer is to the question.
Context_recall - Measures the retriever’s ability to retrieve all necessary information required to answer the question.

RAGAS even allows you to monitor the performance RAG in production. In real implementation, after evaluating your RAG model, steps including refine retrieval algorithms, enhance language model integration are common ways to further improve the RAG.

Summary

In summary, RAG enhances models to produce responses that are both contextually rich and informed by up-to-date or domain-specific data not included in their initial training sets. RAG's widespread adoption can be attributed to several key advantages:

Dynamic Knowledge Integration : RAG facilitates real-time and ongoing retrieval of external information, allowing models to access the latest data that were not part of their initial training.
Improved Accuracy and Relevance : By retrieving and integrating information relevant to the specific context of a query, RAG delivers responses that are both more accurate and contextually appropriate.
Scalability : The ability to dynamically access vast amounts of data means RAG can effectively scale its knowledge base without retraining the core model.
Flexibility : RAG can be adapted to various domains or tasks simply by changing the external sources it queries, without the need for extensive retraining.

However, despite these benefits, RAG also has notable drawbacks and limitations :

Dependence on Quality of Sources : The effectiveness of RAG is heavily dependent on the quality and relevance of the external data sources it retrieves from.
Latency Issues : Retrieving and processing information in real-time can introduce latency, potentially slowing down response times.

In cases where the application requires deep expertise in a narrowly defined domain, or applications that require very fast response times, or must adhere to stringent data privacy standards, fine-tuning would be a better choice compared to RAG.

RAG