A Comprehensive Guide to Context Retrieval in LLMs

LLMs limitations and addressing them through RAG

While Generative AI technologies like ChatGPT possess impressive capabilities, utilizing them for business use cases necessitates access to company knowledge not included in the model’s training. Consequently, this limits LLM adoption, especially in cases where the responses need to be in line with a predefined knowledge base. Furthermore, to complicate matters, LLMs tend to generate false but plausible-sounding statements when lacking specific information, a phenomenon referred to as hallucination. You can check out this blog to learn more about hallucinations. In this blog we will look at the retrieval of important information in LLMs.

To tackle these challenges, the Retrieval Augmented Generation (RAG) approach combines Information Retrieval with thoughtfully designed prompts. This helps equip LLMs with accurate, up-to-date, and relevant information from external knowledge sources. This strategy enables LLMs to generate well-informed responses tailored to specific domains, despite their training data remaining static. RAG pipelines also serve for tasks such as Question-Answering over documents, offering diverse potential commercial applications.

Retrieval Augmented Generation, process and implementation

The Retrieval Augmented Generation (RAG) architecture operates through a three-stage pipeline, starting with data preparation, loading, indexing, and storing data in a Vector Database. The retrieval stage then seeks relevant information from the database based on the task at hand. Further, in the final generation stage, the system generates output by utilizing the retrieved data and meeting the task’s requirements. The quality of the output depends on the quality of the data and the retrieval strategy employed.

Frameworks like Langchain and LlamaIndex have made it fairly easy to develop RAG Pipelines. You can utilize their modules and corresponding functions to load structured and unstructured data and convert documents into indexes.

Transform the indexes into embeddings, representing the semantic meaning of objects numerically, with closer objects indicating similarity in the vector space. There are a lot of embedding API options available including ones from OpenAI, AWS, Azure etc. You can store the generated embeddings in Vector Databases alongside other attributes.

Vector Databases address the need for specialized databases tailored to handle high-dimensional vector data, specifically optimized for storing and querying embeddings. Unlike standalone vector indexes, vector databases combine the functionalities of traditional databases while specializing in managing and executing incredibly fast searches over large volumes of vector embeddings, a feature lacking in scalar-based databases.

The fundamental distinction between traditional relational databases and modern vector databases lies in their optimized data types. Relational databases excel in structured data storage, while vector databases, in addition to structured data, efficiently store unstructured data like text, images, or audio along with their vector embeddings. Weaviate is an open-source, AI-native vector database that helps developers create intuitive and reliable AI-powered applications and RAG pipelines.

The next step is Information Retrieval and Response Synthesis. A general retrieval process in an RAG pipeline consists of similarity search i.e. calculating the distance between the given Query Embeddings and the document Embeddings (typically stored in a Vector Database) using distance metrics like Cosine, and Euclidean distance (depends on the underlying embedding model) and performing the nearest neighbour search task using algorithms like KNN. We will discuss advanced retrieval techniques in the next section. This retrieved data and the given query are plugged into the prompt which is used to generate the final output.

Advanced Retrieval Techniques

The general RAG pipeline is very effective and works fairly well for a lot of use cases. However, in some cases, a customized retrieval pipeline can improve the performance of the system significantly. This includes approaches like Query Rewriting, Hybrid Search, and Reranking.

Hybrid Search

Hybrid Search combines multiple search algorithms to improve the accuracy and relevance of the retrieved output. Here, combining Sparse and Dense Vector search approaches leverages robust algorithms like BM25 for Keyword Search and Semantic Search for deeper semantic understanding. Weaviate comes with hybrid search out of the box with the ability to choose the ranking method, balance Keyword and Vector Search and other customizations. Langchain and LlamaIndex can also be used to perform Hybrid Search with Weaviate.

Reranking

LLMs are found to have positional bias, the information provided first is taken into better consideration. This requires us to provide more relevant information first in the query and we can achieve this using Reranking of search outputs.

The Retrieval Model returns multiple documents (top-k) which are fed into the LLM for final output generation or Response Synthesis. A simple approach is to use LLMs with prompting strategies to generate a reranking output relevant to the task and query.

Pointwise Methods try to measure the relevance between the query and the document by prompting the LLM to generate a confident score, and this score can be used for reranking documents.

Listwise Methods involve prompting the LLM with all the retrieved documents and querying to rerank them. In the case of larger retrieval outputs, a sliding window approach is used. However, Listwise Methods although very efficient are heavily prone to positional bias, which results in the Pairwise approach.

Pairwise Methods constitute prompting the LLM with all the documents in pairs and effectively getting an order based on pairwise ranking. This reduces the document order sensitivity issues in Listwise Methods but requires more time and computation. Relatively smaller LLMs can be fine-tuned to perform Reranking resulting in lesser computeand more efficient reranking.

However, LLMs are not specialised for Reranking making these approaches unreliable. A better way to perform Reranking is using Cohere Reranking API or Sentence Transformer Models supported out of the box by Weaviate. You can fine-tune Cohere’s Rerank model to enhance domain performance further.

Query Rewriting

We need to ensure that we provide relevant and sufficient information to the retrieval model to obtain the required context from potentially ambiguous original queries. Query Rewriting can be divided into following 2 parts:

Ad-hoc Retrieval: Querying is the first step in the search funnel, in the case of Ad-hoc retrieval the aim is to address vocabulary mismatches between queries and documents. LLMs with refined prompt templates help to rewrite the prompts reducing these gaps. A popular technique, HyDE (Hypothetical Document Embeddings), utilizes an LLM to generate a Hypothetical Document/Answer for embedding lookup in the Vector Database instead of the raw query. Query2Doc serves as a query rewriting tool based on large language models (LLMs). It instructs LLMs to produce a passage that essentially functions as generating answers in response to the given query.

Conversational Search: In this approach, the retrieval system engages in a conversation with the user to clarify the type of documents, serving as input for query rewriting in Information Retrieval. A well-structured prompt can summarise the conversation and generate the relevant query required for retrieval.

Other approaches also involve Query Decomposition i.e. breaking down the query into simpler steps and querying individually for every step, this can be very useful in the case of multi-hop queries.

Evaluation of Retrieved Context

One of the most crucial aspects of building a RAG system is the design. Pair most cost-effective Retrieval Model with your LLMs for optimal efficiency. In this case, getting evaluation metrics on your design choices can make your job significantly easier.

Some open-source tools can reduce this effort significantly. One of these tools is UpTrain. It helps in determining LLM and prompt effectiveness using multiple metrics. You can also check out this blog to understand learn how to evaluate LLM your LLM applications. Further, check out our docs to get started with UpTrain. It has functions to analyse relevance, factual accuracy and other metrics for LLMs. These functions can be useful to understand hallucinations and other problems corresponding to your system.

Context Relevance will be the metric we’ll be focusing on, as it shows the quality of Retrieval in LLMs. Here is a demonstration to use UpTrain Evals to get quantitative and qualitative evaluation on Context Relevance.

Here’s an example of context based evaluation done through UpTrain:

data = [
    {
        'question': 'How does photosynthesis work?',
        'context': 'Like tiny solar factories, plants trap sunlight in their leaves using a green pigment called chlorophyll. This light energy fuels a magical process called photosynthesis, where plants combine water and carbon dioxide to make their own food (glucose) and release precious oxygen back into the air. This energy-packed glucose fuels their growth and keeps them thriving, making them the powerhouses of our planet\'s ecosystems.'
    }
]

from uptrain import EvalLLM, Evals
import json

OPENAI_API_KEY = "sk-********************"  # Insert your OpenAI key here

eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)

res = eval_llm.evaluate(
    data = data,
    checks = [Evals.CONTEXT_RELEVANCE]
)

print(json.dumps(res,indent=3))

You can get the complete code here

Conclusion

This blog offers a comprehensive guide to context retrieval in LLMs and introduces RAG as a solution for addressing their limitations. RAG combines Information Retrieval with thoughtfully designed prompts to build LLMs with accurate and up-to-date information from external knowledge sources. This helps enhance the quality of the generated responses.

One key component of RAG is the use of a Vector Database, such as Weaviate. A Vector Database helps to store and query embeddings that represent the semantic meanings of objects. Further, Weaviate optimizes for storing and querying embeddings, making it suitable for managing high-dimensional vector data.

We also see the importance of evaluating the retrieved context in RAG systems. UpTrain is an open-source tool that provides evaluation metrics, including context relevance, to assess the effectiveness of LLMs and prompts. By analyzing relevance and factual accuracy, UpTrain helps identify and address hallucination issues in LLM applications.

References

Large Language Models for Information Retrieval: A Survey: Link
Building RAG-based LLM Applications for Production: Link

A Comprehensive Guide to Context Retrieval in LLMs

LLMs limitations and addressing them through RAG

Retrieval Augmented Generation, process and implementation