What's Wrong in my RAG Pipeline?

Building an LLM prototype is undoubtedly quite fun, but transitioning it into a robust production-grade system demands perseverance and attention to detail. While a rudimentary RAG (Retrieve ➡️ Augment ➡️ Generate) pipeline might suffice to win a hackathon, the reality of building a production grade application demands to deal with a lot more edge cases and the chaos of monitoring and improving everything that can go wrong.

Recent surveys among UpTrain community and top YC companies highlight a common trend: developers frequently string together 5 to 6 LLM calls to craft a complete response. In response, the RAG approach has evolved, becoming more sophisticated with advanced modular architectures. Frameworks including modules for reranking, query rewrite, LLM-based chunking are gaining popularity.

The challenges at hand are twofold:

Identification of root causes during failures:
- Despite achieving an 80% accuracy rate, pinpointing actionable insights remains challenging.
Uncertainty surrounding necessary adjustments:
- Questions arise regarding potential recalibration of the chunking mechanism.
- Consideration is given to incorporating additional instructions into prompts.
- There is contemplation about splitting an LLM call into two for potential performance enhancement.

To address these challenges, we’ve developed RCA (Root Cause Analysis). It takes your failure cases, runs a battery of checks and tells you exactly what went wrong.

How to Find Failure Cases in RAG Pipeline?

We’re developing different RCA Templates to address specific needs. Our first template, RAG_WITH_CITATION, is perfect for building chatbots. It takes user queries, finds relevant information, and generates responses with citations, showing exactly where the information came from.

Essentially there are two essential components in our pipeline:

RAG (Retrieve, Augment, Generate): This mechanism retrieves the most relevant documents to answer user queries. It’s crucial because LLMs may lack specific business knowledge required for accurate responses.
Citation: Beyond just providing a response, it’s vital to specify which parts of the knowledge base were used to generate it. LLMs have a tendency to generate inaccurate information, and robust citation instills trust in users by demonstrating the verifiability of the information provided.

Even within a seemingly straightforward system like the one described above, numerous potential points of failure emerge:

Ambiguous or irrelevant user queries:
- Often, user queries lack clarity, are ambiguous, or simply aren’t applicable to your application. While unexpected, these types of queries constitute a significant portion of real-world user interactions. Examples include meaningless questions like “How are you?”, random characters, or mentions of unrelated topics like “Taylor Swift.” In the context of building a conversation agent, individual follow-up questions within a conversation may appear incomplete, necessitating the introduction of a query rewrite block to contextualize the entire conversation and reformulate the user query accordingly. In such scenarios, the effectiveness of the final question relies heavily on the performance of the query rewrite block.
Incomplete Response – Poor Retrieval:
- What if a user queries about specific information not known in common knowledge? You need to provide that additional information from your knowledge base for an LLM to answer such questions. Poor retrieval quality, i.e., the quality of information present in the knowledge base, can hinder this process. Even if there is a response, it may still be incomplete due to inadequate retrieval of relevant documents. This often arises from complex user queries that necessitate splitting into subquestions for better retrieval. Additionally, fine-tuning embedding models or adjusting parameters like top-k may enhance document relevance.
Incomplete Response – Hallucinations:
- There are instances where the LLM fabricates information, a phenomenon known as hallucination. This can have severe consequences, emphasizing the need for stringent control measures. Even complete but incorrect responses can fall into this category, necessitating hallucination evaluation across the dataset.
Incomplete Response – Incorrect Citations:
- In some cases the generated response can not be validated by appropriate citations. These situations reduces the credibility of the response and tarnishes user trust. Thus, it becomes necessary to ensure that the citations are relevant and supports the generated response.
Incomplete Response – Poor Context Utilization:
- Despite having relevant context, the LLM may fail to fully leverage it, leading to incomplete responses. These cases, characterized by accurate citations but insufficient information utilization, highlight the importance of maximizing context utilization.
Others:
- Not all failure scenarios neatly fit into predefined categories. There may be diverse situations requiring unique troubleshooting approaches beyond the defined failures.

At UpTrain, we believe our framework offers a valuable tool for diagnosing and resolving issues with LLM configurations. Our goal is to expand this framework with more RCA templates tailored to different setups.

We’re committed to providing you with the best tools to identify problems and offer actionable solutions. RCA is just the beginning; we aim to develop advanced mechanisms to automate suggestions for improvement.

Our mission is to empower you with insights and tools to enhance your LLM systems. If you have thoughts or want to brainstorm, we’re here to help. Let’s work together to make LLM technology better.

Automatically identify failures in RAG

Let’s see how to use UpTrain to get failure cases in our RAG pipeline:

Step 1: Install UpTrain and Import Required Libraries

To install UpTrain, run the following command in your terminal:

%pip install uptrain

Once this is complete, import the required libraries:

from uptrain import RcaTemplate, EvalLLM
import json

Step 2: Let’s define a dataset

For simplicity, let’s take a dataset with a single row. We have a question, the retrieved context, the cited context and the response.

sample_data = [
 {'question': 'Can FedL deliver electronic devices?',
 'context': "FedL was established in 2020. Using FedL you can send deliveries to over 1000+ cities in India including major cities like Bangalore, Mumbai and Delhi. Recently we crossed a milestone by completing 1 million deliveries. Using FedL you can deliver any goods under 10kg(not more than that) to anyone whether it be your friends or family. P.S.: We can't deliver electronic devices. You can also use FedL to deliver a car.",
 'response': 'FedL offers deliveries to over 1000+ cities.',
 'cited_context': 'Using FedL you can send deliveries to over 1000+ cities in India including major cities like Bangalore, Mumbai and Delhi.'}
]

Step 3: Running evaluations using UpTrain

Finally, let’s run our evaluation!

OPENAI_API_KEY = "sk-*******"     # Insert your OpenAI key here

eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)  

res = eval_llm.perform_root_cause_analysis(    
               data = sample_data,   
               rca_template = RcaTemplate.RAG_WITH_CITATION
               )

And voila! just in these few steps UpTrain has generated an analysis of this data.

Let’s have a look at the failure case identified in this case:

Key	Value
Error Mode	Poor Context Utilization
Error Resolution Suggestion	Add intermediary steps so as the LLM can better understand context and generate a complete response

Beside, this UpTrain also provides you the scores for the intermediary steps for this analysis with a detailed reasoning.

Key	Description	Score	Reasoning
Question Completeness	Whether the given question makes sense?	1	The question by itself is makes sense and there can be a defined answer to this question.
Context Relevance	Whether the context is relevant to the user’s question?	0.5	The given context can give some relevant answer for the given query but can’t answer it completely. The context mentions that FedL can deliver any goods under 10kg to anyone, but it also explicitly states “P.S.: We can’t deliver electronic devices.” Therefore, while it provides some relevant information about FedL’s delivery services, it does not fully answer the query about delivering electronic devices.
Factual Accuracy	Whether the facts mentioned in the response are grounded to the context?	1	The context explicitly mentions that using FedL you can send deliveries to over 1000+ cities in India, including major cities like Bangalore, Mumbai, and Delhi. Hence, the fact can be verified by the context.
Cited Context Relevance	Whether the cited context is relevant to the user’s question?	0	The given context does not contain any information about FedL delivering electronic devices. It only mentions the cities where deliveries can be sent. This information is not relevant to the query about FedL delivering electronic devices. The extracted context doesn’t contain any information to answer the given query.
Factual Accuracy wrt Cited Context	Whether the facts mentioned in the response are grounded to the cited context?	1	The context explicitly states that using FedL you can send deliveries to over 1000+ cities in India, including major cities like Bangalore, Mumbai, and Delhi. Hence, the fact can be verified by the context.

You can also refer to this notebook to look at the analysis performed by UpTrain. Further, you can also refer to our documentation to understand more about using UpTrain.

Using UpTrain you can also evaluate your LLM generated responses over metrics like as retrieved-context quality, hallucinations, language quality and tone, response completeness, jailbreak, etc. To get started with these evaluations you can refer to UpTrain docs.

If you want to brainstorm about LLM evaluations, discuss about any specific problems you are facing while working on LLMs or give your feedback on UpTrain, you can Book a free call with the maintainers of UpTrain.

Here’s the link to our community if you need any help with UpTrain.

Also, if you like reading this content please don’t forget to star us on GitHub

References

What Is Retrieval-Augmented Generation, aka RAG? (https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/)

What is Retrieval Augmented Generation (RAG) for LLMs?(https://truera.com/ai-quality-education/generative-ai-rags/what-is-retrieval-augmented-generation-rag-for-llms)

Retrieval-Augmented Generation for Large Language Models: A Survey (https://arxiv.org/pdf/2312.10997.pdf)

What’s Wrong in my RAG Pipeline?

How to Find Failure Cases in RAG Pipeline?

Automatically identify failures in RAG

Step 2: Let’s define a dataset

Step 3: Running evaluations using UpTrain

References

Like this:

Comments

Leave a ReplyCancel reply