RAG

RAG: Retrieval-Augmented Generation¶

RAG stands for Retrieval-Augmented Generation. It's a technique that enhances large language models (LLMs) by allowing them to access and incorporate external knowledge sources during the generation process. It is used in chatbots to answer questions based on specific text, internal company documents, code documentation, news articles, etc.

Article reader chatbot¶

I have built a chatbot Article Reader where users can upload a document/link to get answers based on the document only (website or PDF). Unlike regular LLM chatbots, this will not answer questions outside the documents shared. A screenshot of the tool is shown below:

png

Why is RAG Required?¶

Overcoming Limitations of LLMs: LLMs are trained on massive datasets, but they may not have access to the most up-to-date or specific information. RAG addresses this by providing LLMs with a way to access and utilize external knowledge bases. In our case, we can share internal company documents or an annual report of a comapny and ask questions based on the shared document alone.
Improved Accuracy and Relevance: By incorporating external knowledge, RAG helps LLMs generate more accurate and relevant responses, especially when dealing with factual questions or domain-specific topics.
Enhanced Contextualization: RAG enables LLMs to better understand the context of a query by considering external information, leading to more coherent and informative responses.
Up-to-Date Information: RAG allows LLMs to access the latest information from external sources, ensuring that their responses are current and relevant.

How RAG Works¶

Retrieval: The LLM receives a query and retrieves relevant documents or passages from an external knowledge base.
Augmentation: The retrieved information is integrated into the LLM's input, either by concatenating it with the original query or by feeding it directly to the model.
Generation: The LLM generates a response based on the augmented input, incorporating the external knowledge into its output. In essence, RAG combines the strengths of retrieval-based models (for finding relevant information) and generative models (for generating human-like text), creating a more powerful and versatile AI system.

Let us now build two RAG models end-to-end, one using Open AI and the other using local models (Llama).

RAG using OpenAI¶

Let us say we want to read a person's resume and find out the years of experience of the person. We can use the Article Reader chatbot to upload the resume and ask the question.

Let us first read a resume.

from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
import openai
import config

def extract_text_from_pdf(pdf_file):
    pdf_text = ''
    reader = PdfReader(pdf_file)
    for page_no in range(len(reader.pages)):
        pdf_text = pdf_text + reader.pages[page_no].extract_text()
    if(len(pdf_text)==0):
        raise ValueError('No text extracted from this pdf.')
    return pdf_text

text_article = extract_text_from_pdf("Harsha_Achyuthuni_resume.pdf")
print("The first 100 chars in the resume are:\n", text_article[:100])

The first 100 chars in the resume are:
 Achyuthuni Sri Harsha
Data Scientist, Deloitte, Imperial College London, IIMB, harshaash.com
E-mail:

print("The total number of words in the document are: ", len(text_article.split()))

The total number of words in the document are:  1611

Let us define the question that we want to ask:

query = "How many years of experience does the candidate have?"

Retrieval¶

The first step of RAG is dense retrieval. This technique leverages the power of embeddings – unique numerical representations of text – to efficiently search for the most relevant information within a document.

To achieve this, we break down the document into smaller, manageable chunks. Each chunk is then transformed into a dense vector (its embedding), capturing the essence of its meaning. This collection of embeddings forms an index, enabling the system to swiftly locate the chunks that most closely align with the user's query (which is also converted into an embedding).

This is a three step process:
1. Get the text and chunk the sentences
2. Embed the sentences
3. Build search index
4. Search

Chunking¶

There are many ways to split the document into chunks such as: 1. Each sentence is a chunk
2. Each paragraph is a chunk
3. The document is split evenly with a similar number of tokens in each chunk
4. Some chunks lose meaning if we do not include the text around them. So we can add context by adding some text before and after the chunk.

Let us split the document into chunks of size 100 tokens with a 25% overlap on each side.

# Split the text based on tokens into chunks
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=100, # Minimum chunk size is 100
    chunk_overlap=25 # Overlap in the chunk size is 25
)
split_texts = text_splitter.split_text(text_article)
split_texts = [Document(page_content = split_texts[i]) for i in range(len(split_texts))]
print("The total number of chunks in the document are: "+str(len(split_texts)))

The total number of chunks in the document are: 34

For a chunk size of 100 tokens, the below image shows how the text has been chunked. The highlighted yellow and red are separate chunks with the common part being overlapped. png

Embedding and building search index¶

To generate embeddings, I am using OpenAI's tokenizer. Each chunk is converted into an embedding vector.

These embeddings are then stored using Qdrant, a free vector database. Qdrant is specifically designed to handle and search through large collections of embeddings, making it an ideal choice for my RAG system. This also helps me retrieve the most relevant information during the retrieval phase.

# Create embeddings
embeddings = OpenAIEmbeddings(api_key = config.open_api_key, model="text-embedding-3-large")

Storing these embeddings into a vector database called my_chat_documents

# Initialise vector database
qdrant_client = QdrantClient(
    url=config.qdrant_url, 
    api_key=config.qdrant_api_key,
)
# Delete existing collection
qdrant_client.delete_collection(collection_name="my_chat_documents")

# Store embeddings in vector db
qdrant = QdrantVectorStore.from_documents(
    split_texts,
    embeddings,
    url=config.qdrant_url, 
    api_key=config.qdrant_api_key,
    prefer_grpc=True,
    collection_name="my_chat_documents",
)

[qdrant_client.get_collections().collections[i].name for i in range(len(qdrant_client.get_collections().collections))]

['my_chat_documents']

Search¶

I am going to use the vector database to search for the vector that is closest to the query. This can be done by creating embedding vector for the query using the same embedding models and tokenizer and finding nearest neighbours using similarity scores like cosine similarity

# retrieve selected part of the website
qdrant = QdrantVectorStore.from_existing_collection(
  embedding=embeddings,
  collection_name="my_chat_documents",
  url=config.qdrant_url, api_key=config.qdrant_api_key
)
# Find closest chunks
relevant_document_chunks = qdrant.similarity_search(query=query,k=3)
context_list = [d.page_content for d in relevant_document_chunks]

Augmented¶

The top three chunks are:

print("\n-----------------------\n".join(context_list))

Summary
Harsha has 7 years of experience in Data Science and Machine learning. As a senior consultant at
Deloitte, he likes solving complex business problems using data, statistics, technology and business
understanding. He has worked on projects across the data science spectrum, including regression,
classification, deep learning, machine learning, AI, ChatGPT, reinforcement learning, optimization,
unsupervised learning and streaming machine learning.
-----------------------
Achyuthuni Sri Harsha
Data Scientist, Deloitte, Imperial College London, IIMB, harshaash.com
E-mail: achyuthuni.sri.harsha@gmail.com ✼Phone/Whatsapp: +91-9019413416
LinkedIn: sri-harsha-achyuthuni ✼Website: www.harshaash.com
Summary
Harsha has 7 years of experience in Data Science and Machine learning. As a senior consultant at
-----------------------
Solutions that he built are currently deployed at Walmart, Rolls Royce and Dr Reddys. He has also
built, presented and converted POCs across many Fortune 500 clients.
He has a masters from Imperial College London in Business Analytics and is an alumnus of IIM
Bangalore.
Technical skills
Analytics Python, R, CPLEX
Data Engineering SQL, Alteryx
Visualisation Tableau, HTML, Javascript

The first two chunks (out of the three) have the years of experience mentioned and are relevant. Joining these three chunks gives me a comprehensive and augmented text that can be used for generation.

agumented_search_context = ". ".join(context_list)

Generation¶

In the prompt, we will provide an additional variable named context where we provide context to the LLM. This context is the chunks that are relevant for the answer.

qna_system_message = '''You will be provided with a text, and your task is to answer the question based on the text alone. 
    If you are unable to answer or doubtful, please say "I dont know"'''
qna_user_message_template = """
###Context
Here are some documents that are relevant to the question mentioned below.
{context}

###Question
{question}
"""

prompt = [
    {'role':'system', 'content': qna_system_message},
    {'role': 'user', 'content': qna_user_message_template.format(
         context=agumented_search_context,
         question=query
        )
    }
]
openai.api_key = config.open_api_key
response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=prompt,
    temperature=0
)

prediction = response.choices[0].message.content.strip()

print(prediction)

The candidate has 7 years of experience.

The RAG architecture was able to find the correct answer.

RAG with local models¶

Many organizations prioritize data privacy and security, often prohibiting the sharing of sensitive information with external services. This presents a significant challenge for traditional RAG implementations, which often rely on cloud-based models and external knowledge sources.

To address these concerns, I have implemented a RAG architecture using entirely local models. This approach eliminates the need for internet connectivity and ensures that all data processing and model interactions occur within the organization's secure internal systems.

Downloading a pre-trained LLM locally: The most common open source LLM is Facebook's Llama.

!wget https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf

Phi-3-mini-4k-instr 100%[===================>]   2.23G  39.2MB/s    in 57s

from langchain import LlamaCpp

llm = LlamaCpp(
    model_path="Phi-3-mini-4k-instruct-q4.gguf",
    n_gpu_layers=-1,
    max_tokens=500,
    n_ctx=2048,
    seed=42,
    verbose=False
)

For embedding the text, the BAAI/bge-small-en-v1.5 model is used.

from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# Embedding Model for converting text to numerical representations
embedding_model = HuggingFaceEmbeddings(
    model_name='BAAI/bge-small-en-v1.5'
)

A vector database is created using the embedding model

from langchain.vectorstores import FAISS

# Split into a list of sentences
texts = text_article.split('\n')

# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]

# Create a local vector database
db = FAISS.from_texts(texts, embedding_model)

Similar to the previous implementation, I have created the prompt such that the context is provided along with the query

from langchain import PromptTemplate
from langchain.chains import RetrievalQA


# Create a prompt template
template = """<|user|>
Relevant information:
{context}

Provide a concise answer the following question using the relevant information provided above:
{question}<|end|>
<|assistant|>"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# RAG Pipeline
rag = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=db.as_retriever(),
    chain_type_kwargs={
        "prompt": prompt
    },
    verbose=True
)

The answer to the query is

rag.invoke(query)

1m> Finished chain.

{'query': 'How many years of experience does the candidate have?',
 'result': ' The candidate has 7 years of experience in Data Science and Machine learning.'}

RAG evaluation¶

RAG responses can be evaluated on various parameters, like fluency, perceived utility, citation recall and precision, faithfulness, relevance and groundedness. Let us look at relevance and groundedness in detail.

We can use the LLM-as-a-judge method to check the quality of the RAG system.
1. Groundedness: If the answer is based solely on the context provided
2. Relevance: If the answer is relevant and answers all aspects of the query

groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.
"""

groundedness_prompt = [
    {'role':'system', 'content': groundedness_rater_system_message},
    {'role': 'user', 'content': user_message_template.format(
        question=query,
        context=agumented_search_context,
        answer=prediction
        )
    }
]

response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=groundedness_prompt,
    temperature=0
)

print(response.choices[0].message.content)

Steps to evaluate the answer:
1. Identify the specific information in the context related to the candidate's experience.
2. Check if the answer provided matches the exact number of years of experience mentioned in the context.
3. Determine if any additional information not present in the context is included in the answer.

Explanation:
1. The context clearly states that Harsha has 7 years of experience in Data Science and Machine learning.
2. The answer provided states "The candidate has 7 years of experience," which directly matches the information given in the context.
3. The answer does not include any additional information beyond what is provided in the context.

The answer follows the metric completely by accurately stating the number of years of experience the candidate has based on the information given in the context.

### Evaluation: 5
### Score: 5

relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
"""

relevance_prompt = [
    {'role':'system', 'content': relevance_rater_system_message},
    {'role': 'user', 'content': user_message_template.format(
        question=query,
        context=agumented_search_context,
        answer=prediction
        )
    }
]

response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=relevance_prompt,
    temperature=0
)

print(response.choices[0].message.content)

1. Identify the main aspects of the question: The main aspect of the question is to determine the number of years of experience the candidate has.

2. Look for relevant information in the context: Search for details related to the candidate's experience in the context provided.

3. Determine if the answer contains the specific number of years of experience: Check if the answer explicitly states the number of years of experience the candidate has.

4. Evaluate if the answer addresses the main aspect of the question: Assess whether the answer accurately provides the number of years of experience as requested in the question.

Explanation:
The context clearly states that Harsha has 7 years of experience in Data Science and Machine learning. The AI generated answer directly addresses the main aspect of the question by stating "The candidate has 7 years of experience." The answer is relevant as it provides the specific number of years of experience the candidate has based on the information in the context.

Therefore, the answer follows the metric of relevance completely by addressing the main aspect of the question accurately.

###Final Rating
5 - The metric is followed completely

Written with assistance from Generative AI

References¶

Hands-On Large Language Models, Chapter 8 - Semantic Search, Jay Alammar & Maarten Grootendorst
AI Expert Bootcamp, Great Learning