Overview
We’ll walk through a basic RAG example for any project using Python and Flask. By the end of this tutorial you will have a working RAG setup to start experimenting with.
Prerequisites
Repository with the code
If you want to play around with the code with the end result of this blog post, go here:
Setup
Start a new folder for your new project
mkdir rag-application/ && cd rag-application/ && touch main.ipynb
We will use pipenv
to make sure the dependencies you install are the same ones needed to run the code.
Create a Pipfile
at the root of the folder with the following, this will make sure can install all the dependencies that we need.
[[source]]= "https://pypi.org/simple"
url = true
verify_ssl = "pypi"
name
[packages]= "*"
ollama -client = "==2.2.4"
pinecone-dotenv = "==1.0.0"
python-transformers = "==2.2.2"
sentence= "==6.28.0"
ipykernel = "==7.0.0"
pinecone
-packages]
[dev
[requires]= "3.11" python_version
If you don’t have pipenv
install click here to install it
Run the following command:
$ pipenv install
You should now see a Pipfile.lock
file in your project’s folder if everything ran succesfully
Initializing environment variables
Initialize the pinecone environment variable by copying your pinecone API key into your .env
file.
To create the file:
touch .env
To add the key to it:
PINECONE_API_KEY=[INSERT_YOUR_KEY_HERE]
First python cell
Then open main.ipynb
in your editor. Preferrably you’d use VS Code or Cursor for this since those 2 editors have support for Python Notebooks.
A quick overview of each of the dependencies that we’re using in the first python cell that we will add:
- Pinecone: The Vector database we will use for this example.
- Ollama: The LLM that we will use for this example.
Vector databases store the full sentence vectorized in a tensor, which makes it easy to query the database by comparing how close a given query is to what is stored in the database.
Ollama is an LLM that you can run locally, it is quantized to make it run with limited RAM on a consumer grade laptop.
In your python notebook, make sure you pick the correct Python kernel that was created when running pipenv install
:
Create a new Python cell with the following:
# Import to interface with the vector database.
from pinecone import Pinecone
# Initialize LLM model.
from ollama import chat
from ollama import ChatResponse
# Initialize dotenv to load environment variables.
from dotenv import load_dotenv
load_dotenv()
Creating the Vector Database index
Next, we’ll create the index in Pinecone, this will make sure that you can add rows to the database.
The model that we pass as a parameter to create the index refers to the machine learning model used to generate text embeddings. What this means in simple terms is that in an abstract way it maps your data into a large number of floats that when mapped in multidimentional space, all the records that are related are close to each other.
Create a new Python cell with the following:
= os.getenv('PINECONE_API_KEY')
pinecone_api_key
= Pinecone(api_key=pinecone_api_key)
pc
= "rag-example"
index_name
if not pc.has_index(index_name):
pc.create_index_for_model(=index_name,
name="aws",
cloud="us-east-1",
region={
embed"model": "llama-text-embed-v2",
"field_map": {"text": "chunk_text"}
} )
Reading the dataset and uploading it to the vector database
Download this file below and place it in the project’s root folder:
Create batches in memory for the data that we want to upload. We are doing this to prepare for when we upload the records to Pinecone, because there is a rate limit when creating records in Pinecone with a basic account.
import json
= {}
records_batches
with open('./reddit_jokes.json', 'r') as file:
= json.load(file)
json_reader = 0
i = 0
batch for joke in json_reader:
if i % 50 == 0 and i != 0:
+= 1
batch
records_batches.setdefault(batch, [])+= 1
i = joke["title"] + '\n\n' + joke["body"]
full_joke
records_batches[batch].append({"_id": f"rec{i}",
"chunk_text": full_joke
})
print(f"Number of joke record batches: {len(records_batches.keys())}")
Finally, we can upload the record using the batches we just created, this step will take a few minutes to complete.
import time
# Retrieve the connection to the index
= pc.Index(index_name)
dense_index = "reddit_jokes_namespace"
pinecone_index_namespace
# Insert records in batches
for batch_index, batch_data in records_batches.items():
dense_index.upsert_records(pinecone_index_namespace, batch_data)10)
time.sleep(
# Print index's stats to confirm that the records were created:
= dense_index.describe_index_stats()
stats print(stats)
Putting it all together with an LLM
Setup the RAG system and put it all together with an LLM, Ollama in this case
= "reddit_jokes_namespace"
pinecone_index_namespace = "Share with me the best jokes about dogs."
query
= dense_index.search(
results =pinecone_index_namespace,
namespace={
query"top_k": 3,
"inputs": {
'text': query
}
}
)
= []
augmentation_data
print(f"Got {len(results['result']['hits'])} results from the Vector Database")
for hit in results['result']['hits']:
'fields']['chunk_text'])
augmentation_data.append(hit[
= '\n\n\n====================================\n\n\n'
separator = separator.join(augmentation_data)
augmentation_data
= chat(model='llama3.2', messages=[
response: ChatResponse
{'role': 'system',
'content': 'You are a helpful assistant',
},
{'role': 'system',
'content': """
The user's question is:
{query}
Aproach this task step-by-step, take your time and do not skip steps.
Based on the user's question and using the following data as input:
{augmentation_data}
1. Read each paragraph from the previous augmentation text.
2. Create a response that summarizes ONLY the text relevant to the user's query.
3. Create a message so the user can laugh at the jokes found in the data found.
"""
},
{'role': 'user',
'content': query,
}
])
print(response['message']['content'])
Conclusion
And that’s it! Now you have a working RAG setup, from here you can go in multiple directions. RAG has become a fundamental part of how Product and Engineering teams get value out of LLMs in today’s age.
If your team needs help with RAG or anything related to LLMs or Machine learning, checkout my Consulting page!