RAG ChatBot using YouTube Transcripts

Motivation:

So one day, I was checking on some good health advice and gradually landed on ChatGPT (bad idea). Then thought what if Andrew Huberman himself shed some good advice? But I am not going to spend two hours watching his videos to get my questions answered when I can spend a few days creating an Andrew Huberman bot that answers my question (Disclaimer: Not my idea just a recreation😛).

LLMs are great at summarizing, but it won’t take in those 244+ videos’ transcriptions just to answer that will my teeth decay quickly after those late-night snacks 🙈? Here’s where the RAG (Retrieval-Augmented Generation) comes in! So retrieval of all the information about tooth decay and sugar in those snacks from the video transcriptions, augment this information into a great prompt with my question, and let the LLM generate that great answer, well not that great I am going to lose my teeth early.

Plan:

Get the data:

I also believe in humble beginnings, so we will start by transcribing a single video. Here’s a podcast named “Controlling Sugar Cravings & Metabolism with Science-Based Tools”. For this, we can use ‘youtube_transcript_api’ library.

Data Preprocessing:

As we get the transcribed words, we will be cleaning the strings and creating sentences using the ‘spacy’ library. By the end of this, we get well-formed sentences.

Chunking:

Now we group these sentences in small paragraphs, which we will be giving to the LLMs as a knowledge base to answer our questions. Here we will be creating chunks of 15 sentences.

Embeddings:

Now we have chunks of sentences and we will be having our question. So our next goal will be to get all the chunks related to the question.

These embeddings are the numeric representations of the sentences, which will be used to calculate the similarity for our question vector.

Embedding these chunks of sentences does take a lot of time, thus we so calculate it once and store them for further use. At this step, we should use a Vector Database to store these embeddings. The Vector databases help in calculating the similarity efficiently. We will be diving in-depth into it in the later period. As of now, we have a small knowledge base so we can get away by saving it in a CSV file and searching linearly without only looking into a set of specific clusters of vectors.

So here we are just storing it into a CSV file.

We can later load these into a data frame using:

Here we are first reading the CSV file. Then we use PyTorch tensor which are like NumPy arrays to store these embedding vectors. As we are just using our CPU it won’t make much difference but will if we use a GPU.

Calculate Vector Similarity:

In the next part, we will retrieve the top 5 text chunks which will be related to our question. So first we will be creating a vector embedding of our question.

For calculating the similarity we will use the dot product, ie the cosine similarity of the question vector with all the other vectors. Then select the top 5 vectors having the highest scores.

Here indices tensor are your top 5 knowledge base chunks with their similarity score in the values tensor. We will be using these top 5 chunks as the knowledge base for the LLM to answer our question.

Prompt Engineering:

To generate our required information effectively, we have to create a great prompt. We have to give it a persona, context, task, and knowledge base to answer the question. Here’s one for our bot:

The output of this function will be the prompt we will pass to the LLM. Here I am going to use the Gemini-Pro LLM.

Now bulking on the above foods at mid-night, I am going to create a FastAPI backend server and chat UI in Streamlit.

Ingesting data into a Vector Database

So now we have this thing working, let's do it for all his 200+ videos. But we aren’t going to use a CSV for this. We professional, we use a VECTOR DATABASE. The one we are going to use is ChromaDB.

So here's what I stole. First, we create a ChromaDB client.

If we just used Client() it would be all stored in your RAM for that session, so once you exit the session you lose DB. Losing those expensive embeddings in this economy would be disastrous. So we use PersistentClient() which stores these embeddings in an SQLite database.

Then we create a collection. Here we can add document chunks with metadata information like the source of the text chunks. Then we can query the collection with the top n results.

In our case, the metadata source will be the Youtube video ID.

Ingesting YouTube transcripts:

Now we need to get the links of all those 200+ videos. Manually? Naah, we use Youtube API.

We are using this /playlistItems API, to get all the uploads of Andrew’s YouTube channel. This API uses the Pagination concept.

On our first API call, the API sends data for 50 videos and the next page token for getting the next set of video data, and so on. Here’s the diagrammatical explanation:

So, let’s combine all our knowledge now to do the same thing for all 226 videos, with 40 minutes to kill.

So now we will be able to query all of the videos.

Results:

As we have added the source of the text chunks while ingesting data in the Vector DB, we can mention the sources videos, with the highest similarity score.

References:

Daniel Bourke https://www.youtube.com/watch?v=qN_2fnOPY-M
And his code: https://github.com/mrdbourke/simple-local-rag
Streamlit UI: https://www.youtube.com/watch?v=QzFMqQCCicI&t=6s