Member-only story

RAG | Indexing Phase

Berika Varol Malkoçoğlu
4 min readNov 22, 2024

Indexing is another step of the RAG structure that facilitates the use of LLM models on specific data. It allows the data in the referenced documents to be represented in a vector space after being decomposed into chunks. This step helps us to match the incoming query with the most accurate information among the documents.

Create by GPT-4o

There are 3 basic steps we need to know for indexing.

1. Vector Embedding

Vector Embedding is the process of transforming high-dimensional data, such as text, images or other complex data, into lower-dimensional and meaningful numerical vectors. This conversion process makes it easier for computers to understand and manipulate this data.

  • By representing a word (e.g. “king”) with a vector, the Word2Vec model can capture the similarity relationship between “queen” and “king”.
  • For example, the word “king” could be a vector like this: [0.5, -0.4, 0.7, 0.8, 0.9, -0.7, -0.6].
Vector Embedding

Why Do We Use Vector Embedding?

  • Dimensionality reduction: High-dimensional data increases computational costs and can degrade the performance of algorithms. This processing step solves the…

--

--

Berika Varol Malkoçoğlu
Berika Varol Malkoçoğlu

Written by Berika Varol Malkoçoğlu

PhD | Data Scientist | Lecturer | AI Researcher

No responses yet