Vector Search has become a very appreciated method for searching recently. By using ReactiveSearch pipelines, we can add stages to rearrange results using kNN with just a few lines of code.
Before we start, why this?
This pipeline is required to index vector data without asking the user for the data. Imagine a case where index data consists of various fields like Name
, Age
etc. Now, our requirement is that when an indexing request comes, we want to convert the Name
to a vector and store it in the index request as vector_data
.
The question is why is a vector even necessary? Well, a vector can help us build pipelines like the kNN search one where the search requests use the vector data to find the results.
Deploy this pipeline with one-click
Play with the live pipeline in the following playground:
Index Requirements
This how-to guide uses OpenSearch for the demo. In order for the data to be stored in the index, the index should know that the vector_data
field will be of type vector. Not just that, the dimensions of the vector field also needs to be specified.
The dimensions can differ for the vector field. It depends on the utility that converts the string (or any other type) of data to vector. In this example, we will use OpenAI's Embeddings and their dimensions are 1536
. So in this example, we need to set the dimension of the vector field as that.
It can be set by sending the following request to OpenSearch when creating the index:
PUT /{index_name}
with the following body
{
"settings": {
"knn": true,
"knn.algo_param.ef_search": 100
},
"mappings": {
"properties": {
"vector_data": {
"type": "knn_vector",
"dimension": 1536,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib"
}
}
}
}
}
NOTE that the
opensearch-knn
plugin will have to be installed in the OpenSearch cluster. This plugin is installed by default in all complete OpenSearch installations but are not part of the min package of OpenSearch. Read more about the plugin here
Assumptions
There are various algorithms that can be run on top of a data to get vector representation of it. In this case, for the sake of example, we will be using OpenAI's Embedding algorithm to find the vector representation of the data. It is important that we use the same algorithm while indexing the data as well as while searching the data to get correct results.
This means, while indexing, we will have to run the fields that we want to store as vector through this algorithm. We will also need to run the search query through this algorithm to get the vector representation of the query.
Data Set
In order to show this pipeline working in action, we are going to use the Amazon Review Dataset. This dataset contains reviews of dog food products on Amazon. Out of all the fields present in the dataset, we will use the Summary
and Text
field to index them as vector data. What this means is that our array will be a vector representation of the Summary and Text field strings joined using a comma ,
.
NOTE that the comma will not change the meaning of the embeddings since it's special character and will not be converted to vector data.
Using OpenAI Embeddings
OpenAI API requires an API key in order to access the API. This API key can be generated by signing up at https://platform.openai.com/signup. Once signed up, click on Personal
on the top right corner and click View API keys
.
This API key will have to be passed to the pipeline so that it can use the API properly in order to get the data embeddings.
Pre Setups
Now that we know how we are going to implement kNN index, let's start with the basic setup. We will override the _doc
endpoint for the index amazon_reviews
.
The
_doc
endpoint is the endpoint that ElasticSearch/OpenSearch accepts indexing requests to.
The file will be defined in the following way:
enabled: true
description: Index pipeline to store vectorized data
routes:
- path: /amazon_reviews/_doc
method: POST
classify:
category: elasticsearch
acl: index
envs:
openAIApiKey: <your-api-key>
method: POST
Environment Variables
We are passing the Open AI API Key through envs so that it can be used in any stage necessary. This is the openAIApiKey
variable.
Stages
Now that we have the basic pipeline defined, let's get started with the stages. We will have a few pre-built stages and some custom stages in this pipeline.
Pre-Built stages are provided by ReactiveSearch to utilize functions from ReactiveSearch API, like hitting ElasticSearch or translating an RS Query to ES Query.
We will have the following stages defined:
- authorization
- fetch embeddings
- index data
Authorization
This is one of the most important steps in the pipeline. Using this stage we will make sure the user is passing proper credentials to hit the endpoint they are trying to access.
The is a pre-built
stage provided by ReactiveSearch and can be leveraged in the following way:
- id: "authorize user"
use: "authorization"
Fetch Embeddings
Now that we have authorized the user that's making the request, we can fetch the embeddings for the request body passed and update the body with the embeddings. This can be simply done by using the pre-built stage openAIEmbeddingsIndex
.
- id: fetch embeddings
use: openAIEmbeddingsIndex
inputs:
apiKey: "{{openAIApiKey}}"
inputKeys:
- Summary
- Text
outputKey: vector_data
continueOnError: false
This is a stage provided by ReactiveSearch for OpenAI specific usage. It's very easy to use and takes care of reading from the request body, getting the embeddings using OpenAI API and updating the request body accordingly.
Read more about this stage here
In the above stage, we are passing the apiKey
input by reading it dynamically from the envs that are defined in the top of the pipeline.
Besides that there are two more inputs specified.
inputKeys
is the input that indicates which keys from the request body should be used to fetch the embeddings for. In our example and as stated above, we will use the Summary
and Text
key and thus the inputKeys
array contains those two. These two keys will be extracted and joined using a comma ,
and then passed to OpenAI API in order to get the vector embedding for them.
outputKey
indicates the key where the output will be written. In simple words, this is the key that will be injected in the request body with the vector data that was fetched from OpenAI.
In this example, it is set to vector_data
since in the mappings we have defined the vector field as vector_data
. This can be found in the Pre Setups section of this how-to doc.
Index Data
Now that we have the vector data ready and merged in the request body, we can send the index request to OpenSearch. This can be done by using the pre-built stage elasticsearchQuery
.
- id: index data
use: elasticsearchQuery
needs:
- fetch embeddings
Complete Pipeline
The complete pipeline is defined as follows
enabled: true
description: Index pipeline to store vectorized data
routes:
- path: /amazon_reviews/_doc
method: POST
classify:
category: elasticsearch
acl: index
envs:
openAIApiKey: <your-api-key>
method: POST
stages:
- id: authorize user
use: authorization
- id: fetch embeddings
use: openAIEmbeddingsIndex
inputs:
apiKey: "{{openAIApiKey}}"
inputKeys:
- Summary
- Text
outputKey: vector_data
continueOnError: false
- id: index data
use: elasticsearchQuery
needs:
- fetch embeddings
Create the pipeline
Now that we have the whole pipeline defined, we can create the pipeline by hitting the ReactiveSearch instance.
The URL we will hit is: /_pipeline
with a POST request.
The above endpoint expects a multipart/form-data
body with the pipeline
key containing the path to the pipeline file. All the scriptRef
files can be passed as a separate key in the form data and will be parsed by the API automatically. Read more about this endpoint here
We can create the pipeline in the following request:
Below request assumes all the files mentioned in this guide are present in the current directory
curl -X POST 'CLUSTER_ID/_pipeline' -H "Content-Type: multipart/form-data" --form "pipeline=pipeline.yaml"
Testing the Pipeline
We can now hit the indxe endpoint for amazon_reviews
and see if the data is getting converted to vector.
curl -X POST CLUSTER_ID/amazon_reviews/_doc -H "Content-Type: application/json" -d '{"Summary": "dog food", "Text": "good food for my dog"}'