Natural language document retrieval with Python and ElasticSearch

Johan Jublanc
5 min readJan 13, 2024

--

This article demonstrates how to utilize Elasticsearch and its Python SDK to establish a vector index and retrieve documents using natural language.

In the example, we will attempt to search for movies using a natural language approach. We will generate vectors for movie descriptions and store them within an Elasticsearch index.

Subsequently, we’ll have the capability to embed a query (short text from the user) and employ search algorithms (here HNSW) to identify the nearest neighbors within our vector space.

You can find the illustrative project here : https://github.com/JJublanc/movie_vector_search

Setup up your ElasticSearch Cloud account

Create an account on Elastic Cloud

We’ll utilize Elastic Cloud to deploy our cluster and manage it, monitor our indices, check errors and more.

Adjust your deployment to be able to use ML in your index

The “deployment” refers to the process of setting up and configuring an Elasticsearch cluster on a server or cloud infrastructure. It involves defining the hardware resources, node configurations, and other settings needed to run Elasticsearch effectively. A deployment ensures that Elasticsearch is properly installed, configured, and ready to handle data indexing, searching, and other operations.

To enable vectorization using language models, it’s essential to specify the utilization of a machine learning instance.

Get your credentials

From your deployment, access the Elastic Cloud console to retrieve the user and password credentials to be able to connect o your cluster.

You also have to collect your cloud_id and then add those information to a .env file.

ELASTIC_USER="elastic"
ELASTIC_PASSWORD=<your password>
ELASTIC_CLOUD_ID=<your cloud id>

Create your first index

First of all get some data

I utilized the readily available IMDb dataset, which can be easily accessed on Kaggle through the following link: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?resource=download

You can add your archive to a folder called /data in your project.

Set your index

You will require a mapping to define your index schema. In this case, the schema is straightforward, consisting of just three essential fields: id, title, and overview, all of which are text fields.

You can check that it has been correctly created in your Elasticsearch console.

Then populate your index

You can easily verify its success on the Elasticsearch console as well. Simply select your index and switch to the ‘Documents’ tab.

Prepare to add a vector field to your new index

Import your model

You can choose from a list of available model on Huggingface compatible with Elasticsearch pipelines. Once, you have chosen, you can import the model before using it. In our case it is “sentence-transformers/all-minilm-l12-v2”.

Be carreful when you call your model in your pipeline, its name is a bit different : “sentence-transformers__all-minilm-l12-v2”. This subtlety can be a bit tricky and potentially confusing, especially for first-time users.

Here is an example of code doing it.

To achieve this, we leveraged the eland package, which empowers us with machine learning functionality within Elasticsearch.

Create your pipeline and create your new index with embeddings

With your model in place, you can now create a pipeline for it to generate embeddings when populating a new index.

Create your new index

With the pipeline in place, you can proceed to create a new index. To achieve this, we’ll utilize the existing one and incorporate a vectorization step facilitated by the previously defined pipeline.

It’s worth noting that this process not only results in the creation of a new index but also generates a populated index in the event of any failures. This approach enables you to pinpoint any issues, understand what went wrong, and determine the necessary steps for resolution.

For example, you may encounter errors like:

Finally query your vector Index

Elasticsearch employs the advanced HSNW (Hierarchical Navigable Small World) query algorithm, which stands as a state-of-the-art choice. Within this framework, you have the flexibility to configure algorithm parameters and perform queries on your index. It’s crucial to specify the target field you wish to retrieve, which, in our case, is the ‘title’ field.

For instance, we search for movies with descriptions that closely resemble that of ‘To Mars by A-Bomb: The Secret History of Project Orion.’ So we send its description to the search engine and get the closest neighbors.

We retrieve several movies related to space, and we obtain the movie used as the query for the index. This result is quite reassuring regarding the performance of our algorithm.

That’s it!

Do not hesitate to contact me if you want to know more about this project. You can find the code here: https://github.com/JJublanc/movie_vector_search

--

--