Loading JSON Files into Elasticsearch with Bulk API & Docker

Introduction

Elasticsearch is a powerful search engine that allows you to store, search, and analyze large amounts of data quickly and in real-time. It's designed to be scalable and can handle large amounts of data. One way to load data into Elasticsearch is by using a bulk API approach.

In this approach, you can use a docker image to read data from files and then send it to Elasticsearch in bulk. This saves time and resources as you don't have to send each record one by one.

This docker image can be used to load large amounts of data into Elasticsearch quickly and efficiently.

Prerequisites

To execute this solution in a containerized environment and establish a connection with the pre-running Elasticsearch to import the data, the following tools are essential.

Setup

Please adhere to the following instructions to establish this solution.

1. Clone this repository to your local machine

git clone https://github.com/raowaqasakram/elasticsearch-bulk-loader.git

2. Update the configuration files

Navigate to the configs folder.
Open the index_mappings.json file and update it with your index mappings.
Open the index_settings.json file and update it with your index settings.

3. Place JSON files

Navigate to the jsonData folder.
Put all the JSON files you want to load into Elasticsearch in the jsonData folder.

4. Docker Compose configuration

Open the docker-compose.yml file.
In the environment section, update the following line with the elastic search index name where you want to load the data. For example,

  - ES_INDEX_NAME=elon_data_index

Make sure to link this application with the network where Elasticsearch is already running.

networks:
  bulk_data_network:
  # Replace 'elasticsearch_existing_network' with your actual network name of Elasticsearch
    name: elasticsearch_existing_network
    external: true

Load Data to ES Index

Once all the aforementioned configurations have been completed, you can execute the container by utilizing the provided command.

sudo docker-compose up

Console Output

Upon executing the aforementioned command, the console will display relevant output.

Based on the system configurations (Apple M1, 8GB RAM) that I have, it required approximately 6 minutes and 26 seconds to complete the loading process of 100,000 JSON files into an Elasticsearch index.

Important Point

One crucial point to highlight is that the process will create a new index if it does not already exist in Elasticsearch, and it will load the data accordingly.

However, if the index has already been created, the process will load the provided JSON files from the jsonData folder into the specified index mentioned in the docker-compose.yml file. By following this approach, your existing data in the index will remain intact and unaffected.

Verifications

Index Mappings verification

Upon inspection, it becomes evident that the mappings specified within the index_mappings.json file are indeed utilized during the creation process of the Elasticsearch index.

Index Mappings view on Kibana

Bulk data load verification

It can be observed that the index named elon_data_index contains an aggregate of over 100,000 files that have been successfully loaded.

Document counts view on Kibana

Conclusion

In this article, we have discussed how to load bulk JSON data into Elasticsearch using Bulk API and Docker. This approach allows you to load large amounts of data quickly and efficiently.

We have also discussed how to update the configuration files and how to run the docker image using Docker-compose. This solution can be used to automate the loading of data into Elasticsearch, saving time and resources.

References

GitHub
- A Python script to load the data using Elasticsearch Bulk API.
- https://github.com/raowaqasakram/elasticsearch-bulk-loader
Docker Hub
- The docker image is pushed to the docker hub.
- https://hub.docker.com/r/raowaqasakram/es-bulk-data-loader
Elasticsearch Bulk API
- https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html