Elasticsearch basic concepts and terminology

Data in Elasticsearch is organized into indices. Each index is made up of one or more shards.

A shard is a low-level worker unit that holds just a slice of all the data in the index. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch.

As data is written to a shard, it is periodically published into new immutable Lucene segments on disk, and it is at this time it becomes available for querying. This is referred to as a refresh.

As the number of segments grows, these are periodically consolidated into larger segments. This process is referred to as merging.

Sharding is a method of splitting and storing a single logical dataset in multiple databases. By distributing the data among multiple machines, a cluster of database systems can store larger dataset and handle additional requests.

Sharding is also referred to as horizontal partitioning. The distinction of horizontal vs vertical comes from the traditional tabular view of a database. A database can be split vertically — storing different tables & columns in a separate database, or horizontally — storing rows of the same table in multiple database nodes.

Shards are how Elasticsearch distributes data around your cluster. Think of shards as containers for data. Documents are stored in shards, and shards are allocated to nodes in your cluster. As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced.

A node is a running instance of Elasticsearch, while a cluster consists of one or more nodes with the same cluster.name that are working together to share their data and workload. When we start an instance of Elasticsearch, we are starting a node and we have a cluster with single node. As users, we can talk to any node in the cluster, including the master node. Every node knows where each document lives and can forward our request directly to the nodes that hold the data we are interested in.

Communication: Every node in the cluster knows about the other nodes within the cluster, they talk to others directly using the native Elasticsearch language over TCP. Besides that, the nodes are able to talk to the external world using JSON “language” over HTTP. Elasticsearch is a peer-to-peer based system, nodes communicate with others directly.

Each node in the cluster plays one or more roles; it can be a master node, datanode, client node, or tribe node.

The master node is responsible for creating, deleting indices, adding the nodes or remove the nodes from the cluster. Each time a cluster state is changed, the master node broadcasts the changes to other nodes in the cluster. There is only one master node in the cluster at a time.

The data node is responsible for holding the data in the shards and performing data related operations such as create, read, update, delete, search, and aggregations. We can have many data nodes in the cluster. If one of the data nodes stops, the cluster still operates and re-organizes the data on other nodes.

The client node is responsible for routing the cluster-related requests to the master node and the data-related requests to the data nodes, it acts as a “smart router”. The client node does not hold any data, it also cannot become the master node.

The tribe node is special type of client node that is able to talk to multiple clusters to perform search and other operations.

The ingest node is responsible for pre-processing documents before the actual indexing takes into account.

Previous Post

Public Key Cryptography Basics