Elastic Search

Elasticsearch is a highly scalable open-source full text analytics engine. Elasticsearch is distributed by nature: it knows how to manage multiple nodes to provide scale and high availability.

Installation:

Download latest version of Elasticsearch (https://www.elastic.co/downloads)  or use curl command to download.

curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.4/elasticsearch-2.3.4.tar.gz

Then extract it as follows (Windows users should unzip the zip package):

tar -xvf elasticsearch-2.3.4.tar.gz

It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:

cd elasticsearch-2.3.4/bin

And now we are ready to start our node and single cluster (Windows users should run the elasticsearch.bat file):

./elasticsearch

If everything goes well, you should see a bunch of messages that look like below:

./elasticsearch
[2014-03-13 13:42:17,218][INFO ][node           ] [New Goblin] version[2.3.4], pid[2085], build[5c03844/2014-02-25T15:52:53Z]
[2014-03-13 13:42:17,219][INFO ][node           ] [New Goblin] initializing ...
[2014-03-13 13:42:17,223][INFO ][plugins        ] [New Goblin] loaded [], sites []
[2014-03-13 13:42:19,831][INFO ][node           ] [New Goblin] initialized
[2014-03-13 13:42:19,832][INFO ][node           ] [New Goblin] starting ...
[2014-03-13 13:42:19,958][INFO ][transport      ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.8.112:9300]}
[2014-03-13 13:42:23,030][INFO ][cluster.service] [New Goblin] new_master [New Goblin][rWMtGj3dQouz2r6ZFL9v4g][mwubuntu1][inet[/192.168.8.112:9300]], reason: zen-disco-join (elected_as_master)
[2014-03-13 13:42:23,100][INFO ][discovery      ] [New Goblin] elasticsearch/rWMtGj3dQouz2r6ZFL9v4g
[2014-03-13 13:42:23,125][INFO ][http           ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.8.112:9200]}
[2014-03-13 13:42:23,629][INFO ][gateway        ] [New Goblin] recovered [1] indices into cluster_state
[2014-03-13 13:42:23,630][INFO ][node           ] [New Goblin] started

Life instead a Cluster

If we start a single node, with no data and no indices, our cluster looks like this:

EmptyClusterA node is a running instance of Elasticsearch, while a cluster consists of one or more nodes with the same cluster.name that are working together to share their data and workload. As nodes are added or removed from the cluster, the cluster reorganizes itself to spread the data evenly.

Master Node (manages cluster-wide changes):

  •      Creating or deleting an index
  •      Adding or removing a node from the cluster.

As users, we can talk to any node in the cluster, including the master node. Every node knows where each document lives and can forward our request directly to the nodes that hold the data we are interested in. Whichever node we talk to manages the process of gathering the response from the nodes or nodes holding the data and returning the final response to the client.

Storing Data
  • An index is just a “logical namespace” which points to one or more physical shards.
  • A shard is a low-level “worker unit” which holds just a slice of all the data in the index, it is a single instance of Lucene, and is a complete search engine in its own right. Our documents are stored and indexed in shards, but our applications don’t talk to them directly. Instead, they talk to an index.
  • Shards are how Elasticsearch distributes data around your cluster. Think of shards as containers for data. Documents are stored in shards, and shards are allocated to nodes in your cluster. As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced.
  • The number of primary shards in an index is fixed at the time an index is created, but the number of replica shards can be changed at any time. By default, indices are assigned five primary shards.
  • Any newly indexed document will first be stored on a primary shard, and then copied in parallel to the associated replica shard(s).

Elasticsearch Cluster

Empty cluster with no indices will return following cluster health.

Cluster health:

{
   "cluster_name":          "elasticsearch",
   "status":                "green",
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards"0,
   "active_shards":         0,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}

The status field provides an overall indication of how the cluster is functioning. The meanings of the three colors are provided here for reference:

green All primary and replica shards are active.

yellow All primary shards are active, but not all replica shards are active.

red Not all primary shards are active.

Now, create an index with 3 primary shards and one replica (one replica of every primary shard):

Index creation:

{
   "settings": {
      "number_of_shards"3,
      "number_of_replicas"1
   }
}'
A single-node cluster with an index
Single-nodeClusterWithAnIndex
Cluster health:
{
   "cluster_name":          "elasticsearch",
   "status":                "yellow",
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards"3,
   "active_shards":         3,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     3
}

Cluster status is yellow  with three unassigned shards.

Our three replica shards have not been allocated a node.

Add failover by starting another node since with only one node there is no redundancy and it is single point of failure.TwoNodeClusterCluster health is green.

Cluster health:

{
   "cluster_name":          "elasticsearch",
   "status":                "green",
   "timed_out":             false,
   "number_of_nodes":       2,
   "number_of_data_nodes":  2,
   "active_primary_shards"3,
   "active_shards":         6,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}
To scale horizontally, add another node to the cluster:
ThreeNodeCluster

One shard each from Node 1 and Node 2 have moved to the new Node 3, and we have two shards per node instead of 3. This means that the hardware resources (CPU, RAM, I/O) of each node are being shared among fewer shards, allowing each shard to perform better.

Increasing the number of replicas to 2:TwoReplicaThreeNodeCluster.pngThis would improve the search since read requests can be handled by primary or a replica shard. The max we can scale out is by adding 6 more nodes making a cluster of 9 nodes and each shard having access to 100% of its node’s resources.

Coping with failure:

Cluster after killing one node:ClusterKillingOneNodeCluster health is red. Primary shards 1 and 2 were lost when node 1 was killed. Node 2 becomes master and promotes the replicas of these shards on Node 2 and Node 3 to be primaries, putting back the cluster health to yellow.

Configuration:

discovery.zen.minimum_master_nodes: should be set to N/2 + 1 to avoid split brain problem.

discovery.zen.ping.timeout: default value is set it 3 secs and it determines how much time a node will wait for a response from other nodes in the cluster before assuming that the node has failed. Slightly increase the default value for slower networks.

Cluster health is red. Primary shards 1 and 2 were lost when node 1 was killed. Node 2 becomes master and promotes the replicas of these shards on Node 2 and Node 3 to be primaries, putting back the cluster health to yellow.

Configuration:

discovery.zen.minimum_master_nodes: should be set to N/2 + 1 to avoid split brain problem.

discovery.zen.ping.timeout: default value is set it 3 secs and it determines how much time a node will wait for a response from other nodes in the cluster before assuming that the node has failed. Slightly increase the default value for slower networks.

Glossary of terms:

cluster

A cluster consists of one or more nodes which share the same cluster name. Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.

index

An index is like a database in a relational database. It has a mapping which defines multiple types. An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.

node

A node is a running instance of elasticsearch which belongs to a cluster. Multiple nodes can be started on a single server for testing purposes, but usually you should have one node per server. At startup, a node will use unicast (or multicast, if specified) to discover an existing cluster with the same cluster name and will try to join that cluster.

primary shard

Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard. By default, an index has 5 primary shards. You can specify fewer or more primary shards to scale the number of documents that your index can handle. You cannot change the number of primary shards in an index, once the index is created. See also routing

replica shard

Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes:

  1. increase failover: a replica shard can be promoted to a primary shard if the primary fails
  2. increase performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.

shard

A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards. Other than defining the number of primary and replica shards that an index should have, you never need to refer to shards directly. Instead, your code should deal only with an index. Elasticsearch distributes shards amongst all nodes in the cluster, and can move shards automatically from one node to another in the case of node failure, or the addition of new nodes.

 

Reference:

https://www.elastic.co/guide/en/elasticsearch/guide/current/getting-started.html

Advertisements