Elasticsearch dump

It is often required to take a dump of all the data stored in elasticsearch for various reasons. Though elasticsearch provides Snapshot And Restore feature to persist and restore data from either a shared filesystem or remote repository you may still want to get a local dump.

elasticdump is a node module which can be used to get dump from elasticsearch index.

Step 1: Verify node is installed
To get started, first ensure node is installed on your local environment, if not, download and install node package from https://nodejs.org/en/download/

M25:dump nmn-notes$ node -v
v0.10.24

Step 2: Install elasticdump globally
M25:dump nmn-notes$ sudo npm install elasticdump -g

Step 3: Take dump
M25:dump nmn-notes$ elasticdump --input=http:///my_index --output=/tmp/es_dump/dump.json --type=data

Step 3 will fetch all data from my_index index to /tmp/es_dump/dump.json file. Each line in dump.json is going to be a json document.

You can also write a simple script and configure a cron job to run it periodically. Below is a sample script which will connect to localhost elasticsearch and take a dump of a requested index to an output directory.

es_dump.sh

M25:dump nmn-notes$ ./es_dump.sh my_index /tmp/es_data

The above command will take a dump from “my_index index and copy all its content to /tmp/es_data/my_index-2017-05-20:21-59-00.json file.

Advertisements

Elasticsearch Node Type

An instance of an Elasticsearch is a node and a collection of nodes is called a cluster.

All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node. Besides that, each node serves one or more purpose:

  1. Master node
  2. Data node
  3. Client node
  4. Tribe node

1. Master node
Master node controls the cluster. Any master-eligible node (all nodes by default) may be elected to become the master node by the master election process. The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node.

Indexing and searching your data is CPU-, memory-, and I/O-intensive work which can put pressure on a node’s resources. To ensure that your master node is stable and not under pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-eligible nodes and dedicated data nodes.

While master nodes can also behave as coordinating nodes and route search and indexing requests from clients to data nodes, it is better not to use dedicated master nodes for this purpose. It is important for the stability of the cluster that master-eligible nodes do as little work as possible.

To create a standalone master-eligible node, set:
node.master=true (default)
node.data=false

2. Data node
Data nodes hold the shards that contain the documents you have indexed. It also performs data related operations such as CRUD, search, and aggregations. These operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.

To create a standalone data node, set:
node.master=false
node.data=true (default)

3. Client node
If you take away the ability to be able to handle master duties and take away the ability to hold data, then you are left with a client node that can only route requests, handle the search reduce phase, and distribute bulk indexing. Client node neither hold data nor become the master node. It behaves as a “smart router” and is used to forward cluster-level requests to master node and data-related requests(such as search) to the appropriate data nodes. Requests like search requests or bulk-indexing requests may involve data held on different data nodes.

A search request, for example, is executed in two phases which are coordinated by the node which receives the client request – the coordinating node:

  1. In the scatter phase, the coordinating node forwards the request to the data nodes which hold the data. Each data node executes the request locally and returns its results to the coordinating node.
  2. In the gather phase, the coordinating node reduces each data node’s results into a single global resultset.

This means that a client node needs to have enough memory and CPU in order to deal with the gather phase.

Standalone client nodes can benefit large clusters by offloading the coordinating node role from data and master-eligible nodes. Client nodes join the cluster and receive the full cluster state, like every other node, and they use the cluster state to route requests directly to the appropriate place(s).

Warning
Adding too many client nodes to a cluster can increase the burden on the entire cluster because the elected master node must await acknowledgement of cluster state updates from every node! The benefit of client nodes should not be overstated — data nodes can happily serve the same purpose as client nodes.

To create a client node, set:
node.master=false
node.data=false

4. Tribe node
A tribe node, configured via the tribe.* settings, is a special type of client node that can connect to multiple clusters and perform search and other operations across all connected clusters.

Avoiding split brain with minimum_master_nodes:

minimum_master_nodes setting can be used to avoid split brain. Split brain occurs where the cluster is divided into two smaller cluster due to network partition or other issue and both individual smaller cluster now selects its own individual masters hence having more than one master at a time.

To elect a master a quorum of master-eligible nodes is required: (master_eligible_nodes / 2) + 1

So, in a cluster with 3 master-eligible nodes, the minimum quorum required is (3 / 2) + 1 = 2.
Now, if due to network partition, there are two smaller clusters with 2 master-eligible nodes available in 1st cluster and 3rd master-eligible node in 2nd cluster, since 2nd cluster does not have the required quorom of 2 it cannot select a new master.

Hence, if there are three master-eligible nodes, then minimum mater nodes should be set to 2.
discovery.zen.minimum_master_nodes: 2

Suggested deployment for a small/medium size cluster (12-20 nodes) deployed in 4 different clouds:
1. 3 master node medium computes one in 3 different cloud (This is to ensure high availability if entire cloud goes down).
2. 2 client node large/extra large computes one in each cloud, again for availability reason.
3. Remaining large/extra large computes distributed equally among all clouds.

Reference:

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html

Elastic Search

Elasticsearch is a highly scalable open-source full text analytics engine. Elasticsearch is distributed by nature: it knows how to manage multiple nodes to provide scale and high availability.

Installation:

Download latest version of Elasticsearch (https://www.elastic.co/downloads)  or use curl command to download.

curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.4/elasticsearch-2.3.4.tar.gz

Then extract it as follows (Windows users should unzip the zip package):

tar -xvf elasticsearch-2.3.4.tar.gz

It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:

cd elasticsearch-2.3.4/bin

And now we are ready to start our node and single cluster (Windows users should run the elasticsearch.bat file):

./elasticsearch

If everything goes well, you should see a bunch of messages that look like below:

./elasticsearch
[2014-03-13 13:42:17,218][INFO ][node           ] [New Goblin] version[2.3.4], pid[2085], build[5c03844/2014-02-25T15:52:53Z]
[2014-03-13 13:42:17,219][INFO ][node           ] [New Goblin] initializing ...
[2014-03-13 13:42:17,223][INFO ][plugins        ] [New Goblin] loaded [], sites []
[2014-03-13 13:42:19,831][INFO ][node           ] [New Goblin] initialized
[2014-03-13 13:42:19,832][INFO ][node           ] [New Goblin] starting ...
[2014-03-13 13:42:19,958][INFO ][transport      ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.8.112:9300]}
[2014-03-13 13:42:23,030][INFO ][cluster.service] [New Goblin] new_master [New Goblin][rWMtGj3dQouz2r6ZFL9v4g][mwubuntu1][inet[/192.168.8.112:9300]], reason: zen-disco-join (elected_as_master)
[2014-03-13 13:42:23,100][INFO ][discovery      ] [New Goblin] elasticsearch/rWMtGj3dQouz2r6ZFL9v4g
[2014-03-13 13:42:23,125][INFO ][http           ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.8.112:9200]}
[2014-03-13 13:42:23,629][INFO ][gateway        ] [New Goblin] recovered [1] indices into cluster_state
[2014-03-13 13:42:23,630][INFO ][node           ] [New Goblin] started

Life instead a Cluster

If we start a single node, with no data and no indices, our cluster looks like this:

EmptyClusterA node is a running instance of Elasticsearch, while a cluster consists of one or more nodes with the same cluster.name that are working together to share their data and workload. As nodes are added or removed from the cluster, the cluster reorganizes itself to spread the data evenly.

Master Node (manages cluster-wide changes):

  •      Creating or deleting an index
  •      Adding or removing a node from the cluster.

As users, we can talk to any node in the cluster, including the master node. Every node knows where each document lives and can forward our request directly to the nodes that hold the data we are interested in. Whichever node we talk to manages the process of gathering the response from the nodes or nodes holding the data and returning the final response to the client.

Storing Data
  • An index is just a “logical namespace” which points to one or more physical shards.
  • A shard is a low-level “worker unit” which holds just a slice of all the data in the index, it is a single instance of Lucene, and is a complete search engine in its own right. Our documents are stored and indexed in shards, but our applications don’t talk to them directly. Instead, they talk to an index.
  • Shards are how Elasticsearch distributes data around your cluster. Think of shards as containers for data. Documents are stored in shards, and shards are allocated to nodes in your cluster. As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced.
  • The number of primary shards in an index is fixed at the time an index is created, but the number of replica shards can be changed at any time. By default, indices are assigned five primary shards.
  • Any newly indexed document will first be stored on a primary shard, and then copied in parallel to the associated replica shard(s).

Elasticsearch Cluster

Empty cluster with no indices will return following cluster health.

Cluster health:

{
   "cluster_name":          "elasticsearch",
   "status":                "green",
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards"0,
   "active_shards":         0,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}

The status field provides an overall indication of how the cluster is functioning. The meanings of the three colors are provided here for reference:

green All primary and replica shards are active.

yellow All primary shards are active, but not all replica shards are active.

red Not all primary shards are active.

Now, create an index with 3 primary shards and one replica (one replica of every primary shard):

Index creation:

{
   "settings": {
      "number_of_shards"3,
      "number_of_replicas"1
   }
}'
A single-node cluster with an index
Single-nodeClusterWithAnIndex
Cluster health:
{
   "cluster_name":          "elasticsearch",
   "status":                "yellow",
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards"3,
   "active_shards":         3,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     3
}

Cluster status is yellow  with three unassigned shards.

Our three replica shards have not been allocated a node.

Add failover by starting another node since with only one node there is no redundancy and it is single point of failure.TwoNodeClusterCluster health is green.

Cluster health:

{
   "cluster_name":          "elasticsearch",
   "status":                "green",
   "timed_out":             false,
   "number_of_nodes":       2,
   "number_of_data_nodes":  2,
   "active_primary_shards"3,
   "active_shards":         6,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}
To scale horizontally, add another node to the cluster:
ThreeNodeCluster

One shard each from Node 1 and Node 2 have moved to the new Node 3, and we have two shards per node instead of 3. This means that the hardware resources (CPU, RAM, I/O) of each node are being shared among fewer shards, allowing each shard to perform better.

Increasing the number of replicas to 2:TwoReplicaThreeNodeCluster.pngThis would improve the search since read requests can be handled by primary or a replica shard. The max we can scale out is by adding 6 more nodes making a cluster of 9 nodes and each shard having access to 100% of its node’s resources.

Coping with failure:

Cluster after killing one node:ClusterKillingOneNodeCluster health is red. Primary shards 1 and 2 were lost when node 1 was killed. Node 2 becomes master and promotes the replicas of these shards on Node 2 and Node 3 to be primaries, putting back the cluster health to yellow.

Configuration:

discovery.zen.minimum_master_nodes: should be set to N/2 + 1 to avoid split brain problem.

discovery.zen.ping.timeout: default value is set it 3 secs and it determines how much time a node will wait for a response from other nodes in the cluster before assuming that the node has failed. Slightly increase the default value for slower networks.

Cluster health is red. Primary shards 1 and 2 were lost when node 1 was killed. Node 2 becomes master and promotes the replicas of these shards on Node 2 and Node 3 to be primaries, putting back the cluster health to yellow.

Configuration:

discovery.zen.minimum_master_nodes: should be set to N/2 + 1 to avoid split brain problem.

discovery.zen.ping.timeout: default value is set it 3 secs and it determines how much time a node will wait for a response from other nodes in the cluster before assuming that the node has failed. Slightly increase the default value for slower networks.

Glossary of terms:

cluster

A cluster consists of one or more nodes which share the same cluster name. Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.

index

An index is like a database in a relational database. It has a mapping which defines multiple types. An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.

node

A node is a running instance of elasticsearch which belongs to a cluster. Multiple nodes can be started on a single server for testing purposes, but usually you should have one node per server. At startup, a node will use unicast (or multicast, if specified) to discover an existing cluster with the same cluster name and will try to join that cluster.

primary shard

Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard. By default, an index has 5 primary shards. You can specify fewer or more primary shards to scale the number of documents that your index can handle. You cannot change the number of primary shards in an index, once the index is created. See also routing

replica shard

Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes:

  1. increase failover: a replica shard can be promoted to a primary shard if the primary fails
  2. increase performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.

shard

A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards. Other than defining the number of primary and replica shards that an index should have, you never need to refer to shards directly. Instead, your code should deal only with an index. Elasticsearch distributes shards amongst all nodes in the cluster, and can move shards automatically from one node to another in the case of node failure, or the addition of new nodes.

 

Reference:

https://www.elastic.co/guide/en/elasticsearch/guide/current/getting-started.html