Elasticsearch dump

It is often required to take a dump of all the data stored in elasticsearch for various reasons. Though elasticsearch provides Snapshot And Restore feature to persist and restore data from either a shared filesystem or remote repository you may still want to get a local dump.

elasticdump is a node module which can be used to get dump from elasticsearch index.

Step 1: Verify node is installed
To get started, first ensure node is installed on your local environment, if not, download and install node package from https://nodejs.org/en/download/

M25:dump nmn-notes$ node -v
v0.10.24

Step 2: Install elasticdump globally
M25:dump nmn-notes$ sudo npm install elasticdump -g

Step 3: Take dump
M25:dump nmn-notes$ elasticdump --input=http:///my_index --output=/tmp/es_dump/dump.json --type=data

Step 3 will fetch all data from my_index index to /tmp/es_dump/dump.json file. Each line in dump.json is going to be a json document.

You can also write a simple script and configure a cron job to run it periodically. Below is a sample script which will connect to localhost elasticsearch and take a dump of a requested index to an output directory.

es_dump.sh

M25:dump nmn-notes$ ./es_dump.sh my_index /tmp/es_data

The above command will take a dump from “my_index index and copy all its content to /tmp/es_data/my_index-2017-05-20:21-59-00.json file.

Palindrome

A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward or forward, such as madam or kayak.

To check if a given string is palindrome or not, take two pointers approach, first pointer at the start and 2nd pointer at the end, start comparing characters at those index while incrementing the first index and decrementing the 2nd till a mismatch is found or the pointers meet each other at the centre. If NO mismatch is found, the string is a palindrome else it is not. Complexity of this algorithm is linear O(n).

public static boolean isPalindrome(final String str) {
    int start = 0;
    int end = str.length() - 1;
    while (start < end) {
        if (str.charAt(start) != str.charAt(end)) {
            return false;
        }
        start++;
        end--;
    }
    return true;
}

palindrome

Now, if we modify the same problem statement to check if a given string or any of its substring is a palindrome.
The hasPalidrome() method will return length of substring which is a palindrome, in worst-case it will return 1 (remember, every character is a palindrome).

public static int hasPalindrome(final String str) {
    for (int i = 0; i < str.length() - 1; i++) {                                  for (int j = str.length() - 1; j > i; j--) {
            if (isPalindrome(str, i, j)) {
                return j - i + 1;
            }
        }
    }
    return 1;
}

Over here, we are using the isPalindrome subroutine to check if a given string from index start to end is a palindrome or not.

substring-palindrome.png

If we update the above problem statement again to return the length of longest substring which is a palindrome, the above algorithm will NOT work since it will return length of the first palindrome that is found and not necessarily the longest.

For ex: If we modify our original string madam and prefix it with ‘a’, it becomes amadam.

The above algorithm will return length 3 for ‘amadam’ which is a valid palindrome however not the longest, which is still ‘amadam‘ of length 5.

The reason above algorithm does not work for longest palindrome is because we terminate the moment we find the first palindrome. We can modify the above to maintain a max length variable and keep updating it as and when we find a new palindrome of longer length, this is nothing but a brute force approach where we are evaluating all substrings.

A slightly better approach is to first evaluate longer substrings, since the moment we find a substring which is a palindrome, its length will always be greater than any palindrome that can be found within that substring itself and hence we do not have to evaluate any substrings within that substring further.

A naive recursive approach is to find if the given string is a palindrome, if not, find maximum length palindrome in left and right substring.

public static int maxPalindromeLength(final String str, int start, int end){
    //Condition to break recursion
    if (start >= end) {
        return 1;
    }

    if (isPalindrome(str, start, end)) {
        return end - start + 1; 
    }

    //return longest substring of left and right substring.
    return Math.max(
        //longest palindrome in left substring.
        maxPalindromeLength(str, start, end - 1),
        //longest palindrome in right substring.  
        maxPalindromeLength(str, start + 1, end)
    );
}

recursive-palindrome.png

The complexity of this algorithm is exponential O(2^n), since with every increase in n, the total number of comparisons increases exponentially.

The below recursion tree shows total number of comparisons done for n=2, 3 and 4.

recursionTree-palindrome.png

Over here we are comparing the same substring “bc” twice, hence this problem has overlapping subproblems property of dynamic programming approach.

A bottom up approach is to first evaluate smaller substrings and use those results during further evaluation. In other words, a string is a palindrome provided the first and last character matches and substring between those characters is also a palindrome. Hence a string S(0,n) is a palindrome provided character at 0 and n matches and S(0+1, n-1) is also a palindrome. So we can use result of S(0+1, n-1) while evaluating S(0, n), hence this problem has optimal substructure property of dynamic programming approach as well.

Re-computation of overlapping subproblems can be avoided by using a lookup table L[][].

L[i][j] is a palindrome if character at i and j matches and L[i+1][j-1] is also a palindrome.

if (character at i and j matches && L[i+1][j-1] is a palindrome) {
    L[i][j] = L[i+1][j-1] + 2; // length of inner palindrome + 2
}
else {
    L[i][j] = -1; // not a palindrome
}

We start analyzing all substrings of length 2 and use those results while analyzing all substrings of length 3, and so on and so forth.

public static int longestPalindrome(final String str) {
    int length = str.length();
    int [][] L = new int[length][length];

    for (int i = 0; i < length; i ++) {
        for (int j = 0; j < length; j++) {
            //Every character is a palindrome of length 1.
            L[i][j] = (i == j) ? 1 : -1;
        }
    }

    int subStrLength = 2;
    while (subStrLength <= length) {
        for (int i = 0; i < length - subStrLength + 1; i++) {
            int j = i + subStrLength - 1; //end index of substring.
            if ((subStrLength == 2 || L[i+1][j-1] != -1) && 
                str.charAt(i) == str.charAt(j)) {
                L[i][j] = L[i+1][j-1] + 2;
            }
            else {
                L[i][j] = -1;
            }
        }
        subStrLength++;
    }
    return max(L);
}

The complexity of above algorithm is O(n^2).

Git:
algorithms/src/main/java/org/nmn/notes/algorithms/dp/Palindrome.java

References:
https://en.wikipedia.org/wiki/Palindrome

Elasticsearch Node Type

An instance of an Elasticsearch is a node and a collection of nodes is called a cluster.

All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node. Besides that, each node serves one or more purpose:

  1. Master node
  2. Data node
  3. Client node
  4. Tribe node

1. Master node
Master node controls the cluster. Any master-eligible node (all nodes by default) may be elected to become the master node by the master election process. The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node.

Indexing and searching your data is CPU-, memory-, and I/O-intensive work which can put pressure on a node’s resources. To ensure that your master node is stable and not under pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-eligible nodes and dedicated data nodes.

While master nodes can also behave as coordinating nodes and route search and indexing requests from clients to data nodes, it is better not to use dedicated master nodes for this purpose. It is important for the stability of the cluster that master-eligible nodes do as little work as possible.

To create a standalone master-eligible node, set:
node.master=true (default)
node.data=false

2. Data node
Data nodes hold the shards that contain the documents you have indexed. It also performs data related operations such as CRUD, search, and aggregations. These operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.

To create a standalone data node, set:
node.master=false
node.data=true (default)

3. Client node
If you take away the ability to be able to handle master duties and take away the ability to hold data, then you are left with a client node that can only route requests, handle the search reduce phase, and distribute bulk indexing. Client node neither hold data nor become the master node. It behaves as a “smart router” and is used to forward cluster-level requests to master node and data-related requests(such as search) to the appropriate data nodes. Requests like search requests or bulk-indexing requests may involve data held on different data nodes.

A search request, for example, is executed in two phases which are coordinated by the node which receives the client request – the coordinating node:

  1. In the scatter phase, the coordinating node forwards the request to the data nodes which hold the data. Each data node executes the request locally and returns its results to the coordinating node.
  2. In the gather phase, the coordinating node reduces each data node’s results into a single global resultset.

This means that a client node needs to have enough memory and CPU in order to deal with the gather phase.

Standalone client nodes can benefit large clusters by offloading the coordinating node role from data and master-eligible nodes. Client nodes join the cluster and receive the full cluster state, like every other node, and they use the cluster state to route requests directly to the appropriate place(s).

Warning
Adding too many client nodes to a cluster can increase the burden on the entire cluster because the elected master node must await acknowledgement of cluster state updates from every node! The benefit of client nodes should not be overstated — data nodes can happily serve the same purpose as client nodes.

To create a client node, set:
node.master=false
node.data=false

4. Tribe node
A tribe node, configured via the tribe.* settings, is a special type of client node that can connect to multiple clusters and perform search and other operations across all connected clusters.

Avoiding split brain with minimum_master_nodes:

minimum_master_nodes setting can be used to avoid split brain. Split brain occurs where the cluster is divided into two smaller cluster due to network partition or other issue and both individual smaller cluster now selects its own individual masters hence having more than one master at a time.

To elect a master a quorum of master-eligible nodes is required: (master_eligible_nodes / 2) + 1

So, in a cluster with 3 master-eligible nodes, the minimum quorum required is (3 / 2) + 1 = 2.
Now, if due to network partition, there are two smaller clusters with 2 master-eligible nodes available in 1st cluster and 3rd master-eligible node in 2nd cluster, since 2nd cluster does not have the required quorom of 2 it cannot select a new master.

Hence, if there are three master-eligible nodes, then minimum mater nodes should be set to 2.
discovery.zen.minimum_master_nodes: 2

Suggested deployment for a small/medium size cluster (12-20 nodes) deployed in 4 different clouds:
1. 3 master node medium computes one in 3 different cloud (This is to ensure high availability if entire cloud goes down).
2. 2 client node large/extra large computes one in each cloud, again for availability reason.
3. Remaining large/extra large computes distributed equally among all clouds.

Reference:

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html

Java Threads

JVM Threads:
1. Main Thread
2. Memory Management Thread
3. System Management

Benefits of Threads:
1. Threads are lightweight compared to processes, it takes less time and resource to create a thread.
2. Threads share their parent process data and code.
3. Context switching between threads is usually less expensive than between processes.
4. Thread intercommunication is relatively easy than process communication.

What is the difference between Thread and daemon Thread?

When we create a Thread in Java program, it’s known as user thread. A daemon thread runs in background and doesn’t prevent JVM from terminating. It has low priority. When there are no user threads running, JVM shutdown the program and quits. A child thread created from daemon thread is also a daemon thread.

How do we create Thread in Java?
1. Implementing the java.lang.Runnable interface and creating a thread object with it.
2. Extend the java.lang.Thread class., if you extend Thread class you cannot extend any other class.

When to extend Thread instead of Runnable?

We should never extend Thread, in fact Thread class should be final, it’s a design flaw.
Usually some people override Thread class to override getId(), but we should not do that as all the concurrency package use native thread ID which will be different and can cause arbitrary behaviour. Also, if we override Thread class, then the object will be of type Thread.
One instance of extending thread is to get singleton instance of some background daemon tasks like GC, shutdown, CPU Monitor, Catalina Shutdown Hook,

MongoCursorCleaner, Auditing – Auditing can be a producer-consumer use case where all threading are producing audit events and pushing them to a queue and the audit thread is consuming those events from the queue, so create AuditEventThread by extending thread class, create a singleton instance and start the thread by calling its start method , the same behaviour however can be achieved by implementing the runnable interface as well.

Another important point to note is whenever a thread execution is exiting, it will call notifyAll() on its object so that any thread waiting for its lock will be notified, this is how join() is implemented. join() waits on the lock of the thread and once the thread execution is over, the waiting thread will be notified. Now imagine, the thread object is also used as a lock in some piece of the code, now this lock will get false notification due to thread exit and hence it is very bad idea to extend Thread.

What is context-switching in multi-threading?
Context-Switching is the process of storing and restoring of CPU state so that Thread execution can be resumed from the same point at a later point of time. Context Switching is the essential feature for multitasking operating system and support for multi-threaded environment.

Why threads communication methods wait(), notify() and notifyAll() are in object class?

In Java every object has a monitor and wait, notify methods are used to wait for the object’s monitor or to notify other threads that object’s monitor is free now. There is no monitor on threads in Java and synchronization can be used with any object, that’s why it’s part of object class so that every class in Java has these essential methods for inter-thread communication.
Threads do not know status of other threads or which other thread is waiting for that lock.
Java is based on Hoare’s monitors idea. In Java all objects has a monitor.

Threads waits on monitors so, to perform a wait, we need 2 parameters:
– A Thread
– A monitor( any Object)

In the Java design, the thread can not be specified, it is always the current thread running the code. However, we can specify the monitor (which is the object we can wait on). This is a good design, because if we could make any other thread to wait on a desired monitor, this would lead to an “intrusion”, posing difficulties on designing/programming concurrent programs. Remember that in Java all operations that are intrusive in another thread’s execution are deprecated (e.g. stop() )

What is the major advantage of new Lock interface?
Provide two separate lock for reading and writing.

Why Thread sleep() and yield() methods are static?
Thread sleep() and yield() methods work on the currently executing thread. So there is no point in invoking these methods on some other threads that are in wait state, that’s why these methods are made static so that when this method is called statically, it works on the currently executing thread and avoid confusion to the programmers who might think that they can invoke these methods on some non-running threads.

Why Should I NOT call sleep()/yield method on any other Thread?
Because it is error prone and can cause deadlock, if one thread asks another thread to go to sleep, it wont release the lock and if the thread which is going to resume it needs the lock held by sleeping thread then there is a dead lock (not really, sleep do not sleep indefinitely like suspend and do not need notification to wake up ). Same reason with Suspend.?

Why wait(), notify() and notifyAll() must be called from synchronized block or method in Java ?
To avoid race condition, if T1 has checked on some condition and decided to wait, context-switch happens and T2 satisfies the condition and issues notify() which is unheard by T1 since it is not yet waiting, hence T1 will miss this notification and will still go in wait(). In short, before waiting lock must be acquired so that no one else should be able to notify before it starts waiting and hence it wont miss the notification signal.

How can we achieve thread safety in Java?
1. Synchronization
2. Atomic Classes
3. Immutable Classes
4. Lock Interface
5. Volatile
6. Thread safe classes


Why is Thread.stop() deprecated?
Because it is inherently unsafe. Stopping a thread causes it to unlock all the monitors that it has locked. (The monitors are unlocked as the ThreadDeath exception propagates up the stack). If any of the objects previously protected by these monitors were in an inconsistent state, other threads may now view these objects in an inconsistent state. Such objects are said to be damaged. When threads operate on damaged objects, arbitrary behaviour can result. This behaviour may be subtle and difficult to detect, or it may be pronounced. Unlike other unchecked exceptions, ThreadDeath kills threads silently; thus, the user has no warning that his program may be corrupted. The corruption can manifest itself at any time after the actual damage occurs, even hours or days in the future.

Couldn’t I just catch the ThreadDeath exception and fix the damaged object?

Complicate writing correct multithreaded code:
1. A thread can throw a ThreadDeath exception almost anywhere. All synchronized methods and blocks would have to be studied in great detail, with this in mind.
2. A thread can throw a second ThreadDeath exception while cleaning up from the first (in the catch or finally clause). Cleanup would have to be repeated till it succeeded. The code to ensure this would be quite complex.

Why are Thread.suspend and Thread.resume deprecated?
Thread.suspend is deadlock prone. If a thread is suspending while in critical section it holds the monitor and no other thread can enter the critical section unless the previous thread is resumed, and if the thread which will resume the previous wants the same monitor than a deadlock occurs, such condition is known as “frozen” state. Use wait and notify instead.

What about Thread.destroy?
It is similar to Thread.suspend without resume.

What happens when exception occurs in Thread?
If not caught thread will die, if an uncaught exception handler is registered then it will get a callback. JVM will query the thread for its uncaughtexception handler using Thread.getUnCaughtExceptionHandler() and will invoke the handler’s uncaughtException(), passing thread and the exception as arguments

What is FutureTask?
Cancellable asynchronous computation in Java application. This class provides a base implementation of Future. Start and cancel a computation, query to see if the computation is complete and get the result of the computation. get() will block till the result is available.
Executor Service takes a callable task and returns Future.


Runnable {
    
public abstract void run();

}

Callable {
    
V call() throws Exception;

}

Future {
    
isCancelled();
    
cancel()
;
    isDone()
;
    get()
;
}


RunnableFuture extends Runnable,
Future
FutureTask implements RunnableFuture
FutureTask implements Future interface and gives us way to start and cancel the task.
It is better for the ExecutorService to create the FutureTask instead of we wrapping Callable in FutureTask and providing it to the ExecutorService since there will be multiple references in the later case.

Difference between interrupted() and isInterrupted()?
Former clears the interrupt status while the later does not.
Thread.interrupt() sets this flag.
status is cleared after IE is thrown.

Difference between livelock and deadlock in Java?
Livelock is a special case of resource starvation. A thread acts in response of other thread and other thread’s action is also response of another thread then livelock may result. However, the threads are not blocked, they are simply too busy to respond to each other to resume work.

How to check if a thread holds lock on a particular object in Java?
1. Invoke object.wait(), will throw IllegalMonitorStateException if Thread does not hold lock of that object.
2. Thread.holdsLock(Object obj) will return if current thread holds lock of object.

Difference between Reentrant and Synchronized?
1. Reentrant has tryLock along with timeout, synchronized is indefinite block
2. Reentrant supports fairness
3. interrupt Thread while waiting for reentrant lock.
4. Gets list of all waiting thread on reentrant lock.

Difference between volatile and atomic?
Volatile guarantees happens-before that a write will happen before any subsequent write.
atomic variables provide option to perform compound operation atomically.

Difference between CountDownLatch and CyclicBarrier?
CountDownLatch can not be reused once it reaches zero. Good for onetime event like startup.
CyclicBarrier can be reused by calling its reset(). Recurring event like calculating solution of big problem. Another example is multi player game, wait for all players to join.

public class CyclicBarrierExample {
    private static class Task implements Runnable {
        CyclicBarrier barrier;
        Task(CyclicBarrier barrier) {
            this.barrier = barrier;
         }

        @Override
        public void run() {
              try {
                  System.out.println("Waiting for barrier to breach");
                  barrier.await();
                  System.out.println("Thread"+Thread.currentThread().getId()+", has crossed barrier" );
               }
              catch(BrokenBarrierException bbe){
                  bbe.printStackTrace();
              }
        }
     }


    public static void main( String [] args ) {
        final CyclicBarrier cb = new CyclicBarrier(3, new Runnable() {
              @Override
              public void run(){
                  System.out.println("All parties arrived at the barrier, lets play");
              }
         });
         Thread t1 = new Thread(new Task(cb));
         Thread t2 = new Thread(new Task(cb));
         Thread t3 = new Thread(new Task(cb));
         t1.start();
         t2.start();
         t3.start();
    }
}

If another Thread calls reset() on barrier, all awaiting threads will terminate and throw BrokenBarrierException.
Important points about CyclicBarrier:
1. CyclicBarrier can perform completion task once all thread reaches to barrier, this can be specified while creating barrier.
2. If CyclicBarrier is created with 3, means 3 threads need to call await() method to break the barrier.
3. Thread will block on await() until all parties reaches to barrier, another thread interrupt or await timeout.

Why String is immutable?
1. Cache hashcode value and can be used in HashMap
2. Security issue, accessing any file by changing name or loading class. Used as parameters in many Java classes.
3. Can be shared safely between threads.
4. String pool.

What happens if thread throws an exception inside synchronized block?
Thread will relinquish lock no matter what, this wont happen in lock interface, hence lock.unlock should always be in finally block.

Multithreading best practices?
1. Give meaningful name to threads
.
2. Avoid locking or scope of synchronization.
3. Prefer concurrent collections over synchronized
.
4. Always release lock lock.unlock() in finally block
.
5. Prefer synchronizers over wait and notify
Fork-join
work-stealing algorithm, uses multiprocessor capabilities.

What is Semaphore in Java?

What is Mutex?

Difference between Mutex and Semaphore?
1. Mutex is essentially the same thing as a binary semaphore.
2. Mutex has owner, semaphore does not.
3. Mutex supports priority inversion since it knows its owner, can increase its priority.
4. Mutex provides deletion safety. Owner of mutex cannot be deleted.

Difference between Mutex and Monitor?
Scope, mutex is at system, monitor is at application level.

What is ThreadLocal?
Java ThreadLocal is used to create thread-local variables. We know that all threads of an Object share it’s variables, so if the variable is not thread safe, we can use synchronization but if we want to avoid synchronization, we can use ThreadLocal variables.

private static final ThreadLocal localString = new ThreadLocal() {
    @Override
    protected String initialValue() {
            return "Hello World - ThreadLocal" ;
    }
};

What is Thread Group?
1. Used in ThreadPool while creating new threads.
2. List of active threads in a group.
3. Set unCaughtExceptionHandler.
4. Can interrupt the TG.
5. Custom way of creating new threads – with specific name.

Reference:
https://docs.oracle.com/javase/8/docs/technotes/guides/concurrency/threadPrimitiveDeprecation.html

Knapsack Problem

The knapsack problem or rucksack problem is a problem in combinatorial optimization: Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.

The brute force approach is to form all possible combinations of items and consider only those subsets whose total weight is equal to or less than the total knapsack weight W, out of the chosen subsets pick the subset whose value is maximum.

Given two integer array of wt: [1, 2, 3] and val: [20, 5, 30] representing weight and value, below are the possible combination of items along with their weight and value:

{1} -> wt: 1, value: 20
{2} -> wt: 2, value: 5
{3} -> wt: 3, value: 30
{1, 2} -> wt: 1 + 2 = 3, value: 20 + 5 = 25
{1, 3} -> wt: 1 + 3 = 4, value: 20 + 30 = 50
{2, 3} -> wt: 2 + 3 = 5, value: 5 + 20 = 25
{1, 2, 3} -> wt: 1 + 2 + 3 = 6, value: 20 + 5 + 30 = 55

Now depending on the total Knapsack Capacity, we can pick appropriate subsets.
For a knapsack capacity of 1, only first subset {1} can be chosen.
However, for a knapsack capacity of 4, there are 5 subsets whose total weight is less than or equal to 4: {1}, {2}, {3}, {1, 2}, {1, 3}, out which {1, 3} has maximum value of 50, hence we will pick that subset.
Brute force algorithm will involve all sets, and the total number of combinations of all sets is equal to 2^numOfItems.

To find the optimal substructure which has the maximum value, we have to consider each item and for every item we have only two possible cases.
case 1: Item is considered.
case 2: Item is NOT considered.

We have to pick that case which will result in higher value:
case 1: If item is consider, the total value will increase by that item’s value and knapsack capacity will reduce by that item’s capacity.
case 2: If item is NOT consider, value and knapsack capacity will NOT change.

After considering that item, we move on to next item. The recursive solution to above approach is:

public static int knapsack(int W, int [] wt, int [] val, int numOfWts) {
	if (numOfWts == 0 || W <= 0) {             
            return 0;
        }
        if (wt[numOfWts - 1] > W) {
            return knapsack(W, wt, val, numOfWts - 1);
	}
	return max(
	    (val[numOfWts-1]+knapsack(W-wt[numOfWts-1],wt,val,numOfWts-1)),
	    knapsack(W, wt, val, numOfWts - 1)
	);
}

If we draw a recursion tree for the above recursive approach for a knapsack problem K(n,W) where n is number of items and W if knapsack capacity.

wt: {1, 2, 3} val: {1, 2, 3}, W=2
                         K(3, 2) 
                     /            \ 
                   /                \               
             K(2,2)                   K(2,1)
            /      \                  /    \ 
          /          \              /        \
       K(1,2)      K(1,1)        K(1,1)     K(1,0)
       /  \         /   \          /   \
     /      \     /       \      /       \
K(0,2)  K(0,1)  K(0,1)  K(0,0)  K(0,1)   K(0,0)

We are evaluating K(1, 1) twice, this shows that the sub-problems are overlapping. Since knapsack has both optimal substructure and overlapping subproblem this can be solved using DP approach of using a memoization/lookup table.

The lookup table K[n][W] can be constructed bottom UP where n is number of items and W is total knapsack capacity.

To build the table, we start with 1st item and compute the maximum value with increasing knapsack capacity. We introduce a new item and recalculate maximum values again.

public static int knapsackDP(int W, int[] wt, int[] val) {
    int[][] K = new int[wt.length + 1][W + 1];

    for (int i = 0; i <= wt.length; i++) {
        for (int w = 0; w <= W; w++) {
            //base case, either 0 items or no knapsack capacity.
            if (i == 0 || w == 0) {
                K[i][w] = 0;
            }
            //this item can be considered.
            else if (wt[i - 1] <= w) {
                //max of:
                K[i][w]=max(
//case1:value of that item + remaining items with reduced knapsack capacity
                    val[i - 1] + K[i - 1][w - wt[i - 1]], 
//case2: without this item, value and capacity remains same.
                    K[i - 1][w]
                );
            } 
            //weight of that item is higher current knapsack capacity.
            else {
                K[i][w] = K[i - 1][w];
            }
        }
    }
    return K[wt.length][W];
}

Knapsack lookup table for wt: {1, 1, 1, 1 , 1} and value: {1, 1, 1, 1, 1} and W=5 is:
Knapsack table:
[0 0 0 0 0 0 ]
[0 1 1 1 1 1 ]
[0 1 2 2 2 2 ]
[0 1 2 3 3 3 ]
[0 1 2 3 4 4 ]
[0 1 2 3 4 5 ]

Where row is item and column is knapsack capacity. When item is 0 (row = 0) or knapsack capacity is 0 (column = 0), the total knapsack maximum value is 0.

Hence 1st row and 1st column values are all 0s.
2nd row: there is only 1 item to be consider hence increasing knapsack capacity is not increasing total value and the value is constant at 1.
3rd row: now we are considering 2 items each of weight 1, hence from column 3 when total knapsack capacity is 2, both items can be put in knapsack and total value of knapsack increases from 1 to 2, but it stays at 2 since both items are consider, hence 2nd row is: [0 1 2 2 2 2]
Finally for the last 5th row: when all items are available, with increasing knapsack capacity (column), each additional item can be added to be knapsack thereby increasing its value: [0 1 2 3 4 5]

The above DP approach has a time complexity of O(nW)

Git:
algorithms/src/main/java/org/nmn/notes/algorithms/dp/Knapsack.java

Reference:
https://en.wikipedia.org/wiki/Knapsack_problem

Metrics For Your Server

Capturing metrics for a server is essential for following reasons:

  • Capacity planning
  • Traffic pattern
  • Server performance
  • Debugging

A valve can be used to intercept all incoming requests and gather necessary metrics required for the server.

Following metrics should be gathered for a server.

  • Total number of calls.
  • Mean response time.
  • Total number of error counts.
  • Important API request calls.

The below flow chart describe a typical server metrics setup:

tomcatmetricsvalve

  1. All incoming requests are intercepted by metrics valve.
  2. Metrics valve will compute response time along with other metrics parameter.
  3. Metrics valve will forward the request to the server
  4. Metrics valve will intercept the response to check for response status and increment error count in case of any.
  5. At regular duration, metrics valve will POST data to Telemetry server.
  6. Telemetry server will in-turn POST data to any time-series monitoring server like  Graphite
  7. Any any time view monitoring dashboard to check server status.

A typical Graphite dashboard:

quickstart_graphing61

Implementation:

Git link of sample tomcat metrics valve implementation: Tomcat metrics valve

Valve can be configured using tomcat’s context.xml:

<Context>
  ...
  <Valve className="org.nmn.notes.valve.TomcatMetricsValve"/>
  ...
</Context>

Summary:

Capturing metrics of a server is very essential in a production environment. Without proper metrics we will never know about sudden spikes seen in the system and the way our service responds to such spikes. To do proper capacity management we should know the total number of requests each compute of the service is currently serving and the mean response time of each request.

There are lot of other parameters that can be captured like total number of threads, heap size and any cache size used by the service which will give very good insight of the health of the system.

The answer to “how your service is doing” should be a metrics dashboard which can show all critical parameters of that service.

Reference:

Elastic Search

Elasticsearch is a highly scalable open-source full text analytics engine. Elasticsearch is distributed by nature: it knows how to manage multiple nodes to provide scale and high availability.

Installation:

Download latest version of Elasticsearch (https://www.elastic.co/downloads)  or use curl command to download.

curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.4/elasticsearch-2.3.4.tar.gz

Then extract it as follows (Windows users should unzip the zip package):

tar -xvf elasticsearch-2.3.4.tar.gz

It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:

cd elasticsearch-2.3.4/bin

And now we are ready to start our node and single cluster (Windows users should run the elasticsearch.bat file):

./elasticsearch

If everything goes well, you should see a bunch of messages that look like below:

./elasticsearch
[2014-03-13 13:42:17,218][INFO ][node           ] [New Goblin] version[2.3.4], pid[2085], build[5c03844/2014-02-25T15:52:53Z]
[2014-03-13 13:42:17,219][INFO ][node           ] [New Goblin] initializing ...
[2014-03-13 13:42:17,223][INFO ][plugins        ] [New Goblin] loaded [], sites []
[2014-03-13 13:42:19,831][INFO ][node           ] [New Goblin] initialized
[2014-03-13 13:42:19,832][INFO ][node           ] [New Goblin] starting ...
[2014-03-13 13:42:19,958][INFO ][transport      ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.8.112:9300]}
[2014-03-13 13:42:23,030][INFO ][cluster.service] [New Goblin] new_master [New Goblin][rWMtGj3dQouz2r6ZFL9v4g][mwubuntu1][inet[/192.168.8.112:9300]], reason: zen-disco-join (elected_as_master)
[2014-03-13 13:42:23,100][INFO ][discovery      ] [New Goblin] elasticsearch/rWMtGj3dQouz2r6ZFL9v4g
[2014-03-13 13:42:23,125][INFO ][http           ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.8.112:9200]}
[2014-03-13 13:42:23,629][INFO ][gateway        ] [New Goblin] recovered [1] indices into cluster_state
[2014-03-13 13:42:23,630][INFO ][node           ] [New Goblin] started

Life instead a Cluster

If we start a single node, with no data and no indices, our cluster looks like this:

EmptyClusterA node is a running instance of Elasticsearch, while a cluster consists of one or more nodes with the same cluster.name that are working together to share their data and workload. As nodes are added or removed from the cluster, the cluster reorganizes itself to spread the data evenly.

Master Node (manages cluster-wide changes):

  •      Creating or deleting an index
  •      Adding or removing a node from the cluster.

As users, we can talk to any node in the cluster, including the master node. Every node knows where each document lives and can forward our request directly to the nodes that hold the data we are interested in. Whichever node we talk to manages the process of gathering the response from the nodes or nodes holding the data and returning the final response to the client.

Storing Data
  • An index is just a “logical namespace” which points to one or more physical shards.
  • A shard is a low-level “worker unit” which holds just a slice of all the data in the index, it is a single instance of Lucene, and is a complete search engine in its own right. Our documents are stored and indexed in shards, but our applications don’t talk to them directly. Instead, they talk to an index.
  • Shards are how Elasticsearch distributes data around your cluster. Think of shards as containers for data. Documents are stored in shards, and shards are allocated to nodes in your cluster. As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced.
  • The number of primary shards in an index is fixed at the time an index is created, but the number of replica shards can be changed at any time. By default, indices are assigned five primary shards.
  • Any newly indexed document will first be stored on a primary shard, and then copied in parallel to the associated replica shard(s).

Elasticsearch Cluster

Empty cluster with no indices will return following cluster health.

Cluster health:

{
   "cluster_name":          "elasticsearch",
   "status":                "green",
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards"0,
   "active_shards":         0,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}

The status field provides an overall indication of how the cluster is functioning. The meanings of the three colors are provided here for reference:

green All primary and replica shards are active.

yellow All primary shards are active, but not all replica shards are active.

red Not all primary shards are active.

Now, create an index with 3 primary shards and one replica (one replica of every primary shard):

Index creation:

{
   "settings": {
      "number_of_shards"3,
      "number_of_replicas"1
   }
}'
A single-node cluster with an index
Single-nodeClusterWithAnIndex
Cluster health:
{
   "cluster_name":          "elasticsearch",
   "status":                "yellow",
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards"3,
   "active_shards":         3,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     3
}

Cluster status is yellow  with three unassigned shards.

Our three replica shards have not been allocated a node.

Add failover by starting another node since with only one node there is no redundancy and it is single point of failure.TwoNodeClusterCluster health is green.

Cluster health:

{
   "cluster_name":          "elasticsearch",
   "status":                "green",
   "timed_out":             false,
   "number_of_nodes":       2,
   "number_of_data_nodes":  2,
   "active_primary_shards"3,
   "active_shards":         6,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}
To scale horizontally, add another node to the cluster:
ThreeNodeCluster

One shard each from Node 1 and Node 2 have moved to the new Node 3, and we have two shards per node instead of 3. This means that the hardware resources (CPU, RAM, I/O) of each node are being shared among fewer shards, allowing each shard to perform better.

Increasing the number of replicas to 2:TwoReplicaThreeNodeCluster.pngThis would improve the search since read requests can be handled by primary or a replica shard. The max we can scale out is by adding 6 more nodes making a cluster of 9 nodes and each shard having access to 100% of its node’s resources.

Coping with failure:

Cluster after killing one node:ClusterKillingOneNodeCluster health is red. Primary shards 1 and 2 were lost when node 1 was killed. Node 2 becomes master and promotes the replicas of these shards on Node 2 and Node 3 to be primaries, putting back the cluster health to yellow.

Configuration:

discovery.zen.minimum_master_nodes: should be set to N/2 + 1 to avoid split brain problem.

discovery.zen.ping.timeout: default value is set it 3 secs and it determines how much time a node will wait for a response from other nodes in the cluster before assuming that the node has failed. Slightly increase the default value for slower networks.

Cluster health is red. Primary shards 1 and 2 were lost when node 1 was killed. Node 2 becomes master and promotes the replicas of these shards on Node 2 and Node 3 to be primaries, putting back the cluster health to yellow.

Configuration:

discovery.zen.minimum_master_nodes: should be set to N/2 + 1 to avoid split brain problem.

discovery.zen.ping.timeout: default value is set it 3 secs and it determines how much time a node will wait for a response from other nodes in the cluster before assuming that the node has failed. Slightly increase the default value for slower networks.

Glossary of terms:

cluster

A cluster consists of one or more nodes which share the same cluster name. Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.

index

An index is like a database in a relational database. It has a mapping which defines multiple types. An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.

node

A node is a running instance of elasticsearch which belongs to a cluster. Multiple nodes can be started on a single server for testing purposes, but usually you should have one node per server. At startup, a node will use unicast (or multicast, if specified) to discover an existing cluster with the same cluster name and will try to join that cluster.

primary shard

Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard. By default, an index has 5 primary shards. You can specify fewer or more primary shards to scale the number of documents that your index can handle. You cannot change the number of primary shards in an index, once the index is created. See also routing

replica shard

Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes:

  1. increase failover: a replica shard can be promoted to a primary shard if the primary fails
  2. increase performance: get and search requests can be handled by primary or replica shards. By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.

shard

A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards. Other than defining the number of primary and replica shards that an index should have, you never need to refer to shards directly. Instead, your code should deal only with an index. Elasticsearch distributes shards amongst all nodes in the cluster, and can move shards automatically from one node to another in the case of node failure, or the addition of new nodes.

 

Reference:

https://www.elastic.co/guide/en/elasticsearch/guide/current/getting-started.html