When running using Linux Containers (or even Docker) taking a node down puts the entire cluster into a bad state. For example running "nodes" cli command shows mutually conflicting node status: node1 thinks node2 is up. But node2 thinks node1 is down.
After a quick look at the java thread dump on the running nodes we found that several threads were blocked on the connect call (at netty level) to complete. This problem arises because we synchronously try and establish a connection. If the connect call takes a while to complete (with a failure) we end up with several threads stuck trying to open connections to the down node.
A better solution is to establish connections asynchronously and not block the caller thread.