Uploaded image for project: 'ONOS'
  1. ONOS
  2. ONOS-3733

Fix Cluster communication issues noted during failure testing on LxC

    XMLWordPrintable

    Details

    • Type: Story
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.5.0
    • Component/s: None
    • Labels:
    • Story Points:
      5
    • Epic Link:

      Description

      When running using Linux Containers (or even Docker) taking a node down puts the entire cluster into a bad state. For example running "nodes" cli command shows mutually conflicting node status: node1 thinks node2 is up. But node2 thinks node1 is down.

      After a quick look at the java thread dump on the running nodes we found that several threads were blocked on the connect call (at netty level) to complete. This problem arises because we synchronously try and establish a connection. If the connect call takes a while to complete (with a failure) we end up with several threads stuck trying to open connections to the down node.

      A better solution is to establish connections asynchronously and not block the caller thread.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

            Assignee:
            aaron Aaron Kruglikov
            Reporter:
            madan Madan Jampani
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: