ONOS
  1. ONOS
  2. ONOS-5347

ONOS cluster not able to recover after killing one of cluster member

    Details

    • Type: Bug Bug
    • Status: In Progress (View Workflow)
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.7.0, 1.8.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Environment:

      Test Env: 3 node cluster
      Nodes running on LXC containers
      ONOS ver: ONOS 1.8 (Master)

      Also tested the scenario with ONOS 1.7 release

    • Story Points:
      8
    • Epic Link:
    • Sprint:
      Ibis Sprint 2 - Platform, Ibis Sprint 3 - Platform, Junco Sprint #1 - Platform, Junco Sprint #2 - Platform

      Description

      Issue during cluster capability testing of ONOS.
      Test Env: 3 node cluster (running in LXC containers)
      ONOS ver: ONOS 1.8 (Master)
      Steps:

      • All three nodes are up and running. ONOS is running lxc container.
      • Killed one of the instance.

      And this resulted in failure of entire cluster. i.e.\

      • ConsistentMapTimeout exceptions on other two nodes.
      • GUI for any node is not working.
      • ONOS cli on other nodes, output for other nodes as below:

      onos> masters
      Error executing command: org.onosproject.store.service.StorageException$Timeout

      Logs from all three nodes is attached.
      10.0.3.11 logs corresponds to killed instance (after restart)

      I see following two issues:

      1. Killing/crashing of one instance of ONOS resulting in entire cluster down.
      2. (on killed node) Unable to load the app from disk.

      Additional Info: Upon hitting "org.onosproject.store.service.StorageException$Timeout" exception in other scenarios, entire cluster seems to be not functional.

      Also shared detailed observations on testing with ONOS 1.7 & 1.8 in dev community group:
      https://groups.google.com/a/onosproject.org/forum/#!topic/onos-dev/e6EtzPrB1Pw

      1. 0001-debug.patch
        25 kB
        Thiago Sousa Santos
      2. 0001-debug-logs-cluster.patch
        14 kB
        Thiago Sousa Santos
      3. 10_0_3_10_karaf.log
        1.66 MB
        Srinivas Bandi
      4. 10_0_3_11_karaf.log
        653 kB
        Srinivas Bandi
      5. 10_0_3_12_karaf.log
        4.08 MB
        Srinivas Bandi
      6. onos1.log.bz2
        830 kB
        Thiago Sousa Santos

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          Jon Hall added a comment -

          Can you try this again with the following changes?

          • git clone both the atomix and copycat repos which can be found here: https://github.com/atomix
          • Checkout and build(using maven) the latest master of Copycat then Atomix. Note that Atomix depends on Copycat
          • Create a symlink in the onos/lib folder to atomix/all/target/atomix-all-1.0.0-SNAPSHOT.jar
          • Then make the following change to lib/BUCK to use the new Atomix version:
            diff --git i/lib/BUCK w/lib/BUCK
            index 50c00fa..a554275 100644
            --- i/lib/BUCK
            +++ w/lib/BUCK
            @@ -131,12 +131,11 @@ remote_jar (
               visibility = [ 'PUBLIC' ],
             )
            
            -remote_jar (
            +
            +prebuilt_jar (
               name = 'atomix',
            -  out = 'atomix-all-1.0.0-rc9.jar',
            -  url = 'mvn:io.atomix:atomix-all:jar:1.0.0-rc9',
            -  sha1 = '05a5ff76bba888856416b3132e1c45626f0fcaaf',
            -  maven_coords = 'io.atomix:atomix-all:1.0.0-rc9',
            +  binary_jar = 'atomix-all-1.0.0-SNAPSHOT.jar',
            +  maven_coords = 'io.atomix:atomix-all:1.0.0-SNAPSHOT',
               visibility = [ 'PUBLIC' ],
             )
            
          • Then build onos using buck and rerun your scenario.

          There are a couple of bug fixes that have been merged since the last atomix/copycat rc's. I think there is at least one that is relevant to this situation, although I do still see issues with node failures around 50% of the time when running with only one partition.

          Show
          Jon Hall added a comment - Can you try this again with the following changes? git clone both the atomix and copycat repos which can be found here: https://github.com/atomix Checkout and build(using maven) the latest master of Copycat then Atomix. Note that Atomix depends on Copycat Create a symlink in the onos/lib folder to atomix/all/target/atomix-all-1.0.0-SNAPSHOT.jar Then make the following change to lib/BUCK to use the new Atomix version: diff --git i/lib/BUCK w/lib/BUCK index 50c00fa..a554275 100644 --- i/lib/BUCK +++ w/lib/BUCK @@ -131,12 +131,11 @@ remote_jar ( visibility = [ 'PUBLIC' ], ) -remote_jar ( + +prebuilt_jar ( name = 'atomix', - out = 'atomix-all-1.0.0-rc9.jar', - url = 'mvn:io.atomix:atomix-all:jar:1.0.0-rc9', - sha1 = '05a5ff76bba888856416b3132e1c45626f0fcaaf', - maven_coords = 'io.atomix:atomix-all:1.0.0-rc9', + binary_jar = 'atomix-all-1.0.0-SNAPSHOT.jar', + maven_coords = 'io.atomix:atomix-all:1.0.0-SNAPSHOT', visibility = [ 'PUBLIC' ], ) Then build onos using buck and rerun your scenario. There are a couple of bug fixes that have been merged since the last atomix/copycat rc's. I think there is at least one that is relevant to this situation, although I do still see issues with node failures around 50% of the time when running with only one partition.
          Hide
          Srinivas Bandi added a comment -

          Detailed analysis done so far for this issue is shared with dev community for further help.

          https://groups.google.com/a/onosproject.org/forum/#!topic/onos-dev/PhIloZCBYXs

          Show
          Srinivas Bandi added a comment - Detailed analysis done so far for this issue is shared with dev community for further help. https://groups.google.com/a/onosproject.org/forum/#!topic/onos-dev/PhIloZCBYXs
          Hide
          Thomas Vachuska added a comment -

          I am unable to reproduce this issue on a 3-node LXC cluster in our lab. I'm using the 'stc ha-single-node' scenario.

          Show
          Thomas Vachuska added a comment - I am unable to reproduce this issue on a 3-node LXC cluster in our lab. I'm using the 'stc ha-single-node' scenario.
          Hide
          Varun Sharma added a comment -

          Hi Thomas,

          We tried reproducing this issue last friday and we were able to reproduce this issue.
          On the server, the frequency is very low but it did occur.
          On the laptop, we can reproduce this issue .

          Please let us know if you are looking for any specific log. We can try recreating this issue .

          Show
          Varun Sharma added a comment - Hi Thomas, We tried reproducing this issue last friday and we were able to reproduce this issue. On the server, the frequency is very low but it did occur. On the laptop, we can reproduce this issue . Please let us know if you are looking for any specific log. We can try recreating this issue .
          Hide
          Harold Huang added a comment - - edited

          Yes, I can also reproduce this issue on the server in my lab. Although it doesn't always crash when I kill one of the cluster members. But it does happen sometimes. And when I restore this cashed member to heal the cluster, sometimes it will fail and all the nodes in the cluster become unavailable.

          Show
          Harold Huang added a comment - - edited Yes, I can also reproduce this issue on the server in my lab. Although it doesn't always crash when I kill one of the cluster members. But it does happen sometimes. And when I restore this cashed member to heal the cluster, sometimes it will fail and all the nodes in the cluster become unavailable.
          Hide
          Thiago Sousa Santos added a comment - - edited

          This refers to what I see in the onos1 log I attached. This is a 3 node cluster. I'm also attaching the patches for copycat and onos that I used to put more debugging where I thought it would be useful.

          12:53:28,600: A new `masters` query is started. at 12:53:28,602 the Leadership query is sent to different sessions, including session 10, which is currently handled by ONOS3 (172.17.0.4)

          40 - io.atomix.all - 1.0.0.SNAPSHOT | 10 - Sending 1537 QueryRequest[session=10, sequence=218, index=797, query=InstanceQue
          79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -5747432313632212825 /172.17.0.4:9876 onos-copycat-1

          The query goes to ONOS3, ONOS3 bounces it back to the leader (ONOS1) and it is properly answered and gets back to the client in ONOS1 logs (at the bottom).

          In 12:53:28,687 the QueryResponse is received but it is kept in the ClientSequencer because it needs to wait for previous responses to happen.

          10 a860708c-a208-482d-b7a8-a56fd0a02a71 Adding sequence response 1537 -> 1525 QueryRe

          This means it is waiting at the response sequence 1525. And then it will timeout. Let's go back and check what happened to the query 1526.

          It can be found at 12:53:23,705 and it isn't processed because it needs the log to be at index 799, but it is at index 789:

          10 a860708c-a208-482d-b7a8-a56fd0a02a71 Complete response no-callback 789 799

          Then, we need to check what happened with the Command that would take this session from 789 to 799. It can be found at 12:53:23,621:

          10 a860708c-a208-482d-b7a8-a56fd0a02a71 - COMMIT 799 799 io.atomix.copycat.server.sta

          but the ServerSessionContext had no connection and so it stores the PublishEvent to be sent later. Except that this session 10 has no connection to any of the 3 nodes in the cluster and no PublishEvent was generated for this commit.

          At this point I don't know exactly what to expect. This session 10 seems to never get connected again or this wait for the commit 799 event isn't ever cleared. Where to continue looking? I can reproduce this frequently so let me know if I should try and observe some specific aspects of those cluster session/messages.

          Let me know if this description is enough. My scenario is that I keep constantly issuing the "masters" command in the CLI of ONOS1 while it is working under a very heavy load on a network.

          Show
          Thiago Sousa Santos added a comment - - edited This refers to what I see in the onos1 log I attached. This is a 3 node cluster. I'm also attaching the patches for copycat and onos that I used to put more debugging where I thought it would be useful. 12:53:28,600: A new `masters` query is started. at 12:53:28,602 the Leadership query is sent to different sessions, including session 10, which is currently handled by ONOS3 (172.17.0.4) 40 - io.atomix.all - 1.0.0.SNAPSHOT | 10 - Sending 1537 QueryRequest[session=10, sequence=218, index=797, query=InstanceQue 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -5747432313632212825 /172.17.0.4:9876 onos-copycat-1 The query goes to ONOS3, ONOS3 bounces it back to the leader (ONOS1) and it is properly answered and gets back to the client in ONOS1 logs (at the bottom). In 12:53:28,687 the QueryResponse is received but it is kept in the ClientSequencer because it needs to wait for previous responses to happen. 10 a860708c-a208-482d-b7a8-a56fd0a02a71 Adding sequence response 1537 -> 1525 QueryRe This means it is waiting at the response sequence 1525. And then it will timeout. Let's go back and check what happened to the query 1526. It can be found at 12:53:23,705 and it isn't processed because it needs the log to be at index 799, but it is at index 789: 10 a860708c-a208-482d-b7a8-a56fd0a02a71 Complete response no-callback 789 799 Then, we need to check what happened with the Command that would take this session from 789 to 799. It can be found at 12:53:23,621: 10 a860708c-a208-482d-b7a8-a56fd0a02a71 - COMMIT 799 799 io.atomix.copycat.server.sta but the ServerSessionContext had no connection and so it stores the PublishEvent to be sent later. Except that this session 10 has no connection to any of the 3 nodes in the cluster and no PublishEvent was generated for this commit. At this point I don't know exactly what to expect. This session 10 seems to never get connected again or this wait for the commit 799 event isn't ever cleared. Where to continue looking? I can reproduce this frequently so let me know if I should try and observe some specific aspects of those cluster session/messages. Let me know if this description is enough. My scenario is that I keep constantly issuing the "masters" command in the CLI of ONOS1 while it is working under a very heavy load on a network.
          Hide
          Thiago Sousa Santos added a comment -

          Patches used in onos and copycat to increasing debug details

          Show
          Thiago Sousa Santos added a comment - Patches used in onos and copycat to increasing debug details
          Hide
          Thiago Sousa Santos added a comment -

          More information on another round of the same failure, this could be the root of the issue:

          From client 3 perspective
          1) Client 3 tries to connect to ONOS3 (172.17.0.4 - Follower) and it times out

          2016-12-09 00:59:00,216 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.4:9876
          2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Sending ConnectRequest[client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c]
          2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: 5474277447177789744 /172.17.0.4:9876 onos-copycat-3
          2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: 5474277447177789744 java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Timedout waiting for reply

          2) It tries the next in the list -> ONOS2 (172.17.0.3 - Leader) and it connects successfully:

          2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.3:9876
          2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -7630646592334371938 /172.17.0.3:9876 onos-copycat-3
          2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: -7630646592334371938 GOOD
          2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Received ConnectResponse[status=OK, leader=/172.17.0.3:9876, members=[/172.17.0.2:9876, /172.17.0.4:9876, /172.17.0.3:9876]]
          2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1457, logTerm=3, entries=[0], commitIndex=1457, globalIndex=1454]
          2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | ServerStateMachine | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Applying ConnectEntry[index=1457, term=3, client=32b4932b-73bd-4a54-9395-fcccd39aa1d8, address=/172.17.0.3:9876, timestamp=1481245140529]
          2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Sent AppendResponse[status=OK, term=3, succeeded=true, logIndex=1457]

          3) Sometime later this is received:

          2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1307,
          logTerm=3, entries=[1], commitIndex=1307, globalIndex=1291]
          2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Appended ConnectEntry[index=1308, term=3, client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c, address=/172.17.0.4:9876, timestamp=1481245141410] to log at index 1308
          2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | ServerSessionManager | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Register address e65c9467-bcda-4557-ac5a-3d3d8d470a1c /172.17.0.4:9876

          What is happening is:
          The first attempt to the Follower ONOS3 is delayed for some reason and the timeout on Client 3 side makes it try a new connection to ONOS2 (the leader) and gets its all set up even before ONOS3 sends the AcceptRequest to ONOS2. So, in ONOS2 perspective the connection to ONOS3 was made after the one it has received so the log entries are in the wrong order. Meanwhile the Client 3 is connected to ONOS3 but the log entries and the server side information believes it is connected to ONOS2. So this client stops working.

          I confirmed this by looking at ONOS2 and ONOS3 logs and indeed the leader logs show that it first replies to Client 3 connection and later receives the AcceptRequest from ONOS3.

          Show
          Thiago Sousa Santos added a comment - More information on another round of the same failure, this could be the root of the issue: From client 3 perspective 1) Client 3 tries to connect to ONOS3 (172.17.0.4 - Follower) and it times out 2016-12-09 00:59:00,216 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.4:9876 2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Sending ConnectRequest [client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c] 2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: 5474277447177789744 /172.17.0.4:9876 onos-copycat-3 2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: 5474277447177789744 java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Timedout waiting for reply 2) It tries the next in the list -> ONOS2 (172.17.0.3 - Leader) and it connects successfully: 2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.3:9876 2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -7630646592334371938 /172.17.0.3:9876 onos-copycat-3 2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: -7630646592334371938 GOOD 2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Received ConnectResponse[status=OK, leader=/172.17.0.3:9876, members= [/172.17.0.2:9876, /172.17.0.4:9876, /172.17.0.3:9876] ] 2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1457, logTerm=3, entries= [0] , commitIndex=1457, globalIndex=1454] 2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | ServerStateMachine | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Applying ConnectEntry [index=1457, term=3, client=32b4932b-73bd-4a54-9395-fcccd39aa1d8, address=/172.17.0.3:9876, timestamp=1481245140529] 2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Sent AppendResponse [status=OK, term=3, succeeded=true, logIndex=1457] 3) Sometime later this is received: 2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1307, logTerm=3, entries= [1] , commitIndex=1307, globalIndex=1291] 2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Appended ConnectEntry [index=1308, term=3, client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c, address=/172.17.0.4:9876, timestamp=1481245141410] to log at index 1308 2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | ServerSessionManager | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Register address e65c9467-bcda-4557-ac5a-3d3d8d470a1c /172.17.0.4:9876 What is happening is: The first attempt to the Follower ONOS3 is delayed for some reason and the timeout on Client 3 side makes it try a new connection to ONOS2 (the leader) and gets its all set up even before ONOS3 sends the AcceptRequest to ONOS2. So, in ONOS2 perspective the connection to ONOS3 was made after the one it has received so the log entries are in the wrong order. Meanwhile the Client 3 is connected to ONOS3 but the log entries and the server side information believes it is connected to ONOS2. So this client stops working. I confirmed this by looking at ONOS2 and ONOS3 logs and indeed the leader logs show that it first replies to Client 3 connection and later receives the AcceptRequest from ONOS3.

            People

            • Assignee:
              Jon Hall
              Reporter:
              Srinivas Bandi
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Agile