ONOS
  1. ONOS
  2. ONOS-5347

ONOS cluster not able to recover after killing one of cluster member

    Details

    • Type: Bug Bug
    • Status: Closed (View Workflow)
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.7.0, 1.8.0
    • Fix Version/s: 1.9.0, 1.8.2
    • Component/s: None
    • Labels:
    • Environment:

      Test Env: 3 node cluster
      Nodes running on LXC containers
      ONOS ver: ONOS 1.8 (Master)

      Also tested the scenario with ONOS 1.7 release

    • Story Points:
      8
    • Epic Link:
    • Sprint:
      Ibis Sprint 2 - Platform, Ibis Sprint 3 - Platform, Junco Sprint #1 - Platform, Junco Sprint #2 - Platform, Junco Sprint #3 - Platform

      Description

      Issue during cluster capability testing of ONOS.
      Test Env: 3 node cluster (running in LXC containers)
      ONOS ver: ONOS 1.8 (Master)
      Steps:

      • All three nodes are up and running. ONOS is running lxc container.
      • Killed one of the instance.

      And this resulted in failure of entire cluster. i.e.\

      • ConsistentMapTimeout exceptions on other two nodes.
      • GUI for any node is not working.
      • ONOS cli on other nodes, output for other nodes as below:

      onos> masters
      Error executing command: org.onosproject.store.service.StorageException$Timeout

      Logs from all three nodes is attached.
      10.0.3.11 logs corresponds to killed instance (after restart)

      I see following two issues:

      1. Killing/crashing of one instance of ONOS resulting in entire cluster down.
      2. (on killed node) Unable to load the app from disk.

      Additional Info: Upon hitting "org.onosproject.store.service.StorageException$Timeout" exception in other scenarios, entire cluster seems to be not functional.

      Also shared detailed observations on testing with ONOS 1.7 & 1.8 in dev community group:
      https://groups.google.com/a/onosproject.org/forum/#!topic/onos-dev/e6EtzPrB1Pw

      1. 0001-debug.patch
        25 kB
        Thiago Sousa Santos
      2. 0001-debug-logs-cluster.patch
        14 kB
        Thiago Sousa Santos
      3. 10_0_3_10_karaf.log
        1.66 MB
        Srinivas Bandi
      4. 10_0_3_11_karaf.log
        653 kB
        Srinivas Bandi
      5. 10_0_3_12_karaf.log
        4.08 MB
        Srinivas Bandi
      6. onos1.log.bz2
        830 kB
        Thiago Sousa Santos

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          Thiago Sousa Santos added a comment - - edited

          This refers to what I see in the onos1 log I attached. This is a 3 node cluster. I'm also attaching the patches for copycat and onos that I used to put more debugging where I thought it would be useful.

          12:53:28,600: A new `masters` query is started. at 12:53:28,602 the Leadership query is sent to different sessions, including session 10, which is currently handled by ONOS3 (172.17.0.4)

          40 - io.atomix.all - 1.0.0.SNAPSHOT | 10 - Sending 1537 QueryRequest[session=10, sequence=218, index=797, query=InstanceQue
          79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -5747432313632212825 /172.17.0.4:9876 onos-copycat-1

          The query goes to ONOS3, ONOS3 bounces it back to the leader (ONOS1) and it is properly answered and gets back to the client in ONOS1 logs (at the bottom).

          In 12:53:28,687 the QueryResponse is received but it is kept in the ClientSequencer because it needs to wait for previous responses to happen.

          10 a860708c-a208-482d-b7a8-a56fd0a02a71 Adding sequence response 1537 -> 1525 QueryRe

          This means it is waiting at the response sequence 1525. And then it will timeout. Let's go back and check what happened to the query 1526.

          It can be found at 12:53:23,705 and it isn't processed because it needs the log to be at index 799, but it is at index 789:

          10 a860708c-a208-482d-b7a8-a56fd0a02a71 Complete response no-callback 789 799

          Then, we need to check what happened with the Command that would take this session from 789 to 799. It can be found at 12:53:23,621:

          10 a860708c-a208-482d-b7a8-a56fd0a02a71 - COMMIT 799 799 io.atomix.copycat.server.sta

          but the ServerSessionContext had no connection and so it stores the PublishEvent to be sent later. Except that this session 10 has no connection to any of the 3 nodes in the cluster and no PublishEvent was generated for this commit.

          At this point I don't know exactly what to expect. This session 10 seems to never get connected again or this wait for the commit 799 event isn't ever cleared. Where to continue looking? I can reproduce this frequently so let me know if I should try and observe some specific aspects of those cluster session/messages.

          Let me know if this description is enough. My scenario is that I keep constantly issuing the "masters" command in the CLI of ONOS1 while it is working under a very heavy load on a network.

          Show
          Thiago Sousa Santos added a comment - - edited This refers to what I see in the onos1 log I attached. This is a 3 node cluster. I'm also attaching the patches for copycat and onos that I used to put more debugging where I thought it would be useful. 12:53:28,600: A new `masters` query is started. at 12:53:28,602 the Leadership query is sent to different sessions, including session 10, which is currently handled by ONOS3 (172.17.0.4) 40 - io.atomix.all - 1.0.0.SNAPSHOT | 10 - Sending 1537 QueryRequest[session=10, sequence=218, index=797, query=InstanceQue 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -5747432313632212825 /172.17.0.4:9876 onos-copycat-1 The query goes to ONOS3, ONOS3 bounces it back to the leader (ONOS1) and it is properly answered and gets back to the client in ONOS1 logs (at the bottom). In 12:53:28,687 the QueryResponse is received but it is kept in the ClientSequencer because it needs to wait for previous responses to happen. 10 a860708c-a208-482d-b7a8-a56fd0a02a71 Adding sequence response 1537 -> 1525 QueryRe This means it is waiting at the response sequence 1525. And then it will timeout. Let's go back and check what happened to the query 1526. It can be found at 12:53:23,705 and it isn't processed because it needs the log to be at index 799, but it is at index 789: 10 a860708c-a208-482d-b7a8-a56fd0a02a71 Complete response no-callback 789 799 Then, we need to check what happened with the Command that would take this session from 789 to 799. It can be found at 12:53:23,621: 10 a860708c-a208-482d-b7a8-a56fd0a02a71 - COMMIT 799 799 io.atomix.copycat.server.sta but the ServerSessionContext had no connection and so it stores the PublishEvent to be sent later. Except that this session 10 has no connection to any of the 3 nodes in the cluster and no PublishEvent was generated for this commit. At this point I don't know exactly what to expect. This session 10 seems to never get connected again or this wait for the commit 799 event isn't ever cleared. Where to continue looking? I can reproduce this frequently so let me know if I should try and observe some specific aspects of those cluster session/messages. Let me know if this description is enough. My scenario is that I keep constantly issuing the "masters" command in the CLI of ONOS1 while it is working under a very heavy load on a network.
          Hide
          Thiago Sousa Santos added a comment -

          Patches used in onos and copycat to increasing debug details

          Show
          Thiago Sousa Santos added a comment - Patches used in onos and copycat to increasing debug details
          Hide
          Thiago Sousa Santos added a comment -

          More information on another round of the same failure, this could be the root of the issue:

          From client 3 perspective
          1) Client 3 tries to connect to ONOS3 (172.17.0.4 - Follower) and it times out

          2016-12-09 00:59:00,216 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.4:9876
          2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Sending ConnectRequest[client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c]
          2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: 5474277447177789744 /172.17.0.4:9876 onos-copycat-3
          2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: 5474277447177789744 java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Timedout waiting for reply

          2) It tries the next in the list -> ONOS2 (172.17.0.3 - Leader) and it connects successfully:

          2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.3:9876
          2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -7630646592334371938 /172.17.0.3:9876 onos-copycat-3
          2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: -7630646592334371938 GOOD
          2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Received ConnectResponse[status=OK, leader=/172.17.0.3:9876, members=[/172.17.0.2:9876, /172.17.0.4:9876, /172.17.0.3:9876]]
          2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1457, logTerm=3, entries=[0], commitIndex=1457, globalIndex=1454]
          2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | ServerStateMachine | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Applying ConnectEntry[index=1457, term=3, client=32b4932b-73bd-4a54-9395-fcccd39aa1d8, address=/172.17.0.3:9876, timestamp=1481245140529]
          2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Sent AppendResponse[status=OK, term=3, succeeded=true, logIndex=1457]

          3) Sometime later this is received:

          2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1307,
          logTerm=3, entries=[1], commitIndex=1307, globalIndex=1291]
          2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Appended ConnectEntry[index=1308, term=3, client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c, address=/172.17.0.4:9876, timestamp=1481245141410] to log at index 1308
          2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | ServerSessionManager | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Register address e65c9467-bcda-4557-ac5a-3d3d8d470a1c /172.17.0.4:9876

          What is happening is:
          The first attempt to the Follower ONOS3 is delayed for some reason and the timeout on Client 3 side makes it try a new connection to ONOS2 (the leader) and gets its all set up even before ONOS3 sends the AcceptRequest to ONOS2. So, in ONOS2 perspective the connection to ONOS3 was made after the one it has received so the log entries are in the wrong order. Meanwhile the Client 3 is connected to ONOS3 but the log entries and the server side information believes it is connected to ONOS2. So this client stops working.

          I confirmed this by looking at ONOS2 and ONOS3 logs and indeed the leader logs show that it first replies to Client 3 connection and later receives the AcceptRequest from ONOS3.

          Show
          Thiago Sousa Santos added a comment - More information on another round of the same failure, this could be the root of the issue: From client 3 perspective 1) Client 3 tries to connect to ONOS3 (172.17.0.4 - Follower) and it times out 2016-12-09 00:59:00,216 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.4:9876 2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Sending ConnectRequest [client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c] 2016-12-09 00:59:00,218 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: 5474277447177789744 /172.17.0.4:9876 onos-copycat-3 2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: 5474277447177789744 java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Timedout waiting for reply 2) It tries the next in the list -> ONOS2 (172.17.0.3 - Leader) and it connects successfully: 2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | e65c9467-bcda-4557-ac5a-3d3d8d470a1c - Connecting to /172.17.0.3:9876 2016-12-09 00:59:00,478 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Sending message: -7630646592334371938 /172.17.0.3:9876 onos-copycat-3 2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | CopycatTransportConnection | 79 - org.onosproject.onos-core-primitives - 1.7.1 | Received response: -7630646592334371938 GOOD 2016-12-09 00:59:00,570 | DEBUG | ycat-client-io-1 | ClientConnection | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Received ConnectResponse[status=OK, leader=/172.17.0.3:9876, members= [/172.17.0.2:9876, /172.17.0.4:9876, /172.17.0.3:9876] ] 2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1457, logTerm=3, entries= [0] , commitIndex=1457, globalIndex=1454] 2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | ServerStateMachine | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Applying ConnectEntry [index=1457, term=3, client=32b4932b-73bd-4a54-9395-fcccd39aa1d8, address=/172.17.0.3:9876, timestamp=1481245140529] 2016-12-09 00:59:00,571 | DEBUG | 9876-partition-2 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Sent AppendResponse [status=OK, term=3, succeeded=true, logIndex=1457] 3) Sometime later this is received: 2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Received AppendRequest[term=3, leader=-1408161302, logIndex=1307, logTerm=3, entries= [1] , commitIndex=1307, globalIndex=1291] 2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | FollowerState | 40 - io.atomix.all - 1.0.0.SNAPSHOT | /172.17.0.2:9876 - Appended ConnectEntry [index=1308, term=3, client=e65c9467-bcda-4557-ac5a-3d3d8d470a1c, address=/172.17.0.4:9876, timestamp=1481245141410] to log at index 1308 2016-12-09 00:59:01,456 | DEBUG | 9876-partition-3 | ServerSessionManager | 40 - io.atomix.all - 1.0.0.SNAPSHOT | Register address e65c9467-bcda-4557-ac5a-3d3d8d470a1c /172.17.0.4:9876 What is happening is: The first attempt to the Follower ONOS3 is delayed for some reason and the timeout on Client 3 side makes it try a new connection to ONOS2 (the leader) and gets its all set up even before ONOS3 sends the AcceptRequest to ONOS2. So, in ONOS2 perspective the connection to ONOS3 was made after the one it has received so the log entries are in the wrong order. Meanwhile the Client 3 is connected to ONOS3 but the log entries and the server side information believes it is connected to ONOS2. So this client stops working. I confirmed this by looking at ONOS2 and ONOS3 logs and indeed the leader logs show that it first replies to Client 3 connection and later receives the AcceptRequest from ONOS3.
          Hide
          Mao Jianwei added a comment -

          It seem that we confront the same problem from 1.3.0 to now(1.10.0-SNAPSHOT)

          And I want to add that if we start every instances of one cluster not so simultaneously, it will easily cause the whole cluster can not start up and work.

          I think maybe there are some problem in the copycat or atomix?

          Show
          Mao Jianwei added a comment - It seem that we confront the same problem from 1.3.0 to now(1.10.0-SNAPSHOT) And I want to add that if we start every instances of one cluster not so simultaneously, it will easily cause the whole cluster can not start up and work. I think maybe there are some problem in the copycat or atomix?
          Hide
          Jon Hall added a comment -

          Mao Jianwei
          Can you give some more information on your setup and using the latest master give some debug logs?
          How many nodes are you using?
          How do you install ONOS( you mentioned a delay in starting the nodes, how long is the gap between the first and second, etc?*, What commands are you using?, ... )
          How do you kill the node?
          What are the exact symptoms you are seeing?

          Can you give us the karaf logs with debug turned on?
          You can enable atomix debug logging from startup by adding the following line to tools/test/bin/onos-config where there is a similar command:

          echo "log4j.logger.io.atomix= TRACE" \
                  >> $ONOS_INSTALL_DIR/$KARAF_DIST/etc/org.ops4j.pax.logging.cfg

          This will change the log configurations when installing onos remotely when using the installations scripts packaged with onos.

          This seems like two separate issues, so we might move the startup issue to it's won bug once we have more information.

          * This may be somewhat unavoidable, We need a quorum for the cluster to operate. We have tested installing 7 node clusters in sequence, and that does work, but I'm not sure what the maximum delay is from the first node starting to the last node in the quorum starting for ONOS to be able to successfully form a cluster.

          Show
          Jon Hall added a comment - Mao Jianwei Can you give some more information on your setup and using the latest master give some debug logs? How many nodes are you using? How do you install ONOS( you mentioned a delay in starting the nodes, how long is the gap between the first and second, etc?*, What commands are you using?, ... ) How do you kill the node? What are the exact symptoms you are seeing? Can you give us the karaf logs with debug turned on? You can enable atomix debug logging from startup by adding the following line to tools/test/bin/onos-config where there is a similar command: echo "log4j.logger.io.atomix= TRACE" \ >> $ONOS_INSTALL_DIR/$KARAF_DIST/etc/org.ops4j.pax.logging.cfg This will change the log configurations when installing onos remotely when using the installations scripts packaged with onos. This seems like two separate issues, so we might move the startup issue to it's won bug once we have more information. * This may be somewhat unavoidable, We need a quorum for the cluster to operate. We have tested installing 7 node clusters in sequence, and that does work, but I'm not sure what the maximum delay is from the first node starting to the last node in the quorum starting for ONOS to be able to successfully form a cluster.

            People

            • Assignee:
              Jon Hall
              Reporter:
              Srinivas Bandi
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Agile