Uploaded image for project: 'ONOS'
  1. ONOS
  2. ONOS-5347

ONOS cluster not able to recover after killing one of cluster member

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.7.0, 1.8.0
    • Fix Version/s: 1.9.0, 1.8.2
    • Component/s: None
    • Labels:
    • Environment:

      Test Env: 3 node cluster
      Nodes running on LXC containers
      ONOS ver: ONOS 1.8 (Master)

      Also tested the scenario with ONOS 1.7 release

    • Story Points:
      8
    • Epic Link:
    • Sprint:
      Ibis Sprint 2 - Platform, Ibis Sprint 3 - Platform, Junco Sprint #1 - Platform, Junco Sprint #2 - Platform, Junco Sprint #3 - Platform

      Description

      Issue during cluster capability testing of ONOS.
      Test Env: 3 node cluster (running in LXC containers)
      ONOS ver: ONOS 1.8 (Master)
      Steps:

      • All three nodes are up and running. ONOS is running lxc container.
      • Killed one of the instance.

      And this resulted in failure of entire cluster. i.e.\

      • ConsistentMapTimeout exceptions on other two nodes.
      • GUI for any node is not working.
      • ONOS cli on other nodes, output for other nodes as below:

      onos> masters
      Error executing command: org.onosproject.store.service.StorageException$Timeout

      Logs from all three nodes is attached.
      10.0.3.11 logs corresponds to killed instance (after restart)

      I see following two issues:

      1. Killing/crashing of one instance of ONOS resulting in entire cluster down.
      2. (on killed node) Unable to load the app from disk.

      Additional Info: Upon hitting "org.onosproject.store.service.StorageException$Timeout" exception in other scenarios, entire cluster seems to be not functional.

      Also shared detailed observations on testing with ONOS 1.7 & 1.8 in dev community group:
      https://groups.google.com/a/onosproject.org/forum/#!topic/onos-dev/e6EtzPrB1Pw

        Attachments

        1. 0001-debug.patch
          25 kB
        2. 0001-debug-logs-cluster.patch
          14 kB
        3. 10_0_3_10_karaf.log
          1.66 MB
        4. 10_0_3_11_karaf.log
          653 kB
        5. 10_0_3_12_karaf.log
          4.08 MB
        6. onos1.log.bz2
          830 kB

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

              People

              • Assignee:
                jhall Jon Hall
                Reporter:
                sbandi Srinivas Bandi
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: