-
Type: Bug
-
Status: Closed (View Workflow)
-
Priority: Critical
-
Resolution: Fixed
-
Affects Version/s: 1.7.0, 1.8.0
-
Component/s: None
-
Labels:
-
Environment:
Test Env: 3 node cluster
Nodes running on LXC containers
ONOS ver: ONOS 1.8 (Master)Also tested the scenario with ONOS 1.7 release
-
Story Points:8
-
Epic Link:
-
Sprint:Ibis Sprint 2 - Platform, Ibis Sprint 3 - Platform, Junco Sprint #1 - Platform, Junco Sprint #2 - Platform, Junco Sprint #3 - Platform
Issue during cluster capability testing of ONOS.
Test Env: 3 node cluster (running in LXC containers)
ONOS ver: ONOS 1.8 (Master)
Steps:
- All three nodes are up and running. ONOS is running lxc container.
- Killed one of the instance.
And this resulted in failure of entire cluster. i.e.\
- ConsistentMapTimeout exceptions on other two nodes.
- GUI for any node is not working.
- ONOS cli on other nodes, output for other nodes as below:
onos> masters
Error executing command: org.onosproject.store.service.StorageException$Timeout
Logs from all three nodes is attached.
10.0.3.11 logs corresponds to killed instance (after restart)
I see following two issues:
1. Killing/crashing of one instance of ONOS resulting in entire cluster down.
2. (on killed node) Unable to load the app from disk.
Additional Info: Upon hitting "org.onosproject.store.service.StorageException$Timeout" exception in other scenarios, entire cluster seems to be not functional.
Also shared detailed observations on testing with ONOS 1.7 & 1.8 in dev community group:
https://groups.google.com/a/onosproject.org/forum/#!topic/onos-dev/e6EtzPrB1Pw