-
Type: Bug
-
Status: Open (View Workflow)
-
Priority: Major
-
Resolution: Unresolved
-
Affects Version/s: 1.15.0
-
Fix Version/s: None
-
Component/s: None
-
Labels:
-
Story Points:5
-
Epic Link:
We are seeing growing inter-cluster traffic in CHO test (with H-AGG topo and SegmentRouting) which seems to be causing growing CPU load and traffic reroute latency.
We are running a 3-node cluster with onos-1.15. The attached traffic monitoring graph is from onos node1. We see the same issue on both mininet and hardware pod.
What we observed so far:
- Traffic grows linearly with the number of port up/down events we trigger in the test and it is roughly ~2KB/s increase per port-up/down event
- Traffic won't go down unless we kill and restart one node (see the rightmost part of the traffic graph)
By analyzing statistics of the traffic we confirm that the most part of the growing traffic comes from sessions to/from 5679 port on the onos nodes. For example
10.192.21.213:54148 <-> 10.192.21.215:5679
And these sessions belong to some netty-messaging threads. Next we'll investigate which message types are being sent the most.
Update: uploaded "metrics" output of onos node1 at the beginning of the test as well as 1 hour later.
Update on 2-25: the traffic growth doesn’t seem to be come from the subjects collected by `metrics`. I’m not able to find any indication of an increasing rate of messages from `metrics`.
After taking a deeper look at the following session whose traffic (unidirectional) increased from 163 KB/s to 311 KB/s in 2 hours using wireshark analyze tools:
10.192.21.213:54632 -> 10.192.21.215:9876
It shows that the average packet size stay the same while the packet rate increased.
Please check attached graphs of IO statistics in 60s as well as pcap files for this session.
Update on 2-27: a large part of the traffic comes from atomix <-> onos communications. By looking at TRACE logs of io.atomix, it shows a high rate of requests and responses for `onos-group-store-keymap`. It turned out the number of groups on leaf switches increase with the number of port-up events triggered in the test which should be contributing to at least part of the traffic growth.
Filed https://jira.opencord.org/browse/CORD-3243 to track the group growth issue