Uploaded image for project: 'ONOS'
  1. ONOS
  2. ONOS-7939

Inter-cluster traffic keeps growing in CHO test

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.15.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Story Points:
      5
    • Epic Link:

      Description

      We are seeing growing inter-cluster traffic in CHO test (with H-AGG topo and SegmentRouting) which seems to be causing growing CPU load and traffic reroute latency.

      We are running a 3-node cluster with onos-1.15. The attached traffic monitoring graph is from onos node1. We see the same issue on both mininet and hardware pod.

      What we observed so far:

      • Traffic grows linearly with the number of port up/down events we trigger in the test and it is roughly ~2KB/s increase per port-up/down event
      • Traffic won't go down unless we kill and restart one node (see the rightmost part of the traffic graph)

      By analyzing statistics of the traffic we confirm that the most part of the growing traffic comes from sessions to/from 5679 port on the onos nodes. For example

      10.192.21.213:54148        <-> 10.192.21.215:5679
      

      And these sessions belong to some netty-messaging threads. Next we'll investigate which message types are being sent the most.

      Update: uploaded "metrics" output of onos node1 at the beginning of the test as well as 1 hour later.

      Update on 2-25: the traffic growth doesn’t seem to be come from the subjects collected by `metrics`. I’m not able to find any indication of an increasing rate of messages from `metrics`.

      After taking a deeper look at the following session whose traffic (unidirectional) increased from 163 KB/s to 311 KB/s in 2 hours using wireshark analyze tools:

      10.192.21.213:54632    ->    10.192.21.215:9876
      

      It shows that the average packet size stay the same while the packet rate increased.

      Please check attached graphs of IO statistics in 60s as well as pcap files for this session.

      Update on 2-27: a large part of the traffic comes from atomix <-> onos communications. By looking at TRACE logs of io.atomix, it shows a high rate of requests and responses for `onos-group-store-keymap`. It turned out the number of groups on leaf switches increase with the number of port-up events triggered in the test which should be contributing to at least part of the traffic growth.

      Filed https://jira.opencord.org/browse/CORD-3243 to track the group growth issue

        Attachments

        1. 18-00.pcap
          9.83 MB
        2. image.png
          image.png
          441 kB
        3. onos1-metrics-12-35-26
          84 kB
        4. onos1-metrics-13-34-16
          87 kB
        5. onos-diags.tar.gz
          1.04 MB
        6. packet-IO-18-00.png
          packet-IO-18-00.png
          319 kB
        7. packet-IO-20-00.png
          packet-IO-20-00.png
          464 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

            Assignee:
            Unassigned
            Reporter:
            You You Wang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated: