After having looked at the logs for a diverse set of HA issues over the past few weeks, I've come to realize there's a fundamental flaw in the Copycat client that exacerbates these types of issues. Often, when we see StorageService failures like timeouts, they seem to cascade across the entire ONOS process. When a ConsistentMap.put call in one application fails, we often see seemingly random timeouts elsewhere in the cluster. The reason for this is because all primitives share a single Copycat session for each partition. This is problematic because the Copycat session performs sequencing for all primitives that interact with a given partition. So, a failure in one primitive can cascade to other primitives.
Up until now, all primitives shared a Copycat client because it provided ordering guarantees across all primitives. But because we relaxed the primitive thread model in
ONOS-6267, those guarantees are now no longer relevant across primitives, but only within primitives. So, the Copycat client should be refactored to support separate logical sessions for each partition of each primitive. This should be a fairly straightforward task to accomplish. Doing so will ensure that sequencing for one primitive occurs independently of sequencing for all other primitives, and it will therefore reduce the likelihood of cascading timeouts and significantly increase the concurrency all the way down to Netty.