The failure of a coordinator during the two-phase commit based transaction protocol can lead to locks being retained within a transcational state machine indefinitely. For example, if a transaction prepares transactional changes within a single partition and then crashes, those locks will remain forever.
But it's not sufficient to simply release transactional locks when a session times out. That is only a temporary solution that risks losing the atomicity guarantees of transactions when a failure occurs during the second phase. If a coordinator begins to commit a transaction and successfully commits to a subset of the partitions before crashing, releasing locks on remaining partitions would turn the transaction into a partial commit.
The best solution here is to add a TransactionService that can manage transaction state and, more importantly, resolve transactions after failures. When a transaction coordinator fails, another node should take over for the coordinator and either roll back or commit the transaction, depending on whether the first phase has been completed. This solution typically risks a loss of liveness when both a coordinator and a participant fail, but because participants in ONOS are highly available, that risk is highly mitigated.
The other solution is to use three-phase commit, but that would add additional undesirable overhead to the protocol that's unnecessary because of the fault tolerance of participants in ONOS.