Distributed Transactional Contention Management as the Traveling Salesman Problem

Distributed Transactional Contention Management as the Traveling Salesman Problem Bo Zhang, Binoy Ravindran, Roberto Palmieri Virginia Tech SIROCCO 204

Lock-based concurrency control has serious drawbacks Coarse grained locking Simple But no concurrency SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

Fine-grained locking is better, but SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Excellent performance Poor programmability Lock problems don t go away! Deadlocks, livelocks, lock-convoying, priority inversion,. Most significant difficulty composition

Transactional memory Like database transactions Easier to program Composable First HTM, then STM now HyTM SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

Optimistic execution yields performance gains at the simplicity of coarse-grain, but no silver bullet SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Time Coarse-grained locking STM Fine-grained locking High data dependencies Irrevocable operations Interaction between transactions and non-transactions Conditional waiting Threads E.g., C/C++ Intel Run-Time System STM (B. Saha et. al. (2006). McRT- STM: A High Performance Software Transactional Memory. ACM PPoPP)

Contention management. Which transaction to abort? T0! T!!! x = x + y;!!!!! x = x / 25;! x = x / 25;! Contention manager Can cause too many aborts, e.g., when a long running transaction conflicts with shorter transactions An aborted transaction may wait too long SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

From Mul)processor to Distributed Systems (from STM to DTM) cache cache cache Bus shared memory cache cache cache Bus shared memory Mul$processor TM: Built- in cache- coherence support cache cache cache Bus Distributed TM: Message passing links Intel SMP MESI protocol [3] shared memory Cache- coherence protocol needed SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan D- STM problem space Cache- coherence protocol Locate and move objects in the network Guarantee the consistency over mul$ple object copies Conflict resolu$on Conserva$ve approach Non- conserva$ve approach Key property: guarantee progress Fault- tolerance Network with node failures Replica$on protocol: manage object replicas

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Transac)on execu)on models in DTM Control flow Moving transac$ons, objects are held locally Consistency: distributed commit protocol Inherit the database transac$onal synchroniza$on Data flow Move the object to run all transac$ons locally Synchroniza$on: op$mis$c Conflicts are resolved by conflict resolu$on strategy No need for a distributed commit protocol Easier to exploit locality

DTM, how it works A Txn reques$ng object o B Txn holding object o CC.locate(o) A Network B CC.move(o) req(o) req(o) TM Proxy Local Cache Local Cache TM Proxy not found in use Processors (or nodes) connected by message- passing links cmp(a,b) CR(A,B) Distributed cache- coherence protocol (CC) Loca$ng and moving objects in the network Conflict Resolu$on Maintaining consistency among mul$ple copies of an object Conflict resolu$on module (CR) Resolve conflicts among transac$ons How to make the correct/op$mal decision? SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Design Goal Input The distributed system: nodes communicate via message passing links T: a set of n transactions accessing s shared objects in a metricspace network of m nodes A: conflict resolution strategy C: cache-coherence protocol. Output makespan(a,c): the total time needed to complete the set of transactions under (A,C). Goal: maximize the throughput by minimizing makespan(a,c) over all possible combinations of input (A,C).

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Measures of Quality Compare with an optimal clairvoyant off-line scheduling algorithm OPT. OPT has all transactions knowledge in advance. Each transaction is scheduled exactly once under OPT. Competitive ratio: evaluate the optimality of makespan(a,c) CR(A,C) = max (makespan(a,c)/(makespan(opt))

Problem statement We can consider the transaction scheduling problem for multiprocessor STM as a subset of the transaction scheduling problem for DTM. The two problems are equivalent as long as the communication cost can be ignored, compared with the local execution time duration. We model contention management as a non-clairvoyant scheduling problem If all transactions are conflicting each other, then a sequential schedule is the best solution. SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Towards the optimal: Cost Graph Cost Graph: each node in the system is a vertex in the graph each edge (v i, v j ) represents the channel to move an object from node v i to node v j each edge (v i, v j ) is weighted with the cost (d ij ) for moving the object from node v i to node v j V d 2 V 2 d 4 d 3 d 23 d 24 V 4 d 34 V 3

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Towards the optimal: Conflict Graph Conflict Graph: each transaction is represented as a numbered node each edge is marked with the object which causes transactions to conflict we can construct a coloring of the conflict graph since transactions with the same color are not connected, every set Ci forms an independent set and can be executed in parallel without facing any conflicts

Towards the optimal: Ordering Conflict Graph An optimal offline schedule Opt determines a k-coloring of the conflict graph and an execution order (the ordering conflict graph) such that for any two sets C i and C j, where i < j, if and conflict, and is in C and in, then is postponed until commits There are k! ordering conflict graphs The ordering conflict graph is weighted: Node s weight is the transaction s execution time Arc s weight is the cost for moving the object from the source node to the destination node SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

but the optimal is too complex The commit time of a transaction T is determined by one of the weighted paths that ends at T The makespan is the weight of the longest weighted path in the ordering conflict graph The optimal is reached selecting the ordering conflict graph that minimizes the makespan Finding an optimal contention manager is (NP-) hard If each node issues only one transaction and cost of moving objects is negligible, then the problem is equivalent to finding the chromatic number of the conflict graph If the number of shared object is one, the problem is equivalent to finding the traveling salesman path (TSP) in the cost graph SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Also When each node generates a sequence of transactions, it is not always optimal to schedule transactions according to the ordering conflict graph since the conflict graph evolves over time, an optimal schedule based on a static conflict graph may lose potential parallelism in the future.

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Local optimality is not global optimality (2-coloring) O 5 O 4 O 8 4 8 5 7 5 8 O 2 O 3 O 4 O O 9 O O 3 3 6 4 2 6 O 0 7 O 6 O O 7 T 2 2 T 3 T T 3 2 4 T 5 2 T 2 2 T 7 T 4 2 2 Makespan = 4d 6 8 5 7 3 6 Not Op)mal! 8

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Local optimality is not global optimality (4-coloring) O 5 O 4 O 8 4 8 5 4 5 8 O 2 O 3 O 4 O O 9 O O 3 3 6 7 2 6 O 0 7 O 6 O O 7 T 2 2 T 3 Makespan = 3d 2 5 4 8 3 6 4 7 2 5 6 7 8 3

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Lower Bounds For STM, any online determinis$c CM is Ω(s)- compe$$ve, where s is the number of objects [A[ya et al. 06] For DTM, any online determinis$c work conserva$ve CM is 5 For DTM, a (max[s, s2 ])- D op$mal, where is the normalized network diameter D When the normalized network diameter is bounded (D is a constant), it can only provide a Ω(s 2 )- compe$$ve ra$o. Can we find an approximate op$mal solu$on in reasonable $me?

CUTTING CUTTING is a randomized scheduling algorithm based on partitioning the cost graph Assumptions: Transaction T i knows its required set of objects after it starts We assume that the moving cost is bounded at D Input: A set of transactions with their execution time The conflict graph The cost graph An approximate TSP algorithm (ATSP) Output: A schedule for executing transactions SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Why TSP and why that assumption? T invoked by N writes objects {o 2, o 3 } N 2 stores o 2 ; N3 stores o 3 d 2 N d 4 d 3 N 2 d 23 d 24 N 4 d 34 N 3 Without assumption T Fetch (o 2 ) 2 x d 2 Fetch (o 3 ) N 2 With assumption T d 2 Fetch (o 2, o 3 ) d 23 N 2 2 x d 3 N 3 d 3 N 3 (2 x d 2 ) + (2 x d 3 ) = ~4d d 2 + d 23 + d 3 = ~3d

Cutting: how it works The cost graph is partitioned in C partitions such that for any pair of nodes (v i,v j ) belonging to one partition, d ij ATSP/C Within each partition: Nodes are numbered with an integer from to the size of the partition A binary tree is built following nodes numbers Each transaction randomly selects an integer that is used for deciding the transaction to abort after a conflict P P 2 P C v v v v 2 v 3 v 2 v 3 v 2 v 3 SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

Cutting: how it works Handling conflicts between two transactions: Phase : Ø Within the same partition and one transaction is an ancestor of the other in the partition s binary tree, the node that precedes the other in the ATPS path aborts the other Ø Otherwise the transaction with the lesser partition number aborts the other Phase 2: Ø Each transaction randomly selects an integer (π) when it starts or restarts. If one transaction is not an ancestor of the other, the transaction with the lower π proceeds and the other transaction aborts. Whenever a transaction is aborted by a remote transaction, the requested object is moved to the remote transaction immediately. SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

Cutting: analysis The average case competitive ratio of Cutting is O s A log 2 m log 2 n for s objects shared by n transactions invoked by m nodes This is close to the multiprocessor bound of O(s) SIROCCO 204, July 23-25, 204, Hida Takayama, Japan

Analysis n2 with probability n2. Hen st integer among all conflicting transactions 2 2 with probability ( l me transaction is O 32e(C C log + m) log nn ( trials + from the average-case perow of study two efficiency measures oflncutting n2 ) > otal, TaOther needs at most results on Cutting ive: the average response time (how long it takes for Therefore, a transactionbytoinduction, commit ). verage) andlthe averagetmakespan. L of Bt(P ) needs at most ( + q A transaction needs: t 2 nsaction T invoked by a level-l node v action needs O C log m log n trials, on av(+log2 L) log2 L at least 2 log A8e(C + ) nt trials probability 2n2 ma transaction needs Oa C log m log n trials from the moment it. N of 2a6Ltrial, i.e., thelntime untilwith transaction voked untila it commits, on Note transaction canaverage. only selectofa trials: new E[# of trials a transaction n canthat calculate the average number trials from the it is invoked until itsince commits, onthe starting 2 moment when (locally or remotely). Hence, if a transaccommit]average = O C log m log n. f.the Wesame startpartition, from transaction T selected, invoked by root nodethat v 2a Bt(P the duration is at most t ). probability transactio the ATSP patha is randomly thethe v is the T cannot be aborted by another ancestor in Bt(Pt ). Hence, nsaction inroot, another partition, the. duration is follows. Lmax L+ ated at level L is /2 The lemma 0 ntion onlysends aborted when chooses a larger than, is: which is the integer itsaverage requests ofit objects simultaneq be The response time of a transaction 0 n by a conflicting transaction T 0transactions, invoked by node v 2 Pt08. The nsaction conflicts with multiple Lemma Theprobability average resp 0 2T Atsp for transaction T, no transaction 2 selects thea same random number knows is the transaction closest to it.nn From T ( me of a transaction is O C log m log + ). Q C (T ) m 0 0 is:by Pr(@T 2 NT from = )other = partitions ( ) ( ) ( rted transactions by 0 NT m m m) e, expected time a transaction otethe that (T ) commit C m. Onofthe other hand, the probability that is6,ateach Proof. From Lemma 0 0 Cutting is The lemma q The average-case ratio as small as follows. O for transaction Tof is at least Thus, the ction needs C any log2conflicting m logcompetitive n trials, on average. We (C+) now.study the du of a trial, i.e., the time until a transaction 2its neighbors 2can select ability that is smallest among all is at least a new random num e(c+). petitive ratio of Cutting is O s A log m log n. oteuse that transaction can only select a new random number after it is a We thea following Cherno bound: ound provided by Lemma 7 andiflemma 8, tion conflicts with a transac (locally or remotely). Hence, a transacma same 7 Let partition, X, Xwith, Xnduration be independent Poisson trials such that, for 2,...the Atsp uces a schedule average-case makespan A the is at most P X+= Cn ; X if,itµconflicts with 2 i = )2= pi, where, Pr(X 0 p. Then, for = E[X] = i i SIROCCO 204, July 23 25, 204, Hida Takayama, Japan N log m log n + Atsp ), where N is i=

SIROCCO 204, July 23-25, 204, Hida Takayama, Japan Thanks! Questions? Research project s web-site: www.hyflow.org