SELF- OPTIMIZING DATA GRIDS. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

SELF- OPTIMIZING DATA GRIDS Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011, Rome

Project Goals Develop an open- source middleware for the Cloud: 1. Providing a simple and intui've programming model: hide complexity of distribu'on, elas'city, fault- tolerance 2. Minimizing administra'on and monitoring costs: automate elas'c provisioning based on QoS/cost constraints 3. Minimize opera'onal costs via self- tuning adap'ng consistency mechanisms to maximize efficiency

Architecture Overview 31$"456*)01+2$.&) /+#+)01+2$.&) 0"?"$%>"E'&C$%&';&"CC)9;$/%!-$!"#$%$&'()*+%+,-.) J'#M8'-?$-N+8):8"B'9$/%!$ 2=O+8?$5&)*$ 4"NN+&$ #+"&8P$/%!$ 0)-?&)=<?+*$ 6Q+8<B'9$ R&"C+A'&S$ 7+8'9:;<&"=>+$0)-?&)=<?+*$ #'@A"&+$1&"9-"8B'9">$4+C'&D$ 7+8'9:;<&"=>+$#?'&";+$#D-?+C$ F27GH2/0$I$J'#$423!127$ F27GH2/0$/3/HKL67$ /0/%1/1!23$4/3/567$ 0"?"$%>"E'&C$ 2NBC)T+&$ 6>"-B8$#8">)9;$ 4"9";+&$ /+#+)01+2$.&)>-($%?,".+=$%)@)6"%'%,)!""#$%&'()*+&,-.$.-7$".(-) 8.$9'7'$%'%,):) ;<!)%-,$=+=$%)

Methodologies explored so far Analy'cal modeling: queuing theory, markov processes stochas'c techniques Machine learning: off- line techniques: Decision Trees, Neural networks, Support Vector Machine on- line techniques (reinforcement learning): UCB algorithm

Analytical modeling white box approach: requires detailed knowledge of internal dynamics good extrapola'on power: allow forecas'ng system behavior in unexplored regions of its parameters space minimal learning 'me: basically parameters instan'a'on complex and expensive to design/validate subject to unavoidable approxima'on errors

Machine learning Black box approach: observe inputs, context and outputs of a system use sta's'cal methods to iden'fy payerns/rules Good accuracy in already explored regions of the parameters space but poor extrapola'on power Learning 'me grows exponen'ally with number of features: but eventually outperforms analy'cal models (typically!)

Hybrid techniques IDEA: get the best of the two worlds Two alterna've approaches so far: 1. Divide- and- conquer: AM for well- specified sub- components ML for sub- components that are: too complex to model explicitly, or whose internal dynamics are only par'ally specified 2. Use AM to ini'alize ML knowledge: reduce learning 'me of ML techniques correct AM using feedback from opera'onal system

Self-tuning problems addressed so far Dynamic selec'on and switching between replica'on protocols: total order based replica'on protocols (Case study 1): purely based on Machine Learning techniques Two phase commit vs primary backup (Case study 2): hybrid ML- AM solu'on divide- et- impera GCS op'miza'on: tuning of batching in total order protocols (Case study 3) hybrid ML- AM ML bootstrapped with AM knowledge

SELF-TUNING REPLICATION Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011 9

The search for the holy grail transac'onal replica'on protocols Single master (primary- backup) Mul' master Total order- based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011 10

No one size fits all Exis'ng solu'ons are op'mized for specific workload/scale scenarios In dynamic environments where both: 1. the workload characteris'cs, and 2. the amount of used resources vary over 'me, self- tuning is the only way to achieve op'mal efficiency Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011 11

Autonomic adaptation at play low #resources: - minimum costs primary-backup: - low % write: low load on primary auto-scale up: - new nodes hired for read-only requests primary-backup: - low % write: primary stands the load multi-master: - hi % write primary overwhelmed higher scalability auto-scale down: - minimum costs switch back to primary-backup : node processing read-only requests : node processing read&write requests low traffic read-dominated low conflict hi traffic read-dominated low conflict hi traffic write dominated low conflict low traffic read dominated low conflict %&'($ /'$%$#&'()$ %$#&'()$!"#$!"#$%$#&'()$(&*+,*-."+,$ %&'($ /'$(&*4-$ &)2,3,)-$!"#$!"#$(&*4-$ 0'1)$

Self-optimizing replication Entails devising solu'ons to 2 keys issues: Allow coexistence/efficient switch among mul'ple replica'on protocols: Avoid blocking transac'on processing during transi'ons Determine the op'mal replica'on strategy given the current (or foreseen) workload characteris'cs: machine learning methods (black box) analy'cal models (white box) hybrid analy'cal/sta's'cal approaches (gray box)

Two case studies CerDficaDon Schemes NVC vs VC vs BFC Single vs muld- master 2PC vs PB!"#$%&'()*+,-#"./-0.1+ ("!#'"!#&"!#%"!#$"!" (" "()!!!"" "(!!)!!!"" 2)%*+3)1+3'()+ joint work with M. Couceiro, and L. Rodrigues *+," -.," +," Throughput (Tx/sec) 16000 14000 12000 10000 8000 6000 4000 2000 0 4 5 6 7 8 9 10 # Nodes PB-low conflict 2PC-low conflict PB-high conflict 2PC-high conflict joint work with D. Didona, S. Peluso and F. Quaglia 2 nd Workshop on Soeware Services, Timisoara, Romania, 6 June 2011 14

Maria Couceiro, Paolo Romano, Luis Rodrigues PolyCert: Polymorphic Self- Op4mizing Replica4on for In- Memory Transac4onal Grids ACM/IFIP/USENIX 12th Interna'onal Middleware Conference (Middleware 2011) TOTAL ORDER BASED CERTIFICATION MECHANISMS

Where they fit in the picture Single master (primary- backup) Mul' master Total order- based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng

Certification (a.k.a. deferred update) A transac'on is executed en'rely at a single replica: good scalability also in write- intensive workload No coordina'on during transac'on execu'on phase: minimize traffic If the transac'on is ready to commit, coordina'on is required: To ensure serializability To propagate the updates

Certification Two transac'ons may update concurrently the same data at different replicas. Coordina'on must detect this situa'on and abort at least one of these transac'ons. Three alterna'ves: Non- vo'ng algorithm Vo'ng algorithm BFC All rely on total order broadcast: - ensure agreement on transac'on serializa'on order - avoid deadlocks - achieve fault- tolerance

Classic Replication Protocols Focus on full replica'on protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng

Non-voting The transac'on executes locally. When the transac'on is ready to commit, the read and write set are sent to all replicas using total order broadcast Transac'ons are cer'fied in total order. A transac'on may commit if its read set is s'll valid (i.e., no other transac'on has updated the read set).

Non-voting R1 Execu'on Transac'on T1 TOB of T1 s read & writeset TOB of T2 s read & writeset R2 Execu'on Transac'on T2 Valida'on&Commit T1 Valida'on&Abort T2 R3 Valida'on&Commit T1 Valida'on&Abort T2 + only validation executed at all replicas: high scalability with write intensive workloads - need to send also readset: often very large!

Classic Replication Protocols Focus on full replica'on protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng

Voting The transac'on executes locally at replica R When the transac'on is ready to commit, only the write set is sent to all replicas using total order broadcast Commit requests are processed in total order A transac'on may commit if its read set is s'll valid (i.e., no other transac'on has updated the read set): Only R can cerdfy the transacdon! R send the outcome of the transac'on to all replicas

Voting R1 Execu'on Transac'on T1 T1 s TOB (ws) T1 s reliable broadcast ( vote ) T1 s valida'on R2 wait for R1 s vote + sends only write-set (much smaller than read-sets normally) - Additional communication phase to disseminate decision (vote)

Classic Replication Protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng

Bloom Filter Certification (BFC) Bloom filters: space- efficient data structure for test membership queries Probabilis'c answer to Is elem contained in BF? No false nega'ves: A no answer is always correct False posi'ves: A yes answer may be false Compression is a func'on of a (tunable) false posi've rate

Bloom Filter Certification (BFC) Key idea: use BF to encode readset and detect intersec'on with writesets of concurrent transac'ons: False posi'ves: addi'onal (determinis'c) abort strongly reduce network traffic: 1% false posi've up to 30x compression

BFC vs Voting vs Non-Voting ("!"#$%&'()*+,-#"./-0.1+!#'"!#&"!#%"!#$"!" (" "()!!!"" "(!!)!!!"" 2)%*+3)1+3'()+ *+," -.," +,"

PolyCert Polymorphic Self- Op'mizing Cer'fica'on Co- existence of the 3 cer'fica'on schemes Machine- learning techniques to determine the op'mal cer'fica'on strategy per transac'on Logic associated with the on- line choice of the replica'on strategy encapsulated into a generic oracle

Architecture!"#$%#&'($#)*+,,)-&#'($**./0!1* Polymorphic Replication Manager 2/3* 453* /3* Replication Protocol Selector Oracle <7=*/#)>7*0?("7* @30* 67&-%-($*!"77* 879"7%%("* :34* ;7#"$7"* replica i

Protocol When the transac'on finishes local execu'on: Ask the Oracle which protocol to use. Build a message accordingly. AB- cas t the message.

Protocol Upon delivery of an AB- cast message: The message is inserted in a queue with the transac'ons to be cer'fied NVC or BFC: no further processing is done un'l the message reaches the head of the queue VC: If the transac'on does not conflict with others in the queue Local: validate it and send the vote.

Protocol Transac'ons are removed from the head of the queue and validated sequen'ally NVC or BFC: each node applies locally the corresponding cer'fica'on algorithm, valida'ng and applying/discarding the write- set. VC: If the vote has been received, act accordingly. Else: Remote: wait for vote. Local: validate transac'on and send the vote.

Replication Protocol Selector Oracle Two implementa'ons: Off- line Machine Learning Techniques On- line Reinforcement Learning

Off-line Machine Learning Techniques For each transac'on: Predict size of AB message m for the various cer'fica'on schemes Forecast AB latency for each message size. We evaluated several ML approaches: Regression decision trees (best results) Neural networks Support vector machine

Regression Decision Trees Define a set of human- readable rules, where each rule: iden'fies a region in the feature space associates with the region a linear func'on of the features Build a piecewise linear approxima'on of a func'on of the features Chooses the branching ayribute such that the resul'ng split maximizes the normalized informa'on gain

Neural Networks Inspired by the structure and func'onal aspects of biological neural networks Define weight of connec'ons to minimize average predic'on error across all training data back- propaga'on algorithm

Support Vector Machines As a classifier: Iden'fies a set of hyperplanes that have largest distance to the nearest training data points of each class As a func'on approx.: iden'fies the hyperplane that is closer as possible to the set of training data

Off-line Machine Learning Techniques Uses up to 53 monitored system ayributes: CPU Memory Network Time- series Requires computa'onal intensive training phase

On-line Reinforcement Learning Each replica builds on- line expecta'ons on the rewards of each protocol: no assump'on on rewards distribu'ons Solves the explora'on- exploita'on dilemma: did I test this op'on sufficiently in this scenario?

UCB Mul4- armed bandit problem: Each arm of a slot machine is associated with an unknown reward Each round, one arm is played Find the strategy that maximizes the average reward Upper Confidence Bound (UCB): lightweight and provably op'mal solu'on to the bandit problem computes sample average and UCB for each arm UCB captures the degree of uncertainty on the arm s actual reward

On-line Reinforcement Learning Dis'nguishes workload scenario solely based on read- set s size exponen'al discre'za'on intervals to minimize training 'me Replicas exchange sta's'cal informa'on periodically to boost learning

Results - Bank Benchmark ("!"#$%&'()*+,-#"./-0.1+!#'"!#&"!#%"!#$"!" (" "()!!!"" "(!!)!!!"" 2)%*+3)1+3'()+ *+," -.," +," /0123,-" /4"

Results - Bank Benchmark Evolution of Throughput!"#$%&"'%()*(#+,-+./$-0-1.$,23) '#" '!" &#" &!" %#" %!" $#" $!" #" +,-"./01+,-"!"!" &!" (!" )!" $%!" $#!" $*!"!451)*-1.$,2-3)

Results - STMBench7 (#%"!"#$%&'()*+,-#"./-0.1+ (#$" ("!#'"!#&"!#%"!#$"!" )*" +," -*." +/01-*."

Diego Didona, Sebas'ano Peluso, Paolo Romano, Francesco Quaglia, Luis Rodrigues CASE STUDY 2: PRIMARY-BACKUP <=> 2PC

Classic Replication Protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng Vo'ng

Single Master Write transac'ons are executed en'rely in a single replica (the primary) If the transac'on aborts, no coordina'on is required. If the transac'on is ready to commit, coordina'on is required to update all the other replicas (backups). Reliable broadcast primi've. Read transac'ons can be executed on backup replicas. No distributed deadlocks No distributed coordinadon during commit Throughput of write txs doesn t scale up with number of nodes

Classic Replication Protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng Vo'ng

2PC-based replication Transactions attempt to acquire atomically locks at all nodes using two phase commit (2PC) 2PC materializes conflicts among concurrent remote transactions generating: DISTRIBUTED DEADLOCKS + good scalability at low conflict - thrashes at high conflict

Performance comparison 16000 14000 PB-low conflict 2PC-low conflict PB-high conflict 2PC-high conflict 12000 Throughput (Tx/sec) 10000 8000 6000 4000 2000 0 4 5 6 7 8 9 10 # Nodes

Goals Autonomically select the best suited protocol to Minimize transac'ons' service 'me Maximize achievable throughput Elas'cally scale system size Scale up if the system needs more computa'onal power Scale down if the system is oversized

Architectural Overview An Autonomic Manager Periodically receives sta's'cs from nodes Aggregates sta's'cs and query an oracle Triggers protocol switch and system scaling The manager resides on one of the system's nodes Communica'on through a dedicated ISPN cache Listeners on the cache for asynchronous communica'on

Full Architecture - 2PC

Full Architecture - PB

Collected Statistics Business logic cost (local transac'on execu'on 'me) Transac'ons arrival rate Put opera'ons per transac'on Percentage of write transac'ons Transport layer latency Conflict degree (more on this later)

Key Technical Problem How to forecast Performances of protocol B while running protocol A? Performances of the system with X nodes while running on Y nodes? given that replica'on protocol/scale changes affect: The transac'on conflict probability The transport layer latency

Methodology Joint usage of analy'cal modelling and machine learning techniques: analy'cal model of replica'on algorithm dynamics: lock conten'on, distributed deadlock probability message exchange payern machine learning to forecast performance of group communica'on layer: RTT as a func'on of msg size, throughput, #nodes

Analytical Model Average transac'on service 'me es'ma'on through the use of an analy'cal model Captures detailed dynamics of replica'on algorithms It is possible to mathema'cally model them, as the algorithms' behaviour is fully known

Machine learning techniques Transport layer latency predicted through the use of a sta's'cal model Resource virtualiza'on makes mathema'cal modelling unfeasible: No knowledge of actual load No knowledge of actual physical resources

Analytical Model - Overview Focused on modelling data conten'on and replica'on protocols dynamics Key methodologies: Mean Value Analysis & Queuing theory PRO: Good trade- off between performance (of solvers!) and predic'on accuracy CON: Unable to predict distribu'ons, percen'les (useful for instance for SLAs)

Analytical Model Contention Probability Conten'on probability depends on Transac'ons arrival rate Read- only vs Update transac'ons Deadlock induced restarts Transac'ons dura'on Business logic (in absence of conten'on) Wai'ng 'me to acquire locks Commit phase (which depends on the replica'on scheme) Data access payern => TRICKY!!! Which items are accessed in what order by transac'ons

Analytical Model Data Access Pattern Typical solu'ons Assume a- priori knowledge of access payerns Require non- trivial analysis Probability distribu'on of accesses to data normally assuming very simple distribu'ons Offline: unfeasible for evolving systems Online: costly and complex Hard to deploy in prac'cal seyngs

Application Contention Factor Key methodological innova'on: ApplicaDon ContenDon Factor (ACF) Captures probability of intersec'on between datasets accessed by two concurrent transac'ons independently of: concurrency control/replica'on protocol in use number of nodes/threads ac've in the system Derived from measurement of opera'onal system, not computed based on some hypothesized access payern

ACF: General idea Assume to be in config s1 (e.g. 2PC, 10 nodes, 3 threads x node) and want to predict performance in config s2 (e.g. PB, 3 nodes, 5 threads x node): 1. while in s1 measure: a) lock dura'on, T hold b) lock request arrival rate, λ c) conflict probability between 2 xact, p conflict 2. derive ACF = p conflict / (T hold * λ) hint: locks modelled as M/G/1 queues 3. ACF can now be used as input of an analy'cal model of the replica'on protocols used in scenario s2

Abstracting over applications data access patterns ACF computed based on the lock probability using PB and 2PC across different workload scenarios Application Contention Factor 5e-05 100KPR 100KTPC 50KPR 50KTPC 25KPR 25KTPC Application Contention Factor 4e-05 3e-05 2e-05 1e-05 0 2 3 4 5 6 7 8 9 10 Number of nodes

Statistical Model - Overview Based on decision trees Ini'al knowledge base gathered using offline training phase Possibility to periodically update sta's'cal model using data collected online Takes as input a set of metrics gathered in the current system configura'on Outputs the forecast transport layer latency (RTT) for a target configura'on

Set of Input Metrics Number of nodes RTT in the current configura'on Size of exchanged messages Throughput of the current configura'on Unknown!!! Guessed using the analy'cal model (more next)

Statistical Model Accuracy Correla'on between 0.96 and 0.98 Rela've error between 0.19 and 0.22

Models Coupling AnalyDcal model forecasts the ISPN throughput taking as input the RTT in the target configura'on. StaDsDcal model forecasts the RTT taking as input the ISPN throughput in the target configura'on Fixed point solu'on found using recursion

Global Model Accuracy 20000 15000 2PC-est-low conflict 2PC-low conflict 2PC-est-high conflict 2PC-high conflict PB-est-low conflict PB-low conflict Throughput (Tx/sec) 10000 5000 0 4 5 6 7 8 9 10 # Nodes

and now in action! 7000 th.txt u 1:2 6500 6000 5500 throughput (tx/sec) 5000 4500 4000 3500 3000 2500 2000 LOW CONFLICT HIGH CONFLICT 0 100 200 300 400 500 600 700 800 900 time (sec) LOW CONFLICT

Future Work Enhance accuracy of analy'cal model in high conten'on scenarios Validate the model also against more complex workload (TPC- C) Assess the suitability of the presented techniques also for elas'c scaling scenarios

Paolo Romano and MaYeo Leoney Self- tuning Batching in Total Order Broadcast Protocols via Analy4cal Modelling and Reinforcement Learning IEEE Interna'onal Conference on Compu'ng, Networking and Communica'ons, Network Algorithm & Performance Evalua'on Symposium (ICNC'12), Jan. 2012 CASE STUDY 3: BATCHING IN TOTAL ORDER BROADCAST PROTOCOLS

Sequencer based TOB (STOB) Total order broadcast (TOB) algorithms rely on a special process to ensure total order: P2 m seq sequencer assigns total order seq P3 m TOB(m) message diffusion message ordering

Batching in STOB protocols STOB have theore'cally op'mal latency: 2 comm. steps, independently of the number of processes but sequencer becomes the boyleneck at high throughput Batching at the sequencer process: wait for several msgs and order them altogether: amor'ze sequencing cost across mul'ple messages op'mal wai'ng 'me depends on message arrival rate: very effec've at high throughput very bad at low throughput!

Analytical model (i) Model sequencing process as a M/M/1 queue: simple queuing model, easily solvable Key equa'on: 'me to sequence a batch of b msgs at arrival rate m: T seq (b,m)=t 1st + (b- 1)/2m + T add (b- 1) T 1st : 'me to sequence a batch of size 1 (b- 1)/2m: avg. wait 'me to build a batch T add (b- 1): overhead for remaining b- 1 messages

Analytical model (ii) Using queuing theory arguments we determine the op'mal batching 'me, b *, as a func'on of the arrival rate, m: b (m) = 1, if m< T addσ 2 σ 2mT add σ+ if T addσ 2 2 + 1 2 2 + 1 2 2m σ 2mT add σ 4σ 2 +2T 2 add σ4 2, 2(σ+2m(T add σ 1)) 2 (2σT add 1) 2 (1+2σT add )σ 2 4σ 2 +2T 2 add σ4 2 <m<m

Determining model params To use the model one need to determine two parameters: T 1st & T add Determined using a simple benchmark: 1. find peak throughput w/o batching: m * b=1 2. find peak throughput at max batching level: m * b=max then set: T 1st = 1/m * b=1 T add = 2/m * b=max

Model Accuracy Optimal Batching Value 100 10 Exaustive Manual Tuning Analytical Model Model underes'mates op'mal batching value at medium load Problem: batching underes4ma4on causes system instability! 1 0 2000 4000 6000 8000 10000 12000 14000 Average Msg. Arrival Rate (msgs/sec)

Validation with real traffic 12000 Avg Msg. Arrival Rate (msgs/sec) 10000 8000 6000 4000 2000 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Hour of the day (3 Sept. 2010) Figure 4. Traffic at the FenixEDU system (Sept. 3 2010)

Peak period analysis msgs/sec Latency (msec) 100 10 1 10000 5000 0 16 17 18 19 20 21 22 Hour of the day Ramp- up & ramp- down transi'on through the problema'c areas: - ramp- up is sufficiently short: system struggles, but recovers - ramp- down is longer:

What about a pure ML approach? Discre'za'on of the func'on b * =f(m) that outputs the op'mal batching b * given the current message arrival rate m m={10,100,1000,2000,,16000} b={1,2,4,8,,256} Use an instance of UCB with each arrival rate m, having an arm per each batching value b: use UCB to determine the most rewarding arm

Pure ML approaches Problem: ML techniques need to explore different solu'ons (batching values) to iden'fy op'mal one: low load: useless addi'onal latency medium- high load: insufficient batching values lead very rapidly to instability and thrashing

Combining the two approaches 1. Ini'alize UCB rewards with the predic'ons of the analy'cal model: reduce frequency of obviously wrong explora'ons 2. Let UCB update the ini'al reward values: correct model s predic'on errors exploi'ng feedback from the system

Combining the two approaches Latency (msec) 100 10 1 Model Model+RL msgs/sec 10000 5000 0 16 17 18 19 20 21 22 Hour of the day

Future work Focus on elas'c- scaling, keeping into account data grids dynamics: consistency costs, transac'on conflicts Study effects of self- tuning mul'ple, mutually dependent layers of the data grid BeYer integra'on with QoS specifica'on APIs