SELF- OPTIMIZING DATA GRIDS Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011, Rome
Project Goals Develop an open- source middleware for the Cloud: 1. Providing a simple and intui've programming model: hide complexity of distribu'on, elas'city, fault- tolerance 2. Minimizing administra'on and monitoring costs: automate elas'c provisioning based on QoS/cost constraints 3. Minimize opera'onal costs via self- tuning adap'ng consistency mechanisms to maximize efficiency
Architecture Overview 31$"456*)01+2$.&) /+#+)01+2$.&) 0"?"$%>"E'&C$%&';&"CC)9;$/%!-$!"#$%$&'()*+%+,-.) J'#M8'-?$-N+8):8"B'9$/%!$ 2=O+8?$5&)*$ 4"NN+&$ #+"&8P$/%!$ 0)-?&)=<?+*$ 6Q+8<B'9$ R&"C+A'&S$ 7+8'9:;<&"=>+$0)-?&)=<?+*$ #'@A"&+$1&"9-"8B'9">$4+C'&D$ 7+8'9:;<&"=>+$#?'&";+$#D-?+C$ F27GH2/0$I$J'#$423!127$ F27GH2/0$/3/HKL67$ /0/%1/1!23$4/3/567$ 0"?"$%>"E'&C$ 2NBC)T+&$ 6>"-B8$#8">)9;$ 4"9";+&$ /+#+)01+2$.&)>-($%?,".+=$%)@)6"%'%,)!""#$%&'()*+&,-.$.-7$".(-) 8.$9'7'$%'%,):) ;<!)%-,$=+=$%)
Methodologies explored so far Analy'cal modeling: queuing theory, markov processes stochas'c techniques Machine learning: off- line techniques: Decision Trees, Neural networks, Support Vector Machine on- line techniques (reinforcement learning): UCB algorithm
Analytical modeling white box approach: requires detailed knowledge of internal dynamics good extrapola'on power: allow forecas'ng system behavior in unexplored regions of its parameters space minimal learning 'me: basically parameters instan'a'on complex and expensive to design/validate subject to unavoidable approxima'on errors
Machine learning Black box approach: observe inputs, context and outputs of a system use sta's'cal methods to iden'fy payerns/rules Good accuracy in already explored regions of the parameters space but poor extrapola'on power Learning 'me grows exponen'ally with number of features: but eventually outperforms analy'cal models (typically!)
Hybrid techniques IDEA: get the best of the two worlds Two alterna've approaches so far: 1. Divide- and- conquer: AM for well- specified sub- components ML for sub- components that are: too complex to model explicitly, or whose internal dynamics are only par'ally specified 2. Use AM to ini'alize ML knowledge: reduce learning 'me of ML techniques correct AM using feedback from opera'onal system
Self-tuning problems addressed so far Dynamic selec'on and switching between replica'on protocols: total order based replica'on protocols (Case study 1): purely based on Machine Learning techniques Two phase commit vs primary backup (Case study 2): hybrid ML- AM solu'on divide- et- impera GCS op'miza'on: tuning of batching in total order protocols (Case study 3) hybrid ML- AM ML bootstrapped with AM knowledge
SELF-TUNING REPLICATION Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011 9
The search for the holy grail transac'onal replica'on protocols Single master (primary- backup) Mul' master Total order- based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011 10
No one size fits all Exis'ng solu'ons are op'mized for specific workload/scale scenarios In dynamic environments where both: 1. the workload characteris'cs, and 2. the amount of used resources vary over 'me, self- tuning is the only way to achieve op'mal efficiency Collabora'on Mee'ng with Op'mis, 20-21 Sept. 2011 11
Autonomic adaptation at play low #resources: - minimum costs primary-backup: - low % write: low load on primary auto-scale up: - new nodes hired for read-only requests primary-backup: - low % write: primary stands the load multi-master: - hi % write primary overwhelmed higher scalability auto-scale down: - minimum costs switch back to primary-backup : node processing read-only requests : node processing read&write requests low traffic read-dominated low conflict hi traffic read-dominated low conflict hi traffic write dominated low conflict low traffic read dominated low conflict %&'($ /'$%$#&'()$ %$#&'()$!"#$!"#$%$#&'()$(&*+,*-."+,$ %&'($ /'$(&*4-$ &)2,3,)-$!"#$!"#$(&*4-$ 0'1)$
Self-optimizing replication Entails devising solu'ons to 2 keys issues: Allow coexistence/efficient switch among mul'ple replica'on protocols: Avoid blocking transac'on processing during transi'ons Determine the op'mal replica'on strategy given the current (or foreseen) workload characteris'cs: machine learning methods (black box) analy'cal models (white box) hybrid analy'cal/sta's'cal approaches (gray box)
Two case studies CerDficaDon Schemes NVC vs VC vs BFC Single vs muld- master 2PC vs PB!"#$%&'()*+,-#"./-0.1+ ("!#'"!#&"!#%"!#$"!" (" "()!!!"" "(!!)!!!"" 2)%*+3)1+3'()+ joint work with M. Couceiro, and L. Rodrigues *+," -.," +," Throughput (Tx/sec) 16000 14000 12000 10000 8000 6000 4000 2000 0 4 5 6 7 8 9 10 # Nodes PB-low conflict 2PC-low conflict PB-high conflict 2PC-high conflict joint work with D. Didona, S. Peluso and F. Quaglia 2 nd Workshop on Soeware Services, Timisoara, Romania, 6 June 2011 14
Maria Couceiro, Paolo Romano, Luis Rodrigues PolyCert: Polymorphic Self- Op4mizing Replica4on for In- Memory Transac4onal Grids ACM/IFIP/USENIX 12th Interna'onal Middleware Conference (Middleware 2011) TOTAL ORDER BASED CERTIFICATION MECHANISMS
Where they fit in the picture Single master (primary- backup) Mul' master Total order- based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng
Certification (a.k.a. deferred update) A transac'on is executed en'rely at a single replica: good scalability also in write- intensive workload No coordina'on during transac'on execu'on phase: minimize traffic If the transac'on is ready to commit, coordina'on is required: To ensure serializability To propagate the updates
Certification Two transac'ons may update concurrently the same data at different replicas. Coordina'on must detect this situa'on and abort at least one of these transac'ons. Three alterna'ves: Non- vo'ng algorithm Vo'ng algorithm BFC All rely on total order broadcast: - ensure agreement on transac'on serializa'on order - avoid deadlocks - achieve fault- tolerance
Classic Replication Protocols Focus on full replica'on protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng
Non-voting The transac'on executes locally. When the transac'on is ready to commit, the read and write set are sent to all replicas using total order broadcast Transac'ons are cer'fied in total order. A transac'on may commit if its read set is s'll valid (i.e., no other transac'on has updated the read set).
Non-voting R1 Execu'on Transac'on T1 TOB of T1 s read & writeset TOB of T2 s read & writeset R2 Execu'on Transac'on T2 Valida'on&Commit T1 Valida'on&Abort T2 R3 Valida'on&Commit T1 Valida'on&Abort T2 + only validation executed at all replicas: high scalability with write intensive workloads - need to send also readset: often very large!
Classic Replication Protocols Focus on full replica'on protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng
Voting The transac'on executes locally at replica R When the transac'on is ready to commit, only the write set is sent to all replicas using total order broadcast Commit requests are processed in total order A transac'on may commit if its read set is s'll valid (i.e., no other transac'on has updated the read set): Only R can cerdfy the transacdon! R send the outcome of the transac'on to all replicas
Voting R1 Execu'on Transac'on T1 T1 s TOB (ws) T1 s reliable broadcast ( vote ) T1 s valida'on R2 wait for R1 s vote + sends only write-set (much smaller than read-sets normally) - Additional communication phase to disseminate decision (vote)
Classic Replication Protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng BFC Vo'ng
Bloom Filter Certification (BFC) Bloom filters: space- efficient data structure for test membership queries Probabilis'c answer to Is elem contained in BF? No false nega'ves: A no answer is always correct False posi'ves: A yes answer may be false Compression is a func'on of a (tunable) false posi've rate
Bloom Filter Certification (BFC) Key idea: use BF to encode readset and detect intersec'on with writesets of concurrent transac'ons: False posi'ves: addi'onal (determinis'c) abort strongly reduce network traffic: 1% false posi've up to 30x compression
BFC vs Voting vs Non-Voting ("!"#$%&'()*+,-#"./-0.1+!#'"!#&"!#%"!#$"!" (" "()!!!"" "(!!)!!!"" 2)%*+3)1+3'()+ *+," -.," +,"
PolyCert Polymorphic Self- Op'mizing Cer'fica'on Co- existence of the 3 cer'fica'on schemes Machine- learning techniques to determine the op'mal cer'fica'on strategy per transac'on Logic associated with the on- line choice of the replica'on strategy encapsulated into a generic oracle
Architecture!"#$%#&'($#)*+,,)-&#'($**./0!1* Polymorphic Replication Manager 2/3* 453* /3* Replication Protocol Selector Oracle <7=*/#)>7*0?("7* @30* 67&-%-($*!"77* 879"7%%("* :34* ;7#"$7"* replica i
Protocol When the transac'on finishes local execu'on: Ask the Oracle which protocol to use. Build a message accordingly. AB- cas t the message.
Protocol Upon delivery of an AB- cast message: The message is inserted in a queue with the transac'ons to be cer'fied NVC or BFC: no further processing is done un'l the message reaches the head of the queue VC: If the transac'on does not conflict with others in the queue Local: validate it and send the vote.
Protocol Transac'ons are removed from the head of the queue and validated sequen'ally NVC or BFC: each node applies locally the corresponding cer'fica'on algorithm, valida'ng and applying/discarding the write- set. VC: If the vote has been received, act accordingly. Else: Remote: wait for vote. Local: validate transac'on and send the vote.
Replication Protocol Selector Oracle Two implementa'ons: Off- line Machine Learning Techniques On- line Reinforcement Learning
Off-line Machine Learning Techniques For each transac'on: Predict size of AB message m for the various cer'fica'on schemes Forecast AB latency for each message size. We evaluated several ML approaches: Regression decision trees (best results) Neural networks Support vector machine
Regression Decision Trees Define a set of human- readable rules, where each rule: iden'fies a region in the feature space associates with the region a linear func'on of the features Build a piecewise linear approxima'on of a func'on of the features Chooses the branching ayribute such that the resul'ng split maximizes the normalized informa'on gain
Neural Networks Inspired by the structure and func'onal aspects of biological neural networks Define weight of connec'ons to minimize average predic'on error across all training data back- propaga'on algorithm
Support Vector Machines As a classifier: Iden'fies a set of hyperplanes that have largest distance to the nearest training data points of each class As a func'on approx.: iden'fies the hyperplane that is closer as possible to the set of training data
Off-line Machine Learning Techniques Uses up to 53 monitored system ayributes: CPU Memory Network Time- series Requires computa'onal intensive training phase
On-line Reinforcement Learning Each replica builds on- line expecta'ons on the rewards of each protocol: no assump'on on rewards distribu'ons Solves the explora'on- exploita'on dilemma: did I test this op'on sufficiently in this scenario?
UCB Mul4- armed bandit problem: Each arm of a slot machine is associated with an unknown reward Each round, one arm is played Find the strategy that maximizes the average reward Upper Confidence Bound (UCB): lightweight and provably op'mal solu'on to the bandit problem computes sample average and UCB for each arm UCB captures the degree of uncertainty on the arm s actual reward
On-line Reinforcement Learning Dis'nguishes workload scenario solely based on read- set s size exponen'al discre'za'on intervals to minimize training 'me Replicas exchange sta's'cal informa'on periodically to boost learning
Results - Bank Benchmark ("!"#$%&'()*+,-#"./-0.1+!#'"!#&"!#%"!#$"!" (" "()!!!"" "(!!)!!!"" 2)%*+3)1+3'()+ *+," -.," +," /0123,-" /4"
Results - Bank Benchmark Evolution of Throughput!"#$%&"'%()*(#+,-+./$-0-1.$,23) '#" '!" &#" &!" %#" %!" $#" $!" #" +,-"./01+,-"!"!" &!" (!" )!" $%!" $#!" $*!"!451)*-1.$,2-3)
Results - STMBench7 (#%"!"#$%&'()*+,-#"./-0.1+ (#$" ("!#'"!#&"!#%"!#$"!" )*" +," -*." +/01-*."
Diego Didona, Sebas'ano Peluso, Paolo Romano, Francesco Quaglia, Luis Rodrigues CASE STUDY 2: PRIMARY-BACKUP <=> 2PC
Classic Replication Protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng Vo'ng
Single Master Write transac'ons are executed en'rely in a single replica (the primary) If the transac'on aborts, no coordina'on is required. If the transac'on is ready to commit, coordina'on is required to update all the other replicas (backups). Reliable broadcast primi've. Read transac'ons can be executed on backup replicas. No distributed deadlocks No distributed coordinadon during commit Throughput of write txs doesn t scale up with number of nodes
Classic Replication Protocols Single master (primary- backup) Mul' master Total order based 2PC- based Cer'fica'on State machine replica'on Non- vo'ng Vo'ng
2PC-based replication Transactions attempt to acquire atomically locks at all nodes using two phase commit (2PC) 2PC materializes conflicts among concurrent remote transactions generating: DISTRIBUTED DEADLOCKS + good scalability at low conflict - thrashes at high conflict
Performance comparison 16000 14000 PB-low conflict 2PC-low conflict PB-high conflict 2PC-high conflict 12000 Throughput (Tx/sec) 10000 8000 6000 4000 2000 0 4 5 6 7 8 9 10 # Nodes
Goals Autonomically select the best suited protocol to Minimize transac'ons' service 'me Maximize achievable throughput Elas'cally scale system size Scale up if the system needs more computa'onal power Scale down if the system is oversized
Architectural Overview An Autonomic Manager Periodically receives sta's'cs from nodes Aggregates sta's'cs and query an oracle Triggers protocol switch and system scaling The manager resides on one of the system's nodes Communica'on through a dedicated ISPN cache Listeners on the cache for asynchronous communica'on
Full Architecture - 2PC
Full Architecture - PB
Collected Statistics Business logic cost (local transac'on execu'on 'me) Transac'ons arrival rate Put opera'ons per transac'on Percentage of write transac'ons Transport layer latency Conflict degree (more on this later)
Key Technical Problem How to forecast Performances of protocol B while running protocol A? Performances of the system with X nodes while running on Y nodes? given that replica'on protocol/scale changes affect: The transac'on conflict probability The transport layer latency
Methodology Joint usage of analy'cal modelling and machine learning techniques: analy'cal model of replica'on algorithm dynamics: lock conten'on, distributed deadlock probability message exchange payern machine learning to forecast performance of group communica'on layer: RTT as a func'on of msg size, throughput, #nodes
Analytical Model Average transac'on service 'me es'ma'on through the use of an analy'cal model Captures detailed dynamics of replica'on algorithms It is possible to mathema'cally model them, as the algorithms' behaviour is fully known
Machine learning techniques Transport layer latency predicted through the use of a sta's'cal model Resource virtualiza'on makes mathema'cal modelling unfeasible: No knowledge of actual load No knowledge of actual physical resources
Analytical Model - Overview Focused on modelling data conten'on and replica'on protocols dynamics Key methodologies: Mean Value Analysis & Queuing theory PRO: Good trade- off between performance (of solvers!) and predic'on accuracy CON: Unable to predict distribu'ons, percen'les (useful for instance for SLAs)
Analytical Model Contention Probability Conten'on probability depends on Transac'ons arrival rate Read- only vs Update transac'ons Deadlock induced restarts Transac'ons dura'on Business logic (in absence of conten'on) Wai'ng 'me to acquire locks Commit phase (which depends on the replica'on scheme) Data access payern => TRICKY!!! Which items are accessed in what order by transac'ons
Analytical Model Data Access Pattern Typical solu'ons Assume a- priori knowledge of access payerns Require non- trivial analysis Probability distribu'on of accesses to data normally assuming very simple distribu'ons Offline: unfeasible for evolving systems Online: costly and complex Hard to deploy in prac'cal seyngs
Application Contention Factor Key methodological innova'on: ApplicaDon ContenDon Factor (ACF) Captures probability of intersec'on between datasets accessed by two concurrent transac'ons independently of: concurrency control/replica'on protocol in use number of nodes/threads ac've in the system Derived from measurement of opera'onal system, not computed based on some hypothesized access payern
ACF: General idea Assume to be in config s1 (e.g. 2PC, 10 nodes, 3 threads x node) and want to predict performance in config s2 (e.g. PB, 3 nodes, 5 threads x node): 1. while in s1 measure: a) lock dura'on, T hold b) lock request arrival rate, λ c) conflict probability between 2 xact, p conflict 2. derive ACF = p conflict / (T hold * λ) hint: locks modelled as M/G/1 queues 3. ACF can now be used as input of an analy'cal model of the replica'on protocols used in scenario s2
Abstracting over applications data access patterns ACF computed based on the lock probability using PB and 2PC across different workload scenarios Application Contention Factor 5e-05 100KPR 100KTPC 50KPR 50KTPC 25KPR 25KTPC Application Contention Factor 4e-05 3e-05 2e-05 1e-05 0 2 3 4 5 6 7 8 9 10 Number of nodes
Statistical Model - Overview Based on decision trees Ini'al knowledge base gathered using offline training phase Possibility to periodically update sta's'cal model using data collected online Takes as input a set of metrics gathered in the current system configura'on Outputs the forecast transport layer latency (RTT) for a target configura'on
Set of Input Metrics Number of nodes RTT in the current configura'on Size of exchanged messages Throughput of the current configura'on Unknown!!! Guessed using the analy'cal model (more next)
Statistical Model Accuracy Correla'on between 0.96 and 0.98 Rela've error between 0.19 and 0.22
Models Coupling AnalyDcal model forecasts the ISPN throughput taking as input the RTT in the target configura'on. StaDsDcal model forecasts the RTT taking as input the ISPN throughput in the target configura'on Fixed point solu'on found using recursion
Global Model Accuracy 20000 15000 2PC-est-low conflict 2PC-low conflict 2PC-est-high conflict 2PC-high conflict PB-est-low conflict PB-low conflict Throughput (Tx/sec) 10000 5000 0 4 5 6 7 8 9 10 # Nodes
and now in action! 7000 th.txt u 1:2 6500 6000 5500 throughput (tx/sec) 5000 4500 4000 3500 3000 2500 2000 LOW CONFLICT HIGH CONFLICT 0 100 200 300 400 500 600 700 800 900 time (sec) LOW CONFLICT
Future Work Enhance accuracy of analy'cal model in high conten'on scenarios Validate the model also against more complex workload (TPC- C) Assess the suitability of the presented techniques also for elas'c scaling scenarios
Paolo Romano and MaYeo Leoney Self- tuning Batching in Total Order Broadcast Protocols via Analy4cal Modelling and Reinforcement Learning IEEE Interna'onal Conference on Compu'ng, Networking and Communica'ons, Network Algorithm & Performance Evalua'on Symposium (ICNC'12), Jan. 2012 CASE STUDY 3: BATCHING IN TOTAL ORDER BROADCAST PROTOCOLS
Sequencer based TOB (STOB) Total order broadcast (TOB) algorithms rely on a special process to ensure total order: P2 m seq sequencer assigns total order seq P3 m TOB(m) message diffusion message ordering
Batching in STOB protocols STOB have theore'cally op'mal latency: 2 comm. steps, independently of the number of processes but sequencer becomes the boyleneck at high throughput Batching at the sequencer process: wait for several msgs and order them altogether: amor'ze sequencing cost across mul'ple messages op'mal wai'ng 'me depends on message arrival rate: very effec've at high throughput very bad at low throughput!
Analytical model (i) Model sequencing process as a M/M/1 queue: simple queuing model, easily solvable Key equa'on: 'me to sequence a batch of b msgs at arrival rate m: T seq (b,m)=t 1st + (b- 1)/2m + T add (b- 1) T 1st : 'me to sequence a batch of size 1 (b- 1)/2m: avg. wait 'me to build a batch T add (b- 1): overhead for remaining b- 1 messages
Analytical model (ii) Using queuing theory arguments we determine the op'mal batching 'me, b *, as a func'on of the arrival rate, m: b (m) = 1, if m< T addσ 2 σ 2mT add σ+ if T addσ 2 2 + 1 2 2 + 1 2 2m σ 2mT add σ 4σ 2 +2T 2 add σ4 2, 2(σ+2m(T add σ 1)) 2 (2σT add 1) 2 (1+2σT add )σ 2 4σ 2 +2T 2 add σ4 2 <m<m
Determining model params To use the model one need to determine two parameters: T 1st & T add Determined using a simple benchmark: 1. find peak throughput w/o batching: m * b=1 2. find peak throughput at max batching level: m * b=max then set: T 1st = 1/m * b=1 T add = 2/m * b=max
Model Accuracy Optimal Batching Value 100 10 Exaustive Manual Tuning Analytical Model Model underes'mates op'mal batching value at medium load Problem: batching underes4ma4on causes system instability! 1 0 2000 4000 6000 8000 10000 12000 14000 Average Msg. Arrival Rate (msgs/sec)
Validation with real traffic 12000 Avg Msg. Arrival Rate (msgs/sec) 10000 8000 6000 4000 2000 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Hour of the day (3 Sept. 2010) Figure 4. Traffic at the FenixEDU system (Sept. 3 2010)
Peak period analysis msgs/sec Latency (msec) 100 10 1 10000 5000 0 16 17 18 19 20 21 22 Hour of the day Ramp- up & ramp- down transi'on through the problema'c areas: - ramp- up is sufficiently short: system struggles, but recovers - ramp- down is longer:
What about a pure ML approach? Discre'za'on of the func'on b * =f(m) that outputs the op'mal batching b * given the current message arrival rate m m={10,100,1000,2000,,16000} b={1,2,4,8,,256} Use an instance of UCB with each arrival rate m, having an arm per each batching value b: use UCB to determine the most rewarding arm
Pure ML approaches Problem: ML techniques need to explore different solu'ons (batching values) to iden'fy op'mal one: low load: useless addi'onal latency medium- high load: insufficient batching values lead very rapidly to instability and thrashing
Combining the two approaches 1. Ini'alize UCB rewards with the predic'ons of the analy'cal model: reduce frequency of obviously wrong explora'ons 2. Let UCB update the ini'al reward values: correct model s predic'on errors exploi'ng feedback from the system
Combining the two approaches Latency (msec) 100 10 1 Model Model+RL msgs/sec 10000 5000 0 16 17 18 19 20 21 22 Hour of the day
Future work Focus on elas'c- scaling, keeping into account data grids dynamics: consistency costs, transac'on conflicts Study effects of self- tuning mul'ple, mutually dependent layers of the data grid BeYer integra'on with QoS specifica'on APIs