Autonomic Thread Parallelism and Mapping Control for Software Transactional Memory

Size: px
Start display at page:

Download "Autonomic Thread Parallelism and Mapping Control for Software Transactional Memory"

Transcription

1 Autonomic Thread Parallelism and Mapping Control for Software Transactional Memory Presented by Naweiluo Zhou PhD Supervisors: Dr. Gwenae l Delaval Dr. E ric Rutten Prof. Jean-Franc ois Me haut PhD Defence October 21, 2016

2 The era of multi-core processors Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence 2 / 49

3 Introduction Single-core processors used to be dominant... Single-core processors Increment of clock frequency is approaching a physical end Multi-core processors 1 Include multiple cores on a single chip 2 Give high performance with more cores Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 3 / 49

4 Introduction Parallel applications enables high performance Tasks of an application are processed by multiple cores Execution accelerates as tasks are processed in parallel Transactional memory provides a good platform to deal with tasks in parallel therefore, utilised in this thesis Memory L3 C0 C1 C2 C3 L3 C4 C5 C6 C7 An overview of multi-core processor Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 4 / 49

5 Introduction More cores, higher performance, but... issues of parallel programs Concept: parallelism degree is the number of active threads (tn) Synchronisation time, computing time, data access time A high parallelism degree may decline computing time, but increase synchronisation time. More threads maybe higher synchronisation The trade-off between synchronisation and computation? Multi-core processors hold complicated memory architecture How to reduce data access time to memory? The behaviour of an application can vary during its execution How to control a system at runtime? Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 5 / 49

6 Introduction Contributions This thesis contributes to: High performance computing Autonomic computing: techniques that can manage systems automatically High Performance Computing Autonomic Computing Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence 6 / 49

7 Introduction Overview 1 Background on TM and Autonomic Computing Transactional Memory Thread Mapping Autonomic Computing Techniques 2 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview Four Groups of Decision Functions 3 Performance Evaluation 4 Related Work 5 Conclusion and Future Work Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 7 / 49

8 Background on TM and Autonomic Computing Outline 1 Background on TM and Autonomic Computing Transactional Memory Thread Mapping Autonomic Computing Techniques 2 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview Four Groups of Decision Functions 3 Performance Evaluation 4 Related Work 5 Conclusion and Future Work Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 8 / 49

9 Background on TM and Autonomic Computing Transactional Memory How to address synchronisation issues threads need to synchronise Locks A conventional way for synchronisation. But: Deadlocks, vulnerability to failures, faults... Difficult to detect deadlocks (e.g. not always reproducible) Hard to analyse interaction among concurrent operations Transactional Memory Lock-free, easy to implement concurrent operations Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 9 / 49

10 Background on TM and Autonomic Computing Transactional Memory Concepts of transactional memory Shared data access is wrapped into transactions A transaction is an atomic block Concurrent access is performed inside transactions Transactions are executed speculatively Can be implemented in STM (e.g. TinySTM, SwissTM...), HTM and HyTM 1 r/w conflict r r r r r w w... thread1 thread2 thread OK No OK Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 10 / 49

11 a Autonomic Thread Parallelism & Mapping Control for STM Background on TM and Autonomic Computing Transactional Memory Transactional Memory a transaction can either commit or abort, no intermediate status conflict Three concepts 1 Commit: a transaction succeeds changes are made to memory 2 Abort: a transaction has a conflict changes are discarded 3 Rollback: re-execute the aborted transactions abort Tx start rollback no conflict commit 1 r/w conflict r r r r w r w thread1 thread2 thread OK No OK Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 11 / 49

12 Background on TM and Autonomic Computing Transactional Memory Transactional memory Example Three threads access data from different memory locations Threads execute transactions Case1 thread1 reads object3 thread2 reads object3 1 r/w conflict r r r r w r w Case2 thread2 reads object7 thread3 writes object7 thread1 thread2 thread OK No OK Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 12 / 49

13 Background on TM and Autonomic Computing Transactional Memory How to deal with conflicts in TM When conflicts happen Many designs to solve conflicts: abort others, abort itself... 1 r/w conflict Case1 Thread1 Thread2 progress r r r r w r w... thread1 thread2 thread Case2: resolving conflicts one transaction aborts one transaction continues OK No OK Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 13 / 49

14 Background on TM and Autonomic Computing Thread Mapping Multi-core processors also bring complexity Data access latency differs Data access latency to different levels of memory differs Data access latency differs from one core to another How to place threads on cores to better use resources? 1 GB Primary Memory 60ns Memory 256 KB 32 KB 100 bytes cache L1 cache CPU registers 4ns 1ns < 1ns L3 data C0 C1 C2 C3 L3 C4 C5 C6 C7 Example of access latency Example of multi-core processor Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 14 / 49

15 Background on TM and Autonomic Computing Thread Mapping Thread mapping Concept: assign threads to specific cores 1 Better use of interconnects: using intra-chip interconnects 2 Reduce invalidation misses: reduce misses caused by two private caches: holding the same data and continuously invalidating each other 3 Reduce compulsory misses: caused by competition for the same cache. Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 15 / 49

16 Background on TM and Autonomic Computing Thread Mapping Thread mapping strategies Thread mapping: assign threads to specific cores We map threads to different cores to gain better performance 1 Compact threads are physically placed on sibling cores 2 Scatter threads are distributed across different processors 3 Round-Robin threads share higher level of cache (e.g. L3) but not the lower ones 4 Linux default thread scheduling based on priorities L3 L3 C0 C1 C2 C3 C0 C1 C2 C3 Compact Round-Robin L3 L3 C4 C5 C6 C7 C4 C5 C6 C7 L3 C0 C1 C2 C3 L3 C0 C1 C2 C3 thread mapping strategies 1 [1]: courtesy of Castro Scatter Linux L3 C4 C5 C6 C7 L3 C4 C5 C6 C7 thread migration Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 16 / 49

17 Background on TM and Autonomic Computing Autonomic Computing Techniques Restrictions of HPC systems: Restrictions from TM systems TM sytems are complicated and incorporate numerous tunable parameters (e.g. parallelism degrees) these parameters of TM are usually set offline few actions can be made to adapt them to runtime behaviour Hardware complexity brings more tunable parameters Autonomic computing techniques are capable of: Monitoring behaviour dynamically Tuning parameters accordingly to ameliorate performance Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence 17 / 49

18 Background on TM and Autonomic Computing Autonomic Computing Techniques Autonomic computing autonomic computing techniques enable systems to be self-adaptive This concept is first proposed by IBM in 2003 Autonomic control systems have at least one of the properties: Self-optimisation: seek to improve performance & efficiency Self-configuration: a new component learns the system configurations Self-healing: recover from failures Self-protection: defend against attacks Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 18 / 49

19 Background on TM and Autonomic Computing Autonomic Computing Techniques Autonomic computing Elements of a feedback control loop: 1 Managed element: any software or hardware resource 2 Autonomic manager a software component: monitor, plan, knowledge 3 Sensor: collect information 4 Effector: carry out changes Autonomic Element Autonomic Manager Analyse Plan Monitors Knowledge Execute sensors effectors Managed Elements A feedback control loop Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 19 / 49

20 Background on TM and Autonomic Computing Autonomic Computing Techniques Autonomic computing Components of the autonomic manager: 1 Monitor: sampling 2 Analyser 3 Knowledge 4 Plan: use the knowledge of the system to do computation 5 Execute: make changes Autonomic Element Autonomic Manager Analyse Plan Monitors Knowledge Execute sensors effectors Managed Elements A feedback control loop Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 20 / 49

21 Background on TM and Autonomic Computing Autonomic Computing Techniques What we want to address thread parallelism + App performance on + - muticore processors - thread mapping Our work considers two factors that can impact on performance Alas... Difficult to set a parallelism degree offline Difficult to decide thread locations offline Especially for applications with online behaviour variations Autonomic computing techniques can tackle the above issues Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 21 / 49

22 Autonomic Parallelism and Mapping Control Outline 1 Background on TM and Autonomic Computing Transactional Memory Thread Mapping Autonomic Computing Techniques 2 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview Four Groups of Decision Functions 3 Performance Evaluation 4 Related Work 5 Conclusion and Future Work Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 22 / 49

23 Autonomic Parallelism and Mapping Control Problems of control both parallelism and thread mapping 1 Is thread mapping necessary? e.g. when tn<core number, mapping can be performed 2 The order of decisions: parallelism or thread mapping first? 3 If parallelism changes, its mapping strategy differs? 4 Frequency of changing thread mapping strategy? Performance lost by thread migration Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 23 / 49

24 Autonomic Parallelism and Mapping Control The solutions 1 Predict parallelism degree first 2 Predict the thread mapping strategy 3 Mapping is re-profiled when parallelism degree varies parallelism degree half of core number The reasons 1 The maximum parallelism degree requires no thread mapping (by Castro2012) 2 High parallelism degree may give unstable behaviour regardless of mapping (based on experimental observation) 3 Influence of parallelism is more significant (based on experimental observation) Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 24 / 49

25 Autonomic Parallelism and Mapping Control Metrics What we measure 1 parameters: commits, aborts, time 2 commit ratio (CR)= commits/(commits+aborts), throughput=commits/time CR range : detect phase change CR fluctuates within the range during the same phase High CR low contention, high throughput fast execution Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 25 / 49

26 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview The feedback control loop Autonomic Element Autonomic Manager th inc and tn change and tn<=core_num/2 mapping & CR range process informa on make decisions start predict tn true verify true true no thread control parameters we measure sensors control ac ons effectors Transac onal memory Applica ons Mul core HW managed elements The feedback control loop overview of system architecture th dec stop profile ((CR>CR_UP or CR=1 ) and tn<tn_max ). or (CR<CR_LOW and tn>tn_min) The structure of autonomic manager in an automaton shape Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 26 / 49

27 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview The feedback control loop 1 Control objectives: maximise throughput and reduce application execution time 2 Inputs: commits, aborts, time 3 Outputs: inc/dec parallelism degree, set thread mapping strategy Autonomic Element Autonomic Manager Analyse CR, th CR range Plan mapping, opt tn Monitors Knowledge online && offline Execute system info commits, aborts, profile flag, mapping sensors me inc/dec tn effectors TinySTM benchmarks Mul core HW Managed Elements The feedback control loop in MAPE-K shape Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 27 / 49

28 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Four groups of decision functions Decision functions of: 1 Parallelism 2 Thread mapping strategy 3 CR range (phase detection) 4 Thread profile start predict tn true verify th inc and tn change and tn<=core_num/2 th dec mapping & CR range true stop profile true ((CR>CR_UP or CR=1 ) and tn<tn_max ). or (CR<CR_LOW and tn>tn_min) no thread control The structure of autonomic manager Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 28 / 49

29 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Parallelism decision functions: two models Simple model No action unless a phase change is detected Keep inc/dec tn by one until throughput decreases Inc tn when CR > cr up; dec tn when CR < cr low Probabilistic model Directly compute the near-optimum parallelism degree Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 29 / 49

30 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Parallelism decision function: probabilistic model The number of transactions N executed during L 0 : N = L 0 L n = α n (1) within a fixed period L 0 (N=aborts+commits) assuming the average length of transactions is L The probability of a commit: Pr = (1 p) (N N n ) = q (N N n ) (2) p (independent of n): probability of a conflict between two txs; n is parallelism degree. Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 30 / 49

31 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Parallelism decision function: probabilistic model Pr = (1 p) (N N n ) = q (N N n ) (2) Equation (2) relies on two assumptions The same amount of transactions (txs) are executed in each active threads during a fixed period Enough amount of tx are executed making the probability of a conflict between two txs approaches a constant Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 31 / 49

32 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Parallelism decision function: probabilistic model P(X i = 1) = q (N N n ) = q α(n 1) (3) X i = 1 commit, X i = 0 aborts T is a random variable, T = X i T follows a binomial distribution: B(N, q (N N n ) ) Hence the expected value of T is: E[T ] = N.q (N N n ) = αnq α(n 1) (4) T (n) = α n q α(n 1) (5) T is throughput, α = L 0 L CR is the commit ratio Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 32 / 49

33 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Parallelism decision function: probabilistic model T (n) = αq α(n 1) + α 2 nq α(n 1) ln(q) = 0 (6) Hence the optimum parallelism degree: q = CR 1 α(n 1) (7) n opt = n 1 ln CR (8) n opt is optimum parallelism n is the active thread number CR is the commit ratio Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 33 / 49

34 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Two CR range decision functions Simple phase detection algorithm 1 Obtain the optimum tn 2 Obtain CR when inc/dec optimum tn by one Advanced phase detection algorithm: 1 cr up = cr opt + cr opt δ 2 cr low = cr opt cr opt δ cr opt : CR generated by the optimum tn and mapping δ inc 1% when a false phase change happens Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 34 / 49

35 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Thread mapping decision function Conditions for mapping strategies tn half the core number tn has changed Obtain the optimum mapping Profile each strategy Optimum strategy: generates the highest throughput start predict tn true verify th dec mapping & CR range stop profile ((CR>CR_UP or CR=1 ) and tn<tn_max). th inc and tn change and tn<=core_num/2 true or true (CR<CR_LOW and tn>tn_min) no thread control The structure of autonomic manager Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 35 / 49

36 Autonomic Parallelism and Mapping Control Four Groups of Decision Functions Thread profile decision function stop thread profile Current throughput previous throughput? set back old tn After thread mapping and CR range decision start predict tn true verify th inc and tn change and tn<=core_num/2 th dec mapping & CR range true stop profile true no thread control start thread profile CR falls out the CR range? ((CR>CR_UP or CR=1 ) and tn<tn_max ). or (CR<CR_LOW and tn>tn_min) The structure of autonomic manager Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 36 / 49

37 Performance Evaluation Outline 1 Background on TM and Autonomic Computing Transactional Memory Thread Mapping Autonomic Computing Techniques 2 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview Four Groups of Decision Functions 3 Performance Evaluation 4 Related Work 5 Conclusion and Future Work Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 37 / 49

38 Performance Evaluation Experimental platforms Configurations for the UMA number of cores 24 number of sockets 4 clock (GHz) 2.66 cache capacity (KB) 3072 (each) L3 cache capacity(mb) 16 (each) DRAM capacity (GB) 64 L3 L3 Memory bus L3 L3 C0 C1 C2 C3 C4 C5 C6 c7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 STM and benchmarks for evaluation STM: TinySTM Benchmarks: EigenBench: artificial but highly configurable STAMP: generic but less configurable (include: yada etc.) Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 38 / 49

39 Performance Evaluation Time comparison with static parallelism average of 10-time executions dot is execution time for static parallelism our approaches outperform the static parallelism execution time (second) para para & mapping execution time (second) para para & mapping thread number thread number EigenBench yada (from STAMP) Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 39 / 49

40 Performance Evaluation Runtime thread number and mapping strategy variation runtime behaviour of one execution applications with multiple phases show necessity of runtime para & mapping adaptation thread number para para & mapping Scatter Compact thread number Compact 0 para para & mapping logic time (1.4K commits) logic time (10K commits) EigenBench yada (from STAMP) Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 40 / 49

41 Performance Evaluation How to verify the adaptive approaches Throughputs indicate program progress The throughputs by the autonomic control approaches: compare with max throughputs at each phase approach or exceed the max throughputs Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence 41 / 49

42 Performance Evaluation Online throughput variation runtime behaviour of one execution the throughputs from our methods approach or exceed those of the static parallelism throughput( ktx/s) threads 4thread 8threads 16threads 24threads para only para & mapping throughput( ktx/s) threads 4thread 8threads 16threads 24threads para only para & mapping logic time (1.4k commits) logic time (10k commits) EigenBench yada (from STAMP) Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 42 / 49

43 Performance Evaluation Discussion Limitations 1 The probabilistic parallelism predictor is based on ideal situations. The prediction can be sub-optimum de facto. 2 The optimum mapping strategy can differ over several executions: throughputs are affected by thread migration some mapping strategies show similar performance Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 43 / 49

44 Related Work Outline 1 Background on TM and Autonomic Computing Transactional Memory Thread Mapping Autonomic Computing Techniques 2 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview Four Groups of Decision Functions 3 Performance Evaluation 4 Related Work 5 Conclusion and Future Work Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 44 / 49

45 Related Work Related work on parallelism and mapping static and dynamic adaptation 1 Parallelism adaptation using control techniques (Ansari2008, Rughetti2014) requires offline training procedure to obtain predictor less sensitive to online behaviour change 2 Thread mapping adaptation (Castro2012, Diener2010) dynamic mapping adaptation using machine learning on STM offline searching for optimum mapping on non-tm platforms 3 Coordination of parallelism and adaptation no previous work on adapting both parallelism & mapping dynamically for TM systems some studies show insights into offline adaptation for non-tm platforms (Wang2009, Corbalan2000) Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 45 / 49

46 Conclusion and Future Work Outline 1 Background on TM and Autonomic Computing Transactional Memory Thread Mapping Autonomic Computing Techniques 2 Autonomic Parallelism and Mapping Control Feedback Control Loop Overview Four Groups of Decision Functions 3 Performance Evaluation 4 Related Work 5 Conclusion and Future Work Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 46 / 49

47 Conclusion and Future Work Conclusion 1 Introduce feedback control loops to dynamically control: parallelism degree thread mapping coordination of the above 2 Propose: two models for sub-optimum parallelism detection two phase detection algorithms 3 Compare the performance of static parallelism with: adjusting parallelism only adjusting parallelism and thread mapping together Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 47 / 49

48 Conclusion and Future Work Future Work 1 Thread mapping new mapping strategies prediction of mapping strategies with assistance of compilers 2 From STM to HTM 3 Coordination of multiple feedback control loops 4 Feedback control loops for other parallel platforms (e.g. OpenMP) Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 48 / 49

49 Publications Publications 1 N. Zhou, G. Delaval, B. Robu, É. Rutten, and J.-F. Méhaut, Autonomic Parallelism and Thread Mapping Control on Software Transactional Memory, in ICAC th IEEE International Conference on Autonomic Computing, (Wursburg, Germany), July N. Zhou, G. Delaval, B. Robu, É. Rutten, and J.-F. Méhaut, Control of Autonomic Parallelism Adaptation on Software Transactional Memory, in International Conference on High Performance Computing & Simulation (HPCS), (Innsbruck, Austria), July (Outstanding Paper Nomination) 3 N. Zhou, G. Delaval, B. Robu, É. Rutten, and J.-F. Méhaut, Autonomic Parallelism Adaptation for Software Transactional Memory, in Conférence d informatique en Parallélisme, Architecture et Système (COMPAS), (Lorient, France), July N. Zhou, G. Delaval, B. Robu, É. Rutten, and J.-F. Méhaut, Autonomic Parallelism Adaptation on Software Transactional Memory, Research Report RR-8887, Univ. Grenoble Alpes ; INRIA Grenoble, Mar Grenoble Alpes University/INRIA,France Naweiluo Zhou, PhD thesis defence naweiluo.zhou@inria.fr 49 / 49

Autonomic Parallelism and Thread Mapping Control on Software Transactional Memory

Autonomic Parallelism and Thread Mapping Control on Software Transactional Memory Autonomic Parallelism and Thread Mapping Control on Software Transactional Memory Naweiluo Zhou, Gwenaël Delaval, Bogdan Robu, Eric Rutten, Jean-François Méhaut To cite this version: Naweiluo Zhou, Gwenaël

More information

Autonomic Parallelism Adaptation on Software Transactional Memory

Autonomic Parallelism Adaptation on Software Transactional Memory Autonomic Parallelism Adaptation on Software Transactional Memory Naweiluo Zhou, Gwenaël Delaval, Bogdan Robu, Méhaut Éric Rutten, Jean-François To cite this version: Naweiluo Zhou, Gwenaël Delaval, Bogdan

More information

Autonomic Thread Parallelism and Mapping Control for Software Transactional Memory

Autonomic Thread Parallelism and Mapping Control for Software Transactional Memory Autonomic Thread Parallelism and Mapping Control for Software Transactional Memory Naweiluo Zhou To cite this version: Naweiluo Zhou. Autonomic Thread Parallelism and Mapping Control for Software Transactional

More information

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich

Performance Evaluation of Adaptivity in STM. Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich Performance Evaluation of Adaptivity in STM Mathias Payer and Thomas R. Gross Department of Computer Science, ETH Zürich Motivation STM systems rely on many assumptions Often contradicting for different

More information

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support ATLAS: A Chip-Multiprocessor with Transactional Memory Support Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Ge, Christos Kozyrakis, and Kunle Olukotun Transactional Coherence and Consistency

More information

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)

Cost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today

More information

Evaluating the Impact of Transactional Characteristics on the Performance of Transactional Memory Applications

Evaluating the Impact of Transactional Characteristics on the Performance of Transactional Memory Applications Evaluating the Impact of Transactional Characteristics on the Performance of Transactional Memory Applications Fernando Rui, Márcio Castro, Dalvan Griebler, Luiz Gustavo Fernandes Email: fernando.rui@acad.pucrs.br,

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Transactifying Apache s Cache Module

Transactifying Apache s Cache Module H. Eran O. Lutzky Z. Guz I. Keidar Department of Electrical Engineering Technion Israel Institute of Technology SYSTOR 2009 The Israeli Experimental Systems Conference Outline 1 Why legacy applications

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Chí Cao Minh 28 May 2008

Chí Cao Minh 28 May 2008 Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

EazyHTM: Eager-Lazy Hardware Transactional Memory

EazyHTM: Eager-Lazy Hardware Transactional Memory EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center,

More information

SELF-TUNING HTM. Paolo Romano

SELF-TUNING HTM. Paolo Romano SELF-TUNING HTM Paolo Romano 2 Based on ICAC 14 paper N. Diegues and Paolo Romano Self-Tuning Intel Transactional Synchronization Extensions 11 th USENIX International Conference on Autonomic Computing

More information

Operating System Review Part

Operating System Review Part Operating System Review Part CMSC 602 Operating Systems Ju Wang, 2003 Fall Virginia Commonwealth University Review Outline Definition Memory Management Objective Paging Scheme Virtual Memory System and

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory

Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike IBM Research Tokyo Thread-Level Speculation (TLS) [Franklin et al., 92] or Speculative Multithreading (SpMT)

More information

DBT Tool. DBT Framework

DBT Tool. DBT Framework Thread-Safe Dynamic Binary Translation using Transactional Memory JaeWoong Chung,, Michael Dalton, Hari Kannan, Christos Kozyrakis Computer Systems Laboratory Stanford University http://csl.stanford.edu

More information

LogTM: Log-Based Transactional Memory

LogTM: Log-Based Transactional Memory LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

Understanding Hardware Transactional Memory

Understanding Hardware Transactional Memory Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence

More information

A Machine Learning-Based Approach for Thread Mapping on Transactional Memory Applications

A Machine Learning-Based Approach for Thread Mapping on Transactional Memory Applications A Machine Learning-Based Approach for Thread Mapping on Transactional Memory Applications Márcio Castro, Luís Fabrício Wanderley Góes, Christiane Pousa Ribeiro, Murray Cole, Marcelo Cintra and Jean-François

More information

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin) : LFS and Soft Updates Ken Birman (based on slides by Ben Atkin) Overview of talk Unix Fast File System Log-Structured System Soft Updates Conclusions 2 The Unix Fast File System Berkeley Unix (4.2BSD)

More information

Lock vs. Lock-free Memory Project proposal

Lock vs. Lock-free Memory Project proposal Lock vs. Lock-free Memory Project proposal Fahad Alduraibi Aws Ahmad Eman Elrifaei Electrical and Computer Engineering Southern Illinois University 1. Introduction The CPU performance development history

More information

Using Software Transactional Memory In Interrupt-Driven Systems

Using Software Transactional Memory In Interrupt-Driven Systems Using Software Transactional Memory In Interrupt-Driven Systems Department of Mathematics, Statistics, and Computer Science Marquette University Thesis Defense Introduction Thesis Statement Software transactional

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 What are lock-free data structures A shared data structure is lock-free if its operations

More information

6 Transactional Memory. Robert Mullins

6 Transactional Memory. Robert Mullins 6 Transactional Memory ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Limitations of lock-based programming Transactional memory Programming with TM ( STM ) Software TM ( HTM ) Hardware TM 2

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

Scheduling Transactions in Replicated Distributed Transactional Memory

Scheduling Transactions in Replicated Distributed Transactional Memory Scheduling Transactions in Replicated Distributed Transactional Memory Junwhan Kim and Binoy Ravindran Virginia Tech USA {junwhan,binoy}@vt.edu CCGrid 2013 Concurrency control on chip multiprocessors significantly

More information

OPERATING SYSTEM TRANSACTIONS

OPERATING SYSTEM TRANSACTIONS OPERATING SYSTEM TRANSACTIONS Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alexander Benn, and Emmett Witchel The University of Texas at Austin OS APIs don t handle concurrency 2 OS is weak

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

Optimising Multicore JVMs. Khaled Alnowaiser

Optimising Multicore JVMs. Khaled Alnowaiser Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services

More information

Amdahl's Law in the Multicore Era

Amdahl's Law in the Multicore Era Amdahl's Law in the Multicore Era Explain intuitively why in the asymmetric model, the speedup actually decreases past a certain point of increasing r. The limiting factor of these improved equations and

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations 1 Paper on TM Pathologies (ISCA 08) LL: lazy versioning, lazy conflict detection, committing

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow

More information

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation

Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation Thesis Defense Master of Science Sean Moore Advisor: Binoy Ravindran Systems Software Research Group Virginia Tech Multiprocessing

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics

EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics Pervasive Parallelism Laboratory, Stanford University Sungpack Hong Tayo Oguntebi Jared Casper Nathan Bronson Christos Kozyrakis

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Scalable Software Transactional Memory for Chapel High-Productivity Language

Scalable Software Transactional Memory for Chapel High-Productivity Language Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL

More information

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime B. Saha, A-R. Adl- Tabatabai, R. Hudson, C.C. Minh, B. Hertzberg PPoPP 2006 Introductory TM Sales Pitch Two legs

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important

More information

CPU Architecture. HPCE / dt10 / 2013 / 10.1

CPU Architecture. HPCE / dt10 / 2013 / 10.1 Architecture HPCE / dt10 / 2013 / 10.1 What is computation? Input i o State s F(s,i) (s,o) s Output HPCE / dt10 / 2013 / 10.2 Input and Output = Communication There are many different types of IO (Input/Output)

More information

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state

Input and Output = Communication. What is computation? Hardware Thread (CPU core) Transforming state What is computation? Input and Output = Communication Input State Output i s F(s,i) (s,o) o s There are many different types of IO (Input/Output) What constitutes IO is context dependent Obvious forms

More information

Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures

Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Outline Research challenges in multicore

More information

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo .. Understanding Performance of Shared Memory Programs Kenjiro Taura University of Tokyo 1 / 45 Today s topics. 1 Introduction. 2 Artificial bottlenecks. 3 Understanding shared memory performance 2 / 45

More information

NUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar

NUMA-aware Reader-Writer Locks. Tom Herold, Marco Lamina NUMA Seminar 04.02.2015 NUMA Seminar Agenda 1. Recap: Locking 2. in NUMA Systems 3. RW 4. Implementations 5. Hands On Why Locking? Parallel tasks access shared resources : Synchronization mechanism in concurrent environments

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Course Outline. Processes CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Distributed Systems

Course Outline. Processes CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Distributed Systems Course Outline Processes CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Distributed Systems 1 Today: Memory Management Terminology Uniprogramming Multiprogramming Contiguous

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer

More information

Lecture 12 Transactional Memory

Lecture 12 Transactional Memory CSCI-UA.0480-010 Special Topics: Multicore Programming Lecture 12 Transactional Memory Christopher Mitchell, Ph.D. cmitchell@cs.nyu.edu http://z80.me Database Background Databases have successfully exploited

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer Multi-Core Computing with Transactional Memory Johannes Schneider and Prof. Roger Wattenhofer 1 Overview Introduction Difficulties with parallel (multi-core) programming A (partial) solution: Transactional

More information

Distributed Deadlock Detection for. Distributed Process Networks

Distributed Deadlock Detection for. Distributed Process Networks 0 Distributed Deadlock Detection for Distributed Process Networks Alex Olson Embedded Software Systems Abstract The distributed process network (DPN) model allows for greater scalability and performance

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) CPU Scheduling Daniel Mosse (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013) Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU I/O Burst Cycle Process

More information

Execution architecture concepts

Execution architecture concepts by Gerrit Muller Buskerud University College e-mail: gaudisite@gmail.com www.gaudisite.nl Abstract The execution architecture determines largely the realtime and performance behavior of a system. Hard

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Outline. Failure Types

Outline. Failure Types Outline Database Tuning Nikolaus Augsten University of Salzburg Department of Computer Science Database Group 1 Unit 10 WS 2013/2014 Adapted from Database Tuning by Dennis Shasha and Philippe Bonnet. Nikolaus

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

First-In-First-Out (FIFO) Algorithm

First-In-First-Out (FIFO) Algorithm First-In-First-Out (FIFO) Algorithm Reference string: 7,0,1,2,0,3,0,4,2,3,0,3,0,3,2,1,2,0,1,7,0,1 3 frames (3 pages can be in memory at a time per process) 15 page faults Can vary by reference string:

More information

Exploiting object structure in hardware transactional memory

Exploiting object structure in hardware transactional memory Comput Syst Sci & Eng (2009) 5: 303 35 2009 CRL Publishing Ltd International Journal of Computer Systems Science & Engineering Exploiting object structure in hardware transactional memory Behram Khan,

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

On Exploring Markov Chains for Transaction Scheduling Optimization in Transactional Memory

On Exploring Markov Chains for Transaction Scheduling Optimization in Transactional Memory WTTM 2015 7 th Workshop on the Theory of Transactional Memory On Exploring Markov Chains for Transaction Scheduling Optimization in Transactional Memory Pierangelo Di Sanzo, Marco Sannicandro, Bruno Ciciani,

More information

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory

Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Rei Odaira, Jose G. Castanos and Hisanobu Tomari IBM Research and University of Tokyo April 8, 2014 Rei Odaira, Jose G.

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Logical Diagram of a Set-associative Cache Accessing a Cache

Logical Diagram of a Set-associative Cache Accessing a Cache Introduction Memory Hierarchy Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year Levels of memory with different sizes & speeds close to

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

Performance Improvement via Always-Abort HTM

Performance Improvement via Always-Abort HTM 1 Performance Improvement via Always-Abort HTM Joseph Izraelevitz* Lingxiang Xiang Michael L. Scott* *Department of Computer Science University of Rochester {jhi1,scott}@cs.rochester.edu Parallel Computing

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now? Incrementally Parallelizing Database Transactions with Thread-Level Speculation Todd C. Mowry Carnegie Mellon University (in collaboration with Chris Colohan, J. Gregory Steffan, and Anastasia Ailamaki)

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Introduction. Memory Hierarchy

Introduction. Memory Hierarchy Introduction Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year 1 Memory Hierarchy Levels of memory with different sizes & speeds close to

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information