Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory
|
|
- Tobias McKenzie
- 6 years ago
- Views:
Transcription
1 Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike IBM Research Tokyo
2 Thread-Level Speculation (TLS) [Franklin et al., 92] or Speculative Multithreading (SpMT) Speculatively parallelize a sequential program into a multithreaded program. What is parallelization? To find data-independent tasks from a program. Why speculation? Because a compiler cannot detect every data dependence. Sequential execution Task Task Task TLS execution w/ 3 threads 2
3 Runtime Requirements for TLS With TLS: Compiler finds probably data-independent tasks. Runtime guarantees data independence among tasks. (Minimum) runtime requirements for TLS Data dependence (= conflict) detection among tasks Execution rollback at a conflict Ordered commit of tasks TLS execution w/ 3 threads Conflict Ordered commit 3 Rollback
4 Hardware Transactional Memory (HTM) Coming into the Market Blue Gene/Q zec12 POWER8 4th Generation Core Processor (Haswell) HTM supports Conflict detection among transactions Execution rollback at a conflict HTM satisfies 2/3 of the runtime requirements for TLS! Task = transaction 4
5 Our Goal How well can TLS improve the performance on real HTM hardware? Used Intel 4th Generation Core Processor (Intel TSX). Manually modified and measured SPEC CPU
6 Our True Goal How poorly can TLS improve the performance on real HTM hardware? Because proposed TLS systems had advanced hardware support. E.g. ordered transactions, data forwarding, etc. Blue Gene/Q is the only real system supporting advanced hardware for TLS. Ordered transactions 6
7 Our True Goal How poorly can TLS improve the performance on real HTM hardware? What kind of hardware support should be implemented next in the off-the-shelf HTM? 7
8 Transactional Memory At programming/compile time Enclose critical sections with transaction begin/end operations. At execution time Memory operations within a transaction observed as one step by other threads. Multiple transactions executed in parallel as long as their memory operations do not conflict. xbegin(); a->count++; xend(); Thread X xbegin(); a->count++; xend(); xbegin(); a->count++; xend(); Thread Y xbegin(); a->count++; xend(); xbegin(); b->count++; xend(); 8
9 HTM IBM Research - Tokyo Instruction set (Intel TSX) : Begin a transaction : End a transaction XABORT, etc. Micro-architecture Read and write sets held in CPU caches Conflict detection using CPU cache coherence protocol Conflict detection by cache line granularity Rollback by discarding write set and restoring registers Abort reasons: Read set and write set conflict Read set and write set overflow External interruptions, etc. abort_handler abort_handler: 9
10 TLS for Loops We focus on frequently executed loops. Task = iteration(s) = transaction Why not parallelize function calls? Difficult to implement TLS for function calls on HTM. (Refer to the paper for the details.) Sequential execution Iteration 1 Iteration 2 Iteration 3 TLS execution w/ 3 threads Iteration 1 Iteration 2 Iteration 3 10
11 TLS on HTM Enclose each iteration with and. Re-execute iteration in case of abort. Iteration 1 Iteration 2 Conflict Iteration 3 Iteration 3 re-execution 11
12 Ordered Transactions Must commit in the same order as sequential execution. Because data independence can be guaranteed only after all of the preceding iterations have committed. Iteration 1 Iteration 2 Iteration 3 Commit order inversion 12
13 Ordered Transactions by Software Hardware support by proposed TLS systems Wait until the preceding iterations commit. Software implementation by checking commit order Use a global variable to indicate the next iteration to commit. Abort if cannot commit. Iteration 1 Can commit? Iteration 2 Can commit? Iteration 3 Can commit? Iteration 3 reexecution Can commit? 13
14 Ordered Transactions by Software Hardware support by proposed TLS systems Wait until the preceding iterations commit. Software implementation by checking commit order Use a global variable to indicate the next iteration to commit. Abort if cannot commit. Iteration 1 Can commit? Iteration 2 Why not spin-wait? Refer to our paper. Can commit? Iteration 3 Can commit? Iteration 3 reexecution Can commit? 14
15 Our Goal How poorly can TLS improve the performance on real HTM hardware? What kind of hardware support should be implemented next in the off-the-shelf HTM? Will hardware support for ordered transactions really help? 15
16 False Sharing due to Cache-Line Granularity Conflict Detection double array[]; for (int i = ; i < ; i++) { array[i] = ; } Writes by Thread 1 Writes by Thread 2 Writes by Thread 3 TLS array[] 16 Cache line = 64 bytes on x86
17 Transaction Coarsening to Avoid False Sharing Iteration 1 Iteration 2 Iteration 8 Iteration 9 Iteration 10 Iteration 16 Iteration 17 Iteration 18 Iteration 24 Writes by Thread 3 array[] Writes by Thread 1 Writes by Thread 2 17
18 Benchmarks and Methodology SPEC CPU benchmarks showing more than 1.5-fold speedups with 4 threads in a previous TLS study [Packirisamy et al., 2009] 429.mcf, 433.milc, 456.hmmer, 464.h264ref, 470.lbm, and 482.sphinx3 Manually modified frequently executed loops. Inserted,, and commit order checks. Transformed a target loop into a doubly-nested loop for transaction coarsening Experimental environment Core i processor (4 cores, 2-way SMT) 4-GB memory Linux / GCC
19 Normalized Throughput Results Throughput (1 = sequential) mcf 456.hmmer 470.lbm 433.milc 464.h264ref 482.sphinx3 Higher is better Number of software threads SMT enabled Up to 11% speedups with 2 or 4 threads. But mostly degraded the throughput. 19
20 433.milc Throughput (1 = sequential) Number of software threads Parallel program Loop coverage: 23% Abort ratio (%) Total Overflow Other 0 Order inversion Conflict Number of software threads Commit order inversion is a dominant abort reason. Hardware support for ordered transactions will help. 20
21 Abort Statistics (1/2) mcf milc Abort ratio (%) Abort ratio (%) 100 Total 80 Order inversion 60 Buffer overflow 40 Conflict 20 Other Total Order inv Buffer ov Conflict Other Number of software threads 456.hmmer Number of software threads Abort ratio (%) Number of software threads Total Order inversion Buffer overflow Conflict Other Conflicts were a dominant abort reason in all of the benchmarks except 433.milc. 21
22 Abort Statistics (2/2) h264ref lbm Abort ratio (%) Abort ratio (%) 100 Total 80 Order inversion 60 Buffer overflow 40 Conflict 20 Other Total Order inv Buffer ov Conflict Other sphinx3 Number of software threads Number of software threads Abort ratio (%) Number of software threads Total Order inversion Buffer overflow Conflict Other Conflicts were a dominant abort reason in all of the benchmarks except 433.milc. 22
23 Reasons for Conflicts and Possible Hardware Support Benchmark 429.mcf Conflict reason RAW dependence Possible hardware support Data forwarding 433.milc No 456.hmmer RAW dependence Data forwarding 464.h264ref WAR dependence Multi-version cache 470.lbm 482.sphinx3 WAW dependence (false sharing by prefetching) WAW dependence (false sharing) WAW dependence (false sharing by prefetching) (Fix in prefetcher) Word-level conflict detection (Fix in prefetcher) 23
24 Examples of Read-After-Write Data Dependence 429.mcf static int size; static DATA array[n]; func() { for () { if () { size++; array[size]->field = ; } } } 456.hmmer for (k = 1; k <= M; k++) { dc[k] = dc[k-1] + ; } Hardware support already proposed in TLS literatures. Data forwarding. 24
25 Example of Write-After-Read Data Dependence 464.h264ref for () { line = func(); = line[0]; } static DATA line[n]; DATA *func() { line[0] = ; return line; } Difficult to analyze by a compiler. WAR dependence across different functions in different source files. Multi-version caches needed. 25
26 Conflicts Precede Commit Order Inversion Commit order matters only when most of the transactions reach the committing points. With data dependence, most of the transactions cannot run to the end. Iteration 1 Iteration 2 Conflict Iteration 3 Commit order inversion 26
27 Conflicts due to Prefetching Even with transaction coarsening, conflicts still happened. 464.h264ref and 482.sphinx3. Prefetched adjacent cache lines caused conflicts. Writes by Thread 1 Prefetch Conflict 64 bytes 64 bytes 64 bytes Writes by Thread 2 Prefetch 27
28 Conclusion IBM Research - Tokyo How well can TLS improve the performance on real HTM hardware? Up to 11% speedups with 4 threads in SPEC CPU2006 on 4th Generation Core Processor. But degraded throughput in most cases. What kind of hardware support should be implemented next in the off-the-shelf HTM? Hardware support for ordered transactions will help in parallel programs. However, many programs contain data dependence. Not only ordered transactions, but also other hardware facilities to avoid conflicts should be implemented. (Intel should fix the adjacent cache line prefetcher!) 28
Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory
Thread-Level Speculation on Off-the-Shelf Transactional Memory Rei Odaira and Takuya Nakaike IBM Research - Tokyo Toyosu, Tokyo, Japan {odaira, nakaike}@jp.ibm.com Abstract Thread-level speculation can
More informationEliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory
Eliminating Global Interpreter Locks in Ruby through Hardware Transactional Memory Rei Odaira, Jose G. Castanos and Hisanobu Tomari IBM Research and University of Tokyo April 8, 2014 Rei Odaira, Jose G.
More informationHardware Transactional Memory on Haswell
Hardware Transactional Memory on Haswell Viktor Leis Technische Universität München 1 / 15 Introduction transactional memory is a very elegant programming model transaction { transaction { a = a 10; c
More informationTransactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28
Transactional Memory or How to do multiple things at once Benjamin Engel Transactional Memory 1 / 28 Transactional Memory: Architectural Support for Lock-Free Data Structures M. Herlihy, J. Eliot, and
More informationInvyswell: A HyTM for Haswell RTM. Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy
Invyswell: A HyTM for Haswell RTM Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, Maurice Herlihy Multicore Performance Scaling u Problem: Locking u Solution: HTM? u IBM BG/Q, zec12,
More informationPerformance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing
Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance Computing Richard Yoo, Christopher Hughes: Intel Labs Konrad Lai, Ravi Rajwar: Intel Architecture Group Agenda
More informationEfficient Architecture Support for Thread-Level Speculation
Efficient Architecture Support for Thread-Level Speculation A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Venkatesan Packirisamy IN PARTIAL FULFILLMENT OF THE
More informationVMM Emulation of Intel Hardware Transactional Memory
VMM Emulation of Intel Hardware Transactional Memory Maciej Swiech, Kyle Hale, Peter Dinda Northwestern University V3VEE Project www.v3vee.org Hobbes Project 1 What will we talk about? We added the capability
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationWork Report: Lessons learned on RTM
Work Report: Lessons learned on RTM Sylvain Genevès IPADS September 5, 2013 Sylvain Genevès Transactionnal Memory in commodity hardware 1 / 25 Topic Context Intel launches Restricted Transactional Memory
More informationPerformance Improvement via Always-Abort HTM
1 Performance Improvement via Always-Abort HTM Joseph Izraelevitz* Lingxiang Xiang Michael L. Scott* *Department of Computer Science University of Rochester {jhi1,scott}@cs.rochester.edu Parallel Computing
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationA Fast Instruction Set Simulator for RISC-V
A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.
More informationEnhancing Real-Time Behaviour of Parallel Applications using Intel TSX
Enhancing Real-Time Behaviour of Parallel Applications using Intel TSX Florian Haas, Stefan Metzlaff, Sebastian Weis, and Theo Ungerer Department of Computer Science, University of Augsburg, Germany January
More informationExploring Speculative Parallelism in SPEC2006
Exploring Speculative Parallelism in SPEC2006 Venkatesan Packirisamy, Antonia Zhai, Wei-Chung Hsu, Pen-Chung Yew and Tin-Fook Ngai University of Minnesota, Minneapolis. Intel Corporation {packve,zhai,hsu,yew}@cs.umn.edu
More informationHTM in the wild. Konrad Lai June 2015
HTM in the wild Konrad Lai June 2015 Industrial Considerations for HTM Provide a clear benefit to customers Improve performance & scalability Ease programmability going forward Improve something common
More informationPerformance Improvement via Always-Abort HTM
1 Performance Improvement via Always-Abort HTM Joseph Izraelevitz* Lingxiang Xiang Michael L. Scott* *Department of Computer Science University of Rochester {jhi1,scott}@cs.rochester.edu Parallel Computing
More informationIntel Transactional Synchronization Extensions (Intel TSX) Linux update. Andi Kleen Intel OTC. Linux Plumbers Sep 2013
Intel Transactional Synchronization Extensions (Intel TSX) Linux update Andi Kleen Intel OTC Linux Plumbers Sep 2013 Elision Elision : the act or an instance of omitting something : omission On blocking
More informationTransactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Transactional Memory Companion slides for The by Maurice Herlihy & Nir Shavit Our Vision for the Future In this course, we covered. Best practices New and clever ideas And common-sense observations. 2
More informationYuxi Chen, Shu Wang, Shan Lu, and Karthikeyan Sankaralingam *
Yuxi Chen, Shu Wang, Shan Lu, and Karthikeyan Sankaralingam * * 2 q Synchronization mistakes in multithreaded programs Thread 1 Thread 2 If(ptr){ tmp = *ptr; ptr = NULL; } Segfault q Common q Hard to diagnose
More informationTransactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech
Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationCost of Concurrency in Hybrid Transactional Memory. Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University)
Cost of Concurrency in Hybrid Transactional Memory Trevor Brown (University of Toronto) Srivatsan Ravi (Purdue University) 1 Transactional Memory: a history Hardware TM Software TM Hybrid TM 1993 1995-today
More informationHAFT Hardware-Assisted Fault Tolerance
HAFT Hardware-Assisted Fault Tolerance Dmitrii Kuvaiskii Rasha Faqeh Pramod Bhatotia Christof Fetzer Technische Universität Dresden Pascal Felber Université de Neuchâtel Hardware Errors in the Wild Online
More informationImplementing Transactional Memory in Kernel space
Implementing Transactional Memory in Kernel space Breno Leitão Linux Technology Center leitao@debian.org leitao@freebsd.org Problem Statement Mutual exclusion concurrency control (Lock) Locks type: Semaphore
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationbool Account::withdraw(int val) { atomic { if(balance > val) { balance = balance val; return true; } else return false; } }
Transac'onal Memory Acknowledgement: Slides in part adopted from: 1. a talk on Intel TSX from Intel Developer's Forum in 2012 2. the companion slides for the book "The Art of Mul'processor Programming"
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationYukinori Sato (JAIST, JST CREST) Yasushi Inoguchi (JAIST) Tadao Nakamura (Keio University)
Whole Program Data Dependence Profiling to Unveil Parallel Regions in the Dynamic Execution Yukinori Sato (JAIST, JST CREST) Yasushi Inoguchi (JAIST) Tadao Nakamura (Keio University) 2012 IEEE International
More informationDMP Deterministic Shared Memory Multiprocessing
DMP Deterministic Shared Memory Multiprocessing University of Washington Joe Devietti, Brandon Lucia, Luis Ceze, Mark Oskin A multithreaded voting machine 2 thread 0 thread 1 while (more_votes) { load
More informationLIMITS OF ILP. B649 Parallel Architectures and Programming
LIMITS OF ILP B649 Parallel Architectures and Programming A Perfect Processor Register renaming infinite number of registers hence, avoids all WAW and WAR hazards Branch prediction perfect prediction Jump
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationUnbounded Transactional Memory
(1) Unbounded Transactional Memory!"#$%&''#()*)+*),#-./'0#(/*)&1+2, #3.*4506#!"#-7/89*75,#!:*.50/#;"#
More informationT-SGX: Eradicating Controlled-Channel
T-SGX: Eradicating Controlled-Channel Attacks Against Enclave Programs Ming-Wei Shih Sangho Lee Taesoo Kim Marcus Peinado Georgia Institute of Technology Microsoft Research 2 3 Intel SGX aims to secure
More informationUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware
More informationFast, precise dynamic checking of types and bounds in C
Fast, precise dynamic checking of types and bounds in C Stephen Kell stephen.kell@cl.cam.ac.uk Computer Laboratory University of Cambridge p.1 Tool wanted if (obj >type == OBJ COMMIT) { if (process commit(walker,
More informationHeuristics for Profile-driven Method- level Speculative Parallelization
Heuristics for Profile-driven Method- level John Whaley and Christos Kozyrakis Stanford University Speculative Multithreading Speculatively parallelize an application Uses speculation to overcome ambiguous
More informationQuantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis
Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis June 27, 2017, ROSS @ Washington, DC Soramichi Akiyama, Takahiro Hirofuchi National Institute of Advanced Industrial Science
More informationMcRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime
McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime B. Saha, A-R. Adl- Tabatabai, R. Hudson, C.C. Minh, B. Hertzberg PPoPP 2006 Introductory TM Sales Pitch Two legs
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationDynamic Performance Tuning for Speculative Threads
Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationSimultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor
Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Haakan
More informationPotential violations of Serializability: Example 1
CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same
More informationFall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi
More informationCOSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)
COSC 243 1 Overview This Lecture Architectural topics CISC RISC Multi-core processors Source: Chapter 15,16 and 17(10 th edition) 2 Moore s Law Gordon E. Moore Co-founder of Intel April 1965 The complexity
More informationSecurity-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat
Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationTradeoffs in Transactional Memory Virtualization
Tradeoffs in Transactional Memory Virtualization JaeWoong Chung Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi,, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Computer Systems Lab Stanford
More informationFault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support
Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support Florian Haas 1(B), Sebastian Weis 1, Theo Ungerer 1, Gilles Pokam 2, and Youfeng Wu 2 1 Department of Computer
More informationFootprint-based Locality Analysis
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationUniprocessors. HPC Fall 2012 Prof. Robert van Engelen
Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures
More informationOn Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions
On Improving Transactional Memory: Optimistic Transactional Boosting, Remote Execution, and Hybrid Transactions Ahmed Hassan Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic
More informationUNIVERSITY OF MINNESOTA. Shengyue Wang
UNIVERSITY OF MINNESOTA This is to certify that I have examined this copy of a doctoral thesis by Shengyue Wang and have found that it is complete and satisfactory in all respects, and that any and all
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationS = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)
1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationEnhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization
Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization Anton Kuijsten Andrew S. Tanenbaum Vrije Universiteit Amsterdam 21st USENIX Security Symposium Bellevue,
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationNightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems
NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationScheduling the Intel Core i7
Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne
More informationDeAliaser: Alias Speculation Using Atomic Region Support
DeAliaser: Alias Speculation Using Atomic Region Support Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.illinois.edu Memory Aliasing Prevents Good
More informationCall Paths for Pin Tools
, Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationDeterministic Shared Memory Multiprocessing
Deterministic Shared Memory Multiprocessing Luis Ceze, University of Washington joint work with Owen Anderson, Tom Bergan, Joe Devietti, Brandon Lucia, Karin Strauss, Dan Grossman, Mark Oskin. Safe MultiProcessing
More informationA Concurrent Skip List Implementation with RTM and HLE
A Concurrent Skip List Implementation with RTM and HLE Fan Gao May 14, 2014 1 Background Semester Performed: Spring, 2014 Instructor: Maurice Herlihy The main idea of my project is to implement a skip
More informationContinuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat
Continuous Object Access Profiling and Optimizations to Overcome the Memory Wall and Bloat Rei Odaira, Toshio Nakatani IBM Research Tokyo ASPLOS 2012 March 5, 2012 Many Wasteful Objects Hurt Performance.
More informationEvaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences
Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences By Anasua Bhowmik Manoj Franklin Indian Institute of Science Univ of Maryland Supported by National Science Foundation,
More informationManaging Resource Limitation of Best-Effort HTM
Managing Resource Limitation of Best-Effort HTM Mohamed Mohamedin, Roberto Palmieri, Ahmed Hassan, Binoy Ravindran Abstract The first release of hardware transactional memory (HTM) as commodity processor
More informationAccelerating Irregular Computations with Hardware Transactional Memory and Active Messages
MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important
More informationOutline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA
CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationGoing Under the Hood with Intel s Next Generation Microarchitecture Codename Haswell
Going Under the Hood with Intel s Next Generation Microarchitecture Codename Haswell Ravi Rajwar Intel Corporation QCon San Francisco Nov 9, 2012 1 What is Haswell? 45nm 32nm 22nm Nehalem Westmere Sandy
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationConcurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis
oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to
More informationDependence-Aware Transactional Memory for Increased Concurrency. Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin
Dependence-Aware Transactional Memory for Increased Concurrency Hany E. Ramadan, Christopher J. Rossbach, Emmett Witchel University of Texas, Austin Concurrency Conundrum Challenge: CMP ubiquity Parallel
More informationBias Scheduling in Heterogeneous Multi-core Architectures
Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationMutex Locking versus Hardware Transactional Memory: An Experimental Evaluation
Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation Thesis Defense Master of Science Sean Moore Advisor: Binoy Ravindran Systems Software Research Group Virginia Tech Multiprocessing
More informationCOMP3151/9151 Foundations of Concurrency Lecture 8
1 COMP3151/9151 Foundations of Concurrency Lecture 8 Transactional Memory Liam O Connor CSE, UNSW (and data61) 8 Sept 2017 2 The Problem with Locks Problem Write a procedure to transfer money from one
More informationCS 351 Final Exam Solutions
CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question
More informationTransactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preview)
Transactional Memory for C/C++ IBM XL C/C++ for Blue Gene/Q, V12.0 (technology preiew) ii Transactional Memory for C/C++ Contents Chapter 1. Transactional memory.... 1 Chapter 2. Compiler option reference..
More informationThe Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas
The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program
More information