Dynamic Performance Tuning for Speculative Threads
|
|
- Myles Parks
- 5 years ago
- Views:
Transcription
1 Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of Electronic Computer Engineering University of Minnesota Twin Cities
2 Motivation Multicore processors brought forth large potentials of computing power Intel Core i7 die photo Exploiting these potentials demands thread-level parallelism 2
3 Time Time Time Exploiting Thread-Level Parallelism Sequential Traditional Parallelization Store *p Load *q Store *p p!= q?? Load *q Compiler gives up Thread-Level Speculation (TLS) p!= q p == q Load 20 Load 88 3 Store 88 Parallel execution Store 88 Speculation Failure Potentially more parallelism with speculation But Unpredictable Load 88
4 Addressing Unpredictability State-of-the-art approach: Profiling-directed optimization Compiler analysis A typical compilation framework Profiling run Source code Region selection Profiling info Parallel code generation Speculative code optimization Machine code 4
5 Time Problem #1 0 spawning thread allow commit signal wait load imbalance inter-thread synchronization 3 3 speculation failure Hard to model TLS overhead statically, even with profiling 5
6 Problem #2 Training Input Actual Input Profiling Behavior Actual Behavior Profile-based decision may not be appropriate for actual input 6
7 Metric Problem #3 phase1 phase2 phase3 Time Behavior of the same code changes over time 7
8 Load 0x22 Extra misses Problem #4 Sequential Parallel P P P L1$ L1$ L1$ miss Load 0x22 miss Load 0x22 miss Load 0x22 hit Prefetching Pollution P L1$ P L1$ P L1$ P L1$ Load *p=0x88 Load *p =0x88 Load *p=0x80 Load *p =0x08 How important are cache impacts? 8
9 Normalized Execution Time Impact on Cache Performance A major loop from benchmark art (scanner.c:594) Cache Squash Other load *p=0x X load *p =0x % Cache: p == p Prefetched (+) 0 SEQ TLS-4core Cache impact is difficult to model statically 9
10 Our proposal Proposed Performance Tuning Framework Compiler Analysis Source code Region selection Accurate Timing & Cache Impacts Runtime Information Runtime Analysis Performance Evaluation Adaptations for inputs and phase changes Profiling info Parallel code generation Speculative code optimization Dynamic Execution Performance Profile Table Machine code Compiler Annotation Tuning Policy 10
11 Parallelization Decisions at Runtime Thread spawning End of a parallel execution Performance Profile Table parallelize serialize spawn spawn spawn spawn Parallel Execution Parallel execution Sequential execution Evaluate performance New execution model facilitates dynamic tuning 11
12 Normalized Execution Time Evaluating Performance Impact Runtime Analysis Runtime Information Performance Evaluation Dynamic Execution Performance Profile Table Cache Squash Other 0.4 Tuning Policy 0.2 Simple metric: cost of speculative failure 0 SEQ TLS-4core Comprehensive evaluation: sequential performance prediction 12
13 Sequential Performance Prediction Hardware Counters Busy ifetch ExeStall Cache Performance Impact Cache Programmable Cycle Classifier P Information from ROB P Speculative Parallel TLS overhead Predict ExeStall ifetch Busy Predicted sequential 13
14 Cache Behavior: Scenario #1 TLS T0 load p (unmodified) miss Squash SEQ T0 load p miss Predicted T0 load p hit T0 load p hit Count the miss in squashed execution 14
15 Cache Behavior: Scenario #2 TLS SEQ Predicted load p store p invalidate p T0 load p T0 T0 T0 miss miss load p store p miss load p store p miss*2 Not count a miss if tag matched invalid status 15
16 Tune the performance Runtime Analysis Runtime Information Performance Evaluation Dynamic Execution Performance Profile Table Tuning Policy Exhaustive search Prioritized search 16
17 Performance Summary metric Speedup or not? Improve. in cycles How to prioritize the search Search order: Inside-Out Compiler suggestion Outside-In Performance summary metric: Speedup or not? Improvement in cycles Suboptimal Quantitative Quantitative +StaticHint Simple Inside-Out Search order Compiler suggestion 17
18 Evaluation Infrastructure Benchmarks SPEC2000 written in C, -O3 optimization Underlying architecture 4-core, chip-multiprocessor (CMP) P P P P speculation supported by coherence C C C C Simulator Superscalar with detailed memory model simulates communication latency models bandwidth and contention Interconnect C Detailed, cycle-accurate simulation 18
19 Speedup wrt. SEQ Comparing Tuning Policies Simple Quantitative Quantitative+StaticHint 1.37x 1.23x x Parallel Code Overhead 19
20 Speedup wrt. SEQ Profile-Based vs. Dynamic ProfileBased ProfileBased/ProfileBasedSEQ Quant+StaticHint (Quant+StaticHint)/dynamicSEQ 1.27x 1.25x 1.46x Dynamic outperformed static by 10% (1.37x/1.25x) Potentially go up to 15% (1.46x/1.27x) 20
21 Related works: Region Selection Statically Compiler Heuristics [Vijaykumar Micro 98] Balanced Min-Cut [Johnson PLDI 04] Simple Profiling Pass [Liu PPoPP 06] (POSH) Extensive Profiling [Wang LCPC 05] Dynamically Runtime Loop Detection [Tubella HPCA 98] [Marcuello ICS 98] Saving Power from Useless Respawning [Renau ICS 05] Profile-Based Search [Johnson PPoPP 07] Dynamic Performance Tuning [Luo ISCA 09] (this work) Lack of adaptability Lack of high-level information 21
22 Conclusions Deciding how to extract speculative threads at runtime Quantitatively summarize performance profile Compiler hints avoid suboptimal decisions Compiler and hardware work together. 37% speedup over sequential execution 10% speedup over profile-based decisions Configurable HPMs enable efficient dynamic optimizations Dynamic performance tuning can improve TLS efficiency 22
Supporting Speculative Multithreading on Simultaneous Multithreaded Processors
Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu, and Pen-Chung Yew Department of Computer Science, University
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationExploiting Parallelism in Multicore Processors through Dynamic Optimizations
Exploiting Parallelism in Multicore Processors through Dynamic Optimizations A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Yangchun Luo IN PARTIAL FULFILLMENT
More informationUNIVERSITY OF MINNESOTA. Shengyue Wang
UNIVERSITY OF MINNESOTA This is to certify that I have examined this copy of a doctoral thesis by Shengyue Wang and have found that it is complete and satisfactory in all respects, and that any and all
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationExploring Speculative Parallelism in SPEC2006
Exploring Speculative Parallelism in SPEC2006 Venkatesan Packirisamy, Antonia Zhai, Wei-Chung Hsu, Pen-Chung Yew and Tin-Fook Ngai University of Minnesota, Minneapolis. Intel Corporation {packve,zhai,hsu,yew}@cs.umn.edu
More informationThe Design and Implementation of Heterogeneous Multicore Systems for Energy-efficient Speculative Thread Execution
The Design and Implementation of Heterogeneous Multicore Systems for Energy-efficient Speculative Thread Execution YANGCHUN LUO, Advanced Micro Devices Inc. WEI-CHUNG HSU, National Chiao Tung University,
More informationSimultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor
Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Haakan
More informationEfficient Architecture Support for Thread-Level Speculation
Efficient Architecture Support for Thread-Level Speculation A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Venkatesan Packirisamy IN PARTIAL FULFILLMENT OF THE
More informationEfficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective
Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Venkatesan Packirisamy, Yangchun Luo, Wei-Lung Hung, Antonia Zhai, Pen-Chung Yew and Tin-Fook
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationConstructing Virtual Architectures on Tiled Processors. David Wentzlaff Anant Agarwal MIT
Constructing Virtual Architectures on Tiled Processors David Wentzlaff Anant Agarwal MIT 1 Emulators and JITs for Multi-Core David Wentzlaff Anant Agarwal MIT 2 Why Multi-Core? 3 Why Multi-Core? Future
More informationEfficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective. Technical Report
Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Technical Report Department of Computer Science and Engineering University of Minnesota
More informationLoop Selection for Thread-Level Speculation
Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S Yellajyosula Antonia Zhai, Pen-Chung Yew Department of Computer Science University of Minnesota {shengyue, dai, kiran, zhai,
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationOutline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA
CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra
More informationCS533: Speculative Parallelization (Thread-Level Speculation)
CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationData Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation
More informationComputer Architecture Spring 2016
omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,
More informationLoop Selection for Thread-Level Speculation
Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science University of Minnesota {shengyue, dai, kiran, zhai,
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art
More informationSpeculative Parallelization in Decoupled Look-ahead
Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationFall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationEliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors
Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors Marcelo Cintra and Josep Torrellas University of Edinburgh http://www.dcs.ed.ac.uk/home/mc
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationSEVERAL studies have proposed methods to exploit more
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu James Tuck Luis Ceze Wonsun Ahn Karin Strauss Jose Renau Josep Torrellas Department of Computer Science Computer Engineering Department University
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationChí Cao Minh 28 May 2008
Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor
More informationIncreasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing
Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan School of Computer Science University of Manchester salman.khan@cs.man.ac.uk Nikolas Ioannou School of Informatics
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationTransactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech
Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationDual Thread Speculation: Two Threads in the Machine are Worth Eight in the Bush
Dual Thread Speculation: Two Threads in the Machine are Worth Eight in the Bush Fredrik Warg and Per Stenstrom Chalmers University of Technology 2006 IEEE. Personal use of this material is permitted. However,
More informationMultiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor
Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Seon Wook Kim, Chong-Liang Ooi, Il Park, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar School
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationCore Fusion: Accommodating Software Diversity in Chip Multiprocessors
Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Authors: Engin Ipek, Meyrem Kırman, Nevin Kırman, and Jose F. Martinez Navreet Virk Dept of Computer & Information Sciences University
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationPerformance, Power, Die Yield. CS301 Prof Szajda
Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationPageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili
PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationEXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System
EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More information6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models
Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance
More informationDo we need a crystal ball for task migration?
Do we need a crystal ball for task migration? Brandon {Myers,Holt} University of Washington bdmyers@cs.washington.edu 1 Large data sets Data 2 Spread data Data.1 Data.2 Data.3 Data.4 Data.0 Data.1 Data.2
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationMulti-threaded processors. Hung-Wei Tseng x Dean Tullsen
Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to
More informationEXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu
Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationThe Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor
IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More information15-740/ Computer Architecture
15-740/18-740 Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011 Announcements Milestone II Due November 4, Friday Please talk with us if you
More informationCoherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures
Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures Lluc Alvarez Lluís Vilanova Miquel Moreto Marc Casas Marc Gonzàlez Xavier Martorell Nacho Navarro
More informationThread-Level Speculation on Off-the-Shelf Hardware Transactional Memory
Thread-Level Speculation on Off-the-Shelf Hardware Transactional Memory Rei Odaira Takuya Nakaike IBM Research Tokyo Thread-Level Speculation (TLS) [Franklin et al., 92] or Speculative Multithreading (SpMT)
More information(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.
(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing. Intro: CMP with MT cores e.g. POWER5, Niagara 1 & 2, Nehalem Off-chip miss
More informationMultiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor
Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering 10-1-2000 Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Seon
More informationComplexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors
Complexity Analysis of A Cache Controller for Speculative Multithreading Chip Multiprocessors Yoshimitsu Yanagawa, Luong Dinh Hung, Chitaka Iwama, Niko Demus Barli, Shuichi Sakai and Hidehiko Tanaka Although
More informationPOSH: A Profiler-Enhanced TLS Compiler that Leverages Program Structure
POSH: A Profiler-Enhanced TLS Compiler that Leverages Program Structure Wei Liu, James Tuck, Luis Ceze, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois
More informationReconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj
Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationHeuristics for Profile-driven Method- level Speculative Parallelization
Heuristics for Profile-driven Method- level John Whaley and Christos Kozyrakis Stanford University Speculative Multithreading Speculatively parallelize an application Uses speculation to overcome ambiguous
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationPortland State University ECE 588/688. IBM Power4 System Microarchitecture
Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments
More informationBest-Offset Hardware Prefetching
Best-Offset Hardware Prefetching Pierre Michaud March 2016 2 BOP: yet another data prefetcher Contribution: offset prefetcher with new mechanism for setting the prefetch offset dynamically - Improvement
More informationMultiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor
Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Chong-Liang Ooi, Seon Wook Kim, Il Park, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar School
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationWhy Parallel Architecture
Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming
More informationPortland State University ECE 587/687. Caches and Memory-Level Parallelism
Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each
More informationLecture 1: Parallel Architecture Intro
Lecture 1: Parallel Architecture Intro Course organization: ~13 lectures based on textbook ~10 lectures on recent papers ~5 lectures on parallel algorithms and multi-thread programming New topics: interconnection
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More information15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11
15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More information