Revisiting the Sequential Programming Model for Multi-Core
|
|
- Theodore Morrison
- 6 years ago
- Views:
Transcription
1 Revisiting the Sequential Programming Model for Multi-Core Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, & David I. August The Liberty Research Group Princeton University
2 2
3 Source: Intel/Wikipedia 3
4 App OS Intel Core2 Duo Die Photo: Source Intel 4
5 App OS?? App OS?????? SUN Niagara 2 Die Photo: Source SUN App OS?????????????????????????????????????? AMD Phenom Die Photo: Source AMD???????????????????????????????????????? Terascale 80-core chip: Source Intel 5
6 Parallel Programming Languages Automatic Thread Extraction Automatic Thread Extraction Automatic Thread Extraction Automatic Thread Extraction Automatic Thread Extraction Is easily Debugable, Maintainable, etc.? Is Performance Retargetable? Programmer Managed Speculation? Parallelism Hard to Extract? Legacy Application? 6
7 CCSP TLS DSWP SpecDSWP What prevents the automatic extraction of parallelism? Lack of an Aggressive Compilation Framework 7
8 Time Time Scientific Programs Core 4 General Purpose Programs Core 4 Iter 1 Iter 2 Iter 3 Iter 4 Iter 1 Iteration level parallelism Iter 2 Iter 3... prevented by Loop-Carried Dependences 8
9 Time Iteration A: X++; A 1 B 1 B: Work( ); printf( ); C 1 D 1 A 2 B 2 C: if (rare) break; C 2 D 2 A 3 D: printf( ); B 3 An Aggressive Compilation Framework must parallelize inside of the loop body C 3 D 3 9
10 Time Iteration A: X++; A 1 B 1 B: Work( ); printf( ); C 1 D 1 A 2 B 2 C: if (rare) misspec; C 2 D 2 A 3 D: printf( ); B 3 C 3 D 3 An Aggressive Compilation Framework must speculate rare or predictable dependences 10
11 Time Iteration A: X++; A 1 A 2 B 1 A 3 B: Work( ); C 1 D 1 B 2 C: if (rare) misspec; C 2 D 2 D: printf( ); printf( ); B 3 C 3 D 3 An Aggressive Compilation Framework must schedule dependences to reduce synchronization 11
12 Time Iteration A: X++; read( ); A 1 B 1 A 2 A 3 B: Work ( ); C 1 D 1 B 2 C 2 C: if (rare) misspec; D 2 B 3 C 3 D 3 D: printf( ); printf( ); An Aggressive Compilation Framework must be able to optimize deep into the call tree 12
13 Time Time DSWP/SpecDSWP DOACROSS/TLS Core 4 Core 4 A 1 A 1 A 2 A 2 A 3 A 4 B 1 B 2 B 1 B 2 A 3 A 4 C 1 C 1 B 3 B 4 B 3 B 4 D 1 C 2 D 2 D 1 C 2 D 2 C 3 D 3 C 4 C 3 D 4 D 3 C 4 D 4 An Aggressive Compilation Framework must be able to parallelize loops to efficiently utilize the available cores. 13
14 Stalled Time Time Stalled Stalled Pipeline Fill DSWP/SpecDSWP DOACROSS/TLS Core 4 Core 4 A 1 A 1 A 2 A 2 A 3 A 4 B 2 B 2 A 3 A 4 A 5 B 3 A 6 B 1 B 1 B 4 B 4 CD 1 CD 1 B 3 B 5 B 6 CD 2 CD 3 CD 4 A 5 B 5 CD 5 CD 2 A 6 CD 3 CD 4 CD 5 CD 6 6 Iterations completed 5 Iterations completed B 6 14
15 Performance Potential Trace-based simulation with regions of the trace summarized by a singlethreaded run on native hardware What prevents the automatic extraction of parallelism? Lack of an Aggressive Compilation Framework Sequential Programming Model 15
16 Time Time High Level View Low Level Reality Core 4 Core 4 Iter 1 Iter 2 Iter 3 Iter 4 Iter 1 Iter 2 Iter 3 16
17 Time Low Level Reality alloc 1 char *memory; void * alloc(int size); void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc 2 alloc3 alloc 4 alloc5 Can t speculate the dependence alloc 6 17
18 Time Low Level Reality alloc 1 char *memory; void * alloc(int size); alloc 2 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc3 alloc 4 alloc5 Can t speculate the dependence Can t schedule the dependence alloc 6 18
19 Time Low Level Reality alloc 1 char void * alloc(int size); alloc 2 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc3 alloc 4 alloc5 Can t speculate the dependence Can t schedule the dependence Can reorder the dependence alloc 6 19
20 Time Low Level Reality alloc 1 char *memory; alloc void * alloc(int size); alloc 2 alloc 5 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc 4 alloc 6 Compiler does not preserve the existing sequential order, but does guarantee the existence of a sequential ordering 20
21 Performance Potential What prevents the automatic extraction of parallelism? Lack of an Aggressive Compilation Framework Sequential Programming Model 21
22 22
23 The Liberty Research Group 23
REVISITING THE SEQUENTIAL PROGRAMMING MODEL FOR THE MULTICORE ERA
... REVISITING THE SEQUENTIAL PROGRAMMING MODEL FOR THE MULTICORE ERA... AUTOMATIC PARALLELIZATION HAS THUS FAR NOT BEEN SUCCESSFUL AT EXTRACTING SCALABLE PARALLELISM FROM GENERAL PROGRAMS. AN AGGRESSIVE
More informationEECS 583 Class 16 Research Topic 1 Automatic Parallelization
EECS 583 Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012 Announcements + Reading Material Midterm exam: Mon Nov 19 in class (Next next Monday)» I will post 2
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationOutline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA
CS 258 Parallel Computer Architecture Data Speculation Support for a Chip Multiprocessor (Hydra CMP) Lance Hammond, Mark Willey and Kunle Olukotun Presented: May 7 th, 2008 Ankit Jain Outline The Hydra
More informationDecoupled Software Pipelining in LLVM
Decoupled Software Pipelining in LLVM 15-745 Final Project Fuyao Zhao, Mark Hahnenberg fuyaoz@cs.cmu.edu, mhahnenb@andrew.cmu.edu 1 Introduction 1.1 Problem Decoupled software pipelining [5] presents an
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationCOS 320. Compiling Techniques
Topic 14: Parallelism COS 320 Compiling Techniques Princeton University Spring 2015 Prof. David August 1 Final Exam! Friday May 22 at 1:30PM in FRIEND 006 Closed book One Front/Back 8.5x11 2 Moore s Law
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationWhy Parallel Architecture
Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationTopics on Compilers Spring Semester Christine Wagner 2011/04/13
Topics on Compilers Spring Semester 2011 Christine Wagner 2011/04/13 Availability of multicore processors Parallelization of sequential programs for performance improvement Manual code parallelization:
More informationData Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationInterrupts & System Calls
Interrupts & System Calls Nima Honarmand Previously on CSE306 Open file hw1.txt App Ok, here s handle App 4 App Libraries Libraries Libraries User System Call Table (350 1200) Supervisor Kernel Hardware
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationMEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming
MEMORY HIERARCHY DESIGN B649 Parallel Architectures and Programming Basic Optimizations Average memory access time = Hit time + Miss rate Miss penalty Larger block size to reduce miss rate Larger caches
More informationMulti-threaded processors. Hung-Wei Tseng x Dean Tullsen
Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to
More informationTopic 14: Dealing with Branches
Topic 14: Dealing with Branches COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 FLASHBACK: Pipeline Hazards Control Hazards What is the next instruction?
More informationTopic 14: Dealing with Branches
Topic 14: Dealing with Branches COS / ELE 375 FLASHBACK: Pipeline Hazards Control Hazards What is the next instruction? Branch instructions take time to compute this. Stall, Predict, or Delay: Computer
More informationSpeculative Parallelization Using Software Multi-threaded Transactions
Speculative Parallelization Using Software Multi-threaded Transactions Arun Raman Hanjun Kim Thomas R. Mason Thomas B. Jablin David I. August Departments of Electrical Engineering and Computer Science,
More informationDynamic Performance Tuning for Speculative Threads
Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of
More informationMT-SDF: Scheduled Dataflow Architecture with mini-threads
2013 Data-Flow Execution Models for Extreme Scale Computing MT-SDF: Scheduled Dataflow Architecture with mini-threads Domenico Pace University of Pisa Pisa, Italy col.pace@hotmail.it Krishna Kavi University
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationApplication parallelization for multi-core Android devices
SOFTWARE & SYSTEMS DESIGN Application parallelization for multi-core Android devices Jos van Eijndhoven Vector Fabrics BV The Netherlands http://www.vectorfabrics.com MULTI-CORE PROCESSORS: HERE TO STAY
More informationGrassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands
Grassroots ASPLOS can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands ASPLOS-17 Doctoral Workshop London, March 4th, 2012 1 Current
More informationPARALLELIZATION TECHNIQUES WITH IMPROVED DEPENDENCE HANDLING EASWARAN RAMAN A DISSERTATION PRESENTED TO THE FACULTY RECOMMENDED FOR ACCEPTANCE
PARALLELIZATION TECHNIQUES WITH IMPROVED DEPENDENCE HANDLING EASWARAN RAMAN A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY RECOMMENDED
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationINTELLIGENT SPECULATION FOR PIPELINED MULTITHREADING NEIL AMAR VACHHARAJANI A DISSERTATION PRESENTED TO THE FACULTY RECOMMENDED FOR ACCEPTANCE
INTELLIGENT SPECULATION FOR PIPELINED MULTITHREADING NEIL AMAR VACHHARAJANI A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY RECOMMENDED
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationParallelism: The Real Y2K Crisis. Darek Mihocka August 14, 2008
Parallelism: The Real Y2K Crisis Darek Mihocka August 14, 2008 The Free Ride For decades, Moore's Law allowed CPU vendors to rely on steady clock speed increases: late 1970's: 1 MHz (6502) mid 1980's:
More informationI/O Buffering and Streaming
I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationUnit 8: Superscalar Pipelines
A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationChapter 06: Instruction Pipelining and Parallel Processing
Chapter 06: Instruction Pipelining and Parallel Processing Lesson 09: Superscalar Processors and Parallel Computer Systems Objective To understand parallel pipelines and multiple execution units Instruction
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 12: Multithreading. Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 12: Multithreading Welcome! Today s Agenda: Introduction Hardware Trust No One / An fficient Pattern xperiments Final Assignment
More informationThe Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas
The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program
More informationComputer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 9: Multithreading
More informationGLOBAL INSTRUCTION SCHEDULING FOR MULTI-THREADED ARCHITECTURES GUILHERME DE LIMA OTTONI A DISSERTATION PRESENTED TO THE FACULTY
GLOBAL INSTRUCTION SCHEDULING FOR MULTI-THREADED ARCHITECTURES GUILHERME DE LIMA OTTONI A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
More informationDesign Principles for End-to-End Multicore Schedulers
c Systems Group Department of Computer Science ETH Zürich HotPar 10 Design Principles for End-to-End Multicore Schedulers Simon Peter Adrian Schüpbach Paul Barham Andrew Baumann Rebecca Isaacs Tim Harris
More informationRevisiting the Sequential Programming Model for Multi-Core
Revisiting the Sequential Programming Model for Multi-Core Matthew J. Bridges Neil Vachharajani Yun Zhang Thomas Jablin David I. August Department of Computer Science Princeton University {mbridges, nvachhar,
More information15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 5: Precise Exceptions Prof. Onur Mutlu Carnegie Mellon University Last Time Performance Metrics Amdahl s Law Single-cycle, multi-cycle machines Pipelining Stalls
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationModule 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency
Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional
More informationSuperscalar Processors Ch 14
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationSuperscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?
Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion
More informationComputer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008
Computer Architecture Area Fall 2009 PhD Qualifier Exam October 20 th 2008 This exam has nine (9) problems. You should submit your answers to six (6) of these nine problems. You should not submit answers
More informationOutline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate
Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationProfessional Multicore Programming. Design and Implementation for C++ Developers
Professional Multicore Programming Design and Implementation for C++ Developers Cameron Hughes Tracey Hughes WILEY Wiley Publishing, Inc. Introduction xxi Chapter 1: The New Architecture 1 What Is a Multicore?
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationSpeculative Decoupled Software Pipelining
Speculative Decoupled Software Pipelining Neil Vachharajani Ram Rangan Easwaran Raman Matthew J. Bridges Guilherme Ottoni David I. August Department of Computer Science Princeton University Princeton,
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationEnabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report
Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Ameya Velingker and Dougal J. Sutherland {avelingk, dsutherl}@cs.cmu.edu http://www.cs.cmu.edu/~avelingk/compilers/
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationIntermediate Programming, Spring 2017*
600.120 Intermediate Programming, Spring 2017* Misha Kazhdan *Much of the code in these examples is not commented because it would otherwise not fit on the slides. This is bad coding practice in general
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationPortland State University ECE 588/688. IBM Power4 System Microarchitecture
Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments
More informationMemory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL
Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL Forecast This research studies the performance of memory ordering mechanisms on Chip Multi- Processors (CMPs) for modern
More informationThe Implications of Multi-core
The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationParallel Computing. Parallel Computing. Hwansoo Han
Parallel Computing Parallel Computing Hwansoo Han What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Memory Issues in CUDA Execution Scheduling in CUDA February 23, 2012 Dan Negrut, 2012 ME964 UW-Madison Computers are useless. They can only
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationComputer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Out-of-Order Execution II Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15 Video
More informationASAP: AUTOMATIC SPECULATIVE ACYCLIC PARALLELIZATION FOR CLUSTERS HANJUN KIM A DISSERTATION PRESENTED TO THE FACULTY RECOMMENDED FOR ACCEPTANCE
ASAP: AUTOMATIC SPECULATIVE ACYCLIC PARALLELIZATION FOR CLUSTERS HANJUN KIM A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY RECOMMENDED
More informationEnhancing the performance of Decoupled Software Pipeline through Backward Slicing
Enhancing the performance of Decoupled Software Pipeline through Backward Slicing Esraa Alwan Department of Computer Science Bath University Email:ehoa20@bath.ac.uk John Fitch Department of Computer Science
More informationPipelined Processor Design. EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon
Pipelined Processor Design EE/ECE 4305: Computer Architecture University of Minnesota Duluth By Dr. Taek M. Kwon Concept Identification of Pipeline Segments Add Pipeline Registers Pipeline Stage Control
More informationSuperscalar Machines. Characteristics of superscalar processors
Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationLiberty Queues for EPIC Architectures
Liberty Queues for EPIC Architectures Thomas B. Jablin 1 Yun Zhang 1 James A. Jablin 2 Jialu Huang 1 Hanjun Kim 1 David I. August 1 1 Department of Computer Science, Princeton University 2 Department of
More informationAMCAT Automata Coding Sample Questions And Answers
1) Find the syntax error in the below code without modifying the logic. #include int main() float x = 1.1; switch (x) case 1: printf( Choice is 1 ); default: printf( Invalid choice ); return
More informationAdvanced cache memory optimizations
Advanced cache memory optimizations Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationReal Time Power Estimation and Thread Scheduling via Performance Counters. By Singh, Bhadauria, McKee
Real Time Power Estimation and Thread Scheduling via Performance Counters By Singh, Bhadauria, McKee Estimating Power Consumption Power Consumption is a highly important metric for developers Simple power
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More information