Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks
|
|
- Leslie Crawford
- 5 years ago
- Views:
Transcription
1 Breaking Cyclic-Multithreading Parallelization with XML Parsing Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks 0 / 21
2 Scope Today s commodity platforms include multiple cores 1 / 21
3 Scope Today s commodity platforms include multiple cores 1 / 21
4 Scope Today s commodity platforms include multiple cores Use multiple cores for a single program 1 / 21
5 Scope Today s commodity platforms include multiple cores Use multiple cores for a single program Distribute loop iterations among cores a.k.a. Cyclic-Multithreading (CMT) 1 / 21
6 Cyclic-Multithreading (CMT) 2 / 21
7 Cyclic-Multithreading (CMT) This talk is about limits of CMT 2 / 21
8 Cyclic-Multithreading (CMT) This talk is about limits of CMT HELIX is a re-evaluation of CMT for today s multicore 2 / 21
9 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 2 / 21
10 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 2 / 21
11 Team of the HELIX Project 3 / 21
12 Project Goal 4 / 21
13 Project Goal 4 / 21
14 Project Goal 4 / 21
15 Project Goal 4 / 21
16 Project Goal 4 / 21
17 Project Goal 4 / 21
18 Project Goal 4 / 21
19 Project Goal 4 / 21
20 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9,805 CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17,875 libxml2 170,893 5 / 21
21 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9, CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17, libxml2 170, / 21
22 The HELIX Execution Model 6 / 21
23 The HELIX Execution Model 6 / 21
24 The HELIX Execution Model 6 / 21
25 The HELIX Execution Model 6 / 21
26 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21
27 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21
28 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21
29 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21
30 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle 7 / 21
31 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle 7 / 21
32 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore 7 / 21
33 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Intel Hyper-Threading 7 / 21
34 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21
35 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21
36 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21
37 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] [ISCA 2014] 7 / 21
38 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9, CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17, libxml2 170, / 21
39 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 8 / 21
40 Algorithm 9 / 21
41 Algorithm: Nested Tree Nodes 10 / 21
42 Algorithm: Single Element Analysis 11 / 21
43 Algorithm: CMT Opportunity 12 / 21
44 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 12 / 21
45 Evaluation 13 / 21
46 Evaluation Architecture Conventional multicore Ring cache [ISCA 2014] 4 cores (Intel Atom-like) 13 / 21
47 Evaluation Architecture Conventional multicore Compiler Ring cache [ISCA 2014] 4 cores (Intel Atom-like) HELIX compiler: HCCv3 [ISCA 2014] 13 / 21
48 Evaluation Architecture Conventional multicore Compiler Simulator IRSim Ring cache [ISCA 2014] 4 cores (Intel Atom-like) HELIX compiler: HCCv3 [ISCA 2014] IR-based simulator [ISCA 2014] 13 / 21
49 Limits of HELIX 14 / 21
50 Limits of HELIX 14 / 21
51 Limits of HELIX (2) 15 / 21
52 Limits of CMT Oracle Control and data dependences Invariant variables Function pointers 16 / 21
53 Limits of CMT Oracle Control and data dependences Invariant variables Function pointers 16 / 21
54 Multiple CMT: Beyond the Single Loop Parallelism 17 / 21
55 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? 17 / 21
56 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core 17 / 21
57 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied 17 / 21
58 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied Constraint: No parallelism for recursive loops 17 / 21
59 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied Constraint: No parallelism for recursive loops Idealization No communication cost No dispatching cost No cost to switch loop iteration 17 / 21
60 Opportunity of MCMT 18 / 21
61 Opportunity of MCMT Static DDG: no hope 18 / 21
62 Opportunity of MCMT Static DDG: no hope 18 / 21
63 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees 18 / 21
64 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees 18 / 21
65 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees Nested trees: require parallelism among same-loop invocations 18 / 21
66 Algorithm: CMT Opportunity 19 / 21
67 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches 20 / 21
68 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches Multiple CMT There is parallelism among multiple loops Dynamic analyses and/or code transformations are necessary 20 / 21
69 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches Multiple CMT There is parallelism among multiple loops Dynamic analyses and/or code transformations are necessary References HELIX project 20 / 21
70 Thanks for your attention! Questions? 21 / 21
Automatically Accelerating Non-Numerical Programs By Architecture-Compiler Co-Design
Automatically Accelerating Non-Numerical Programs By Architecture-Compiler Co-Design Simone Campanoni * Kevin Brownell Svilen Kanev Timothy M. Jones + Harvard University Northwestern University * University
More informationHELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs Simone Campanoni Kevin Brownell Svilen Kanev Timothy M. Jones + Gu-Yeon Wei David Brooks Harvard University
More informationresearch highlights DOI: /
research highlights Automatically Accelerating Non-Numerical Programs by Architecture-Compiler Co-Design By Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationAPPENDIX Summary of Benchmarks
158 APPENDIX Summary of Benchmarks The experimental results presented throughout this thesis use programs from four benchmark suites: Cyclone benchmarks (available from [Cyc]): programs used to evaluate
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationOutline. Speculative Register Promotion Using Advanced Load Address Table (ALAT) Motivation. Motivation Example. Motivation
Speculative Register Promotion Using Advanced Load Address Table (ALAT Jin Lin, Tong Chen, Wei-Chung Hsu, Pen-Chung Yew http://www.cs.umn.edu/agassiz Motivation Outline Scheme of speculative register promotion
More informationEECS 583 Class 16 Research Topic 1 Automatic Parallelization
EECS 583 Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012 Announcements + Reading Material Midterm exam: Mon Nov 19 in class (Next next Monday)» I will post 2
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationMemory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012
Memory Prefetching for the GreenDroid Microprocessor David Curran December 10, 2012 Outline Memory Prefetchers Overview GreenDroid Overview Problem Description Design Placement Prediction Logic Simulation
More informationKoji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency
Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch
More informationAutomatic Selection of Compiler Options Using Non-parametric Inferential Statistics
Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Masayo Haneda Peter M.W. Knijnenburg Harry A.G. Wijshoff LIACS, Leiden University Motivation An optimal compiler optimization
More informationA Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn
A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead
More informationCOS 320. Compiling Techniques
Topic 14: Parallelism COS 320 Compiling Techniques Princeton University Spring 2015 Prof. David August 1 Final Exam! Friday May 22 at 1:30PM in FRIEND 006 Closed book One Front/Back 8.5x11 2 Moore s Law
More informationIMPROVING SYSTEM PERFORMANCE INCREASINGLY DEPENDS ON EXPLOITING MICRO- WITHOUT REQUIRING ANY SPECIAL HARDWARE; AVOIDS SLOWING DOWN COMPILED
[3-9] mmi000003.3d 5/7/0 6: Page... HEIX: MKING THE EXTRTION OF THRED-EVE PREISM MINSTREM... IMPROVING SYSTEM PERFORMNE INRESINGY DEPENDS ON EXPOITING MIRO- PROESSOR PREISM, YET MINSTREM OMPIERS STI DON
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationExploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students
More informationMany Cores, One Thread: Dean Tullsen University of California, San Diego
Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and
More informationArchitecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005
Architecture Cloning For PowerPC Processors Edwin Chan, Raul Silvera, Roch Archambault edwinc@ca.ibm.com IBM Toronto Lab Oct 17 th, 2005 Outline Motivation Implementation Details Results Scenario Previously,
More informationInserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1
I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in
More informationChip-Multithreading Systems Need A New Operating Systems Scheduler
Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationEfficient Architecture Support for Thread-Level Speculation
Efficient Architecture Support for Thread-Level Speculation A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Venkatesan Packirisamy IN PARTIAL FULFILLMENT OF THE
More informationBetter than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin Sangyeun Cho Department of Computer Science University of Pittsburgh jinlei,cho@cs.pitt.edu Abstract Private
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationSimple and Efficient Construction of Static Single Assignment Form
Simple and Efficient Construction of Static Single Assignment Form saarland university Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mallon and Andreas Zwinkau computer science
More informationImprovements to Linear Scan register allocation
Improvements to Linear Scan register allocation Alkis Evlogimenos (alkis) April 1, 2004 1 Abstract Linear scan register allocation is a fast global register allocation first presented in [PS99] as an alternative
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More informationGood old days ended in Nov. 2002
WaveScalar! Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling Area scaling 3 Chip MultiProcessors Low complexity Scalable Fast 4 CMP Problems Hard to program Not practical to scale
More informationHELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing
HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing Simone Campanoni Harvard University Cambridge, USA xan@eecs.harvard.edu Vijay Janapa Reddi The University of Texas at Austin
More informationContinuous Adaptive Object-Code Re-optimization Framework
Continuous Adaptive Object-Code Re-optimization Framework Howard Chen, Jiwei Lu, Wei-Chung Hsu, and Pen-Chung Yew University of Minnesota, Department of Computer Science Minneapolis, MN 55414, USA {chenh,
More informationHELIX-UP: Relaxing Program Semantics to Unleash Parallelization
HELIX-: Relaxing Program Semantics to Unleash Parallelization Simone Campanoni Glenn Holloway Gu-Yeon Wei David Brooks Harvard University {xan,holloway,guyeon,dbrooks}@eecs.harvard.edu Abstract Automatic
More informationEfficient Locality Approximation from Time
Efficient Locality Approximation from Time Xipeng Shen The College of William and Mary Joint work with Jonathan Shaw at Shaw Technologies Inc., Tualatin, OR Locality is Important Traditional reasons: memory
More informationDetecting Global Stride Locality in Value Streams
Detecting Global Stride Locality in Value Streams Huiyang Zhou, Jill Flanagan, Thomas M. Conte TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University 1 Introduction
More informationInstruction Based Memory Distance Analysis and its Application to Optimization
Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological
More informationAries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX
Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554
More informationDynamic Trace Analysis with Zero-Suppressed BDDs
University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring 4-1-2011 Dynamic Trace Analysis with
More informationTRIPS: Extending the Range of Programmable Processors
TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart
More informationWhich is the best? Measuring & Improving Performance (if planes were computers...) An architecture example
1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles
More informationRoboBees + Aladdin + HELIX Approximate Accelerator Architectures
RoboBees + Aladdin + HELIX Approximate Accelerator Architectures Gu-Yeon Wei School of Engineering and Applied Sciences Harvard University CMOS scaling is running out Technological Fallow Period 2 Power
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationLoop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization
Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of
More informationSpeculative Multithreaded Processors
Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationSVF: Static Value-Flow Analysis in LLVM
SVF: Static Value-Flow Analysis in LLVM Yulei Sui, Peng Di, Ding Ye, Hua Yan and Jingling Xue School of Computer Science and Engineering The University of New South Wales 2052 Sydney Australia March 18,
More informationTopic 22: Multi-Processor Parallelism
Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute
More informationLecture 20 CIS 341: COMPILERS
Lecture 20 CIS 341: COMPILERS Announcements HW5: OAT v. 2.0 records, function pointers, type checking, array-bounds checks, etc. Due: TOMORROW Wednesday, April 11 th Zdancewic CIS 341: Compilers 2 A high-level
More informationTraceBack: First Fault Diagnosis by Reconstruction of Distributed Control Flow
TraceBack: First Fault Diagnosis by Reconstruction of Distributed Control Flow Andrew Ayers Chris Metcalf Junghwan Rhee Richard Schooler VERITAS Emmett Witchel Microsoft Anant Agarwal UT Austin MIT Software
More informationCOL862 Programming Assignment-1
Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,
More informationHDFI: Hardware-Assisted Data-flow Isolation
HDFI: Hardware-Assisted Data-flow Isolation Presented by Ben Schreiber Chengyu Song 1, Hyungon Moon 2, Monjur Alam 1, Insu Yun 1, Byoungyoung Lee 1, Taesoo Kim 1, Wenke Lee 1, Yunheung Paek 2 1 Georgia
More informationEfficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective
Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Venkatesan Packirisamy, Yangchun Luo, Wei-Lung Hung, Antonia Zhai, Pen-Chung Yew and Tin-Fook
More informationA Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST
A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable
More informationMin-Cut Program Decomposition for Thread-Level Speculation
Min-Cut Program Decomposition for Thread-Level Speculation Troy A. Johnson, Rudolf Eigenmann, T. N. Vijaykumar {troyj, eigenman, vijay}@ecn.purdue.edu School of Electrical and Computer Engineering Purdue
More informationBase Vectors: A Potential Technique for Micro-architectural Classification of Applications
Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova
More informationREVISITING THE SEQUENTIAL PROGRAMMING MODEL FOR THE MULTICORE ERA
... REVISITING THE SEQUENTIAL PROGRAMMING MODEL FOR THE MULTICORE ERA... AUTOMATIC PARALLELIZATION HAS THUS FAR NOT BEEN SUCCESSFUL AT EXTRACTING SCALABLE PARALLELISM FROM GENERAL PROGRAMS. AN AGGRESSIVE
More informationCSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading
CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,
More informationData-Triggered Threads: Eliminating Redundant Computation
In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA 2011) Data-Triggered Threads: Eliminating Redundant Computation Hung-Wei Tseng and Dean M. Tullsen Department
More informationEfficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective. Technical Report
Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Technical Report Department of Computer Science and Engineering University of Minnesota
More informationDual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationPreliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads
Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu
More informationWorkloads, Scalability and QoS Considerations in CMP Platforms
Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload
More informationExploring Wakeup-Free Instruction Scheduling
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance
More informationATOS introduction ST/Linaro Collaboration Context
ATOS introduction ST/Linaro Collaboration Context Presenter: Christian Bertin Development team: Rémi Duraffort, Christophe Guillon, François de Ferrière, Hervé Knochel, Antoine Moynault Consumer Product
More informationImplicitly-Multithreaded Processors
Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationUsing Thread-Level Speculation to Simplify Manual Parallelization
Using Thread-Level Speculation to Simplify Manual Parallelization Manohar K. Prabhu Stanford University Computer Systems Laboratory Stanford, California 94305 mkprabhu@stanford.edu Kunle Olukotun Stanford
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationLoop Selection for Thread-Level Speculation
Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science University of Minnesota {shengyue, dai, kiran, zhai,
More informationLoop Selection for Thread-Level Speculation
Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S Yellajyosula Antonia Zhai, Pen-Chung Yew Department of Computer Science University of Minnesota {shengyue, dai, kiran, zhai,
More informationPhi-Predication for Light-Weight If-Conversion
Phi-Predication for Light-Weight If-Conversion Weihaw Chuang Brad Calder Jeanne Ferrante Benefits of If-Conversion Eliminates hard to predict branches Important for deep pipelines How? Executes all paths
More informationA DYNAMIC PERIODICITY DETECTOR: APPLICATION TO SPEEDUP COMPUTATION
1 of 16 A DYNAMIC PERIODICITY DETECTOR: APPLICATION TO SPEEDUP COMPUTATION Abstract Felix Freitag, Julita Corbalan, Jesus Labarta Departament d Arquitectura de Computadors (DAC) Universitat Politècnica
More informationAn Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors
ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,
More informationIntegrated CPU and Cache Power Management in Multiple Clock Domain Processors
Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationThe Predictability of Computations that Produce Unpredictable Outcomes
This is an update of the paper that appears in the Proceedings of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 23-34, Austin TX, December, 2001. It includes minor text
More informationHigh Performance Memory Requests Scheduling Technique for Multicore Processors
High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical
More informationWaveScalar. Winter 2006 CSE WaveScalar 1
WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE
More informationImplicitly-Multithreaded Processors
Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract
More informationExploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors
Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili Sibi Govindan Doug Burger Stephen W. Keckler beroy@cs.utexas.edu sibi@cs.utexas.edu dburger@microsoft.com skeckler@nvidia.com
More informationDune: Safe User- level Access to Privileged CPU Features
Dune: Safe User- level Access to Privileged CPU Features Adam Belay, Andrea Bi>au, Ali MashAzadeh, David Terei, David Mazières, and Christos Kozyrakis Stanford University A quick review of VirtualizaAon
More informationCS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41 Performance II CS61C L41 Performance II (1) Lecturer PSOE Dan Garcia www.cs.berkeley.edu/~ddgarcia UWB Ultra Wide Band! The FCC moved
More informationA Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures
A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures Qiong Cai Josep M. Codina José González Antonio González Intel Barcelona Research Centers, Intel-UPC {qiongx.cai, josep.m.codina,
More informationInlining Java Native Calls at Runtime
Inlining Java Native Calls at Runtime (CASCON 2005 4 th Workshop on Compiler Driven Performance) Levon Stepanian, Angela Demke Brown Computer Systems Group Department of Computer Science, University of
More informationMapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics
Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Jian Chen, Nidhi Nayyar and Lizy K. John Department of Electrical and Computer Engineering The
More informationProceedings of the 2nd Workshop on Industrial Experiences with Systems Software
USENIX Association Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software Boston, Massachusetts, USA December 8, 2002 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION 2002 by The USENIX
More informationCache Optimization by Fully-Replacement Policy
American Journal of Embedded Systems and Applications 2016; 4(1): 7-14 http://www.sciencepublishinggroup.com/j/ajesa doi: 10.11648/j.ajesa.20160401.12 ISSN: 2376-6069 (Print); ISSN: 2376-6085 (Online)
More informationFINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY
FINE-GRAIN STATE PROCESSORS By PENG ZHOU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science) MICHIGAN TECHNOLOGICAL UNIVERSITY
More informationReducing Latencies of Pipelined Cache Accesses Through Set Prediction
Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the
More informationWish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution
Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department
More informationTOLERATING CACHE-MISS LATENCY WITH MULTIPASS PIPELINES
TOLERATING CACHE-MISS LATENCY WITH MULTIPASS PIPELINES MULTIPASS PIPELINING USES PERSISTENT ADVANCE EXECUTION TO ACHIEVE MEMORY-LATENCY TOLERANCE WHILE MAINTAINING THE SIMPLICITY OF AN IN-ORDER DESIGN.
More information1.6 Computer Performance
1.6 Computer Performance Performance How do we measure performance? Define Metrics Benchmarking Choose programs to evaluate performance Performance summary Fallacies and Pitfalls How to avoid getting fooled
More informationComputer System. Performance
Computer System Performance Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/
More informationEliminating Microarchitectural Dependency from Architectural Vulnerability
Eliminating Microarchitectural Dependency from Architectural Vulnerability Vilas Sridharan and David R. Kaeli Department of Electrical and Computer Engineering Northeastern University {vilas, kaeli}@ece.neu.edu
More information