Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks

Size: px
Start display at page:

Download "Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks"

Transcription

1 Breaking Cyclic-Multithreading Parallelization with XML Parsing Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks 0 / 21

2 Scope Today s commodity platforms include multiple cores 1 / 21

3 Scope Today s commodity platforms include multiple cores 1 / 21

4 Scope Today s commodity platforms include multiple cores Use multiple cores for a single program 1 / 21

5 Scope Today s commodity platforms include multiple cores Use multiple cores for a single program Distribute loop iterations among cores a.k.a. Cyclic-Multithreading (CMT) 1 / 21

6 Cyclic-Multithreading (CMT) 2 / 21

7 Cyclic-Multithreading (CMT) This talk is about limits of CMT 2 / 21

8 Cyclic-Multithreading (CMT) This talk is about limits of CMT HELIX is a re-evaluation of CMT for today s multicore 2 / 21

9 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 2 / 21

10 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 2 / 21

11 Team of the HELIX Project 3 / 21

12 Project Goal 4 / 21

13 Project Goal 4 / 21

14 Project Goal 4 / 21

15 Project Goal 4 / 21

16 Project Goal 4 / 21

17 Project Goal 4 / 21

18 Project Goal 4 / 21

19 Project Goal 4 / 21

20 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9,805 CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17,875 libxml2 170,893 5 / 21

21 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9, CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17, libxml2 170, / 21

22 The HELIX Execution Model 6 / 21

23 The HELIX Execution Model 6 / 21

24 The HELIX Execution Model 6 / 21

25 The HELIX Execution Model 6 / 21

26 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

27 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

28 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

29 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

30 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle 7 / 21

31 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle 7 / 21

32 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore 7 / 21

33 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Intel Hyper-Threading 7 / 21

34 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21

35 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21

36 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21

37 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] [ISCA 2014] 7 / 21

38 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9, CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17, libxml2 170, / 21

39 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 8 / 21

40 Algorithm 9 / 21

41 Algorithm: Nested Tree Nodes 10 / 21

42 Algorithm: Single Element Analysis 11 / 21

43 Algorithm: CMT Opportunity 12 / 21

44 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 12 / 21

45 Evaluation 13 / 21

46 Evaluation Architecture Conventional multicore Ring cache [ISCA 2014] 4 cores (Intel Atom-like) 13 / 21

47 Evaluation Architecture Conventional multicore Compiler Ring cache [ISCA 2014] 4 cores (Intel Atom-like) HELIX compiler: HCCv3 [ISCA 2014] 13 / 21

48 Evaluation Architecture Conventional multicore Compiler Simulator IRSim Ring cache [ISCA 2014] 4 cores (Intel Atom-like) HELIX compiler: HCCv3 [ISCA 2014] IR-based simulator [ISCA 2014] 13 / 21

49 Limits of HELIX 14 / 21

50 Limits of HELIX 14 / 21

51 Limits of HELIX (2) 15 / 21

52 Limits of CMT Oracle Control and data dependences Invariant variables Function pointers 16 / 21

53 Limits of CMT Oracle Control and data dependences Invariant variables Function pointers 16 / 21

54 Multiple CMT: Beyond the Single Loop Parallelism 17 / 21

55 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? 17 / 21

56 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core 17 / 21

57 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied 17 / 21

58 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied Constraint: No parallelism for recursive loops 17 / 21

59 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied Constraint: No parallelism for recursive loops Idealization No communication cost No dispatching cost No cost to switch loop iteration 17 / 21

60 Opportunity of MCMT 18 / 21

61 Opportunity of MCMT Static DDG: no hope 18 / 21

62 Opportunity of MCMT Static DDG: no hope 18 / 21

63 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees 18 / 21

64 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees 18 / 21

65 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees Nested trees: require parallelism among same-loop invocations 18 / 21

66 Algorithm: CMT Opportunity 19 / 21

67 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches 20 / 21

68 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches Multiple CMT There is parallelism among multiple loops Dynamic analyses and/or code transformations are necessary 20 / 21

69 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches Multiple CMT There is parallelism among multiple loops Dynamic analyses and/or code transformations are necessary References HELIX project 20 / 21

70 Thanks for your attention! Questions? 21 / 21

Automatically Accelerating Non-Numerical Programs By Architecture-Compiler Co-Design

Automatically Accelerating Non-Numerical Programs By Architecture-Compiler Co-Design Automatically Accelerating Non-Numerical Programs By Architecture-Compiler Co-Design Simone Campanoni * Kevin Brownell Svilen Kanev Timothy M. Jones + Harvard University Northwestern University * University

More information

HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs

HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs Simone Campanoni Kevin Brownell Svilen Kanev Timothy M. Jones + Gu-Yeon Wei David Brooks Harvard University

More information

research highlights DOI: /

research highlights DOI: / research highlights Automatically Accelerating Non-Numerical Programs by Architecture-Compiler Co-Design By Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

APPENDIX Summary of Benchmarks

APPENDIX Summary of Benchmarks 158 APPENDIX Summary of Benchmarks The experimental results presented throughout this thesis use programs from four benchmark suites: Cyclone benchmarks (available from [Cyc]): programs used to evaluate

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Outline. Speculative Register Promotion Using Advanced Load Address Table (ALAT) Motivation. Motivation Example. Motivation

Outline. Speculative Register Promotion Using Advanced Load Address Table (ALAT) Motivation. Motivation Example. Motivation Speculative Register Promotion Using Advanced Load Address Table (ALAT Jin Lin, Tong Chen, Wei-Chung Hsu, Pen-Chung Yew http://www.cs.umn.edu/agassiz Motivation Outline Scheme of speculative register promotion

More information

EECS 583 Class 16 Research Topic 1 Automatic Parallelization

EECS 583 Class 16 Research Topic 1 Automatic Parallelization EECS 583 Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012 Announcements + Reading Material Midterm exam: Mon Nov 19 in class (Next next Monday)» I will post 2

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012 Memory Prefetching for the GreenDroid Microprocessor David Curran December 10, 2012 Outline Memory Prefetchers Overview GreenDroid Overview Problem Description Design Placement Prediction Logic Simulation

More information

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch

More information

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Masayo Haneda Peter M.W. Knijnenburg Harry A.G. Wijshoff LIACS, Leiden University Motivation An optimal compiler optimization

More information

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn

A Cross-Architectural Interface for Code Cache Manipulation. Kim Hazelwood and Robert Cohn A Cross-Architectural Interface for Code Cache Manipulation Kim Hazelwood and Robert Cohn Software-Managed Code Caches Software-managed code caches store transformed code at run time to amortize overhead

More information

COS 320. Compiling Techniques

COS 320. Compiling Techniques Topic 14: Parallelism COS 320 Compiling Techniques Princeton University Spring 2015 Prof. David August 1 Final Exam! Friday May 22 at 1:30PM in FRIEND 006 Closed book One Front/Back 8.5x11 2 Moore s Law

More information

IMPROVING SYSTEM PERFORMANCE INCREASINGLY DEPENDS ON EXPLOITING MICRO- WITHOUT REQUIRING ANY SPECIAL HARDWARE; AVOIDS SLOWING DOWN COMPILED

IMPROVING SYSTEM PERFORMANCE INCREASINGLY DEPENDS ON EXPLOITING MICRO- WITHOUT REQUIRING ANY SPECIAL HARDWARE; AVOIDS SLOWING DOWN COMPILED [3-9] mmi000003.3d 5/7/0 6: Page... HEIX: MKING THE EXTRTION OF THRED-EVE PREISM MINSTREM... IMPROVING SYSTEM PERFORMNE INRESINGY DEPENDS ON EXPOITING MIRO- PROESSOR PREISM, YET MINSTREM OMPIERS STI DON

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005

Architecture Cloning For PowerPC Processors. Edwin Chan, Raul Silvera, Roch Archambault IBM Toronto Lab Oct 17 th, 2005 Architecture Cloning For PowerPC Processors Edwin Chan, Raul Silvera, Roch Archambault edwinc@ca.ibm.com IBM Toronto Lab Oct 17 th, 2005 Outline Motivation Implementation Details Results Scenario Previously,

More information

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1 I Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL Inserting Prefetches IA-32 Execution Layer - 1 Agenda IA-32EL Brief Overview Prefetching in Loops IA-32EL Prefetching in

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Efficient Architecture Support for Thread-Level Speculation

Efficient Architecture Support for Thread-Level Speculation Efficient Architecture Support for Thread-Level Speculation A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Venkatesan Packirisamy IN PARTIAL FULFILLMENT OF THE

More information

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin Sangyeun Cho Department of Computer Science University of Pittsburgh jinlei,cho@cs.pitt.edu Abstract Private

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Simple and Efficient Construction of Static Single Assignment Form

Simple and Efficient Construction of Static Single Assignment Form Simple and Efficient Construction of Static Single Assignment Form saarland university Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mallon and Andreas Zwinkau computer science

More information

Improvements to Linear Scan register allocation

Improvements to Linear Scan register allocation Improvements to Linear Scan register allocation Alkis Evlogimenos (alkis) April 1, 2004 1 Abstract Linear scan register allocation is a fast global register allocation first presented in [PS99] as an alternative

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Good old days ended in Nov. 2002

Good old days ended in Nov. 2002 WaveScalar! Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling Area scaling 3 Chip MultiProcessors Low complexity Scalable Fast 4 CMP Problems Hard to program Not practical to scale

More information

HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing

HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing Simone Campanoni Harvard University Cambridge, USA xan@eecs.harvard.edu Vijay Janapa Reddi The University of Texas at Austin

More information

Continuous Adaptive Object-Code Re-optimization Framework

Continuous Adaptive Object-Code Re-optimization Framework Continuous Adaptive Object-Code Re-optimization Framework Howard Chen, Jiwei Lu, Wei-Chung Hsu, and Pen-Chung Yew University of Minnesota, Department of Computer Science Minneapolis, MN 55414, USA {chenh,

More information

HELIX-UP: Relaxing Program Semantics to Unleash Parallelization

HELIX-UP: Relaxing Program Semantics to Unleash Parallelization HELIX-: Relaxing Program Semantics to Unleash Parallelization Simone Campanoni Glenn Holloway Gu-Yeon Wei David Brooks Harvard University {xan,holloway,guyeon,dbrooks}@eecs.harvard.edu Abstract Automatic

More information

Efficient Locality Approximation from Time

Efficient Locality Approximation from Time Efficient Locality Approximation from Time Xipeng Shen The College of William and Mary Joint work with Jonathan Shaw at Shaw Technologies Inc., Tualatin, OR Locality is Important Traditional reasons: memory

More information

Detecting Global Stride Locality in Value Streams

Detecting Global Stride Locality in Value Streams Detecting Global Stride Locality in Value Streams Huiyang Zhou, Jill Flanagan, Thomas M. Conte TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University 1 Introduction

More information

Instruction Based Memory Distance Analysis and its Application to Optimization

Instruction Based Memory Distance Analysis and its Application to Optimization Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological

More information

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX

Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Aries: Transparent Execution of PA-RISC/HP-UX Applications on IPF/HP-UX Keerthi Bhushan Rajesh K Chaurasia Hewlett-Packard India Software Operations 29, Cunningham Road Bangalore 560 052 India +91-80-2251554

More information

Dynamic Trace Analysis with Zero-Suppressed BDDs

Dynamic Trace Analysis with Zero-Suppressed BDDs University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring 4-1-2011 Dynamic Trace Analysis with

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

RoboBees + Aladdin + HELIX Approximate Accelerator Architectures

RoboBees + Aladdin + HELIX Approximate Accelerator Architectures RoboBees + Aladdin + HELIX Approximate Accelerator Architectures Gu-Yeon Wei School of Engineering and Applied Sciences Harvard University CMOS scaling is running out Technological Fallow Period 2 Power

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Topic 22: Multi-Processor Parallelism

Topic 22: Multi-Processor Parallelism Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

SVF: Static Value-Flow Analysis in LLVM

SVF: Static Value-Flow Analysis in LLVM SVF: Static Value-Flow Analysis in LLVM Yulei Sui, Peng Di, Ding Ye, Hua Yan and Jingling Xue School of Computer Science and Engineering The University of New South Wales 2052 Sydney Australia March 18,

More information

Topic 22: Multi-Processor Parallelism

Topic 22: Multi-Processor Parallelism Topic 22: Multi-Processor Parallelism COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Review: Parallelism Independent units of work can execute

More information

Lecture 20 CIS 341: COMPILERS

Lecture 20 CIS 341: COMPILERS Lecture 20 CIS 341: COMPILERS Announcements HW5: OAT v. 2.0 records, function pointers, type checking, array-bounds checks, etc. Due: TOMORROW Wednesday, April 11 th Zdancewic CIS 341: Compilers 2 A high-level

More information

TraceBack: First Fault Diagnosis by Reconstruction of Distributed Control Flow

TraceBack: First Fault Diagnosis by Reconstruction of Distributed Control Flow TraceBack: First Fault Diagnosis by Reconstruction of Distributed Control Flow Andrew Ayers Chris Metcalf Junghwan Rhee Richard Schooler VERITAS Emmett Witchel Microsoft Anant Agarwal UT Austin MIT Software

More information

COL862 Programming Assignment-1

COL862 Programming Assignment-1 Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,

More information

HDFI: Hardware-Assisted Data-flow Isolation

HDFI: Hardware-Assisted Data-flow Isolation HDFI: Hardware-Assisted Data-flow Isolation Presented by Ben Schreiber Chengyu Song 1, Hyungon Moon 2, Monjur Alam 1, Insu Yun 1, Byoungyoung Lee 1, Taesoo Kim 1, Wenke Lee 1, Yunheung Paek 2 1 Georgia

More information

Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective

Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Venkatesan Packirisamy, Yangchun Luo, Wei-Lung Hung, Antonia Zhai, Pen-Chung Yew and Tin-Fook

More information

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable

More information

Min-Cut Program Decomposition for Thread-Level Speculation

Min-Cut Program Decomposition for Thread-Level Speculation Min-Cut Program Decomposition for Thread-Level Speculation Troy A. Johnson, Rudolf Eigenmann, T. N. Vijaykumar {troyj, eigenman, vijay}@ecn.purdue.edu School of Electrical and Computer Engineering Purdue

More information

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications

Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Base Vectors: A Potential Technique for Micro-architectural Classification of Applications Dan Doucette School of Computing Science Simon Fraser University Email: ddoucett@cs.sfu.ca Alexandra Fedorova

More information

REVISITING THE SEQUENTIAL PROGRAMMING MODEL FOR THE MULTICORE ERA

REVISITING THE SEQUENTIAL PROGRAMMING MODEL FOR THE MULTICORE ERA ... REVISITING THE SEQUENTIAL PROGRAMMING MODEL FOR THE MULTICORE ERA... AUTOMATIC PARALLELIZATION HAS THUS FAR NOT BEEN SUCCESSFUL AT EXTRACTING SCALABLE PARALLELISM FROM GENERAL PROGRAMS. AN AGGRESSIVE

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

Data-Triggered Threads: Eliminating Redundant Computation

Data-Triggered Threads: Eliminating Redundant Computation In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA 2011) Data-Triggered Threads: Eliminating Redundant Computation Hung-Wei Tseng and Dean M. Tullsen Department

More information

Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective. Technical Report

Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective. Technical Report Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Technical Report Department of Computer Science and Engineering University of Minnesota

More information

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

Exploring Wakeup-Free Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance

More information

ATOS introduction ST/Linaro Collaboration Context

ATOS introduction ST/Linaro Collaboration Context ATOS introduction ST/Linaro Collaboration Context Presenter: Christian Bertin Development team: Rémi Duraffort, Christophe Guillon, François de Ferrière, Hervé Knochel, Antoine Moynault Consumer Product

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia

More information

Using Thread-Level Speculation to Simplify Manual Parallelization

Using Thread-Level Speculation to Simplify Manual Parallelization Using Thread-Level Speculation to Simplify Manual Parallelization Manohar K. Prabhu Stanford University Computer Systems Laboratory Stanford, California 94305 mkprabhu@stanford.edu Kunle Olukotun Stanford

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Loop Selection for Thread-Level Speculation

Loop Selection for Thread-Level Speculation Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science University of Minnesota {shengyue, dai, kiran, zhai,

More information

Loop Selection for Thread-Level Speculation

Loop Selection for Thread-Level Speculation Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S Yellajyosula Antonia Zhai, Pen-Chung Yew Department of Computer Science University of Minnesota {shengyue, dai, kiran, zhai,

More information

Phi-Predication for Light-Weight If-Conversion

Phi-Predication for Light-Weight If-Conversion Phi-Predication for Light-Weight If-Conversion Weihaw Chuang Brad Calder Jeanne Ferrante Benefits of If-Conversion Eliminates hard to predict branches Important for deep pipelines How? Executes all paths

More information

A DYNAMIC PERIODICITY DETECTOR: APPLICATION TO SPEEDUP COMPUTATION

A DYNAMIC PERIODICITY DETECTOR: APPLICATION TO SPEEDUP COMPUTATION 1 of 16 A DYNAMIC PERIODICITY DETECTOR: APPLICATION TO SPEEDUP COMPUTATION Abstract Felix Freitag, Julita Corbalan, Jesus Labarta Departament d Arquitectura de Computadors (DAC) Universitat Politècnica

More information

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

The Predictability of Computations that Produce Unpredictable Outcomes

The Predictability of Computations that Produce Unpredictable Outcomes This is an update of the paper that appears in the Proceedings of the 5th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 23-34, Austin TX, December, 2001. It includes minor text

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

WaveScalar. Winter 2006 CSE WaveScalar 1

WaveScalar. Winter 2006 CSE WaveScalar 1 WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract

More information

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili Sibi Govindan Doug Burger Stephen W. Keckler beroy@cs.utexas.edu sibi@cs.utexas.edu dburger@microsoft.com skeckler@nvidia.com

More information

Dune: Safe User- level Access to Privileged CPU Features

Dune: Safe User- level Access to Privileged CPU Features Dune: Safe User- level Access to Privileged CPU Features Adam Belay, Andrea Bi>au, Ali MashAzadeh, David Terei, David Mazières, and Christos Kozyrakis Stanford University A quick review of VirtualizaAon

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41 Performance II CS61C L41 Performance II (1) Lecturer PSOE Dan Garcia www.cs.berkeley.edu/~ddgarcia UWB Ultra Wide Band! The FCC moved

More information

A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures

A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures A Software-Hardware Hybrid Steering Mechanism for Clustered Microarchitectures Qiong Cai Josep M. Codina José González Antonio González Intel Barcelona Research Centers, Intel-UPC {qiongx.cai, josep.m.codina,

More information

Inlining Java Native Calls at Runtime

Inlining Java Native Calls at Runtime Inlining Java Native Calls at Runtime (CASCON 2005 4 th Workshop on Compiler Driven Performance) Levon Stepanian, Angela Demke Brown Computer Systems Group Department of Computer Science, University of

More information

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Jian Chen, Nidhi Nayyar and Lizy K. John Department of Electrical and Computer Engineering The

More information

Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software

Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software USENIX Association Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software Boston, Massachusetts, USA December 8, 2002 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION 2002 by The USENIX

More information

Cache Optimization by Fully-Replacement Policy

Cache Optimization by Fully-Replacement Policy American Journal of Embedded Systems and Applications 2016; 4(1): 7-14 http://www.sciencepublishinggroup.com/j/ajesa doi: 10.11648/j.ajesa.20160401.12 ISSN: 2376-6069 (Print); ISSN: 2376-6085 (Online)

More information

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY FINE-GRAIN STATE PROCESSORS By PENG ZHOU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science) MICHIGAN TECHNOLOGICAL UNIVERSITY

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

TOLERATING CACHE-MISS LATENCY WITH MULTIPASS PIPELINES

TOLERATING CACHE-MISS LATENCY WITH MULTIPASS PIPELINES TOLERATING CACHE-MISS LATENCY WITH MULTIPASS PIPELINES MULTIPASS PIPELINING USES PERSISTENT ADVANCE EXECUTION TO ACHIEVE MEMORY-LATENCY TOLERANCE WHILE MAINTAINING THE SIMPLICITY OF AN IN-ORDER DESIGN.

More information

1.6 Computer Performance

1.6 Computer Performance 1.6 Computer Performance Performance How do we measure performance? Define Metrics Benchmarking Choose programs to evaluate performance Performance summary Fallacies and Pitfalls How to avoid getting fooled

More information

Computer System. Performance

Computer System. Performance Computer System Performance Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

More information

Eliminating Microarchitectural Dependency from Architectural Vulnerability

Eliminating Microarchitectural Dependency from Architectural Vulnerability Eliminating Microarchitectural Dependency from Architectural Vulnerability Vilas Sridharan and David R. Kaeli Department of Electrical and Computer Engineering Northeastern University {vilas, kaeli}@ece.neu.edu

More information