A Parallelizing Compiler for Multicore Systems

Size: px
Start display at page:

Download "A Parallelizing Compiler for Multicore Systems"

Transcription

1 A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014) June 10-11, 2014 Schloss Rheinfels, Sankt Goar, Germany

2 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

3 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

4 100,000 Performance (vs. VAX-11/780) 10, AX-11/780, 5 MHz AMD Athlon 64, 2.8 GHz 11,865 14,38719,484 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer /575, 575 MHz , AlphaServer /600, 600 MHz Digital Alphastation 5/500, 500 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 4/266, 266 MHz IBM POWERstation 100, 150 MHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz %/year Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz %/year 24,129 21,871 25%/year 1.5, VAX-11/ The Parallel Challenge David A. Patterson and John L. Hennessy.! Computer Organization and Design: The Hardware/Software Interface.! Elsevier, 2014.

5 The Parallel Challenge libraries compiler directives programming languages parallelizing compilers

6 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

7 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

8 dikernel: Domain- Independent Computational Kernel DOMAIN-SPECIFIC CONCEPT LEVEL (problem solving methods and application domain) DOMAIN-INDEPENDENT CONCEPT LEVEL (programming practice) SEMANTIC LEVEL (control flow and data dependence graphs) SYNTACTIC LEVEL (abstract syntax tree) TEXT LEVEL (ASCII code) Characterizes the computations carried out in a program without being affected by how they are coded Exposes multiple levels of parallelism M. Arenaz et al. XARK: An Extensible Framework for Automatic Recognition of Computational Kernels. ACM Transactions on Programming Languages and Systems, 30(6), 2008.

9 Standard statement-based IR BB0 i = 0; BB1 t = 0; (2) j = 0; (2) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} (1) BB2 t = t + A[i][j] * x[j]; j++; (2) (1) BB3 if (j < m) (2) BB4 (1) y[i] = t; (2) i++; (2) (1) BB5 if (i < n) T F

10 Building the KIR (I) BB0 i = 0; BB1 i=0 dominates i++ DEF(i,i=0) USE(i, i++) t = 0; (2) < i BB0 > j = 0; (2) BB2 < i BB4 > < j BB1 > t = t + A[i][j] * x[j]; (2) (1) (1) BB3 j++; (2) < j BB2 > < t BB1 > if (j < m) < t BB2 > BB4 (1) y[i] = t; (2) < y BB4 > i++; (2) (1) BB5 if (i < n) T F

11 Building the KIR (II) < i BB0 > ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) < i BB4 > < j BB1 > < t BB1 > scalar assignment < j BB2 > < t BB1 > ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < t BB2 > < y BB4 > < y BB4 > regular assignment

12 Building the KIR (and III) ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

13 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

14 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

15 Automatic Partitioning driven by the KIR (I) t is a privatizable scalar variable ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

16 Automatic Partitioning driven by the KIR (II) spurious dikernel-level dependence ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

17 Automatic Partitioning driven by the KIR (III) critical path ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) 1.for (i = 0; i < n; i++) { 2. t = 0; 3. for (j = 0; j < m; j++) { 4. t = t + A[i][j] * x[j]; 5. } 6. y[i] = t; 7.} < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction < y BB4 > regular assignment

18 Automatic Partitioning driven by the KIR (and IV) ROOT EXECUTION SCOPE ES_for i (Fig. 1, lines 1-7) critical path < t BB1 > scalar assignment ES_for j (Fig. 1, lines 3-5) < t BB2 > scalar reduction 1.#pragma omp parallel for 2. shared(a,x,y) private (t,i,j) 3.for (i = 0; i < n; i++) { 4. t = 0; 5. for (j = 0; j < m; j++) { 6. t = t + A[i][j] * x[j]; 7. } 8. y[i] = t; 9.} < y BB4 > regular assignment

19 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

20 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

21 Experimental Remaining Overhead Irregular Evaluation 70 Built on top of GCC Execution Time (s) EQUAKE from SPEC CPU2000 on 2 Intel Xeon E5520 quad-core processors The Intel compiler is unable to parallelize this case study properly while our approach reduces the execution time KIR/ICC ICC KIR/ICC ICC KIR/ICC WL x 1 WL x 2 WL x 3 ICC More results on J.M. Andión et al. A Novel Compiler Support for Automatic Parallelization on Multicore Systems. Parallel Computing, 39(9), Speedup KIR/ICC ICC KIR/ICC ICC KIR/ICC WL x 1 WL x 2 WL x 3 ICC

22 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

23 Outline Motivation: The Parallel Challenge KIR: A dikernel-based IR Automatic Partitioning driven by the KIR Experimental Evaluation Conclusions

24 1.The KIR: a dikernel-based IR dikernels dikernel-level dependences execution scopes 2.Automatic Partitioning Technique coarse-grain parallelism global OpenMP parallelization strategy

25 Future Work Locality exploitation techniques Fine-grain parallelism Many-core architectures such as GPUs J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014 & International Journal of Parallel Programming (to appear)

26 A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014) June 10-11, 2014 Schloss Rheinfels, Sankt Goar, Germany

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel

More information

Chapter 1: Fundamentals of Quantitative Design and Analysis

Chapter 1: Fundamentals of Quantitative Design and Analysis 1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the

More information

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2018-19 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2nd sem.

More information

Introduction. What is Computer Architecture? Design constraints. What is Computer Architecture? Computer Architecture ELEC3441

Introduction. What is Computer Architecture? Design constraints. What is Computer Architecture? Computer Architecture ELEC3441 Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2016-17 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2 What

More information

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC3441. Dr. Hayden Kwok-Hay So

Introduction. What is Computer Architecture? Meltdown & Spectre. Meltdown & Spectre. Computer Architecture ELEC3441. Dr. Hayden Kwok-Hay So Computer Architecture ELEC3441 What is Computer Architecture? Introduction 2 nd Semester, 2017-18 Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Computer Architecture 2 Meltdown

More information

Concepts Introduced. Classes of Computers. Classes of Computers (cont.) Great Architecture Ideas. personal computers (PCs)

Concepts Introduced. Classes of Computers. Classes of Computers (cont.) Great Architecture Ideas. personal computers (PCs) Concepts Introduced Classes of Computers classes of computers great architecture ideas software levels computer components performance measures technology trends personal computers (PCs) servers intended

More information

Towards a Holistic Approach to Auto-Parallelization

Towards a Holistic Approach to Auto-Parallelization Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis, Zheng Wang, Björn Franke and Michael F.P. O

More information

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels?

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? J. Lobeiras, M. Amor, M. Arenaz, and B.B. Fraguela Computer Architecture Group, University of A Coruña, Spain {jlobeiras,margamor,manuel.arenaz,basilio.fraguela}@udc.es

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Exercise: OpenMP Programming

Exercise: OpenMP Programming Exercise: OpenMP Programming Multicore programming with OpenMP 19.04.2016 A. Marongiu - amarongiu@iis.ee.ethz.ch D. Palossi dpalossi@iis.ee.ethz.ch ETH zürich Odroid Board Board Specs Exynos5 Octa Cortex

More information

FADA : Fuzzy Array Dataflow Analysis

FADA : Fuzzy Array Dataflow Analysis FADA : Fuzzy Array Dataflow Analysis M. Belaoucha, D. Barthou, S. Touati 27/06/2008 Abstract This document explains the basis of fuzzy data dependence analysis (FADA) and its applications on code fragment

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Computer Architecture. Introduction. Lynn Choi Korea University

Computer Architecture. Introduction. Lynn Choi Korea University Computer Architecture Introduction Lynn Choi Korea University Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, 공학관 411, lchoi@korea.ac.kr, TA: 윤창현 / 신동욱, 3290-3896,

More information

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer

More information

An Introduction to Parallel Architectures

An Introduction to Parallel Architectures An Introduction to Parallel Architectures Andrea Marongiu a.marongiu@unibo.it Impact of Parallel Architectures From cell phones to supercomputers In regular CPUs as well as GPUs Parallel HW Processing

More information

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 NB: please read this page carefully, as it contains 4 separate specifications for a Workstation

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

ArcExplorer -- Java Edition 9.0 System Requirements

ArcExplorer -- Java Edition 9.0 System Requirements ArcExplorer -- Java Edition 9.0 System Requirements This PDF contains system requirements information, including hardware requirements, best performance configurations, and limitations, for ArcExplorer

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

POWER-AWARE SOFTWARE ON ARM. Paul Fox

POWER-AWARE SOFTWARE ON ARM. Paul Fox POWER-AWARE SOFTWARE ON ARM Paul Fox OUTLINE MOTIVATION LINUX POWER MANAGEMENT INTERFACES A UNIFIED POWER MANAGEMENT SYSTEM EXPERIMENTAL RESULTS AND FUTURE WORK 2 MOTIVATION MOTIVATION» ARM SoCs designed

More information

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao 1 Karl Fuerlinger 2 Holger Marten 1 jie.tao@kit.edu karl.fuerlinger@nm.ifi.lmu.de holger.marten@kit.edu 1 : Steinbuch

More information

Minimum Hardware and OS Specifications

Minimum Hardware and OS Specifications Hardware and OS Specifications File Stream Document Management Software System Requirements for v4.5 NB: please read through carefully, as it contains 4 separate specifications for a Workstation PC, a

More information

Why Parallel Architecture

Why Parallel Architecture Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores

More information

Lightweight Fault Detection in Parallelized Programs

Lightweight Fault Detection in Parallelized Programs Lightweight Fault Detection in Parallelized Programs Li Tan UC Riverside Min Feng NEC Labs Rajiv Gupta UC Riverside CGO 13, Shenzhen, China Feb. 25, 2013 Program Parallelization Parallelism can be achieved

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy Jan Verschelde joint with Genady Yoffe and Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics,

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John Computer Performance Evaluation and Benchmarking EE 382M Dr. Lizy Kurian John Evolution of Single-Chip Transistor Count 10K- 100K Clock Frequency 0.2-2MHz Microprocessors 1970 s 1980 s 1990 s 2010s 100K-1M

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura Microprocessors 2011 Budditha Hettige 2 Processor Instructions

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Behavioral Data Mining. Lecture 12 Machine Biology

Behavioral Data Mining. Lecture 12 Machine Biology Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach

More information

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing

More information

Unit OS2: Operating System Principles. Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze

Unit OS2: Operating System Principles. Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Unit OS2: Operating System Principles 2.5. Quiz Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Copyright Notice 2000-2005 David A. Solomon and Mark

More information

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture:

More information

Facial Recognition Using Neural Networks over GPGPU

Facial Recognition Using Neural Networks over GPGPU Facial Recognition Using Neural Networks over GPGPU V Latin American Symposium on High Performance Computing Juan Pablo Balarini, Martín Rodríguez and Sergio Nesmachnow Centro de Cálculo, Facultad de Ingeniería

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware OpenACC Standard Directives for Accelerators Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS OpenACC Compiler, HMPP Workbench

More information

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling

Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling Tobias Schwarzer 1, Joachim Falk 1, Michael Glaß 1, Jürgen Teich 1, Christian Zebelein 2, Christian

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD   17 April 2018 ARSITEKTUR SISTEM KOMPUTER Wayan Suparta, PhD https://wayansuparta.wordpress.com/ 17 April 2018 Reduced Instruction Set Computers (RISC) CISC Complex Instruction Set Computer RISC Reduced Instruction Set

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

A Lightweight OpenMP Runtime

A Lightweight OpenMP Runtime Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent

More information

OpenMP Optimization and its Translation to OpenGL

OpenMP Optimization and its Translation to OpenGL OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For

More information

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

CS Computer Architecture Spring Lecture 01: Introduction

CS Computer Architecture Spring Lecture 01: Introduction CS 35101 Computer Architecture Spring 2008 Lecture 01: Introduction Created by Shannon Steinfadt Indicates slide was adapted from :Kevin Schaffer*, Mary Jane Irwinº, and from Computer Organization and

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances) HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Hyper-Threading Influence on CPU Performance

Hyper-Threading Influence on CPU Performance João Martins* Jorge Gomes* Mario David* Gonçalo Borges* * LIP Laboratório de Instrumentação e Física Experimental de Particulas HePiX Spring

More information

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Modelling and optimisation of scientific software for multicore platforms

Modelling and optimisation of scientific software for multicore platforms Modelling and optimisation of scientific software for multicore platforms Domingo Giménez... and the list of collaborators within the presentation Group page: http://www.um.es/pcgum Presentations: http://dis.um.es/%7edomingo/investigacion.html

More information

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline Why Parallelism Parallel Execution Parallelizing Compilers

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

Parallelized Progressive Network Coding with Hardware Acceleration

Parallelized Progressive Network Coding with Hardware Acceleration Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

URL:   Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture 01 1 EE 4720 Computer Architecture 01 1 URL: https://www.ece.lsu.edu/ee4720/ RSS: https://www.ece.lsu.edu/ee4720/rss home.xml Offered by: David M. Koppelman 3316R P. F. Taylor Hall, 578-5482, koppel@ece.lsu.edu,

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

CS 316: Multicore/GPUs

CS 316: Multicore/GPUs CS 316: Multicore/GPUs Kavita Bala Fall 2007 Computer Science Cornell University Announcements Core Wars will be out in the next couple of days Aim at having fun! Number of points allocated to it is small

More information

Introductory OpenMP June 2008

Introductory OpenMP June 2008 5: http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture5.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12 June 2008 Introduction

More information

Parallel Programming

Parallel Programming Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias

More information

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36 The Challenges of X86 Hardware Virtualization GCC- Virtualization: Rajeev Wankar 36 The Challenges of X86 Hardware Virtualization X86 operating systems are designed to run directly on the bare-metal hardware,

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Programming Strategies for Contextual Runtime Specialization

Programming Strategies for Contextual Runtime Specialization June nd, 15 Schloss Rheinfels, Sankt Goar, Germany Programming Strategies for Contextual Runtime Specialization Tiago Carvalho t.carvalho@fe.up.pt Pedro Pinto p.pinto@fe.up.pt João M. P. Cardoso jmpc@acm.org

More information

Towards Automatic Code Generation for GPUs

Towards Automatic Code Generation for GPUs Towards Automatic Code Generation for GPUs Javier Setoain 1, Christian Tenllado 1, Jose Ignacio Gómez 1, Manuel Arenaz 2, Manuel Prieto 1, and Juan Touriño 2 1 ArTeCS Group 2 Computer Architecture Group

More information

IT 252 Computer Organization and Architecture. Introduction. Chia-Chi Teng

IT 252 Computer Organization and Architecture. Introduction. Chia-Chi Teng IT 252 Computer Organization and Architecture Introduction Chia-Chi Teng What is computer architecture about? Computer architecture is the study of building computer systems. IT 252 is roughly split into

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 6 Part 3 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 OpenMP popular for decade Compiler-based technique Start with plain old C, C++, or Fortran Insert #pragmas into source file You

More information