Trends and Challenges in Multicore Programming

Similar documents
GPUs and Emerging Architectures

An Introduction to Parallel Programming

High Performance Computing (HPC) Introduction

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Overview of research activities Toward portability of performance

CUDA GPGPU Workshop 2012

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS671 Parallel Programming in the Many-Core Era

Addressing Heterogeneity in Manycore Applications

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Carlo Cavazzoni, HPC department, CINECA

Parallel Programming Libraries and implementations

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

27. Parallel Programming I

The Art of Parallel Processing

27. Parallel Programming I

ECE 8823: GPU Architectures. Objectives

27. Parallel Programming I

Bottom-Up Multicore Programming

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel Hybrid Computing Stéphane Bihan, CAPS

OpenMP for next generation heterogeneous clusters

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Introduction to Multicore Programming

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Parallel Languages: Past, Present and Future

Parallel Computing. Hwansoo Han (SKKU)

PRACE Autumn School Basic Programming Models

Parallel Computing Why & How?

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chapel Introduction and

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

2 Introduction to parallel computing. Chip Multiprocessors (ACS MPhil) Robert Mullins

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Programming Models for Supercomputing in the Era of Multicore

Performance Tools for Technical Computing

Overview. 2 Introduction to parallel computing. The control structure. Parallel computers

High Performance Computing. Introduction to Parallel Computing

Computer Architecture and Structured Parallel Programming James Reinders, Intel

45-year CPU Evolution: 1 Law -2 Equations

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallel Programming on Larrabee. Tim Foley Intel Corp

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Lecture 1: Gentle Introduction to GPUs

Introduction II. Overview

OPERATING SYSTEM. Chapter 4: Threads

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Chapter 4: Threads. Chapter 4: Threads

Dealing with Heterogeneous Multicores

CS 475: Parallel Programming Introduction

Parallel Programming. Libraries and Implementations

CS420: Operating Systems

Chapter 4: Multithreaded Programming

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

How to Write Fast Code , spring th Lecture, Mar. 31 st

Introduction to Parallel Computing

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Introduction to Multicore Programming

Parallella: A $99 Open Hardware Parallel Computing Platform

Hybrid Model Parallel Programs

Concurrency: what, why, how

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Trends in HPC (hardware complexity and software challenges)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Parallel and High Performance Computing CSE 745

GPUs have enormous power that is enormously difficult to use

Intel Parallel Studio 2011

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Parallel Algorithm Engineering

COMP528: Multi-core and Multi-Processor Computing

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

ECE/CS 250 Computer Architecture. Summer 2016

IBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CS758: Multicore Programming

An Alternative to GPU Acceleration For Mobile Platforms

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

A General Discussion on! Parallelism!

OpenMP tasking model for Ada: safety and correctness

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

WHY PARALLEL PROCESSING? (CE-401)

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Chapter 3 Parallel Software

HPC code modernization with Intel development tools

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Programming hybrid systems with implicit memory based synchronization

Transcription:

Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010

Outline The Roadmap of Multicores The Challenge of Parallel Thinking Multicore Languages and Compilers Summary

The Roadmap of Multicores

At the beginnings... 1965: the number of transistors placed inexpensively on integrated circuit will double approximately every two years Gordon E. Moore Co founder and Chairman Emeritus of Intel Innocent observation led to an industry goal: Moore's Law

Moore's Law illustrated on Intel chips Drawing: Herb Sutter

Multicore Architectures A processing system with 2 or more independent cores integrated on the same chip Number and type of cores: multicore or manycore heterogeneous or homogeneous Memory architecture: shared distributed mixture Interconnection network

Shared Memory Multicore Architectures Shared Memory

Distributed Memory Multicore Architectures Mem Mem Mem Mem

(Future) Manycore Architectures Shared Memory Shared Memory Shared Memory Shared Memory

The Babel of Multicores

Overwhelming Multicores

Intel's Nehalem Architecture shared memory multi threading up to 8 cores 2 threads/core private L1 and L2 cache 24MB shared L3 cache

Sun's Niagara 2 Architecture shared memory multi threading up to 8 cores 4 8 threads/core 4MB shared L2 cache

Cell BE Heterogeneous Architecture 1 PPU for OS and PC 8 SPE capable of vector processing with local store memory (4GB) high performance interconnection bus

NVIDIA GPUs with CUDA

Field Programmable Gate Arrays (FPGAs) device with a matrix of reconfigurable gate array logic circuitry when configured works as a hardware implementation of a software application user can create task specific cores that all run like parallel circuits inside one FPGA chip

The Challenge of Parallel Thinking

Think Parallel or Perish 2009:... the 'not parallel' era will appear to be a very primitive time in the history of computers when people look back in a hundred years... James Reinders Chief Evangelist Intel s Software Products Division... in less than a decade, a programmer who does not 'Think parallel' first will not be a programmer

Multicore is a Challenge Primarily in software development Performance speed up depends on how good is the multi threading of the parallel source code Parallel code ought to be: correct efficient scalable future proof Portable code across platforms major issue

Start to Think Parallel what hardware do we have? multithreaded system architecture GPU heterogeneous multicore (ex. Cell BE) FPGA etc... language: what data structures and operations are supported? identify parallelism embarrassingly parallel? functional decomposition: task parallelism data decomposition: data parallelism

Start to Think Parallel what hardware do we have? multithreaded system architecture GPU heterogeneous multicore (ex. Cell BE) FPGA etc... language: what data structures and operations are supported? identify parallelism embarrassingly parallel? functional decomposition: task parallelism data decomposition: data parallelism

Threading Methods Explicit threading (rather low level) manually write all code responsible for managing threads that interface to a specific library Library based best for functional decomposition user creates and synchronize threads explicitly Ex. Pthreads Compiler based best for data parallelization user annotates code with pragmas Ex. OpenMP, TBB

Golden True: Use Abstraction Where Possible Future proof applications Express parallelism, without thinking much about threads/core management Libraries, OpenMP, Intel TBB good examples for this Best to avoid raw native threads, like Pthreads Native threads and MPI are like the assembly language of parallelism Think in tasks, not threads

Load balancing keep all threads busy all the time

Fine or Coarse grain Which is Best? Depends on algorithm and hardware Fine grain: good for load balancing, but too much communication overhead Coarse grain: more opportunity to increase performance, but not so good for load balancing

Lock based Synchronization is error prone they may cause blocking: deadlocks: threads are waiting for each other to release a resource livelocks: threads continuously change their state but not doing any useful work. Lock free programming: e.g. transactional memory

Exercise: Parallelizing A Baking Process We are making a birthday cake: Mix ingredients: 20 minutes Bake: 30 minutes Can we parallelize it? How many cooks? Each cook has his own spoon? How if I make cup cakes?

A Note on Performance Gain Amdahl's Law: the pessimistic A program's serial portion is a practical upper bound on the performance of its parallel portion Baking a cake: Number of cooks 1 2 4 infinite Time 30 + 20 = 50 30 + 10 = 40 30 + 5 = 35 30 + 0 = 30 Speedup 1.0x 1.2x 1.4x 1.6x Overall parallel performance is still limited by the baking time So are massively parallel systems hopeless...?

A Second Note on Performance Gain Gustafson's Law: scaled speed up measurement the optimistic What if we want to bake 100 cakes? Number of cooks 1 2 4 infinite Time 30 + 20*100 = 2030 30 + 20*50 = 1030 30 + 20*25 = 530 30 + 20*0 = 30 Speedup 1.0x 1.9x 3.8x 67x certain problems have increased performance by increasing the problem size the problem size scales with the number of processors speed up should be measured by scaling the problem size to the number of processors, not fixing the problem size

Parallelizing is Difficult Writing correct, efficient parallel programs has always been challenging (e.g. HPC) This applies to multicore programming too Higher abstraction levels help We cannot start from scratch whenever a new multicore hardware turns up

Ongoing Research Goal: to lift the abstraction level even higher To free the user from low level hardware details To bridge the gap between programming different types of multicores and/or HPC facilities Many high level programming models have been proposed: Functional approaches: Haskell, SAC, Crystal, etc Data parallel languages: NESL, DPH, SAC, Fortran95, Sisal etc. Implicit parallelism: HPF, Id, NESL, Sisal, ZPL PGAS model: UPC, Co Array Fortran, Fortress, Chapel, X10 etc.

Languages and Compilers for Multicores

Intel's for Multicore CPUs Intel Compilers support OpenMP Intel launched its own MPI library Performance analysis tools, debuggers Intel TBB adding parallelism to C++ Intel's Ct technology nested data parallelism for C++ Intel Parallel Studio an all in all support toolbox Higher level models: Intel Concurrent Collections for C++ Intel Cilk++ Software Development Kit

Intel's Parallel Studio Microsoft Visual Studio C/C++ developers toolbox interoperable with OpenMP and Intel's TBB libraries helps the programmer throughout the parallelization process (to identify, create, debug and tune)

Others Java: Java threads, java.util.concurrent package Microsoft.NET: Task Parallel Library (TPL) Haskell: thread programming and data parallelism etc...

Programming GPGPU NVIDIA's CUDA model: Gives access to the enormous computing power of NVIDIA GPUs via standards like OpenCL, C/C++, Fortran, Python,.NET OpenCL generally adopted by other GPU vendors (e.g. AMD)

OpenCL (Open Computing Language) a new open standard for programming heterogenous systems suported by most hardware vendors uniform programming environment to write efficient, portable code for both multicore CPUs and GPUs http://www.khronos.org/opencl/

Summary Multicores (hardware) are reality are overwhelming many, more complex, more heterogeneous to appear Multicores (software) writing parallel code is challenging (always has been) programming models are versatile and confusing portability across various platforms major issue Unified high level parallel programming model is still open research

Staying Tuned? http://software.intel.com/en us/parallel/ http://www.upcrc.illinois.edu/ http://www.multicoreinfo.com/ http://www.drdobbs.com/go parallel