MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Similar documents
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CME 213 S PRING Eric Darve

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

General Purpose GPU Computing in Partial Wave Analysis

MANY-CORE COMPUTING. 10-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

GPU for HPC. October 2010

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

Parallel programming: Introduction to GPU architecture

From Brook to CUDA. GPU Technology Conference

GPUs and Emerging Architectures

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GPU Computing Séminaire Calcul Hybride Aristote 25 Mars 2010

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Lecture 1: Gentle Introduction to GPUs

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Trends in HPC (hardware complexity and software challenges)

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Computer Architecture

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Multimedia in Mobile Phones. Architectures and Trends Lund

The Era of Heterogeneous Computing

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

CS427 Multicore Architecture and Parallel Computing

Scientific Computing on GPUs: GPU Architecture Overview

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

INF5063: Programming heterogeneous multi-core processors Introduction

CONSOLE ARCHITECTURE

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Parallel Programming on Ranger and Stampede

Introduction to GPU architecture

Using Graphics Chips for General Purpose Computation

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Massively Parallel Architectures

Experts in Application Acceleration Synective Labs AB

Mathematical computations with GPUs

Introduction to GPGPU and GPU-architectures

GPUs and GPGPUs. Greg Blanton John T. Lubia

Tesla GPU Computing A Revolution in High Performance Computing

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Trends and Challenges in Multicore Programming

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction to CELL B.E. and GPU Programming. Agenda

Fra superdatamaskiner til grafikkprosessorer og

45-year CPU Evolution: 1 Law -2 Equations

Parallel Algorithm Engineering

Parallel Architectures

n N c CIni.o ewsrg.au

Parallel Computing. Hwansoo Han (SKKU)

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

The Mont-Blanc approach towards Exascale

Parallel Computing: Parallel Architectures Jin, Hai

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Parallelism in Hardware

General Purpose Computing on Graphical Processing Units (GPGPU(

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

High Performance Computing on GPUs using NVIDIA CUDA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Multicore Hardware and Parallelism

CDA3101 Recitation Section 13

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Scaling through more cores

ECE 571 Advanced Microprocessor-Based Design Lecture 20

WHY PARALLEL PROCESSING? (CE-401)

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

The Art of Parallel Processing

How to Write Fast Code , spring th Lecture, Mar. 31 st

Real-Time Rendering Architectures

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

high performance medical reconstruction using stream programming paradigms

Fundamentals of Computer Design

High Performance Computing with Accelerators

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

What does Heterogeneity bring?

PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Advances of parallel computing. Kirill Bogachev May 2016

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Parallel Architecture. Hwansoo Han

Antonio R. Miele Marco D. Santambrogio

Transcription:

MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center

Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013) 3. Programming: advanced (14-10-2013) 4. Case study: LOFAR telescope with many-cores by Rob van Nieuwpoort (17-10-2013)

What are many-cores? 3 From Wikipedia: A many-core processor is a multicore processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient largely because of issues with congestion in supplying instructions and data to the many processors. In this course: Multi-core/many-core CPUs (GP)GPUs

What are many-cores 4 How many is many? Several tens of cores How are they different from multi-core CPUs? Non-uniform memory access (NUMA) Private memories Network-on-chip Examples Multi-core CPUs (48-core AMD magny-cours) Graphics Processing Units (GPUs) n GPGPU = general purpose programming on GPUs Server processors (Sun Niagara) HPC processors n Cell B.E. (PlayStation 3) n Intel Xeon Phi (aka Intel MIC, former Larrabee)

Today s Topics 5 Why do many-cores exist? History Hardware introduction Performance model: Arithmetic Intensity and Roofline

6 Why many-cores? Moore s law Many-cores in real-life

Moore s Law 7 Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase... Electronics Magazine 1965

8 Transistor Counts (Intel)

Impact of device shrinking 9 Assume transistor size shrinks by a factor of x! Transistors per unit area: up by x*x Die size? Assume the same Clock rate? may go up by x because wires are shorter Raw computing power? Programs could go x*x*x times faster In reality? Power consumption, memory, parallelism impose stricter bounds!

Revolution in Processors 10 Chip density is continuing to increase about 2x every 2 years BUT Clock speed is not ILP is not Power is not

New ways to use transistors 11 Parallelism on-chip: multi-core processors Multicore revolution Every machine will soon be a parallel machine What about performance? Can applications use this parallelism? Do they have to be rewritten from scratch? Will all programmers have to be parallel programmers? New programming models are needed Try to hide complexity from most programmers

Top500 [1/4] 12 State of the art in HPC (top500.org) Trial for all new HPC architectures 195 cores/node! Accelerated! Accelerated!

Top500 [2/4] 13 Performance is dominated by multi-/many-cores Multi-core CPUs Accelerators

Top500 [3/4] 14 Accelerators? Relatively low numbers High performance impact

China's Tianhe-1A 15 #10 in top500 list June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon X5670 processors 7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

China's Tianhe-2 16 #1 in Top500 June 2013 54.902 pflops peak 33.862 pflops max 16.000 nodes = 16.000 x (2 x Xeon IvyBridge + 3 x Xeon Phi) = 3.120.000 cores ( => 195 cores/node)

17 Top500: prediction

18 GPUs vs. Top500

Why do we need many-cores? 19 T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad

20 Why do we need many-cores?

21 Power efficiency

22 Graphics in 1980

23 Graphics in 2000

Realism of modern GPUs 24 http://www.youtube.com/watch? v=bjdeipvpjgq&feature=pla yer_embedded#t=49s Courtesy techradar.com

Why do we need many-cores? 25 Performance Large scale parallelism Power Efficiency Use transistors more efficiently Price (GPUs) Game market is huge, bigger than Hollywood Mass production, economy of scale spotty teenagers pay for our HPC needs! Prestige Reach ExaFLOP by 2019

26 History

27 Multi-core @ Intel

GPGPU History 28 Fermi 3B xtors GeForce 256 RIVA 128 23M xtors 3M xtors 1995 2000 GeForce 3 60M xtors GeForce FX 125M xtors GeForce 8800 681M xtors 2005 2010 Current generation: NVIDIA Kepler 7.1 transistors More cores, more parallelism, more performance

GPGPU History 29 Use Graphics primitives for HPC Ikonas [England 1978] Pixel Machine [Potmesil & Hoffert 1989] Pixel-Planes 5 [Rhoades, et al. 1992] Programmable shaders, around 1998 DirectX / OpenGL Map application onto graphics domain! GPGPU Brook (2004), Cuda (2007), OpenCL (Dec 2008),...

CUDA C/C++ Continuous Innovation 30 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10 CUDA Toolkit 1.0 CUDA Toolkit 1.1 CUDA Visual Profiler 2.2 CUDA Toolkit 2.0 CUDA Toolkit 2.3 Parallel Nsight Beta CUDA Toolkit 3.0 C Compiler C Extensions Single Precision BLAS FFT SDK 40 examples Win XP 64 Atomics support Multi-GPU support cuda-gdb HW Debugger Double Precision Compiler Optimizations Vista 32/64 Mac OSX DP FFT 16-32 Conversion intrinsics Performance enhancements C++ inheritance Fermi support Tools updates Driver / RT interop 3D Textures HW Interpolation

Cuda Tools 31 Parallel Nsight Visual Studio Visual Profiler For Linux cuda-gdb For Linux

32 Another GPGPU history

33 GPUs @ AMD

34 Multi-core @ AMD

35 Multi-core @ AMD

36 GPU @ ARM

37 Many-core hardware

Choices 38 Core type(s): Fat or slim? Vectorized (SIMD)? Homogeneous or heterogeneous? Number of cores: Few or many? Memory Shared-memory or distributed-memory? Parallelism Instruction-level parallelism, threads, vectors,

A taxonomy 39 Based on field-of-origin : General-purpose n Intel, AMD Graphics Processing Units (GPUs) n NVIDIA, ATI Gaming/Entertainment n Sony/Toshiba/IBM Embedded systems n Philips/NXP, ARM Servers n Oracle, IBM, Intel High Performance Computing n Intel, IBM,

General Purpose Processors 40 Architecture Few fat cores Vectorization (SSE, AVX) Homogeneous Stand-alone Memory Shared, multi-layered Per-core cache and shared cache Programming Processes (OS Scheduler) Message passing Multi-threading Coarse-grained parallelism

Server-side 41 General-purpose-like with more hardware threads Lower performance per thread high throughput Examples Sun Niagara II n 8 cores x 8 threads IBM POWER7 n 8 cores x 4 threads Intel SCC n 48 cores, all can run their own OS

Graphics Processing Units 42 Architecture Hundreds/thousands of slim cores Homogeneous Accelerator Memory Very complex hierarchy Both shared and per-core Programming Off-load model Many fine-grained symmetrical threads Hardware scheduler

Cell/B.E. 43 Architecture Heterogeneous 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE) Memory Per-core memory, network-on-chip Programming User-controlled scheduling 6 levels of parallelism, all under user control Fine- and coarse-grain parallelism

Xeon Phi 44 Architecture ~60 homogeneous cores n 4 threads per core x86 architecture Memory Per-core caches (L1,L2) n Coherence UMA [?] Programming SPMD/MPMD Fine- and coarse-grain parallelism (vector processing and threads, respectively

Take home message 45 Variety of platforms Core types & counts Memory architecture & sizes Parallelism layers & types Scheduling Open questions: Why so many? How many platforms do we need? Can any application run on any platform?

46 Hardware performance metrics

Hardware Performance metrics 47 Clock frequency [GHz] = absolute hardware speed Memories, CPUs, interconnects Operational speed [GFLOPs] Operations per cycle Memory bandwidth [GB/s] differs a lot between different memories on chip Power [Watt] Derived metrics FLOP/Byte, FLOP/Watt

Theoretical peak performance 48 Peak = chips * cores * vectorwidth * FLOPs/cycle * clockfrequency Examples from DAS-4: Intel Core i7 CPU n 2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs NVIDIA GTX 580 GPU n 1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * 1.544 GhZ = 1581 GFLOPs ATI HD 6970 n 1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle * 0.880 GhZ = 2703 GFLOPs

DRAM Memory bandwidth 49 Throughput = memory bus frequency * bits per cycle * bus width Memory clock!= CPU clock! In bits, divide by 8 for GB/s Examples: Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

Memory bandwidths 50 On-chip memory can be orders of magnitude faster Registers, shared memory, caches, E.g., AMD HD 7970 L1 cache achieves 2 TB/s Other memories: depends on the interconnect Intel s technology: QPI (Quick Path Interconnect) n 25.6 GB/s AMD s technology: HT3 (Hyper Transport 3) n 19.2 GB/s Accelerators: PCI-e 2.0 n 8 GB/s

Power 51 Chip manufactures specify Thermal Design Power (TDP) We can measure dissipated power Whole system Typically (much) lower than TDP Power efficiency FLOPS / Watt Examples (with theoretical peak and TDP) Intel Core i7: 154 / 160 = 1.0 GFLOPs/W NVIDIA GTX 580: ATI HD 6970: 1581 / 244 = 6.3 GFLOPs/W 2703 / 250 = 10.8 GFLOPs/W

Summary Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte Sun Niagara 2 8 64 11.2 76 0.1 IBM bg/p 4 8 13.6 13.6 1.0 IBM Power 7 8 32 265 68 3.9 Intel Core i7 4 16 85 25.6 3.3 AMD Barcelona 4 8 37 21.4 1.7 AMD Istanbul 6 6 62.4 25.6 2.4 AMD Magny-Cours 12 12 125 25.6 4.9 Cell/B.E. 8 8 205 25.6 8.0 NVIDIA GTX 580 16 512 1581 192 8.2 NVIDIA GTX 680 8 1536 3090 192 16.1 AMD HD 6970 384 1536 2703 176 15.4 AMD HD 7970 32 2048 3789 264 14.4

Absolute hardware performance 53 Only achieved in the optimal conditions: Processing units 100% used All parallelism 100% exploited All data transfers at maximum bandwidth In real life No application is like this Can we reason about real performance?

54 Performance analysis Operational Intensity and the Roofline model

An Example 55 I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want to hire you to use many-cores to improve the performance. Metrics I will judge candidates by: How fast can the application be: n Execution time => what the users are interested in! How many times faster can you make it: n Speed-up => use the best possible sequential performance How do I know I should chose you? n Achievable performance => reason how far the performance is n Depends on application, hardware, and dataset! Is this architecture a good one to use? n Utilization => did I really need this hardware?

Questions? Comments? 56 For questions, comments, suggestions, : A.L.Varbanescu@uva.nl