It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Similar documents
It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Intro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

Intro To Parallel Computing. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Seagate ExaScale HPC Storage

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

High-Performance Computing - and why Learn about it?

HPC as a Driver for Computing Technology and Education

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING Top 10 Supercomputers in the World as of November 2013*

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Top500

Hybrid Architectures Why Should I Bother?

Trends in HPC (hardware complexity and software challenges)

Chapter 1. Introduction

Emerging Heterogeneous Technologies for High Performance Computing

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Cray XC Scalability and the Aries Network Tony Ford

ECE 574 Cluster Computing Lecture 2

HPCG UPDATE: ISC 15 Jack Dongarra Michael Heroux Piotr Luszczek

Why we need Exascale and why we won t get there by 2020 Horst Simon Lawrence Berkeley National Laboratory

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

HPC Technology Update Challenges or Chances?

Preparing GPU-Accelerated Applications for the Summit Supercomputer

HPC Algorithms and Applications

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Why we need Exascale and why we won t get there by 2020

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

Mathematical computations with GPUs

Fabio AFFINITO.

Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle

High Performance Computing (HPC) Introduction

Overview of HPC and Energy Saving on KNL for Some Computations

Introduction to GPGPUs

Fujitsu s Approach to Application Centric Petascale Computing

CUDA. Matthew Joyner, Jeremy Williams

CINECA and the European HPC Ecosystem

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Trends in HPC Architectures and Parallel

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Parallel Programming

An Introduction to OpenACC

Overview. Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science. Lecture 1 Overview

Data Driven Innova,on, High Performance Compu,ng and the Square Kilometre Array

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Fra superdatamaskiner til grafikkprosessorer og

Accelerating High Performance Computing.

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

ECE 574 Cluster Computing Lecture 18

Towards Exascale Computing with the Atmospheric Model NUMA

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Overview. High Performance Computing - History of the Supercomputer. Modern Definitions (II)

The Future of High- Performance Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Parallel and Distributed Programming Introduction. Kenjiro Taura

Motivation Goal Idea Proposition for users Study

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

HPC Technology Trends

TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology

High-Performance Computing & Simulations in Quantum Many-Body Systems PART I. Thomas Schulthess

An approach to provide remote access to GPU computational power

Introduction to Parallel Programming for Multicore/Manycore Clusters

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

arxiv: v1 [physics.comp-ph] 4 Nov 2013

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Parallel Languages: Past, Present and Future

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

High Performance Linear Algebra

HPC Architectures. Types of resource currently in use

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

European Processor Initiative & RISC-V

Trends in HPC Architectures and Parallel

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Lecture 7: Parallel Processing

Timothy Lanfear, NVIDIA HPC

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Our Workshop Environment

Parallel Programming on Ranger and Stampede

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

High Performance Computing

Electronic structure calculations on Thousands of CPU's and GPU's

GPU Architecture. Alan Gray EPCC The University of Edinburgh

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Overview of Tianhe-2

Transcription:

It s a Multicore World John Urbanic Pittsburgh Supercomputing Center

Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data

Moore s Law is not at all dead Intel process technology capabilities High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018 Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm Integration Capacity (Billions of Transistors) 2 4 8 16 32 64 128 256 50nm Transistor for 90nm Process Source: Intel Influenza Virus Source: CDC

That Power and Clock Inflection Point in 2004 didn t get better. Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

Not a new problem, just a new scale CPU Power W) Cray-2 with cooling tower in foreground, circa 1985

How to get same number of transistors to give us more performance without cranking up power? Cache Big core Power 4 3 2 1 Performance 2 1 Key is that Performance area Small core 1 1 Power = ¼ C1 C2 4 4 Performance = 1/2 Cache 3 3 C3 C4 2 1 2 1

And how to get more performance from more transistors with the same power. A 15% Reduction In Voltage Yields RULE OF THUMB Frequency Power Performance Reduction Reduction Reduction 15% 45% 10% SINGLE CORE DUAL CORE Area = 1 Voltage = 1 Freq = 1 Power = 1 Perf = 1 Area = 2 Voltage = 0.85 Freq = 0.85 Power = 1 Perf = ~1.8

Cores, Nodes, Processors, PEs? The most unambiguous way to refer to the smallest useful computing device is as a Processing Element, or PE. This is usually the same as a single core. Processors usually have more than one core as per the previous list. Nodes is commonly used to refer to an actual physical unit, most commonly a circuit board or blade with a network connection. These often have multiple processors. I will try to use the term PE consistently here, but I may slip up myself. Get used to it as you will quite often here all of the above terms used interchangeably where they shouldn t be.

Multi-socket Motherboards Dual and Quad socket boards are very common in the enterprise and HPC world. Less desirable in consumer world.

Shared-Memory Processing at Extreme Scale Programming OpenMP, Pthreads, Shmem Examples All multi-socket motherboards SGI UV (Blacklight!) Intel Xeon 8 dual core processors linked by the UV interconnect 4096 cores sharing 32 TB of memory As big as it gets right now

MPPs (Massively Parallel Processors) Distributed memory at largest scale. Often shared memory at lower level. Titan (ORNL) Sequoia (LLNL) 16.32475 petaflops Rmax and 20.13266 petaflops Rpeak IBM Blue Gene/Q 98,304 compute nodes AMD Opteron 6274 processors (Interlagos) 560,640 cores Gemini interconnect (3-D Torus) Accelerated node design using NVIDIA multi-core accelerators 20+ PFlops peak system performance 1.6 million processor cores 1.6 PB of memory

GPU Architecture Kepler: Streaming Multiprocessor (SMX) 192 SP CUDA Cores per SMX 192 fp32 ops/clock 192 int32 ops/clock 64 DP CUDA Cores per SMX 64 fp64 ops/clock 4 warp schedulers Up to 2048 threads concurrently 32 special-function units 64KB shared mem + L1 cache 48KB Read-Only Data cache 64K 32-bit registers

Top 10 Systems as of November 2015 # Site Manufacturer Computer CPU Interconnect [Accelerator] Cores Rmax (Tflops) Rpeak (Tflops) Power (MW) 1 National Super Computer Center in Guangzhou China NUDT Tianhe-2 (MilkyWay) Intel Xeon E5-2692 2.2 GHz TH Express-2 Intel Xeon Phi 31S1P 3,120,000 33,862 54,902 17.8 2 DOE/SC/Oak Ridge National Laboratory United States Cray Titan Cray XK7 Opteron 6274 2.2 GHz Gemini NVIDIA K20x 560,640 17,590 27,112 8.2 3 DOE/NNSA/LLNL United States IBM Sequoia BlueGene/Q Power BQC 1.6 GHz Custom 1,572,864 17,173 20,132 7.8 4 RIKEN Advanced Institute for Computational Science (AICS) Japan Fujitsu K Computer SPARC64 VIIIfx 2.0 GHz Tofu 705,024 10,510 11,280 12.6 5 DOE/SC/Argonne National Laboratory United States IBM Mira BlueGene/Q Power BQC 1.6 GHz Custom 786,432 8,586 10,066 3.9 6 DOE/NNSA/LANL/SNL United States Cray Trinity Cray XC40 Xeon E5-2698v3 2.3 GHz Aries 301,056 8,100 11,078 7 Swiss National Supercomputing Centre (CSCS) Switzerland Cray Piz Daint Cray XC30 Xeon E5-2670 2.6 GHz Aries NVIDIA K20x 115,984 6,271 7,788 2.3 8 HLRS Germany Cray Hazel Hen Cray XC40 Xeon E5-2680 2.5 GHz Aries 185,088 5,640 7,403 9 King Abdullah University of Science and Technology Saudi Arabia Cray Shaheen II Cray XC40 Xeon E5-2698v3 2.3 GHz Aries 196,608 5,537 7,235 2.8 10 Texas Advanced Computing Dell Stampede Xeon E5-2680 2.7 GHz 462,462 5,168 8,520 4.5

Projected Performance Development 1E+11 1 Eflop/s 1E+09 100 Pflop/s 000000 10 Pflop/s 1 Pflop/s 100000 100 Tflop/s 10 Tflop/s 1 Tflop/s 1000 100 Gflop/s 10 10 Gflop/s 1 Gflop/s 0.1 100 Mflop/s SUM N=1 N=500 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Courtesy Horst Simon, LBNL Data from: TOP500 November 2012

Trends with ends. Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

Amdahl s Law If there is x% of serial component, speedup cannot be better than 100/x. If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use.

Why you should be (extra) motivated. This parallel computing thing is no fad. The laws of physics are drawing this roadmap. If you get on board (the right bus), you can ride this trend for a long, exciting trip. Let s learn how to use multicore machines!