High Performance Computing An introduction talk. Fengguang Song

Size: px

Start display at page:

Download "High Performance Computing An introduction talk. Fengguang Song"

Colin Evans
6 years ago
Views:

1 High Performance Computing An introduction talk Fengguang Song 1

2 2 Content What is HPC History of supercomputing Current supercomputers (Top 500) Common programming models, tools, and languages Simple example of matrix multiplication Software challenges

3 3 What is HPC A computer science to use many high-end computing resources to solve large-scale problems It involves many technologies: Hardware, architecture, OS, software/tools, energy/heat, performance analysis and measurement, algorithm, compilers, runtimes, and so on A yardstick to measure a country s tech level. Is used to improve our life and competitiveness, to enable innovations; It is the most important weapon to remain in the world leader Being used in many different areas: medicine to consumer products, energy to aerospace, HIV virus to auto collision,

4 v=s9ypcptpsuy&list=flbujrsspiu9ltoznbu6tvla&i ndex= v=tgsrvv9u32m Used everywhere Very important to the economy

5 5 Understanding Flop/s How to measure the computing power? Flop/s: floating point operations per second A regular desktop: 10 Gflop/s i.e., 10 billion operations per second Teraflop/s = 1000 Gflop/s Petaflop/s = 1000 Tflop/s Exaflop/s = 1000 Petaflop/s Top One (in June 2014): Tianhe-2: 33.8 Petaflop/s If Tianhe-2 calculates for 1 second, Each person holds one calculator, how long it will take for all people on earth to do the same work? 7 billion people Two months! 6 seconds à 1 year!

6 6 Exascale Computing Climate change: Sea level rise Sever weather Regional climate change Geologic carbon sequestration Energy related: Reducing time and cost of reactor design and deployment Improving the efficiency of combustion energy sources National Nuclear Security: Stockpile certification Predictive scientific challenges Real-time evaluation of urban nuclear detonation

Simulation Enables Fundamental Advances in Science

nucleon structure Fission and fusion reactions

accelerators Probes of dark energy and dark matter

Chemistry Predictive multi-scale materials

technologies, catalysts and batteries Life Science

7 Simulation Enables Fundamental Advances in Science 7 ITER ILC Nuclear physics Quark-gluon plasma & nucleon structure Fission and fusion reactions Facility and experimental design Design of accelerators Probes of dark energy and dark matter ITER shot planning and device control Materials / Chemistry Predictive multi-scale materials modeling Effective, commercial, renewable energy technologies, catalysts and batteries Life Science Better biofuels Sequence to structure to function Structure of nucleons

8 Exascale to Arrive in 2020 How powerful it could be? 30 Tianhe-2 systems 8

9 Challenge of Energy Cost 9

10 Proposed Timeline for Exascale Computing (DOE) 10

11 4 Paradigms and HPC Historically, the two dominant

With progress in computer technology Vacuum tube ->

more and more powerful Simulation emerges as the 3rd

11 11 4 Paradigms and HPC Historically, the two dominant paradigms for scientific discovery are: Theory Experiment With progress in computer technology Vacuum tube -> transistors -> IC -> VLIC Moore s Law Supercomputers are more and more powerful Simulation emerges as the 3rd paradigm Today, Big Data emerges as the 4th paradigm Data -> knowledge

Orleans) ISC conference in Germany in June All data available from www.top500.

12 12 The listing of the 500 most powerful computer in the world Yardstick: Rmax from LINPACK Ax = b, dense problem Updated twice every year SC ## in USA in November (next week in New Orleans) ISC conference in Germany in June All data available from The following Top500-related slides are from Dr. Jack Dongarra at University of Tennessee Knoxville

13 Top One in the Past 20 Years 13 2

14 November 2013: TOP 10 14

15 Vendors Share 15

16 CPUs Share 16

17 Accelerators (53 Systems) 17

18 Countries Share 18

19 Performance Trend in Top500 19

20 Entering the Muticore Era 20 Why do we need multicore exactly? Free lunch of increasing frequencies has ended. Power Voltage 2 x Frequency Frequency Voltage à Power Frequency 3

21 499 are 21 multicore

22 GPU Gaining Popularity 22 Why do we need GPU? Peak FLOPS rates are significantly higher than CPUs. Higher memory bandwidth (177 GB/s vs 11 GB/s) Better performance per unit energy (0.2 vs 2 nj/ Instruction) Higher computational density GPU is optimized for throughput CPU GPU "-,&(-7! "$0#'! BKD! BKD! BKD! BKD!!!!!!!!! ESB6! ESB6!

23 Parallel Programming Models On shared-memory systems: Multithreaded programming Pthread, OpenMP On Distributed-memory systems: MPI On GPU accelerators: Nvidia: CUDA AMD: OpenCL DDR3 RAMs DDR3 RAMs

23 23 Parallel Programming Models On shared-memory systems: Multithreaded programming Pthread, OpenMP On Distributed-memory systems: MPI On GPU accelerators: Nvidia: CUDA AMD: OpenCL DDR3 RAMs DDR3 RAMs CPU Multicore Host System If the machine is distributed, has both multicore CPUs and GPUs, then what to do? Mixed all the 3 programming models: MPI/Pthread/ CUDA! QPI CPU QPI QPI I/O Hub QPI I/O Hub PCIe x8 PCIe x16 PCIe x16 PCIe x16 Infiniband GPU Device Memory GPU Device Memory GPU Device Memory

24 24 Shared-memory Parallel Programming OpenMP A portable API that supports shared-memory parallel programming on many platforms Set of compiler directives and an API for C, C++, FORTRAN You need to identify parallel regions blocks of code, that can run in parallel You can modify a sequential code easily.

25 Shared-memory Parallel Programming 25 Pthread POSIX standard for thread creation and synchronization It is a specification, not implementation could be a user-level or kernellevel library Available on most Unix OSes (Linux, Mac OS, Solaris) int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*thrd_routine) (void *), void *arg); Create a new thread in the calling process The new thread will execute the thrd_routine function You can also pass a void* arg argument to the thread.

26 26 MPI Programming Model Message-passing interface (MPI) Standard (specification) Many implementations: mpich, openmpi, intel Would be OK even if each vendor provides its own implementation Support two types of operations Point-to-point Collective: e.g., broadcast, all_gather New Features: remote memory, parallel I/O, dynamic processes, threads, etc.

27 An Example of Point-to-point MPI 27 Send/Receive (from P0 to P1) A(10) B(20) MPI_Send( A, 10, MPI_DOUBLE, 1, ) MPI_Recv( B, 20, MPI_DOUBLE, 0, ) Datatype Basic for heterogeneity Derived for non-contiguous Contexts Message safety for libraries Buffering Robustness and correctness

CUDA (Compute Unified Device Architecture) 28 Architecture and programming model, introduced in NVIDIA in 2007 Enables GPUs to execute programs written in C.

28 CUDA (Compute Unified Device Architecture) 28 Architecture and programming model, introduced in NVIDIA in 2007 Enables GPUs to execute programs written in C. Within C programs, call kernel routines that are executed on GPU. CUDA syntax extension to C routine as a Kernel. Easy to learn although to get highest performance requires understanding of hardware architecture

29 29 CUDA Programming Paradigm There are 3 key abstractions: A hierarchy of thread groups Shared memories Barrier synchronization Kernel code: a sequential function for one thread, designed to be executed by many threads Thread block: a set of concurrent threads that execute the same kernel Grid: a set of thread blocks, which execute in parallel (kernel A à kernel B à kernel C)

30 30 Host Device Kernel 1 Grid 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread (0,0,0) Thread (1,0,0) Thread (2,0,0) Thread (3,0,0) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread (3,1,0) Courtesy: NDVIA

Simple Processing Flow 31 2 PCI Bus 3 1 1.

31 Simple Processing Flow 31 2 PCI Bus Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

32 32 Hello World! with Device Code global void mykernel(void) { } int main(void) { } mykernel<<<1,1>>>(); printf("hello World!\n"); return 0; CUDA C/C++ keyword global indicates a function that: runs on the device, called from host mykernel<<<1,1>>>():triple angle brackets mark a call from host to device code, called kernel launch

33 Software Optimizations (Take DGEMM as Example) 33 Matrix multiplication C = A x B Double-precision General Matrix Multiplication /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed

34 Optimized Matrix Multiply! Unrolled(C(code( 1 #include <x86intrin.h> 2 #define UNROLL (4) 3 4 void dgemm (int n, double* A, double* B, double* C) 5 { 6 for ( int i = 0; i < n; i+=unroll*4 ) 7 for ( int j = 0; j < n; j++ ) { 8 m256d c[4]; 9 for ( int x = 0; x < UNROLL; x++ ) //can do it manually! 10 c[x] = _mm256_load_pd(c+i+x*4+j*n); for( int k = 0; k < n; k++ ) 13 { 14 m256d b = _mm256_broadcast_sd(b+k+j*n); 15 for (int x = 0; x < UNROLL; x++) 16 c[x] = _mm256_add_pd(c[x], 17 _mm256_mul_pd(_mm256_load_pd(a+n*k+x*4+i), b)); 18 } for ( int x = 0; x < UNROLL; x++ ) 21 _mm256_store_pd(c+i+x*4+j*n, c[x]); 22 } 23 } 34

35 DGEMM 35 Combine cache blocking and subword parallelism 15x

36 New Software Challenges Another disruptive technology 36 Barriers Design did not anticipate exponential growth in parallelism #components and MTBF change the game Technical focus area involved System hardware scalability System software scalability Application scalability Technical gap to close 1000X improvement in system software scaling 100X improvement in system software reliability New wisdom Data movement is expensive Flop/s is cheap

37 Critical Issues in Future New Software 37 Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed-precision methods 2x speed of ops and 2x speed for data movement Autotuning Today s machines are too complicated, build smarts into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can t guarantee this. We understand the issues, but some of our colleagues have a hard time with this.

38 38 HPC Lab at IUPUI Ongoing Research Projects 1) Parallel matrix problem solvers CPU bound 2) Scalable CFD LBM-IB methods for life sciences memory bound and I/O intensive 3) Integrating Big Compute with Big Data with new enabling technologies To couple computation-intensive applications with data-intensive analysis efficiently Big Data? We tackle them as extreme-scale applications of HPC If you have the need for speed, desire to solve the largest problem in the world, HPC may be your future research area An interesting CFD simulation (10/30/14):

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance