Current Trends in High Performance Computing

Size: px

Start display at page:

Download "Current Trends in High Performance Computing"

Mary Bennett
6 years ago
Views:

Current Trends in High Performance Computing Chokchai Box Leangsuksun, PhD SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech

1 Current Trends in High Performance Computing Chokchai Box Leangsuksun, PhD SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech University 1 *SWEPCO endowed professorship is made possible by LA Board of Regents Outline What is HPC? Current Trends More on PS3 and GPU computing Conclusion 12 December

December 2011 3 New trends in computing Old & current SMP, Cluster Multicore computers Intel Core 2 Duo AMD 2x 64 Many-core

2 Mainstream CPUs CPU speed plateaus 3-4 Ghz More cores in a single chip Dual/Quad core is now Manycore (GPGPU) Traditional Applications won t get a free rides Conversion to parallel computing (HPC, MT) 3-4 Ghz cap This diagram is from no free lunch article in DDJ 12 December New trends in computing Old & current SMP, Cluster Multicore computers Intel Core 2 Duo AMD 2x 64 Many-core accelerators GPGPU, FPGA, Cell More Many brains in one computer Not to increase CPU frequency Harness many computers a cluster computing 12/12/11 4 2

3 What is HPC? High Performance Computing Parallel, Supercomputing Achieve the fastest possible computing outcome Subdivide a very large job into many pieces Enabled by multiple high speed CPUs, networking, software & programming paradigms fastest possible solution Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc. Time to insights, Time to discovery, Times to markets 12 December Parallel Programming Concepts Conventional serial execution where the problem is represented as a series of instructions that are executed by the CPU Problem CPU Parallel execution of a problem involves partitioning of the problem into multiple executable parts that are mutually exclusive and collectively exhaustive represented as a partially ordered set exhibiting concurrency. Problem Task Task Task Task instructions Parallel computing takes advantage of concurrency to : Solve larger problems with less time Save on Wall Clock Time Overcoming memory constraints CPU CPU CPU CPU Utilizing non-local resources 6 Source from Thomas Sterling s intro to HPC 12 December instructions 3

HPC Applications and Major Industries Finite Element Modeling Auto/Aero Fluid Dynamics Auto/Aero,

Medical Finance & Business Banks, Brokerage Houses (Regression Analysis, Risk, Options Pricing,

Complex Problems, Large Datasets, Long Runs This slide is from Intel presentation Technologies for

4 HPC Applications and Major Industries Finite Element Modeling Auto/Aero Fluid Dynamics Auto/Aero, Consumer Packaged Goods Mfgs, Process Mfg, Disaster Preparedness (tsunami) Imaging Seismic & Medical Finance & Business Banks, Brokerage Houses (Regression Analysis, Risk, Options Pricing, What if, ) Wal-mart s HPC in their operations Molecular Modeling Biotech and Pharmaceuticals Complex Problems, Large Datasets, Long Runs This slide is from Intel presentation Technologies for Delivering Peak Performance on HPC and Grid Applications 12 December HPC Drives Knowledge Economy 12/12/11 8 4

per second 9 Disaster Preparedness - example Project LEAD Severe Weather prediction (Tornado) OU leads.

5 Life Science Problem an example of Protein Folding Take a computing year (in serial mode) to do molecular dynamics simulation for a protein folding problem Excerpted from IBM David Klepacki s The future of HPC 12 December 2011 Petaflop = a thousand trillion floating point operations per second 9 Disaster Preparedness - example Project LEAD Severe Weather prediction (Tornado) OU leads. HPC & Dynamically adaptation to weather forecast Professor Seidel s LSU CCT Hurricane Route Prediction Emergency Preparedness Accuracy of prediction 1 Mile 2 = $1 M 12 December

6 HPC accelerates a product FE analysis on 1 CPU 1,000,000 elements Numerical processing for 1 element =.1 secs One computer will take 100,000 secs = 27.7 hrs Says 100 CPUs.27 hr ~ 16 mins 12 December Avian Flu Pandemic Modeled on a Supercomputer MIDAS (Models of Infectious Disease Agent Study) program The large-scale, stochastic simulation model examines the nationwide spread of a pandemic influenza virus strain A simulation starts with 2 passengers with contaminated AF arriving LAX The simulation rolls out a city-city and census-tract-level picture of the spread of infection a synthetic population of 281 million people over the course of 180 days It is a very large scale and complex multi-variant 12 December

Avian Flu Pandemic (90 days) Timothy C. Germann, Kai Kadau, Catherine A. Macken (Los Alamos National Laboratory); Ira M. Longini Jr. (Emory University) Source from www.lanl.

7 Avian Flu Pandemic (90 days) Timothy C. Germann, Kai Kadau, Catherine A. Macken (Los Alamos National Laboratory); Ira M. Longini Jr. (Emory University) Source from 12 December Avian Flu Pandemic (II) The results show that advance preparation of a modestly effective vaccine in large quantities appears to be preferable to waiting for the development of a well-matched vaccine that may be too late. The simulation models a synthetic population that matches U.S. census demographics and worker mobility data by randomly assigning the simulated individuals to households, workplaces, schools, and the like. The models serve as virtual laboratories to study how infectious diseases and what intervention strategies are more effective Run on the Los Alamos supercomputer known as Pink, a 1,024-node (2,048 processor) LinuxBIOS/Bpro with 2 GB/ node. Source from 12 December

8 Significant indicators why HPC now? Main stream computers with multi-cores (Intel or AMD) In past 1-2 years, CPU speed was flatten at 3+ Ghz More CPUs in one chip Dual core, multi-core chips Traditional software won t take advantage of these new processors Personal/Desktop Supercomputing. Many real problems are highly computational intensive. NSA uses supercomputing to do data mining DOE fusion, plasma, energy related (including weaponry). Help solving many other important areas (nanotech, life science etc.) Product design, ERM/Inventory Management Giants recently sneeze out HPC Bush s state of union speech 3 main S&T focus of which Supercomputing is one of them Bill Gates keynote speech at SC05 MS goes after HPC Google search engine - 100,000 nodes Playstation 3 is a personal supercomputing platform Hollywood (Entertainment) is HPC-bound (Pixar more than 3000 CPUs to render animation) 12 December HPC preparedness Build work forces that understand HPC paradigm & its applications HPC/Grid Curriculum in IT/CS/CE/ICT Offer HPC-enabling tracks to other disciplinary (engineering, life science, physic, computational chem, business etc..) Training business community Bring awareness to public National and strategic policies Improve Infrastructure 12 December

9 Pause here Switch to a tour of machine rooms Clusters, our Lab to show what they will be using.. Get students info on signup sheet for accounts on our clusters (azul, quadcore, GPU and PS3). Intro to Linux Then continue on HPC101 12/12/11 17 HPC December

10 How to Run Applications Faster? There are 3 ways to improve performance: Work Harder Work Smarter Get more Help Computer Analogy Using faster hardware Optimized algorithms and techniques used to solve computational tasks Multiple computers to solve a particular task 12 December Parallel Programming Concepts Problem Task Task Task Task instructions CPU CPU CPU CPU Source from Thomas Sterling s intro to HPC 12 December

11 HPC objective High Performance Computing Parallel, Supercomputing Achieve the fastest possible computing outcome Subdivide a very large job into many pieces Enabled by multiple high speed CPUs, networking, software & programming paradigms fastest possible solution Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc. 12 December Flynn s Taxonomy of Computer Architectures l SISD - Single Instruction/Single Data l SIMD - Single Instruction/Multiple Data l MISD - Multiple Instruction/Single Data l MIMD - Multiple Instruction/Multiple Data 22 11

12 Single Instruction/Single Data PU Processing Unit Your desktop, before the spread of dual core CPUs Slide Source: Wikipedia, Flynn s Taxonomy 23 Flavors of SISD Instructions: 24 12

13 More on pipelining 25 Single Instruction/Multiple Data Processors that execute same instruction on multiple pieces of data: NVIDIA GPUs Slide Source: Wikipedia, Flynn s Taxonomy 26 13

Macri, Intel 27 SISD versus SIMD Writing a compiler for SIMD architectures is VERY difficult

14 Single Instruction/Multiple Data l l Each core runs the same set of instructions on different data Example: l GPGPU: processes pixels of an image in parallel Slide Source: Klimovitski & Macri, Intel 27 SISD versus SIMD Writing a compiler for SIMD architectures is VERY difficult (inter-thread communication complicates the picture ) Slide Source: ars technica, Peakstream article 28 14

15 Multiple Instruction/Single Data Pipe line : CMU Warp machine. Slide Source: Wikipedia, Flynn s Taxonomy 29 Multiple Instruction/Multiple Data e.g. Multicore systems were based on a MIMD architecture + programming paradigm Such as openmp, multithreads Slide Source: Wikipedia, Flynn s Taxonomy 30 15

Multiple Instruction/Multiple Data l The sky is the limit: each PU is free to

categories Instructions: 31 Current HPC Hardware Traditionally HPC has

Symmetric Multi-Processors (SMP) Cluster Computers Recent trends in HPC

16 Multiple Instruction/Multiple Data l The sky is the limit: each PU is free to do as it pleases l Can be of either shared memory or distributed memory categories Instructions: 31 Current HPC Hardware Traditionally HPC has adopted expensive parallel hardware: Massively Parallel Processors (MPP) Symmetric Multi-Processors (SMP) Cluster Computers Recent trends in HPC Multicore systems Heterogeneous Computing with Accelerator Boards (GPGPU, FPGA) 12 December

HPC cluster Login Compile Submit job At least 2 connections Run tasks 12 December

(predominantly on SMP) PVM (old) UPC, Co-array Fortran CUDA, Brooks+, opencl

17 HPC cluster Login Compile Submit job At least 2 connections Run tasks 12 December Parallel Programming Env Parallel Programming Environments and Tools Threads (PCs, SMPs, NOW..) POSIX Threads Java Threads MPI Linux, NT, on many Supercomputers OpenMP (predominantly on SMP) PVM (old) UPC, Co-array Fortran CUDA, Brooks+, opencl Software DSMs (Shmem) Compilers RAD (rapid application development tools) Debuggers Performance Analysis Tools Visualization Tools 12 December

18 Recent Trends in HPC Hardware Multicore & Manycore are now. Multi CPUs in a single die Better power consumption tightly couple and better for multi-threading GPGPU As a build blocks for a much larger system New Top 500 HPC systems - clusters of multi-core & GPGPU 12 December What are HPC systems 12/12/

19 Current top 5 systems 12/12/11 37 Shared vs Distributed Memory 12/12/

20 Shared memory Global memory space, accessible by all processors Processors may have local memory to hold copies of some global memory. Consistency of copies is usually maintained by hardware (cache coherency) 12/12/11 39 Two typical classes of SM Uniform Memory Access (UMA): Equal access times identical processors typically represented by Symmetric Multi- processor Machines (SMP) or Multicores Non-Uniform Memory Access (NUMA): Memory access times are not uniform, memory access across a link is slower Often made by physically linking two or more SMPs or heterogeneous computing 12/12/

21 Advantage & Disadvantage Global address space is user-friendly Data sharing between tasks is fast System may suffer from lack of scalability. Adding CPUs increases traffic on shared memory - to - CPU path. This is especially true for cache coherent systems Programmer is responsible for correct synchronization Systems larger than an SMP need some specialpurpose components. 12/12/11 41 Distributed Memory 12/12/

22 Multicores Three multicore classifications Homogeneous Heterogeneous Hybrid 12 December Multicores(I) Homogeneous Cores (a main CPU) All cores are identical A traditional MC with few cores Good for jumbo & few tasks Not as many tasks/threads as accelerators or GPU. E.g. Intel Core2Duo, i3, i5, i7, AMD Programming Multithreads/openMP 12 December

23 Multicores(II) Homogeneous Cores as accelerator or compute device Need a main CPU system As attached processing units All cores are identical and many Good for many SIMD tasks/threads E.g. NVIDIA GPGPU, Clearspeed FPGA Programming library calls from a main program or a new language extension, e.g. CUDA 12 December Multicores(III) Heterogeneous Cores All cores are NOT identical All in one die Programming is more difficult See more in PS3 presentation 12 December

Intel or AMD Accelerator NVDIA, ATI Stream or FPGA Programming model is more complex Issues

24 Multicores(IV) Hybrid System Mix between host cores & accelerator cores A typical host can be a desktop to server system, e.g. Intel or AMD Accelerator NVDIA, ATI Stream or FPGA Programming model is more complex Issues memory bandwidth between host vs. devices 12 December Introduction to Cell BE (PS3) Programming HPCI: High Performance Computing Initiative 24

PS3 - awesome HPC system IBM Cell processor Affordable But currently not many tools 12 December 2011 49 Cell BE

December 2011 Synergistic Processor Element 128-bit RISC, SIMD processor 256 KB local storage memory Use DMA to

25 PS3 - awesome HPC system IBM Cell processor Affordable But currently not many tools 12 December Cell BE Architecture PowerPC Processor Element Main Processor 64 bit Also support Vector/SIMD Run the OS, Manage SPE 12 December 2011 Synergistic Processor Element 128-bit RISC, SIMD processor 256 KB local storage memory Use DMA to transfer data between local storage and main memory Picture ref: 25

26 Cell Programming IBM Cell SDK Main Process run on PPE Threads run on SPEs PPE Centric programming paradigm PPE process SPE thread SPE thread SPE thread December 2011 GPGPU General Purpose Graphic Processing Unit 12/12/

27 Two major players Parallel Computing on a GPU NVIDIA GPU Computing Architecture Via a HW device interface In laptops, desktops, workstations, servers 8-series GPUs deliver 50 to 500 GFLOPS on compiled parallel C applications Tesla T from 1-4 TFLOPS GPU parallelism is better than Moore s law, more doubling every year GPGPU is a GPU that allows user to process both graphics and non-graphics applications. Tesla D870 GeForce 8800 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, UrbanaChampaign 27

5 GB in Tesla High-speed memory bandwidth Supports Scalable Link Interface (SLI) NVIDIA Tesla TM

28 NVIDIA GeForce 8800 (G80) the eighth generation of NVIDIA s GeForce graphic cards. High performance CUDA-enabled GPGPU 128 cores Memory MB or 1.5 GB in Tesla High-speed memory bandwidth Supports Scalable Link Interface (SLI) NVIDIA Tesla TM Feature GPU Computing for HPC No display ports Dedicate to computation For massively Multi-threaded computing Supercomputing performance 28

Note: 1 G80 GPU = 128 cores = ~500 GFLOPs 1 T10 = 240 cores = 1

29 NVIDIA Tesla Card >> C-Series(Card) = 1 GPU with 1.5 GB D-Series(Deskside unit) = 2 GPUs S-Series(1U server) = 4 GPUs Note: 1 G80 GPU = 128 cores = ~500 GFLOPs 1 T10 = 240 cores = 1 TFLOPs << NVIDIA G80" David Kirk/ NVIDIA and Wen-mei This slide is from NVDIA CUDA tutorial 29

and Fortran programs to execute on GPGPU.

30 GPGPU Programming with CUDA CUDA (Compute Unified Device Architecture) is a SDK and API that allow a programmer to write C and Fortran programs to execute on GPGPU. Works with NVIDIA G80 or later and Tesla The GPGPU is viewed as a compute device ATI Stream (1) 12/12/

31 ATI /12/11 61 ATI 4870 X2 12/12/

32 Architecture of ATI Radeon 4000 series This slide is from ATI presentation 32

33 This slide is from ATI presentation Introduction to Open CL Toward new approach in Computing Moayad Almohaishi 33

34 Introduction to opencl OpenCL stands for Open Computing Language. It is from consortium efforts such as Apple, NVDIA, AMD etc. The Khronos group who was responsible for OpenGL. Take 6 months to come up with the specifications. OpenCL 1. Royalty-free. 2. Support both task and data parallel programing modes. 3. Works for vendor-agnostic GPGPUs 4. including multi cores CPUs 5. Works on Cell processors. 6. Support handhelds and mobile devices. 7. Based on C language under C99. 34

35 OpenCL Can make query on available devices and build an context of the available devices. Programmers would be able to program more freely for any kind of device. Applications are more resuable even if the hardware changed in the future. 35

36 OpenCL Platform Model CPUs+GPU platforms 12/12/

37 Performance of GPGPU Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS David Kirk/NVIDIA and Wen-mei W. Hwu, 37

38 Last words! HPC or Supercomputing system is not necessarily gigantic in a big machine room but is accessible for Thais and may now be sitting next to your desk Computing is necessity and Fast computing provides competitive edge, esp Knowledge Economy New trends of HPC includes GPGPU, various multicore architecture Prepare ourselves and strengthen our S&T, and industry as well business community for this phenomenon (HPC goes mainstream) before too late. 12 December Back up slides 12/12/

Cancer Gene-mining Unsuccessful on a uni-processor Our approach Novel parallel gene-mining algorithms Input from microarray Retain accuracy Significantly speed up (superlinear) IBM P5 supercomputer

Time taken(in secs) Time to run the algorithm, keeping number of nodes fixed 1200 1000 800 600 400 200 0 13 39 65 91 Number of processors Bladder 100 Mesothelioma Breast 80 60 Renal Leukemia 40 20

39 Cancer Gene-mining Unsuccessful on a uni-processor Our approach Novel parallel gene-mining algorithms Input from microarray Retain accuracy Significantly speed up (superlinear) IBM P5 supercomputer (128 node PPC). Time taken(in secs) Time to run the algorithm, keeping number of nodes fixed Number of processors Bladder 100 Mesothelioma Breast Renal Leukemia Prostate 0 Lung Pancreas Colorectal Ovary Lymphoma Melanoma OvaMarker based Selection GeneSetMine based Selection 12 December Drug Delivery By WU & Palmer, Louisiana Tech U Assisted by HPCI A study of microcapsules for drug delivery. Computational Fluid Dynamics methodology to model the generation of droplets or cores (using alginate and oil) Goal: better understanding process parameters needed for generating cores of homogeneous size for the manufacturing of microcapsules. 12 December

Droplet Generation: Experimental Procedure 12 December 2011 79 Droplet Generation: Example

03 kg/m-s Alginate: Density 1012 kg/m3 Viscosity 0.

40 Droplet Generation: Experimental Procedure 12 December Droplet Generation: Example Results Case 1: Olive oil: Density 930 kg/m3 Viscosity 0.03 kg/m-s Alginate: Density 1012 kg/m3 Viscosity kg/m-s Case 2: Phase 1: Density 918 kg/m3 Viscosity kg/m-s Phase 2: Density kg/m3 Viscosity kg/m-s 12 December 2011 Source from wu s thesis 80 40

High Performance Computing

GPGPU A Current Trend in High Performance Computing Chokchai Box Leangsuksun, PhD SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech University box@latech.edu