Introduction to Parallel Scientific Computing

Size: px

Start display at page:

Download "Introduction to Parallel Scientific Computing"

Susanna Hodge
6 years ago
Views:

1 Introduction to Parallel Scientific Computing TDT Lect. 3 & 4 Anne C. Elster Dept. of Computer & Info. Sci. (IDI) Norwegian Univ. of Science Technology Trondheim, Norway 8-11 Aug Anne C. Elster NTNU/IDI 1

2 KDS IDI 2011 Studieprogram vs. seksjoner Komplekse Datasystemer

3 TDT 4200 Fall 2011 Instructor: Dr. Anne C. Elster Support staff: Vit.ass.: TBA Und. ass.: Ruben Spaans Web page: Lectures: Wednesdays 10:15-12:00 in F3 (may move due to class size?) Thursdays 113:1 5-14:00 in F3 Recitation (øvingstimer): Thursdays 14:15-16:00 in F3 It s Learning!

4 Courses Taught by Dr. Elster: Beregningsvitenskap. TDT4200 Parallel Computing (Parallel programming with MPI & threads) TDT24 Parallell environments & Numerical Computing - 2-day IBM CellBE Course (Fall 2007) - GPU & Thread programming TDT 4205 Compilers DTD 8117 Grid and Heterogeneous Computing

5 HPC History: Personal perspective 1980 s: Concurrent and Parallel Pascal 1986: Intel ipsc Hypercube CMI (Bergen) and Cornell (Cray at NTNU) 1987: Cluster of 4 IBM 3090s : Intel hypercubes Some on BBN : KSR (MPI1 & 2) : SGI systems (some IBM SP) 2001-current: Clusters 2006: IBM NTNU (Njord, 7+ TFLOPS, proprietory switch) GPU programming (Cg) 2008: Quadcore Supercomputer at UiTø (Stallo) HPC-LAB at IDI/NTNU opens with several NVIDIA donation Several quad-core machines (1-2 donated by Schlumberger) 2009: More NVIDIA donations: NVIDIA Tesla s1070 and two Quadro FX 5800 cards (jan 09) 8-11 Anne C. Elster NTNU/IDI 5

6 Hypercubes Distributed processor systems with log n processor connection pattern Intel ipsc Connection Machine Maps tree structures well using Gray codes (each dimension in cube maps to a level in the tree) 8-11 Anne C. Elster NTNU/IDI 6

7 Shared vs. Distributed Memory Processors Processors Shared Mem Distrubuted Memory

8 Replicated vs. Distributed grids Pros: - Easier to implement - Load balanced Con: - Does not scale well (e.g. need to sum grids) Pros: - Scales much better, especially for large problems Con: - More complicated message passing - May need load-balancing of particles 8

9 9 COT -based SUPERCOMPUTER HARDWARE TRENDS: Intel ipsc (mid-1980 s) The first ipsc had no separate communication processor... Specialized OS nodes Today s PC clusters Fast Ethernet or better (more expensive interconnect) Linux OS 32-bit cheapest, but many 64-bit cluster vendors Top500 supercomputers Today s GPU farms entering Top500 list!! 9 Anne C. Elster TDT 4200 Parallel Computing intro Lect 3-4, IDI/NTNU, Aug 2011

10 10 HPC Hardware Trends at IDI Clustis3 (Quad-core cluster) Beregningsvitenskap Installed Spring (9) stk ProLiant DL160GS Server with: 2 stk E5405 2,0GHz Quad-Core 9GB FB-dimm memory 160GB SATA disc 2 stk GbE network cards 10 TDT 4200 Parallel Computing -- Anne C. Elster

11 11 HPC Hardware Trends at IDI Beregningsvitenskap NVIDA 280 Tesla card Unpacking NVIDA s1070 and Quadro FX 5800 cards 11 TDT 4200 Parallel Computing -- Anne C. Elster

12 New architectural features to consider: Register usage On-chip memory Cache / stack manipulation Precision effects Multiplication time addition time Note: Recursion benefits from on-chip registers available for stack 8-11 TDT 4200 Parallel Computing -- Anne C. Elster 12

13 13 Memory Hierachy Registers Cache ( Level 1-3) On-chip RAM RAM Disks - SSD Disks - HD Slower Tapes (Robot storage) 8-11 TDT 4200 Parallel Computing Anne C. Elster

14 14 Memory Access Floating point optimization: Factor 2 (in-cache) Memory access optimizations: Factor 10 or more!! (out of cache data) Much more for RAM vs. disk!! 8-11 TDT 4200 Parallel Computing Anne C. Elster

15 Main HW/SW Challenges: Slow interconnects (improving, but at a cost...) Slow protocols (TCP/IP VIA/new technologies) MEMORY BANDWIDTH!!! 15 TDT 4200 Anne C. Elster

16 16 Multi-level Caching Access 2-3D cells for large grid in same cacheline can give large performance improvements E.g. 128 byte cache-line (max bit float.pt. no.s or 4x4 grid) Traditional: 2*16 cache hits = 32 Cell-caching: ((3*3)*1)+ ((3*3)*2) + 4 = 25 cache hits 25% improvement! 8-11 TDT 4200 Parallel Computing Anne C. Elster

17 Cluster technologies for HPC Advantage: Very cost-effective hardware since uses COTS (Commercial Of-The-Shelf) parts BUT: Typically much slower processor interconnects than traditional HPC systems What about usability? 17 TDT 4200 Parallel Computing Anne C. Elster

18 MESSAGE PASSING CAVEAT: Global operations have more severe impact on cluster performance than traditional supercomputers since communication between processors takes relatively more of the total execution time 18 TDT 4200 Parallel Computing Anne C. Elster

19 HARDWARE TRENDS CONTIN.: 32-bit 64-bit architectures 1 CPU multiple CPUs (2-4) THE WAL-MART EFFECT: game stations (e.g. Playstation-2 farm at UIUC) graphics cards Low-power COTS devices?? 19 TDT 4200 Parallel Computing Anne C. Elster

20 The Ideal Cluster -- Hardware High-bandwidth network Low-latency network Low Operating System overhead (TCP causes slow start ) Great floating-point performance (64-bit processors or more?) 20 TDT 4200 Parallel Computing Anne C. Elster

21 The Ideal Cluster -- Software Compiler that is: Portable Optimizing Do extra work to save communication Self-tuning /Load -balanced Automatic selection of best algorithm One-sided communication support? Optimized middleware 21 TDT 4200 Parallel Computing Anne C. Elster

22 22 The Wal-Mart Effect (PARA02) Wal-Mart bigger than Sears, K-mart and JC Penney s combined predicted to influence $40 billion of IT investments (MIT Review) has much more impact than Microsoft and Cisco could ever hope for Not driven my latest technology, but by business model bad news for HPC? Game market --> HPC market Future high-performance chips and systems --> NVIDIA Tesla! 22 TDT 4200 Parallel Computing -- Anne C. Elster

23 23 COT -based SUPERCOMPUTER HARDWARE TRENDS: Intel ipsc (mid-1980 s) The first ipsc had no separate communication processor... Specialized OS nodes Today s PC clusters Fast Ethernet or better (more expensive interconnect) Linux OS 32-bit cheapest, but many 64-bit cluster vendors Top500 supercomputers Today s GPU farms entering Top500 list.. 23 TDT 4200 Parallel Computing -- Anne C. Elster

24 24 Main HW/SW Challenges: Slow interconnects (improving, but at a cost...) Slow protocols (TCP/IP VIA/new technologies) MEMORY BANDWIDTH!!! 24 TDT 4200 Parallel Computing -- Anne C. Elster

25 25 MPI (Message Passing Interface) < Communication routines standard developed for multiprocessor systems and clusters of workstations Orginally targeted Fortran and C Now also C++ Newer strains: OpenMPI and MPI-Java 8-11 Anne C. Elster 25 NTNU/IDI TDT 4200 Parallel Computing -- Anne C. Elster 25

26 26 What is MPI? -- continued Message passing model Standard (specification) Many implementations MPICH was first most widely used OpenMPI currently most used impl.? Two phases: MPI 1: Traditional message-passing MPI 2: Remote memory (one-sided communications), parallel I/O, and dynamic processes TDT 4200 Parallel Computing -- Anne C. Elster

27 27 Notes on black board re. MPI basics From A User s Guide to MPI by Peter Pacheco 1. Intro 2. Greetings! 3. Collective Communication 4. Grouping Data for Communication 27 TDT 4200 Parallel Computing -- Anne C. Elster

28 28 LIBRARIES: MPICH (ANL) public domain (working with LBNL on VIA version) MPI LAM (more MPI-2 features) public domain MPI-FM (UIUC/UCSD) public domain MPICH built on top of Fast Messages MPI/Pro (MPI Technologies, Inc) commercial (working on VIA version) PaTENT MPI 4.0 (Genias GmbH) commercial MPI for Windows NT (PaTENT = Tool Envirnonment for NT) SCALI, Norway commercial MPI from MESH Technologies (Brian Vinter) commercial Threaded MPI (Penti Hutnanen, others) OpenMP for clusters (B. Champman), Hybrid OpenMP/MPI 28 TDT 4200 Parallel Computing -- Anne C. Elster

29 29 GPUs: Graphical Processor Units HISTORY: Late 70 s/ Early 80 s: Grafic drawing calculations on CPUs Xerox Alto computer: first special bit block transfer instruct Comodore Amiga: first mass-market video accelerator able to draw fills shapes & animations in HW. Graphics sub-system w/ several chips, incl. Dedicated to bit blk xfer Early 90 s: 2D accelleration Ca. 1995: VIDEO GAMES! --> 3D GPUs 29 TDT 4200 Parallel Computing -- Anne C. Elster

30 30 GPU History continued: : 3D rasterization (converting simple 3D geometric primitives (e.g. lines, triangles, rectangles) to 2D screen pixels) Texture mapping (mapping 2D texture image to planar 3D surface) : 3D translation, rotation & scaling Towards 2000: GPUs more configurable, 2001 and beyond: programmable individual pixels) (ability to change 30 TDT 4200 Parallel Computing -- Anne C. Elster

31 31 Limitations Branching usually not a good idea GPU cache is different from CPU cache Optimized for 2D locality Random memory access problematic Floating point precision No integers or booleans (also currently no bit-wise operators, but Cg reerved symbols for these) 31 TDT 4200 Parallel Computing -- Anne C. Elster

32 32 GPU: general programming view Programmable MIMD processor: the vertex processor (one vector & once scalar/clock cycle Rasterizer: pass thru or interpolate values (e.g. passing 4 coordinates to draw rectangle leads to interpolation of pixel coordinates of vertices Programmable SIMD processor (fragment processor w/ up to 32 ops/cycle) Simple blending unit (serial) - z-compares and sends to memory

33 33 GPU -- Outside view Memory Programmable MIMD proc Rasterization Programmable SIMD processor Blend output

34 34 GPU Internal Structure

35 35 General programming on GPUs Rendering = executing GPU textures = CPU arrays Fragment shader programs = inner loops Rendering to texture memory = feedback Vertex coordinates = computational range Texture coordinates = Computational domain Now have NVIDIA s CUDA library! (BLAS & FFT)

36 36 Limitations Branching usually not a good idea GPU cache is different from CPU cache Optimized for 2D locality Random memory access problematic Floating point precision No integers or booleans (also currently no bit-wise operators, but Cg reerved symbols for these) 36 TDT 4200 Parallel Computing -- Anne C. Elster

37 37 SNOW SIMULATION DEMO! Robin Eidissen (Teaching Assitant) 37 TDT 4200 Parallel Computing -- Anne C. Elster

38 38 38 TDT 4200 Parallel Computing -- Anne C. Elster

39 Modularizing Large Codes Split large codes into separate independent modules (e.g. Initializer, solvers, trackers, etc.) Easer to maintain and debug Allows use of external packages (BLAS, LAPACK, PETSc) Can use code as test-bed for part of future codes 8-11 Anne C. Elster NTNU/IDI 39

Fra superdatamaskiner til grafikkprosessorer og

Fra superdatamaskiner til grafikkprosessorer og Brødtekst maskinlæring Prof. Anne C. Elster IDI HPC/Lab Parallel Computing: Personal perspective 1980 s: Concurrent and Parallel Pascal 1986: Intel ipsc