Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620
Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved concurrently Each part is broken down to a series of instructions Instructions from each part execute simultaneously on different processors An overall control/coordination mechanism is employed
Overview: What is Parallel Computing The computational problem should be able to Be broken apart into discrete pieces of work that can be solved simultaneously Execute multiple program instructions at any moment in time Be solved in less time with multiple compute resources than with a single compute resource The compute resources might be A single computer with multiple processors An arbitrary number of computers connected by a network A combination of both A special computational component inside a single computer, separate from the main processors (GPU)
Measuring of speed FLOPS floating-point operations per second MFLOPS GFLOPS TFLOPS PFLOPS
The Demand for Computation Speed Suppose whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) about 5 10 8 cells. Suppose you want to predict the global weather for a week If each calculation requires 200 floating point operations (low estimate, can be as high as 4000), 10 11 floating point operations necessary in one time step.
The Demand for Computation Speed To forecast one scenario of the weather for one week using 1-minute intervals, we need 60 (min/hr) * 24 (hr/day) * 7 (day) = 10,080 (~ 10 time steps) The week-long forecast would require 10 10 10 operations Modern weather prediction use multi-model ensemble forecasting, which calculate numerous different scenarios with different probabilities and provide an estimated forecast
The Demand for Computation Speed Maximum number of floating point operations per clock tick for x86_64 CPU architecture is 4 A single-core CPU with 2.5Ghz clock speed will have: 2.510 4 1010 10 Single core simulation: 10 1010 1 360024 11.5!"#
The Demand for Computation Speed To do this within five minutes: 10 300 3.3510$ 3.4 PS4: theoretical peak performance of 1.84 (TFLOPS) using 8 cores
Who is using parallel computing?
Parallel Speedup Speedup factor: with p is the number of processors: %"& '%&" (! %"& ")" (! * + Ideally: + * * * Super-linear speed up: / &")!% (1) Poor sequential reference implementation (2) Memory caching (3) I/O Blocking
Parallel Efficiency Efficiency 7 * + 100% Limiting factors: (1) non-parallelizable code (2) repetition (3) communication overhead
Amdahl s Law Theoretical max: Let f be the fraction of the program that is notparallelizable. Assuming no communication overhead, the parallel simulation time is: + * N 1O * Replacing this result in the speedup equation: * * N 1O * N1O O1 N1 This is known as Amdahl s Law
Types of Parallel Computers Flynn s Taxonomy Single Instruction, Single Data Stream SIMD Single Instruction, Multiple Data Streams SIMD Multiple Instructions, Single Data Stream MISD Multiple Instruction, Multiple Data Streams MIMD
Types of Parallel Computers
Types of Parallel Computers Streaming SIMD extensions for x86 architecture Shared Memory Distributed Shared Memory Heterogeneous Computing Accelerators Message Passing
Streaming SIMD Intel s SSE AMD s 3DNow
Shared Memory One processor, multiple threads All threads have read/write access to the same memory Programming models: (1) Threads (pthread) programmer manages all parallelism (2) OpenMP: Compiler extensions handle parallelization through in-code markers (3) Vendor libraries (e.g. Intel math kernel libraries)
Heterogeneous Computing (Accelerators) Cell (PS3 processor) Developed for PS3 Used in the world s first supercomputer to reach 1PFLOPS Difficult to program FPGA (field programmable gate array) Dynamically reconfigurable circuit board Expensive, difficult to program Power efficient, low heat
Heterogeneous Computing (Accelerators) GPU (Graphic Processing Units) Processor unit on graphic cards designed to support graphic rendering (numerical manipulation) Significant advantage for certain classes of scientific problem Programming Models: CUDA Library developed by NVIDIA for their GPUs OpenACC Standard devides by NVIDIA, Cray, and Portal Compiler (PGI). Similar to OpenMP in the usage of in-code marker to specify portions to be accelerated OpenAMP Extensions to Visual C++ (Microsoft) to direct computation to GPU OpenCL Set of standards by the group behind OpenGL
Message Passing Processes handle their own memory, data is passed between processes via messages. Scales well Commodity parts Expandable Heterogenous Programming Models: MPI: standardized message passing library MPI + OpenMP (hybrid model) MapReduce programming mode Specialized languages (not gaining traction in scientific computing community)
Benchmarking LINPACK (Linear Algebra Package) Dense Matrix Solver HPCC High-Performance Computing Challenge HPL (LINPACK to solve linear system of equation) DGEMM (Double Precision General Matric Multiply) STREAM (Memory bandwidth) PTRANS (Parallel Matrix Transpose to measure processors communication) RandomAccess (Random memory updates) FFT (double precision complex discrete fourier transform) Communication bandwidth and latency
Benchmarking SHOC Scalable Heterogeneous Computing Benchmark non-traditional systems (GPUs) TestDFSIO Benchmark I/O performance of MapReduce/HDFS
Ranking TOP500 Rank the supercomputers based on their LINPACK score GREEN500 Rank the supercomputers with emphasis on energy usage (LINPACK score / power consumption) GRAPH500 Benchmark for data-intensive computing