Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Similar documents
Copyright 2012, Elsevier Inc. All rights reserved.

A General Discussion on! Parallelism!

BİL 542 Parallel Computing

Fabio AFFINITO.

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel computer architecture classification

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CUDA. Matthew Joyner, Jeremy Williams

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Technology for a better society. hetcomp.com

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

EECS4201 Computer Architecture

Fundamentals of Quantitative Design and Analysis

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Lecture 7: Parallel Processing

Parallel Programming Programowanie równoległe

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Introduction to Parallel Programming and Computing for Computational Sciences. By Justin McKennon

Overview of High Performance Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CME 213 S PRING Eric Darve

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction to GPU hardware and to CUDA

Parallel Computing. November 20, W.Homberg

CSC2/458 Parallel and Distributed Systems Machines and Models

Trends in HPC (hardware complexity and software challenges)

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Review of previous examinations TMA4280 Introduction to Supercomputing

An Introduction to Parallel Programming

Introduction to Parallel Programming

Parallel Numerics, WT 2013/ Introduction

Advanced Scientific Computing

CPU-GPU Heterogeneous Computing

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Parallelism. CS6787 Lecture 8 Fall 2017

Course web site: teaching/courses/car. Piazza discussion forum:

Introduction to Parallel Computing

CSL 860: Modern Parallel

A General Discussion on! Parallelism!

Introduction to parallel Computing

High Performance Computing (HPC) Introduction

High Performance Computing. Introduction to Parallel Computing

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Parallel and High Performance Computing CSE 745

Top500 Supercomputer list

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Lecture 7: Parallel Processing

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

High Performance Computing Systems

HPCC Results. Nathan Wichmann Benchmark Engineer

Unit 9 : Fundamentals of Parallel Processing

Why Multiprocessors?

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Lecture 1: Introduction

Using Graphics Chips for General Purpose Computation

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Lecture 1: Gentle Introduction to GPUs

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Parallel Systems. Project topics

GPU programming. Dr. Bernhard Kainz

Introduction II. Overview

Lect. 2: Types of Parallelism

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Benchmark Results. 2006/10/03

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

OpenACC programming for GPGPUs: Rotor wake simulation

Introduction to GPU computing

Parallel computing and GPU introduction

LACORE: A RISC-V BASED LINEAR ALGEBRA ACCELERATOR FOR SOC DESIGNS

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Performance of computer systems

PIPELINE AND VECTOR PROCESSING

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Commodity Cluster Computing

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Computing. Hwansoo Han (SKKU)

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

QR Decomposition on GPUs

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPUs and Emerging Architectures

Transcription:

Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620

Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved concurrently Each part is broken down to a series of instructions Instructions from each part execute simultaneously on different processors An overall control/coordination mechanism is employed

Overview: What is Parallel Computing The computational problem should be able to Be broken apart into discrete pieces of work that can be solved simultaneously Execute multiple program instructions at any moment in time Be solved in less time with multiple compute resources than with a single compute resource The compute resources might be A single computer with multiple processors An arbitrary number of computers connected by a network A combination of both A special computational component inside a single computer, separate from the main processors (GPU)

Measuring of speed FLOPS floating-point operations per second MFLOPS GFLOPS TFLOPS PFLOPS

The Demand for Computation Speed Suppose whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) about 5 10 8 cells. Suppose you want to predict the global weather for a week If each calculation requires 200 floating point operations (low estimate, can be as high as 4000), 10 11 floating point operations necessary in one time step.

The Demand for Computation Speed To forecast one scenario of the weather for one week using 1-minute intervals, we need 60 (min/hr) * 24 (hr/day) * 7 (day) = 10,080 (~ 10 time steps) The week-long forecast would require 10 10 10 operations Modern weather prediction use multi-model ensemble forecasting, which calculate numerous different scenarios with different probabilities and provide an estimated forecast

The Demand for Computation Speed Maximum number of floating point operations per clock tick for x86_64 CPU architecture is 4 A single-core CPU with 2.5Ghz clock speed will have: 2.510 4 1010 10 Single core simulation: 10 1010 1 360024 11.5!"#

The Demand for Computation Speed To do this within five minutes: 10 300 3.3510$ 3.4 PS4: theoretical peak performance of 1.84 (TFLOPS) using 8 cores

Who is using parallel computing?

Parallel Speedup Speedup factor: with p is the number of processors: %"& '%&" (! %"& ")" (! * + Ideally: + * * * Super-linear speed up: / &")!% (1) Poor sequential reference implementation (2) Memory caching (3) I/O Blocking

Parallel Efficiency Efficiency 7 * + 100% Limiting factors: (1) non-parallelizable code (2) repetition (3) communication overhead

Amdahl s Law Theoretical max: Let f be the fraction of the program that is notparallelizable. Assuming no communication overhead, the parallel simulation time is: + * N 1O * Replacing this result in the speedup equation: * * N 1O * N1O O1 N1 This is known as Amdahl s Law

Types of Parallel Computers Flynn s Taxonomy Single Instruction, Single Data Stream SIMD Single Instruction, Multiple Data Streams SIMD Multiple Instructions, Single Data Stream MISD Multiple Instruction, Multiple Data Streams MIMD

Types of Parallel Computers

Types of Parallel Computers Streaming SIMD extensions for x86 architecture Shared Memory Distributed Shared Memory Heterogeneous Computing Accelerators Message Passing

Streaming SIMD Intel s SSE AMD s 3DNow

Shared Memory One processor, multiple threads All threads have read/write access to the same memory Programming models: (1) Threads (pthread) programmer manages all parallelism (2) OpenMP: Compiler extensions handle parallelization through in-code markers (3) Vendor libraries (e.g. Intel math kernel libraries)

Heterogeneous Computing (Accelerators) Cell (PS3 processor) Developed for PS3 Used in the world s first supercomputer to reach 1PFLOPS Difficult to program FPGA (field programmable gate array) Dynamically reconfigurable circuit board Expensive, difficult to program Power efficient, low heat

Heterogeneous Computing (Accelerators) GPU (Graphic Processing Units) Processor unit on graphic cards designed to support graphic rendering (numerical manipulation) Significant advantage for certain classes of scientific problem Programming Models: CUDA Library developed by NVIDIA for their GPUs OpenACC Standard devides by NVIDIA, Cray, and Portal Compiler (PGI). Similar to OpenMP in the usage of in-code marker to specify portions to be accelerated OpenAMP Extensions to Visual C++ (Microsoft) to direct computation to GPU OpenCL Set of standards by the group behind OpenGL

Message Passing Processes handle their own memory, data is passed between processes via messages. Scales well Commodity parts Expandable Heterogenous Programming Models: MPI: standardized message passing library MPI + OpenMP (hybrid model) MapReduce programming mode Specialized languages (not gaining traction in scientific computing community)

Benchmarking LINPACK (Linear Algebra Package) Dense Matrix Solver HPCC High-Performance Computing Challenge HPL (LINPACK to solve linear system of equation) DGEMM (Double Precision General Matric Multiply) STREAM (Memory bandwidth) PTRANS (Parallel Matrix Transpose to measure processors communication) RandomAccess (Random memory updates) FFT (double precision complex discrete fourier transform) Communication bandwidth and latency

Benchmarking SHOC Scalable Heterogeneous Computing Benchmark non-traditional systems (GPUs) TestDFSIO Benchmark I/O performance of MapReduce/HDFS

Ranking TOP500 Rank the supercomputers based on their LINPACK score GREEN500 Rank the supercomputers with emphasis on energy usage (LINPACK score / power consumption) GRAPH500 Benchmark for data-intensive computing