CSC2/458 Parallel and Distributed Systems Machines and Models

Similar documents
High Performance Computing Systems

Parallel Programming Programowanie równoległe

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Online Course Evaluation. What we will do in the last week?

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Introduction II. Overview

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer Architecture and Organization

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Architecture

Review: Parallelism - The Challenge. Agenda 7/14/2011

Copyright 2012, Elsevier Inc. All rights reserved.

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 7: Parallel Processing

Parallel Computing Platforms

Exercises 6 Parallel Processors and Programs

Parallel computing and GPU introduction

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Fundamentals of Quantitative Design and Analysis

27. Parallel Programming I

27. Parallel Programming I

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Introduction to parallel computing

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

COSC 6385 Computer Architecture - Multi Processor Systems

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Beyond Latency and Throughput

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CSL 860: Modern Parallel

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Fundamentals of Computer Design

27. Parallel Programming I

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Understanding Parallelism and the Limitations of Parallel Computing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSE 392/CS 378: High-performance Computing - Principles and Practice

TDT 4260 lecture 3 spring semester 2015

Parallelism. CS6787 Lecture 8 Fall 2017

Parallel Architectures

What is Parallel Computing?

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 7: Parallel Processing

ARCHITECTURES FOR PARALLEL COMPUTATION

Introduction to Parallel Programming

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

EECS4201 Computer Architecture

Computer Architecture Spring 2016

Multiprocessors & Thread Level Parallelism

Embedded Systems Architecture. Computer Architectures

Introduction to Parallel Programming and Computing for Computational Sciences. By Justin McKennon

Advanced Computer Architecture

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

BİL 542 Parallel Computing

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Programmation Concurrente (SE205)

Chapter 7. Multicores, Multiprocessors, and Clusters

COMPUTER ORGANIZATION AND DESIGN

EE/CSCI 451: Parallel and Distributed Computation

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

ECE 571 Advanced Microprocessor-Based Design Lecture 4

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

ARCHITECTURES FOR PARALLEL COMPUTATION

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Multiprocessors - Flynn s Taxonomy (1966)

Lecture 8. Vector Processing. 8.1 Flynn s taxonomy. M. J. Flynn proposed a categorization of parellel computer systems in 1966.

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Introduction to parallel computing

Fundamentals of Computers Design

Introduction. EE 4504 Computer Organization

A General Discussion on! Parallelism!

Scalability and Classifications

The Art of Parallel Processing

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Architectures of Flynn s taxonomy -- A Comparison of Methods

CA4006 Concurrent & Distributed Programming

Parallel Numerics, WT 2013/ Introduction

BlueGene/L (No. 4 in the Latest Top500 List)

CS560 Lecture Parallel Architecture 1

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Parallel Computing Introduction

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

A study on SIMD architecture

Transcription:

CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS

Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics

Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics

Goals What is the goal of parallel programming?

Scalability Why is scalability important?

Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics

Speedup T 1 is time on one processor T n is time on n processors Speedup(n) = T 1 T n

Amdahl s Law Let: T 1 be T serial + T parallelizable T n is then T serial + T parallelizable n, assuming perfect scalability Divide both terms T 1 and T n by T 1 to obtain serial and parallelizable ratios. Speedup(n) = 1 r serial + r parallelizable n

Amdahl s Law In the limit Speedup( ) = 1 r serial This is also known as strong scalability work is fixed and number of processors is varied. What are the implications of this?

Scalability Limits Assuming infinite processors, what is the speedup if: serial ratio r serial is 0.5 (i.e. 50%) serial ratio is 0.1 (i.e. 10%) serial ratio is 0.01 (i.e. 1%)

Current Top 5 supercomputers Sunway TaihuLight (10.6M cores) Tianhe 2 (3.1M cores) Piz Daint (361K cores) Gyoukou (19.8M cores) Titan (560K cores) Source: Top 500

Weak Scalability Work increases as number of processors increase Parallel work should increase linearly with processors Work W = αw + (1 α)w α is serial fraction of work Scaled Work W = αw + n(1 α)w Empirical observation Usually referred to as Gustafson s Law Source: http://www.johngustafson.net/pubs/pub13/amdahl.htm

Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics

Organization of Parallel Computers Components of parallel machines: Processing Elements Memories Interconnect how processors, memories are connnected to each other

Flynn s Taxonomy Based on notion of streams Instruction stream Data stream Taxonomy based on number of each type of streams Single Instruction - Single Data (SISD) Single Instruction - Multiple Data (SIMD) Multiple Instruction - Single Data (MISD) Multiple Instruction - Multiple Data (MIMD) Flynn, J., (1966), http://ieeexplore.ieee.org/document/1447203/very High Speed Computing Systems, Proceedings of the IEEE

SIMD Implementations: Vector Machines The Cray-1 (circa 1977): Vx vector registers 64 elements 64-bits per element Vector length register (Vlen) Vector mask register Richard Russell, The Cray-1 Computer System, Comm. ACM 21,1 (Jan 1978), 63-72

Vector Instructions Vertical For 0 < i < Vlen: dst[i] = src1[i] + src2[i] 1 2 3 4 + 5 6 2 3 = 6 8 5 7 Most arithmetic instructions

Vector Instructions Horizontal For 0 < i < Vlen: dst = min(src1[i], dst) 1 = min( 1 2 3 4 ) Note that dst is a scalar. Mostly reductions (min, max, sum, etc.) Not well supported Cray-1 did not have this

Vector Instructions Shuffle/Permute src 1 2 3 4 mask 0 3 1 1 dst 1 4 2 2 dst = shuffle(src1, mask) Poor support on older implementations Reasonably well-supported on recent implementations

Masking/Predication src1 6 5 7 2 g5mask 1 0 1 0 src1 6 5 7 2 * src2 1 4 2 2 = dst 6? 14? g5mask = gt(src1, 5) dst = mul(src1, src2, g5mask)

MISD -? Flynn, J., (1966), http://ieeexplore.ieee.org/document/1447203/very High Speed Computing Systems, Proceedings of the IEEE

What type of machine is this? Hyperthreaded Core Different colours in RAM indicate different instruction streams. Source: https://en.wikipedia.org/wiki/hyper-threading

What type of machine is this? GPU Each instruction is 32-wide. Source: https://devblogs.nvidia.com/inside-pascal/

What type of machine is this? TPU Matrix Multiply Unit Source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

TPU Overview Source: https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

Modern Multicores Multiple Cores (MIMD) (Short) Vector Instruction Sets (SIMD) MMX, SSE, AVX (Intel) 3DNow (AMD) NEON (ARM)

Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics

Metrics we care about Latency Time to complete task Lower is better Throughput Rate of completing tasks Higher is better Utilization Time worker (processor, unit) is busy Higher is better Speedup Higher is better

Reducing Latency Use cheap operations Which of these operations are expensive? Bitshift Integer Divide Integer Multiply Latency fundamentally bounded by physics

Increasing Throughput Parallelize! Lots of techniques, focus of this class Add more processors Need lots of work though to benefit

Speedup Measure speedup w.r.t. fastest serial code Not parallel program on 1 processor Always report runtime Never speedup alone Are superlinear speedups possible?