CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Similar documents
Online Course Evaluation. What we will do in the last week?

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Introduction II. Overview

THREAD LEVEL PARALLELISM

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction to parallel computing

Parallel Architectures

Parallel Processing SIMD, Vector and GPU s cont.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)


Parallel Systems I The GPU architecture. Jan Lemeire

WHY PARALLEL PROCESSING? (CE-401)

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Chapter 7. Multicores, Multiprocessors, and Clusters

Multiprocessors & Thread Level Parallelism

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

BlueGene/L (No. 4 in the Latest Top500 List)

! Readings! ! Room-level, on-chip! vs.!

Computer Systems Architecture

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computer Architecture Spring 2016

Computer Systems Architecture

COMPUTER ORGANIZATION AND DESIGN

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

High Performance Computing Systems

Parallel Computing Platforms

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computing Introduction

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Parallelism. CS6787 Lecture 8 Fall 2017

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chapter 18 - Multicore Computers

Lect. 2: Types of Parallelism

Computer parallelism Flynn s categories

Chap. 4 Multiprocessors and Thread-Level Parallelism

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

An Introduction to Parallel Architectures

Introduction to Parallel Computing

Dr. Joe Zhang PDC-3: Parallel Platforms

Issues in Multiprocessors

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Computer Architecture

Multiprocessors - Flynn s Taxonomy (1966)

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Lec 25: Parallel Processors. Announcements

Module 5 Introduction to Parallel Processing Systems

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Simultaneous Multithreading on Pentium 4

45-year CPU Evolution: 1 Law -2 Equations

The Art of Parallel Processing

Top500 Supercomputer list

Parallel Architecture, Software And Performance

Parallel Programming Multicore systems

Copyright 2010, Elsevier Inc. All rights Reserved

Handout 3 Multiprocessor and thread level parallelism

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

High Performance Computing in C and C++

Introduction to Parallel Programming

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Computer Architecture Crash course

Issues in Multiprocessors

Lecture 1: Gentle Introduction to GPUs

Multicore Programming

CSE502: Computer Architecture CSE 502: Computer Architecture

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Processor Performance and Parallelism Y. K. Malaiya

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multi-core Programming - Introduction

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Introduction to Parallel Programming

CS6303-COMPUTER ARCHITECTURE UNIT I OVERVIEW AND INSTRUCTIONS PART A

Lecture 1: Introduction

Multicore Hardware and Parallelism

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel computing and GPU introduction

Parallel Numerics, WT 2013/ Introduction

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Lecture 7: Parallel Processing

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Transcription:

Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

Introduction to Parallel Processing Multiprocessor Machine : A computer system with at least two processors (vs Uniprocessor ) Goal: To connect multiple computers to get higher performance to improve: Scalability, Availability, and Power Efficiency (multicore era) Type 1: High throughput for independent jobs Type 2: Single program that runs on multiple processors More difficult Cluster: A set of computers connected over a local area network Can serve as search engines, web servers, databases, etc. Multicore microprocessors A CPU containing multiple cores in a single chip/die/socket. 3 Today s status All CPUs today are multicore #Cores is expected to increase constantly We expect to see 2 additional cores per chip every two years All machines are SMP: Shared Memory Processors Any Programmers who care about performance must become Parallel Programmers Before 2004, you don t have to. Now, sequential programs are slow. Unfortunately, No easy software and language are available to write both correct and fast parallel programs 4 2

Parallel Programming The difficulty of parallelism is NOT hardware Parallel software is the problem It is difficult to use multiple processors to complete one task faster You hope to get significant performance improvement Otherwise, simply use a faster uniprocessor, since it s easy Difficulties: Partitioning the problem Too many ways to partition it Coordination Communications overhead Load balancing Data locality 5 What is Speedup Number of cores = p Serial run-time = T serial Parallel run-time = T parallel S = T serial T parallel T parallel = T serial / p 3

What is Parallel Efficiency of a Program E = S p = T serial T parallel p = T serial. p T parallel An Example of Speedup and Efficiency 4

Efficiencies of parallel program on different problem sizes Amdahl s Law: S = 1 / (F s + F p /P) à 1 / F s The sequential part of your program limits the speedup of your program on parallel computers Question: 100 processors, how to get 90 speedup? T old = T parallelizable + T sequential T new = T parallelizable /#P + T sequential Speedup = P->inf 1 (1 F parallelizable ) +F parallelizable / #P = 90 Solving: F parallelizable = 99.9% So, we need the sequential part to be <= 0.1% of original time Yes, there are such applications with plenty of parallelism 10 5

Another Example of Amdal s Law An example of workload: add 10 scalars, then sum of two 10 10 matrices Assume adding scalars cannot benefit from parallelism, but matrix can benefit Q: What are the speed up from 10 to 100 processors? On a single processor: Time = (10 + 100) t add = 110 t add 10 processors: Time = 10 t add + 100/10 t add = 20 t add Speedup = 110/20 = 5.5 (or efficiency = 55%) 100 processors: Time = 10 t add + 100/100 t add = 11 t add Speedup = 110/11 = 10 (or efficiency = 10%) 11 Strong Scaling vs Weak Scaling Strong scaling: the problem size fixed As shown in the first example Weak scaling: problem size is proportional to number of processors 10 processors, 10 10 matrix (i.e., original size) //10 elements / processor Time = 10 t add + 100/10 t add 100 processors, 32 32 matrix (32x32=1024) //sqrt(1000)=31.6 Time = 10 t add + 1000/100 t add = 20 t add Constant execution time in this example Most often, people solve bigger problems on bigger computers 12 6

Load Balancing In the previous examples, we assumed that the workload was perfectly balanced! Example: suppose 100x100 matrix, 100 processors 10+100*100/100=110 è 10010/110 = 91X given 100 processors However, if one processor has 5% of the workload 5% x 10000t = 500t The other 99 processors have 95% of the workload Time = Max(500t, 9500t/99) + 10t = 510t Speedup = 10010 / 510 = only 20X given 100 processors 13 Flynn s Taxonomy SISD Single instruction stream Single data stream SIMD Single instruction stream Multiple data stream MISD Multiple instruction stream Single data stream MIMD Multiple instruction stream Multiple data stream 7

SIMD Parallelism achieved by dividing data among the compute units. Applies the same instruction to multiple data items. Also called data parallelism. SIMD Example control unit n data items n ALUs x[1] x[2] x[n] ALU 1 ALU 2 ALU n for (i = 0; i < n; i++) x[i] += val; 8

SIMD What if we don t have as many ALUs as data items? Divide the work and process iteratively. Ex. m = 4 ALUs and n = 15 data items. Round ALU 1 ALU 2 ALU 3 ALU 4 1 X[0] X[1] X[2] X[3] 2 X[4] X[5] X[6] X[7] 3 X[8] X[9] X[10] X[11] 4 X[12] X[13] X[14] SIMD Drawbacks All ALUs are required to execute the same instruction, or remain idle. They must also operate synchronously. Efficient for large data parallel problems, but not for other types of more complex parallel problems. 9

Hardware Multithreading Hardware multithreading (about one core, about ILP) VS MIMD: create n threads running on n processors in parallel. To increase resource utilization on a single core Perform multiple threads of execution in parallel Has replicated registers, PC Support fast switching between threads 3 Versions: Fine-grain multithreading Switch threads after each cycle Interleave instruction execution (normally round-robin) If one thread stalls, others are executed Con: a normal individual thread will be delayed by other threads instructions Coarse-grain multithreading Only switch on long pipeline stall (e.g., L2-cache miss) Simplifies hardware, but does not hide short stalls (e.g., data hazards) SMT 19 Simultaneous Multithreading (SMT) In modern multiple-issue dynamically scheduled processor Can schedule instructions from multiple threads No thread switching on every cycle Instructions from independent threads execute whenever function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium4 HT Two threads: duplicated registers, shared function units and caches 20 10

A HW Multithreading Example 21 MIMD Supports multiple simultaneous instruction streams operating on multiple data streams. Typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU. 11

Shared Memory System A collection of autonomous processors is connected to a memory system via an interconnection network. Each processor can access each memory location. The processors usually communicate implicitly by accessing shared data structures. Shared Memory System Figure 2.3 12

UMA Multicore System Time to access all the memory locations will be the same for all the cores. Figure 2.5 NUMA Multicore System A memory location a core is directly connected to can be accessed faster than a memory location that must be accessed through another chip. Figure 2.6 13

Distributed Memory System Clusters (most popular) A collection of commodity systems. Connected by a commodity interconnection network. Nodes of a cluster are individual computations units joined by a communication network. Distributed Memory System Figure 2.4 14

29 30 15

Intel CPUs Each tick has improved nm technology (thus, lower power, faster clock) Each tock introduces new features and improved architectural performance. How to Decide a Computer s Peak Performance? TABLE III THEORETICAL PER-CYCLE PEAK FOR HASWELL AVX 2.0 SSE SSE SSE AVX+FMA AVX-128 AVX-128 AVX+FMA AVX+FMA (Scalar) (DP) (SP) (scalar) +FMA (DP) +FMA (SP) (DP) (SP) flop / operation 1 1 1 2 2 2 2 2 operations / instruction 1 2 4 1 2 4 4 8 instructions / cycle 2 2 2 2 2 2 2 2 = flop / cycle 2 4 8 4 8 16 16 32 TABLE IV THEORETICAL PER-NODE PEAK FOR E5-2695 V3 SSE SSE SSE AVX+FMA AVX-128 AVX-128 AVX+FMA AVX+FMA (Scalar) (DP) (SP) (scalar) +FMA (DP +FMA (SP) (DP) (SP) flop / cycle 2 4 8 4 8 16 16 32 Clock cycles Rate/ second 2.3G 2.3G 2.3G 2.3G 2.3G 2.3G 2.3G 2.3G cores / socket 14 14 14 14 14 14 14 14 sockets / node 2 2 2 2 2 2 2 2 = flops / node 128.8G 257.6G 515.2G 257.6G 515.2G 1030.4G 1030.4G 2060.8G Paper to read: http://cs.iupui.edu/~fgsong/cs590hpc/how2decide_peak.pdf (Haswell-EP) $2424 (as of Nov 2016) 16

Intel Xeon vs Intel Xeon Phi vs Nvidia GPUs vs IBM Power 589 875 MHz https://www.xcelerit.com/computing-benchmarks/processors/ 17