Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Similar documents
Parallel Architectures

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computer parallelism Flynn s categories

Introduction II. Overview

Parallel Architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Systems Architecture

Parallel Computing Platforms

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to parallel computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Overview. Processor organizations Types of parallel machines. Real machines

Parallel Computing: Parallel Architectures Jin, Hai

Issues in Multiprocessors

Chapter 17 - Parallel Processing

Computer Systems Architecture

WHY PARALLEL PROCESSING? (CE-401)

Parallel Computers. c R. Leduc

Parallel Architecture. Hwansoo Han

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Introduction to Parallel Programming

Processor Architecture and Interconnect

Lect. 2: Types of Parallelism

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

BlueGene/L (No. 4 in the Latest Top500 List)

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Issues in Multiprocessors

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Handout 3 Multiprocessor and thread level parallelism

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Top500 Supercomputer list

Parallel Computing Introduction

COSC 6385 Computer Architecture - Multi Processor Systems

How to Write Fast Code , spring th Lecture, Mar. 31 st

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 7: Parallel Processing

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Lecture 7: Parallel Processing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Introduction to Parallel Programming

THREAD LEVEL PARALLELISM

Multiprocessors - Flynn s Taxonomy (1966)

Threads, SMP, and Microkernels. Chapter 4

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

General introduction: GPUs and the realm of parallel architectures

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Lecture 9: MIMD Architecture

Computer Architecture Crash course

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Introduction to Parallel Computing

Multiprocessors & Thread Level Parallelism

CSE 392/CS 378: High-performance Computing - Principles and Practice

Computer Architecture Spring 2016

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 9: MIMD Architectures

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Comp. Org II, Spring

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Online Course Evaluation. What we will do in the last week?

Lecture 24: Virtual Memory, Multiprocessors

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Processing & Multicore computers

Processor Performance and Parallelism Y. K. Malaiya

Types of Parallel Computers

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Comp. Org II, Spring

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Scalability and Classifications

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Multi-core Programming - Introduction

Chapter 11. Introduction to Multiprocessors

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Parallel Processing SIMD, Vector and GPU s cont.

Computer Architecture

CS/COE1541: Intro. to Computer Architecture

Chap. 4 Part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

PCS - Part Two: Multiprocessor Architectures

Distributed Systems. 01. Introduction. Paul Krzyzanowski. Rutgers University. Fall 2013

Introduction to Parallel Programming

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

EE382V: System-on-a-Chip (SoC) Design

Chapter 18 - Multicore Computers

CellSs Making it easier to program the Cell Broadband Engine processor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Parallel Numerics, WT 2013/ Introduction

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Advanced Parallel Architecture. Annalisa Massini /2017

Transcription:

Chap. 2 part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1

Provocative question (p30) How much do we need to know about the HW to write good par. prog.? Chap. gives HW background knowledge that will allow us to understand: Features of modern processor architectures Be aware of what our par. progs. need to take into account in order to achieve speedup My theme: Do we want to get good results by accident or by intention? Fall 2016 CIS*3090 Parallel Programming 2

Authors strategy First, go down to HW architectural details using 6 representative case studies Then, go back up to create abstractions that preserve most important characteristics while hiding details yields 2 high-level models of processor architecture Armed with good model, we can reason about parallel SW, especially par. algos Fall 2016 CIS*3090 Parallel Programming 3

Learn Flynn s Taxonomy SISD = traditional computer SIMD, 2 uses: Instruction level: refers to single inst. that can operate on multiple variables/array elements Program level = SPMD (p49), how MPI operates on SHARCNET cluster MIMD (or MPMD) = multiple cores/hosts running individual threads/programs MISD rare, HW redundancy (Space Shuttle computer) Fall 2016 CIS*3090 Parallel Programming 4

6 architectures Grouped as chip multiprocessors, SMPs, heterogeneous designs, clusters, and supercomputers What s the difference between chip MP (e.g. Intel Core Duo) vs. SMP? homogeneous vs. heterogeneous designs? Picking out some highlights Fall 2016 CIS*3090 Parallel Programming 5

MESI cache coherency protocol (p32) Great online simulation https://www.cs.tcd.ie/jeremy.jones/vivio/cach es/mesihelp.htm Purpose? How does it work? What s snooping? Simulator shows how bus congestion can result and how SMP isn t scalable For SMP, how can bus congestion be reduced? Fall 2016 CIS*3090 Parallel Programming 6

Figure 2.3 Fall 2016 CIS*3090 Parallel Programming 7

HW vs SW multithreading (p35) SW: OS (or user-level scheduler) loads the single set of registers from thread s saved context in RAM, upon interrupts and every time it decides to reschedule. HW aka hyper-threading (Intel): OS loads two sets of registers from two runnable threads; processor switches back and forth between them when executing thread pauses (even just for mem access) Fall 2016 CIS*3090 Parallel Programming 8

Is hyper-threading a gimmick? Far faster context switching Effectively running 2 threads simultaneously (though with single core) OS sees 2 virtual processors Cheaper than 2 full-fledged cores and produces less load on memory bus Still overlaps computation with memory bus transactions so increases true concurrency Technique used in Intel Nehalem (our MacPro Core i7) looks like 16 cores to OS & SW Fall 2016 CIS*3090 Parallel Programming 9

Crossbar switch (Sun Fire E25K) (Fig 2.5 p36-37) Directly connects any pair of 18 origins & 18 destinations by opening/closing switches Since doesn t require routing on a common bus, and multiple connections possible at once, it s the fastest (& most expensive in HW) kind of interconnect Not scalable, due to n 2 HW resources ( 18x18 is pushing the limit ) Fall 2016 CIS*3090 Parallel Programming 10

Figure 2.5 Crossbar switch connecting four nodes. Notice the output and input channels; crossing wires do not connect unless a connection is shown. Each pair of nodes is directly connected by setting one of the open circles. Copyright 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 11

Shocking feature of Cell BE (p38) SPEs have 256K RAM, for program, stack, heap (if any), data Solution to memory bus congestion: DON T share mem, but have local mems and explicitly move data under program control So, don t use recursion, cut down on library use, get most mileage out of code by using SIMD (vector) instructions Fall 2016 CIS*3090 Parallel Programming 12

Figure 2.6 Architecture of the Cell processor. The architecture is designed to move data: The high speed I/O controllers have a capacity of 76.8 GB/s; each of the two channels to RAM runs at 12.8 GB/s; the capacity of the EIB is theoretically capable of 204.8 GB/s. Copyright 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 13

Cell BE pro/con Strength: several levels of parallelism Lowest: SIMD insts Multiple SPEs (8 or 16) Multiple PPE cores/threads PPE/SPE communication very complex Grad student created solution based on Pilot process/channel approach = CellPilot Good for heterogeneous cluster Fall 2016 CIS*3090 Parallel Programming 14

SHARCNET s specialty clusters Nodes furnished with heterogeneous processors: GPUs, FPGAs, Intel Xeon Phi (new) Gave up on Cell BE What s an FPGA? What use is it with a general-purpose CPU? All of these fall into category of G-P CPU + accelerator Fall 2016 CIS*3090 Parallel Programming 15

BlueGene s 3 networks (Fig 2.8) How many total processors? Torus network is for general-purpose message-passing among processors Collective network optimized for reduction operation Barrier network for efficient synchronization Fall 2016 CIS*3090 Parallel Programming 16

Figure 2.8 BlueGene/L communication networks; (a) 3D torus for standard interprocessor data transfer; (b) collective network for fast evaluation of reductions. Copyright 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 17

Summary of HW insights The ideal of shared memory + large no. of processors does not exist! Too hard to keep memories coherent without blowing up mem. reference time Striking different balances/tradeoffs Intel concept chip Single-Chip Cloud Computer (SCC) with 48 cores, 64GB RAM (shareable), but no cache coherency Renesas IMAPCAR2 with 128 cores to execute SIMD insts, or reconfigure as 32 independent PEs without cache coherency AutoPilot Fall 2016 CIS*3090 Parallel Programming 18