Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Save this PDF as:

Size: px
Start display at page:

Download "Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University"

Transcription

1 Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

2 Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores) 9.1 Introduction Chapter 7 Multicores, Multiprocessors, and Clusters 2

3 Questions to Address 1. How do the parallel processors share data? 2. How do the parallel processors coordinate their computing schedules? 3. How many processors should be used? 4. What is the minimum speedup S(N) acceptable for N processors? What are the factors that drive this decision?

4 Question: How to Get Great Computing Power? There are two obvious options. 1. Build a single large very powerful CPU. 2. Construct a computer from multiple cooperating processing units. The early choice was for a computing system with only a few (1 to 16) processing units. This choice was based on what appeared to be very solid theoretical grounds.

5 Linear Speed-Up The cost of a parallel processing system with N processors is about N times the cost of a single processor; the cost scales linearly. The goal is to get N times the performance of a single processor system for an N-processor system. This is linear speedup. For linear speedup, the cost per unit of computing power is approximately constant.

6 The Cray-1

7 Supercomputers vs. Multiprocessor Clusters If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens. Seymour Cray Here are two opinions from a 1984 article. The speedup factor of using an n processor system over a uniprocessor system has been theoretically estimated to be within the range (log 2 n, n/log 2 n). By the late 1980s, we may expect systems of 8 16 processors. Unless the technology changes drastically, we will not anticipate massive multiprocessor systems until the 90s. The drastic technology change is called VLSI.

8 The Speed Up Factor: S(N)

9 Cost Efficiency: S(N) / N

10 Harold Stone on Linear Speedup Harold Stone wrote in 1990 on what he called peak performance. When a multiprocessor is operating at peak performance, 1. All processors are engaged in useful work. 2. No processor is idle, and no processor is executing an instruction that would not be executed if the same algorithm were executing on a single processor. 3. In this state of peak performance, all N processors are contributing to effective performance, and the processing rate is increased by a factor of N. 4. Peak performance is a very special state that is rarely achievable.

11 The Problem with the Early Theory The early work focused on the problem of general computation. Not all problems can be solved by an algorithm that can be mapped onto a set of parallel processors. However, many very important problems can be solved by parallel algorithms.

12 Hardware and Software Hardware Serial: e.g., Pentium 4 Parallel: e.g., quad-core Xeon e5345 Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware Chapter 7 Multicores, Multiprocessors, and Clusters 12

13 Cooperation Among Processes Parallel execution on a multi-core CPU is not inherently a difficult problem. The problems arise when the processes need to cooperate. Example: A quad-core running 4 independent programs that do not communicate. One measure of the complexity of parallel execution is the amount of communication required among the processes. More communication means more complex.

14 Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it s easier! Difficulties Partitioning Coordination Communications overhead 7.2 The Difficulty of Creating Parallel Processing Programs Chapter 7 Multicores, Multiprocessors, and Clusters 14

15 Amdahl s Law Sequential part can limit speedup Example: 100 processors, 90 speedup? T new = T parallelizable /100 + T sequential Speedup (1 F paralleliz able ) 1 F paralleliz able / Solving: F parallelizable = Need sequential part to be 0.1% of original time Chapter 7 Multicores, Multiprocessors, and Clusters 15

16 Some Results Due to Amdahl s Law

17 Characterizing Problems One result of Amdahl s Law is that only problems with very small necessarily sequential parts can benefit from massive parallel processing. Fortunately, there are many such problems 1. Weather forecasting. 2. Nuclear weapons simulation. 3. Protein folding and issues in drug design.

18 Scaling Example Workload: sum of 10 scalars, and matrix sum Speed up from 10 to 100 processors Single processor: Time = ( ) t add 10 processors Time = 10 t add + 100/10 t add = 20 t add Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 t add + 100/100 t add = 11 t add Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processors Chapter 7 Multicores, Multiprocessors, and Clusters 18

19 Scaling Example (cont) What if matrix size is ? Single processor: Time = ( ) t add 10 processors Time = 10 t add /10 t add = 1010 t add Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 t add /100 t add = 110 t add Speedup = 10010/110 = 91 (91% of potential) Assuming load balanced Chapter 7 Multicores, Multiprocessors, and Clusters 19

20 Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, matrix Time = 20 t add 100 processors, matrix Time = 10 t add /100 t add = 20 t add Constant performance in this example Chapter 7 Multicores, Multiprocessors, and Clusters 20

21 Synchronization Two processors sharing an area of memory P1 writes, then P2 reads Data race if P1 and P2 don t synchronize Result depends of order of accesses Hardware support required Atomic read/write memory operation No other access to the location allowed between the read and write Could be a single instruction E.g., atomic swap of register memory Or an atomic pair of instructions 2.11 Parallelism and Instructions: Synchronization Chapter 2 Instructions: Language of the Computer 21

22 The Necessity for Synchronization In a multiprocessing system, it is essential to have a way in which two or more processors working on a common task can each execute programs without corrupting the other s subtasks. Synchronization, an operation that guarantees an orderly access to shared memory, must be implemented for a properly functioning multiprocessing system. Chun & Latif, MIPS Technologies Inc.

23 Synchronization in Uniprocessors The synchronization issue posits 2 processes sharing an area of memory. The processes can be on different processors, or on a single shared processor. Most issues in operating system design are best imagined within the context of multiple processors, even if there is only one that is being time shared.

24 The Lost Update Problem Here is a synchronization problem straight out of database theory. Two travel agents book a flight with one seat remaining. A1 reads seat count. One remaining. A2 reads seat count. One remaining. A1 books the seat. Now there are no more seats. A2, working with old data, also books the seat. Now we have at least one unhappy customer.

25 I ve Got It; You Can t Have It What is needed is a way to put a lock on the seat count until one of the travel agents completes the booking. Then the other agent must begin with the new seat count. Database engines use record locking as one way to prevent lost updates. Another database technique is the idea of an atomic transaction, here a 2-step transaction.

26 Atomic Transactions We do not mean the type of transaction at left. An atomic read and modify must proceed without any interruption. No other process can access the shared memory between the read and write back to the memory location.

27 Synchronization in MIPS Load linked: ll rt, offset(rs) Store conditional: sc rt, offset(rs) Succeeds if location not changed since the ll Returns 1 in rt Fails if location is changed Returns 0 in rt Example: atomic swap (to test/set lock variable) try: add \$t0,\$zero,\$s4 ;copy exchange value ll \$t1,0(\$s1) ;load linked sc \$t0,0(\$s1) ;store conditional beq \$t0,\$zero,try ;branch store fails add \$s4,\$zero,\$t1 ;put load value in \$s4 Chapter 2 Instructions: Language of the Computer 27

28 Details on LL and SC These commands work with the cache memory system at a cache line level. Each cache line has a LL bit, which is set by the Load Linked command. The LL bit will be cleared if another process writes to that specific cache line. The SC command works only if the LL bit remains set; otherwise it fails.

29 Multithreading Performing multiple threads of execution in parallel Replicate registers, PC, etc. Fast switching between threads Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn t hide short stalls (eg, data hazards) 7.5 Hardware Multithreading Chapter 7 Multicores, Multiprocessors, and Clusters 29

30 Simultaneous Multithreading In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HT Two threads: duplicated registers, shared function units and caches Chapter 7 Multicores, Multiprocessors, and Clusters 30

31 Multithreading Example Chapter 7 Multicores, Multiprocessors, and Clusters 31

32 Future of Multithreading Will it survive? In what form? Power considerations simplified microarchitectures Simpler forms of multithreading Tolerating cache-miss latency Thread switch may be most effective Multiple simple cores might share resources more effectively Chapter 7 Multicores, Multiprocessors, and Clusters 32

33 Instruction and Data Streams An alternate classification Instruction Streams Single Multiple Single SISD: Intel Pentium 4 MISD: No examples today Data Streams Multiple SIMD: SSE instructions of x86 MIMD: Intel Xeon e5345 SPMD: Single Program Multiple Data A parallel program on a MIMD computer Conditional code for different processors 7.6 SISD, MIMD, SIMD, SPMD, and Vector Chapter 7 Multicores, Multiprocessors, and Clusters 33

34 SIMD Operate elementwise on vectors of data E.g., MMX and SSE instructions in x86 Multiple data elements in 128-bit wide registers All processors execute the same instruction at the same time Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Chapter 7 Multicores, Multiprocessors, and Clusters 34

35 Vector Processors Highly pipelined function units Stream data from/to vector registers to units Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS element registers (64-bit elements) Vector instructions lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double Significantly reduces instruction-fetch bandwidth Chapter 7 Multicores, Multiprocessors, and Clusters 35

37 Vector vs. Scalar Vector architectures and compilers Simplify data-parallel programming Explicit statement of absence of loop-carried dependences Reduced checking in hardware Regular access patterns benefit from interleaved and burst memory Avoid control hazards by avoiding loops More general than ad-hoc media extensions (such as MMX, SSE) Better match with compiler technology Chapter 7 Multicores, Multiprocessors, and Clusters 37

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

Chapter 7. Multicores, Multiprocessors, and

Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher h performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

CS 61C: Great Ideas in Computer Architecture Amdahl s Law, Thread Level Parallelism Instructor: Alan Christopher 07/17/2014 Summer 2014 -- Lecture #15 1 Review of Last Lecture Flynn Taxonomy of Parallel

Fiscal Year 2018 Ver. 2019-01-24a Course number: CSC.T433 School of Computing, Graduate major in Computer Science Advanced Computer Architecture 11. Multi-Processor: Distributed Memory and Shared Memory

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

CS3350B Computer Architecture

CS3350B Computer Architecture Winter 2015 Lecture 7.2: Multicore TLP (1) Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and Design, Patterson & Hennessy,

Static Compiler Optimization Techniques

Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

! Readings! ! Room-level, on-chip! vs.!

1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

Vector Processing. Computer Organization Architectures for Embedded Computing. Friday, 13 December 13

Vector Processing Computer Organization Architectures for Embedded Computing Friday, 13 December 13 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing

Computer Architecture

Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism

An Introduction to Parallel Architectures

An Introduction to Parallel Architectures Andrea Marongiu a.marongiu@unibo.it Impact of Parallel Architectures From cell phones to supercomputers In regular CPUs as well as GPUs Parallel HW Processing

THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

1 2 Lecture 28 Introduction to arallel rocessing and some Architectural Ramifications 3 4 ultiprocessing Flynn s Taxonomy Flynn s Taxonomy of arallel achines How many Instruction streams? How many Data

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer

Parallel Computing Platforms

Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

Multicore and Parallel Processing

Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 xkcd/619 2 Pitfall: Amdahl s Law Execution time after improvement

Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

WHY PARALLEL PROCESSING? (CE-401)

PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

The Art of Parallel Processing

The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

COSC 6385 Computer Architecture. - Vector Processors

COSC 6385 Computer Architecture - Vector Processors Spring 011 Vector Processors Chapter F of the 4 th edition (Chapter G of the 3 rd edition) Available in CD attached to the book Anybody having problems

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 Lecture #19 1 Review of Last Lecture Amdahl s Law limits benefits

Introduction II. Overview

Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

Chapter 4 Data-Level Parallelism

CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 4.1 Introduction 4.2 Vector Architecture

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 7: Support for Parallelism

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner Topic 7: Support for Parallelism These slides are mostly taken verbatim, or with minor changes,

Lecture 1: Introduction

Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

Computer Systems Architecture

Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

High Performance Computing Systems

High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

Computer Systems Architecture

Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE Lecture 5B: Data Level Parallelism Avinash Kodi, kodi@ohio.edu Thanks to Morgan Kauffman and Krtse Asanovic Agenda 2 Flynn s Classification Data Level Parallelism Vector

Computer Science 146. Computer Architecture

Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

Chapter 8. Multiprocessors. In-Cheol Park School of EE, KAIST

Chapter 8. Multiprocessors In-Cheol Park School of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering

High Performance Computing for Science and Engineering I

High Performance Computing for Science and Engineering I Petros Koumoutsakos Lecture 3: Parallel Computing Concepts Architectures Performance Metrics PARALLEL COMPUTING CONCEPTS & TERMINOLOGY von Neumann

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Sets: Pipelining

ECE 3056: Architecture, Concurrency, and Energy of Computation Sample Problem Sets: Pipelining 1. Consider the following code sequence used often in block copies. This produces a special form of the load

Introduction to Parallel Computing

Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

Issues in Multiprocessors

Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

Computer Architecture

Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

Computer parallelism Flynn s categories

04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

Issues in Multiprocessors

Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

Processor Performance and Parallelism Y. K. Malaiya

Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles

High Performance Computing in C and C++

High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday

Chapter 4 : Pipeline and Vector processing

Chapter 4 : Pipeline and Vector processing Chapter 4 Pipeline and Vector Processing Pipelining Pipelining is a technique of decomposing a sequential process into suboperations, with each subprocess being

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

Parallelism, Multicore, and Synchronization

Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore

Hardware-Based Speculation

Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction