Multicore Programming

Size: px
Start display at page:

Download "Multicore Programming"

Transcription

1 Multi Programming Parallel Hardware and Performance 8 Nov 00 (Part ) Peter Sewell Jaroslav Ševčík Tim Harris

2 Merge sort 6MB input (-bit integers) Recurse(left) ~98% execution time Recurse(right) Merge to scratch array ~% execution time Copy back to input array

3 Merge sort, dual 6MB input (-bit integers) Recurse(left) Recurse(right) Merge to scratch array Copy back to input array

4 Throughput per / elements per ms T700 dual- laptop Wall-clock execution time / ms

5 Throughput per / elements per ms AMD Phenom Wall-clock execution time / ms

6 The power wall Moore s law: area of transistor is about 0% smaller each generation A given chip area accommodates twice as many transistors P dyn = α f V Shrinking process technology (constant field scaling) allows reducing V to partially counteract increasing f V cannot be reduced arbitrarily Halving V more than halves max f (for a given transistor) Physical limits P leak = V(I sub + I ox ) Reduce I sub (sub-threshold leakage): turn off component, increase threshold volatage (reduces max f) Reduce I ox (gate-oxide leakage): increase oxide thickness (but it needs to decrease with process scale)

7 Operations / µs The memory wall 0000 CPU MHz CPU clock, 00ns to access memory in 98 Memory

8 Operations / µs The memory wall CPU Cycles per memory access: Memory

9 The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in a single thread Identified by the hardware Speculate past memory accesses Speculate past control transfer Diminishing returns

10 Power wall + ILP wall + memory wall = brick wall Power wall means we can t just clock processors faster any longer Memory wall means that many workload s perf is dominated by memory access times ILP wall means we can t find extra work to keep functional units busy while waiting for memory accesses

11 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

12 Amdahl s law Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi- hardware. On a machine with n s, how many s do you need to use to get a x speed-up on the overall algorithm?

13 Speedup Amdahl s law, f=70% Desired x speedup Speedup achieved (perfect scaling on 70%) #s

14 Amdahl s law, f=70% MaxSpeedup(f, c) = f + f c f = fraction of code speedup applies to c = number of s used

15 Speedup Amdahl s law, f=70% Desired x speedup Speedup achieved (perfect scaling on 70%) Limit as c = /(-f) = #s

16 Speedup Amdahl s law, f=0% Speedup achieved with perfect scaling Amdahl s law limit, just.x #s

17 Speedup Amdahl s law, f=98% #s

18 Amdahl s law & multi- Suppose that the same h/w budget (space or power) can make us:

19 Core perf (relative to big Perf of big & small s..0 Assumption: perf = α resource Total perf: 6 * / = Total perf: * = /6 /8 / / Resources dedicated to

20 Perf (relative to big ) Amdahl s law, f=98% medium 6 small. big #Cores

21 Perf (relative to big ) Amdahl s law, f=7%..0 big medium small #Cores

22 Perf (relative to big ) Amdahl s law, f=% big medium 0. 6 small #Cores

23 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

24 Asymmetric chips

25 Perf (relative to big ) Amdahl s law, f=7% medium big 6 small #Cores

26 Perf (relative to big ) Amdahl s law, f=% big medium small #Cores

27 Perf (relative to big ) Amdahl s law, f=98%.. medium + 6 small. big #Cores

28 Speedup (relative to big ) Amdahl s law, f=98% small #Cores

29 Speedup (relative to big ) Amdahl s law, f=98% small Leave larger idle in parallel section #Cores

30 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

31 Merge sort (-)

32 Merge sort (-)

33 Merge sort (8-)

34 T (span): critical path length

35 T (work): time to run sequentially

36 T serial : optimized sequential code

37 A good multi- parallel algorithm T /T serial is low What we lose on sequential performance we must make up through parallelism Resource availability may limit the ability to do that T grows slowly with the problem size We tackle bigger problems by using more s, not by running for longer

38 Quick-sort on good input Recurse T = O(n logn) T = O(n) Sequential partition Sequential append Recurse

39 Quick-sort on good input T = O(n logn) T = O(n) Parallel partition Recurse Recurse Parallel append

40 Quick-sort on good input Recurse T = O(n log n) T = O(log n) Parallel partition Parallel append Recurse

41 Execution time Scheduling from a DAG 0 T = 0 T p T / P 8 6 T p T T = # s

42 Speedup (T p / T ) Scheduling from a DAG T p T 0 T p T / P # s

43 Speedup (T p / T ) In CILK: T p T / p + c T T p T / P Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low T p T c= # s

44 Speedup (T p / T ) In CILK: T p T / p + c T 0 8 Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low 6 T p T / P T p T c= c= # s

45 Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

46 Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads in a workload with: Poor spatial locality Frequent memory accesses Main memory

47 Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads with synergistic resource needs Main memory

48 Multi- h/w common L Core Core ALU ALU ALU ALU L cache L cache L cache Main memory

49 Multi- h/w common L Read L cache L cache L cache Main memory

50 Multi- h/w common L Read L cache L cache L cache S Main memory

51 Multi- h/w common L Read L S cache L cache L cache S Main memory

52 Multi- h/w common L Read L S cache L cache L cache S Main memory

53 Multi- h/w common L Read L S cache L S cache L cache S Main memory

54 Multi- h/w common L Write L S cache L S cache L cache S Main memory

55 Multi- h/w common L Write L cache L S cache L cache S Main memory

56 Multi- h/w common L Write L cache L S cache L cache Main memory

57 Multi- h/w common L Write L cache L I cache L cache Main memory

58 Multi- h/w common L Write L M cache L I cache L cache E Main memory

59 Multi- h/w common L Write L M cache L I cache L cache E Main memory

60 Multi- h/w common L Write L M cache L cache L cache E Main memory

61 Multi- h/w common L Write L M cache L cache L cache Main memory

62 Multi- h/w common L Write L I cache L cache L M cache Main memory

63 Multi- h/w common L Write L I cache L E cache L M cache Main memory

64 Multi- h/w separate L L cache L cache L cache L cache Main memory

65 Multi- h/w additional L L cache L cache L cache L cache L cache Main memory

66 Multi-threaded multi- h/w L cache L cache L cache L cache L cache Main memory

67 SMP multiprocessor L cache L cache L cache L cache Main memory

68 NUMA multiprocessor L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory Interconnect L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory

69 Three kinds of parallel hardware Multi-threaded s Increase utilization of a or memory b/w Peak ops/cycle fixed Multiple s Increase ops/cycle Don t necessarily scale caches and off-chip resources proportionately Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w proportionately

70 Throughput per / elements per ms AMD Phenom Wall-clock execution time / ms

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Technology. Giorgio Richelli

Technology. Giorgio Richelli Technology Giorgio Richelli What Comes out of the Fab Transistor Abstractions in Logic Design In physical world Voltages, Currents Electron flow In logical world - abstraction V < V lo 0 = FALSE V > V

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

27. Parallel Programming I

27. Parallel Programming I 760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program. Parallel Programming WS6 HOMEWORK (with solutions) Performance Metrics Basic concepts. Performance. Suppose we have two computers A and B. Computer A has a clock cycle of ns and performs on average 2 instructions

More information

Quiz for Chapter 1 Computer Abstractions and Technology

Quiz for Chapter 1 Computer Abstractions and Technology Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Imperative Data Parallelism (Performance)

Imperative Data Parallelism (Performance) Imperative Data Parallelism (Performance) Unit 1.a 1 Acknowledgments Authored by Thomas Ball, MSR Redmond This slide deck contains material courtesy of Tim Harris, MSR Cambridge 9/4/2010 2 Data Intensive

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

1.3 Data processing; data storage; data movement; and control.

1.3 Data processing; data storage; data movement; and control. CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Performance, Power, Die Yield. CS301 Prof Szajda

Performance, Power, Die Yield. CS301 Prof Szajda Performance, Power, Die Yield CS301 Prof Szajda Administrative HW #1 assigned w Due Wednesday, 9/3 at 5:00 pm Performance Metrics (How do we compare two machines?) What to Measure? Which airplane has the

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Parallel Functional Programming Lecture 1. John Hughes

Parallel Functional Programming Lecture 1. John Hughes Parallel Functional Programming Lecture 1 John Hughes Moore s Law (1965) The number of transistors per chip increases by a factor of two every year two years (1975) Number of transistors What shall we

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

27. Parallel Programming I

27. Parallel Programming I The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [15 points] Consider two different implementations, M1 and

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January 2016 For more information, see http://www.cs.washington.edu/homes/djg/teachingmaterials/

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) 1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory

More information

Performance of computer systems

Performance of computer systems Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

Multicore Hardware and Parallelism

Multicore Hardware and Parallelism Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014 Lecture 6 Introduction Amdahl s law Basic spin-locks Queue-based locks Hierarchical locks Reader-writer locks Reading

More information

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2 Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving

More information

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Giorgio Buttazzo. Scuola Superiore Sant Anna, Pisa. The transition

Giorgio Buttazzo. Scuola Superiore Sant Anna, Pisa. The transition Giorgio Buttazzo Scuola Superiore Sant Anna, Pisa The transition On May 7 th, 2004, Intel, the world s largest chip maker, canceled the development of the Tejas processor, the successor of the Pentium4-style

More information

Why do we care about parallel?

Why do we care about parallel? Threads 11/15/16 CS31 teaches you How a computer runs a program. How the hardware performs computations How the compiler translates your code How the operating system connects hardware and software The

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group avi.mendelson@intel.com 1 Disclaimer No Intel proprietary information is disclosed. Every future estimate

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018 CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs Ruth Anderson Autumn 2018 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:

More information

CS654 Advanced Computer Architecture. Lec 3 - Introduction

CS654 Advanced Computer Architecture. Lec 3 - Introduction CS654 Advanced Computer Architecture Lec 3 - Introduction Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University of California,

More information

Parallelism, Multicore, and Synchronization

Parallelism, Multicore, and Synchronization Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Multicore and Parallel Processing

Multicore and Parallel Processing Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 xkcd/619 2 Pitfall: Amdahl s Law Execution time after improvement

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Final Review Shuai Wang Department of Computer Science and Technology Nanjing University Computer Architecture Computer architecture, like other architecture, is the art

More information

PERFORMANCE MEASUREMENT

PERFORMANCE MEASUREMENT Administrivia CMSC 411 Computer Systems Architecture Lecture 3 Performance Measurement and Reliability Homework problems for Unit 1 posted today due next Thursday, 2/12 Start reading Appendix C Basic Pipelining

More information

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

CDA3101 Recitation Section 13

CDA3101 Recitation Section 13 CDA3101 Recitation Section 13 Storage + Bus + Multicore and some exam tips Hard Disks Traditional disk performance is limited by the moving parts. Some disk terms Disk Performance Platters - the surfaces

More information

Computer Architecture

Computer Architecture Computer Architecture Architecture The art and science of designing and constructing buildings A style and method of design and construction Design, the way components fit together Computer Architecture

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Transistors and Wires

Transistors and Wires Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis Part II These slides are based on the slides provided by the publisher. The slides

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1 MySQL UC 2010 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul MySQL UC 2010 How Fractal Trees Work 2 More Information You can download this talk and others at http://tokutek.com/technology

More information

CS 110 Computer Architecture

CS 110 Computer Architecture CS 110 Computer Architecture Performance and Floating Point Arithmetic Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

Lecture 23: Parallelism. To Use Library. Getting Good Results. CS 62 Fall 2016 Kim Bruce & Peter Mawhorter

Lecture 23: Parallelism. To Use Library. Getting Good Results. CS 62 Fall 2016 Kim Bruce & Peter Mawhorter Lecture 23: Parallelism CS 62 Fall 2016 Kim Bruce & Peter Mawhorter Some slides based on those from Dan Grossman, U. of Washington To Use Library Getting Good Results Create a ForkJoinPool Instead of subclass

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Steve Wolfman, based on work by Dan Grossman (with small tweaks by Alan Hu) Learning

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Datapath Today s Topics: technologies Technology trends Impact on performance Hierarchy The principle of locality

More information

LECTURE 1. Introduction

LECTURE 1. Introduction LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us might first think of our laptop or maybe one of the desktop machines frequently used in the Majors Lab. Computers, however,

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Department of Computer Science Duke University Ph.D. Qualifying Exam. Computer Architecture 180 minutes

Department of Computer Science Duke University Ph.D. Qualifying Exam. Computer Architecture 180 minutes Department of Computer Science Duke University Ph.D. Qualifying Exam Name: Computer Architecture 180 minutes Not all questions are equally difficult. Look at entire exam. Budget time carefully. Please

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

Instructor Information

Instructor Information CS 203A Advanced Computer Architecture Lecture 1 1 Instructor Information Rajiv Gupta Office: Engg.II Room 408 E-mail: gupta@cs.ucr.edu Tel: (951) 827-2558 Office Times: T, Th 1-2 pm 2 1 Course Syllabus

More information

The Implications of Multi-core

The Implications of Multi-core The Implications of Multi- What I want to do today Given that everyone is heralding Multi- Is it really the Holy Grail? Will it cure cancer? A lot of misinformation has surfaced What multi- is and what

More information

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015 CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015 Outline Done: How to write a parallel algorithm with fork and join Why using divide-and-conquer

More information

B649 Graduate Computer Architecture. Lec 1 - Introduction

B649 Graduate Computer Architecture. Lec 1 - Introduction B649 Graduate Computer Architecture Lec 1 - Introduction http://www.cs.indiana.edu/~achauhan/teaching/ B649/2009-Spring/ 1/12/09 b649, Lec 01-intro 2 Outline Computer Science at a Crossroads Computer Architecture

More information

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Computer Architecture and Structured Parallel Programming James Reinders, Intel Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer

More information