Why Parallel Computing?
|
|
- Julie Simmons
- 6 years ago
- Views:
Transcription
1 Why Parallel Computing? Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong
2 What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple processors performance Concurrent computing can either alternate multiple threads on a single processor or execute in parallel on multiple processors convenient design SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 2
3 What is Parallel Architecture? Machines with multiple processors Intel Quad Core i7 (Nehalem) Sony Playstation 3/4 IBM Blue Gene Parallel Computing vs. Distributed Computing Google Data Center Locations Facebook Data Center SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 3
4 What is Parallel Architecture? One definition A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some general issues Resource allocation Data access, communication and synchronization Performance and scalability SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 4
5 Why Study Parallel Computing? 10 years ago Some important applications demand high performance With CPU clock frequency scaling, it is not enough Parallel computing provides higher performance Today Many interesting applications demand high performance CPU clock rates are no longer increasing! Instruction-level parallelism is not increasing either! Parallel computing is the only way to achieve higher performance in the foreseeable future SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 5
6 Why Study Parallel Architecture? Role of a computer architect To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost Parallelism Provides alternative to faster clock for performance Applies at all levels of system design Bits, instructions, loops, threads, SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 6
7 Why Parallel Computing? Application demands Architectural Trends The impact of technology and power SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 7
8 Application Trends A positive feedback cycle between the two Delivered performance Applications demand for performance New Applications More Performance Example application domains Scientific computing: computational fluid dynamics, biology, meteorology, General purpose computing: video, graphics, Commercial computing: databases, SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 8
9 Speedup Goal of applications in using parallel machines Speedup (p processors) = Performance (p processors) Performance (1 processor) For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Time (1 processor) Time (p processors) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 9
10 Scientific Computing Demand SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong
11 Example: N-body Simulation Simulation of a dynamical system of particles N particles (location, mass, velocity) Influence of physical forces (e.g., gravity) Physics, astronomy, SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 11
12 Example: Graph Computation Google page-rank algorithm Determining the importance of a web page by counting the number and quality of likes to the page Relative importance of a web page u Repeat until stabilized Trillion pages in Web Credit: wikipedia SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 12
13 Engineering Computing Large parallel machines are mainstays in many industries Petroleum reservoir analysis Automotive crash simulation, drag analysis, combustion efficiency Aeronautics airflow analysis, engine efficiency, structural mechanics, electromagnetism Computer-aided design VLSI layout Pharmaceuticals molecular modeling Visualization all the above, entertainment, architecture Financial modeling yield and derivative analysis SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 13
14 Commercial Computing Relies on parallelism for high end Computational power determines scale of business that can be handled e.g.) databases, online-transaction processing, decision support, data mining, data warehousing... TPC benchmarks TPC-C (order entry), TPC-D (decision support) Explicit scaling criteria provided Size of enterprise scales with size of system Problem size no longer fixed as p increases, so throughput is used as a performance measure, transactions per minute (tpm) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 14
15 Commercial Computing Hardware threads in the Top-10 TPC-C machines SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 15
16 Why Parallel Computing? Application demands Architectural Trends The impact of technology and power SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 16
17 Architectural Trends Architecture translates technology s gifts into performance and capability Four generations of architectural history Tube, transistor, IC, VLSI Greatest delineation in VLSI has been in type of parallelism exploited SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 17
18 Moore s Law Gordon Moore Co-founder of Intel co. 1965: the number of transistors/inch 2 doubles every 1.5 years Credit: shigeru23 (CC-BY-SA-3.0) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 18
19 Evolution of Intel Microprocessors Intel 8088, core, no cache 29K transistors Intel Nehalem-EX, cores, 24MB cache 2.3B transistors Intel knight landing, 2016(?) 72 cores, 16GB DDR3 LLC(?) 7.1B transistors Figure credits: ExtremeTech, The future of microprocessors, Communications of the ACM, vol. 54, no.5 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 19
20 Transistors & Clock Frequency Credit: The future of computing performance: game over or next level?, The National Academies Press, 2011 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 20
21 Impact of Device Shrinkage What happens when the transistor size shrinks by a factor of x? Clock rate increases by x because wires are shorter Actually less than x, because of power consumption Transistors per unit area increases by x 2 Die size also tends to increase Typically another factor of ~x Raw computing power of the chip goes up by ~ x 4! Typically x 3 is devoted to either on-chip Parallelism: hidden parallelism such as ILP Locality: caches So most programs x 3 times faster, without changing them SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 21
22 Architectural Trends Greatest trend in VLSI generation is Increase in parallelism Up to 1985: bit level parallelism 4-bit 8 bit 16-bit, slows after 32 bit Adoption of 64-bit now under way, 128-bit far (not performance issue) Mid 80s to mid 90s: instruction level parallelism Pipelining and simple instruction sets + compiler advances (RISC) On-chip caches and functional units superscalar execution Greater sophistication: out of order execution, speculation, prediction To deal with control transfer and latency problems Now: thread level parallelism Multicore processors SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 22
23 Phases in VLSI Generation 100,000,000 Bit-level parallelism Instruction-level Thread-level (?) Transistors 10,000,000 1,000, ,000 4bits 8bits 16bits 32bits i80286 R10000 Pentium i80386 R2000 R3000 Multicore / manycore Heterogenous multicore GPGPU 10,000 i8086 i8080 i8008 i4004 Pipelining Superscalar Branch prediction Out-of-order execution VLIW 1, SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
24 Superscalar Execution Exploits instruction-level parallelism With N pipelines, N instructions can be issued (executed) in parallel 1. Load 2. Load 3. Add 4. Add 5. Add R1, R2 6. Store 2-way superscalar execution pipeline SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 24
25 Out-of-order Execution Exploits instruction-level parallelism With N pipelines, N instructions can be issued (executed) in parallel Reorder instructions 1. Load 2. Load 3. Add 4. Add 5. Add R1, R2 6. Store dependency 2-way superscalar execution pipeline SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 25
26 Architecture-driven Advances Credit: The Future of Microprocessors, Communications of the ACM, 2011 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 26
27 Limitation in ILP Fraction of total cycles (%) Speedup Number of instructions issued Instructions issued per cycle (Issue width) Speedup begins to saturate after issue width of 4 Assumption of ideal machine Infinite resources and fetch bandwidth, perfect branch prediction and renaming But with realistic memory Real caches and non-zero miss latencies SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 27
28 Instruction-level Parallelism Concerns Issue waste cycles Full issue slot Empty issue slot Horizontal waste = 9 slots Vertical waste = 12 slots Issue slots (4way) Contributing factors Instruction dependencies Long-latency operations within a thread Credit: D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, ISCA 1995 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 28
29 Some Sources of Wasted Issue Slots TLB miss Instruction cache miss Data cache miss Load delays (L1 hit latency) Branch misprediction Instruction dependency Memory conflict Memory hierarchy Control flow Instruction stream SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 29
30 Simulation of 8-issue Superscalar Processor Highly underutilized processor cycles In most of SPEC 92 applications On average < 1.5 IPC (19%) Dominant waste differs by application Credit: D. M. Tullsen, S. J. Eggers, and H. M. Levy, Simultaneous Multithreading: Maximizing On-Chip Parallelism, ISCA 1995 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 30
31 Single-Thread Performance Curve Architecture -driven Technology -driven The rate of single-thread performance improvement has decreased [Computer Architectures Hennessy & Patterson] SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 31
32 Single-chip Multiprocessor (Multicore) Limited instruction-level parallelism à Exploits thread-level parallelism 30% area 6-way superscalar processor 4 2-way multiprocessor Credit: K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, A case for single-chip multiprocessor, ASPLOS, 1996 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 32
33 Why Parallel Computing? Application demands Architectural Trends The impact of technology and power SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 33
34 Desperate Cooling? SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 34
35 Cooling Power delivery, packaging, and cooling costs Increased cost of thermal packing $1 per watt for CPUs more than 35W [Tiwari, DAC98] At high-end, 1W of cooling for every 1W of power Cooling Physical solutions Heatsink, thermal paste, fan APM (advanced power management) BIOS manages clock speed Triggered by thermal diodes ACPI (Advanced Configuration and Power Interface) OS manages power for all devices Idle, nap, sleep modes SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 35
36 Power Density (W/ cm 2 ) Power Consumption Trend Source: Patrick Gelsinger, Shenkar Bokar, Intelâ Reason for power trends Smaller feature size Increased compaction High clock frequency More complicated processors Sun s Surface Rocket Nozzle Nuclear Reactor Hot Plate 8086 P Pentium Year Process Transistors Supply voltage Frequency Power nm 0.2B V 1GHz 100W 90nm 1.2B V 3GHz 175W 55nm 7.8B 0.7V 10GHz 300W 3nm 50B 0.5V 30GHz 540W source: Intel (2002) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 36
37 Power Density Limits Serial Performance Single-core processor Dynamic power = V 2 fc V = supply voltage, f = frequency, C = capacitance Increasing frequency (f) increases supply voltage (V) à Cubic effect Multicore processor Dynamic power = V 2 fc Increasing # of cores increase capacitance (C) linearly More transistors by Moore s law à more cores for higher performance instead or, same performance at reduced power SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 37
38 Power Consumption (watts) Core2Quad 3.0GHz Core2duo 3.0GHz Core2duo 2.4GHz (mobile) Core2duo 1.2GHz (mobile) SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong Source: MIT IAP
39 Evolution of Microprocessors Chip density continues increase 2x every 2 years Clock speed is not # of cores increase Power no longer increases SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 39
40 Recent Multicore Processors 2H 15: Intel Knight s Landing 60+ cores; 4-way SMT; 16GB (on pkg) 2015: Oracle SPARC M7 32 cores; 8-way fine-grain MT; 64MB cache Fall 14: Intel Haswell 18 cores; 2-way SMT; 45MB cache June 14: IBM Power8 12 cores; 8-way SMT; 96MB cache Sept 13: SPARC M6 12 cores; 8-way fine-grain MT; 48MB cache May 12: AMD Trinity 4 CPU cores; 384 graphics cores Q2 13: Intel Knight s Corner (coprocessor) 61 cores; 2-way SMT; 16MB cache Feb 12: Blue Gene/Q cores; 4-way SMT; 32MB cache Blue Gene/Q compute chip Credit: The IBM Blue Gene/Q Compute Chip, IEEE Micro 2012 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 40
41 Simultaneous Multithreading (SMT) Permit several independent threads to issue to multiple functional units each cycle SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 41 Credit: R. Kalla, B. Sinharoy, J. Tendler, Simultaneous Multi-threading implementation in Power5, HotChips, 2003
42 Simultaneous Multithreading (SMT) SMT are easily added to superscalar architecture Two set of PCs and registers PC shares I-fetch unit Register rename maps second set of registers Credit: R. Kalla, B. Sinharoy, J. Tendler, Simultaneous Multi-threading implementation in Power5, HotChips, 2003 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 42
43 Large-scale Parallel Hardware Top 7 sites in Top500 supercomputers (Nov. 2014) 148th, 169th, 170th Korea Meteorological administration Credit: top500.org SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 43
44 Cores/Socket & Cores in Top500 fef Systems k-8k 100 8k-16k 50 16k-32k 32k-64k 0 64k-128k k- Credit: top500.org, CS267@UC Berkeley by J. Demmel SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 44
45 Blue Gene/Q Packing Hierarchy 1. Chip 16 cores 2. Module Single Chip 3. Compute Card One single chip module, 16 GB DDR3 Memory 4. Node Card 32 Compute Cards, Optical Modules, Link Chips, Torus 5b. I/O Drawer 8 I/O Cards 8 PCIe Gen2 slots 6. Rack 2 Midplanes 1, 2 or 4 I/O Drawers 7. System 20PF/s 5a. Midplane 16 Node Cards Credit: The Blue Gene/Q Compute Chip, HotChips 2011 SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 45
46 Summary: Why Parallel Computing? If you can t exploit parallelism, performance will be a zero sum game If you want to add a new feature and maintain performance, you will need to remove something else. The future is multicore (i.e. parallel) according to all major processor vendors We expect more and more cores will be delivered with each generation Starting now, everyone will need to know how to write parallel programs It is no longer a niche area: it is a mainstream. SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) 46
47 References Some slides are adopted and adapted from CS194: Parallel Programming by Prof. Katherine Yelick at UC Berkeley CS267: Applications of Parallel Computers by Prof. Jim Demmel at UC Berkeley CS258: Parallel Computer Architecture by Prof. David E. Culler at UC Berkeley COMP422: Parallel Computing by Prof. John Mellor-Crummey at Rice Univ. SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong 47
Why Parallel Computing? Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Why Parallel Computing? Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu What is Parallel Computing? Software with multiple threads Parallel vs. concurrent
More informationParallel Computing. Parallel Computing. Hwansoo Han
Parallel Computing Parallel Computing Hwansoo Han What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple
More informationWhy Parallel Architecture
Why Parallel Architecture and Programming? Todd C. Mowry 15-418 January 11, 2011 What is Parallel Programming? Software with multiple threads? Multiple threads for: convenience: concurrent programming
More informationParallel Computer Architecture
Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationSimultaneous Multithreading and the Case for Chip Multiprocessing
Simultaneous Multithreading and the Case for Chip Multiprocessing John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 2 10 January 2019 Microprocessor Architecture
More informationMulticore and Parallel Processing
Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 xkcd/619 2 Pitfall: Amdahl s Law Execution time after improvement
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationProf. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing
More informationMicroprocessor Trends and Implications for the Future
Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationCSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Why Parallel Computing? Haidar M. Harmanani Spring 2017 Definitions What is parallel? Webster: An arrangement or state that permits several
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationAdministration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects
CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:
More informationNOW Handout Page 1 NO! Today s Goal: CS 258 Parallel Computer Architecture. What will you get out of CS258? Will it be worthwhile?
Today s Goal: CS 258 Parallel Computer Architecture Introduce you to Parallel Computer Architecture Answer your questions about CS 258 Provide you a sense of the trends that shape the field CS 258, Spring
More informationComputer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John
Computer Performance Evaluation and Benchmarking EE 382M Dr. Lizy Kurian John Evolution of Single-Chip Transistor Count 10K- 100K Clock Frequency 0.2-2MHz Microprocessors 1970 s 1980 s 1990 s 2010s 100K-1M
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationCS 194 Parallel Programming. Why Program for Parallelism?
CS 194 Parallel Programming Why Program for Parallelism? Katherine Yelick yelick@cs.berkeley.edu http://www.cs.berkeley.edu/~yelick/cs194f07 8/29/2007 CS194 Lecure 1 What is Parallel Computing? Parallel
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationCS 152 Computer Architecture and Engineering. Lecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationAdvanced Computer Architecture (CS620)
Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).
More informationCSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering
CSE 141: Computer 0 Architecture Professor: Michael Taylor RF UCSD Department of Computer Science & Engineering Computer Architecture from 10,000 feet foo(int x) {.. } Class of application Physics Computer
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationInterconnection Network
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationParallelism in Hardware
Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationEvolution of Computers & Microprocessors. Dr. Cahit Karakuş
Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor
More informationParallel Programming: Principle and Practice
Parallel Programming: Principle and Practice INTRODUCTION Course Goals The students will get the skills to use some of the best existing parallel programming tools, and be exposed to a number of open research
More informationParallelism, Multicore, and Synchronization
Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationAdministration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,
More informationGoals of this lecture
Power Density (W/cm 2 ) Goals of this lecture Design of Parallel and High-Performance Computing Fall 2017 Lecture: Introduction Motivate you! Trends High performance computing Programming models Course
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationReference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded
Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 1 Introduction Li Jiang 1 Course Details Time: Tue 8:00-9:40pm, Thu 8:00-9:40am, the first 16 weeks Location: 东下院 402 Course Website: TBA Instructor:
More informationCS/EE 6810: Computer Architecture
CS/EE 6810: Computer Architecture Class format: Most lectures on YouTube *BEFORE* class Use class time for discussions, clarifications, problem-solving, assignments 1 Introduction Background: CS 3810 or
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1
CS252 Spring 2017 Graduate Computer Architecture Lecture 14: Multithreading Part 2 Synchronization 1 Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture
More informationPerformance of computer systems
Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type
More informationThe Power Wall. Why Aren t Modern CPUs Faster? What Happened in the Late 1990 s?
The Power Wall Why Aren t Modern CPUs Faster? What Happened in the Late 1990 s? Edward L. Bosworth, Ph.D. Associate Professor TSYS School of Computer Science Columbus State University Columbus, Georgia
More informationComputer Architecture Crash course
Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping
More informationComputer Architecture. Fall Dongkun Shin, SKKU
Computer Architecture Fall 2018 1 Syllabus Instructors: Dongkun Shin Office : Room 85470 E-mail : dongkun@skku.edu Office Hours: Wed. 15:00-17:30 or by appointment Lecture notes nyx.skku.ac.kr Courses
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationProcessor Architecture
Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong (jinkyu@skku.edu)
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationLecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationComputer Architecture
Informatics 3 Computer Architecture Dr. Vijay Nagarajan Institute for Computing Systems Architecture, School of Informatics University of Edinburgh (thanks to Prof. Nigel Topham) General Information Instructor
More informationSupercomputing and Mass Market Desktops
Supercomputing and Mass Market Desktops John Manferdelli Microsoft Corporation This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationLecture 21: Parallelism ILP to Multicores. Parallel Processing 101
18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationChapter 1: Fundamentals of Quantitative Design and Analysis
1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationTransistors and Wires
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis Part II These slides are based on the slides provided by the publisher. The slides
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationCSC 447: Parallel Programming for Multi- Core and Cluster Systems. Lectures TTh, 11:00-12:15 from January 16, 2018 until 25, 2018 Prerequisites
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Introduction and A dministrivia Haidar M. Harmanani Spring 2018 Course Introduction Lectures TTh, 11:00-12:15 from January 16, 2018 until
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationLecture 1: CS/ECE 3810 Introduction
Lecture 1: CS/ECE 3810 Introduction Today s topics: Why computer organization is important Logistics Modern trends 1 Why Computer Organization 2 Image credits: uber, extremetech, anandtech Why Computer
More informationAdministration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice
CSE 392/CS 378: High-performance Computing: Principles and Practice Administration Professors: Keshav Pingali 4.126 ACES Email: pingali@cs.utexas.edu Jim Browne Email: browne@cs.utexas.edu Robert van de
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Winter 2018 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2018 1 When and Where? When:
More informationMoore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016
Moore s Law CS 6534: Tech Trends / Intro Microprocessor Transistor Counts 1971-211 & Moore's Law 2,6,, 1,,, Six-Core Core i7 Six-Core Xeon 74 Dual-Core Itanium 2 AMD K1 Itanium 2 with 9MB cache POWER6
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationCS 6534: Tech Trends / Intro
1 CS 6534: Tech Trends / Intro Charles Reiss 24 August 2016 Moore s Law Microprocessor Transistor Counts 1971-2011 & Moore's Law 16-Core SPARC T3 2,600,000,000 1,000,000,000 Six-Core Core i7 Six-Core Xeon
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors
More informationECE 486/586. Computer Architecture. Lecture # 2
ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:
More information