Multithreaded Programming
|
|
- Harold Lucas
- 5 years ago
- Views:
Transcription
1 Multithreaded Programming Arch D. Robison Intel Corporation Copyright 2005 Intel Corporation 1
2 Overall Goal Acquire mental model of how parallel computers work Learn basic guidelines on designing parallel programs Know what you need to look up later Emphasis is on multithreaded shared-memory programs, but most of the high-level concepts apply to any parallel program. Copyright 2005 Intel Corporation 2
3 Outline Episode I II III IV V VI Parallelism 101 Hardware Decomposition Specification of parallelism and synchronization Correctness Practical issues Copyright 2005 Intel Corporation 3
4 Episode I Reasons to write parallel programs Kinds of parallelism Speedup Copyright 2005 Intel Corporation 4
5 Reasons to Write Parallel Programs Functionality Throughput Turn-around Capacity Copyright 2005 Intel Corporation 5
6 Functionality Sometimes parallel version is easier to write. Often the case in stream processing expand sed 's/ */ /' fold s Responsiveness of human interface Decouple human interface from computing portions Compute in background Decouple portions of user interface Not have window block because another window s dialog box is open. Copyright 2005 Intel Corporation 6
7 Throughput Thoughput = useful work per unit time Parallel program can improve throughput even on a single CPU Example: hide I/O latency Copyright 2005 Intel Corporation 7
8 Turn-Around Solve problem faster Typically by applying more resources Parallel solution is sometimes better algorithm! Parallel branch-and-bound may tighten bound quicker, resulting in less wasted work. Examples Practical weather prediction 24-hour forecast that takes 24 hours to compute is not useful. Real time (e.g. video games, cybernetics) Must meet deadline. Copyright 2005 Intel Corporation 8
9 Capacity Big parallel processors have huge memory capacity Can solve problem in core without having to resort to slow out of core techniques RAM on Top 5 Supercomputers in June 2005 Blue Gene/L BGW Columbia Earth Simulator MareNostrum NASA/Ames IBM Thomas J. Watson Research Center NSA/Ames Japan Barcelona 32 Terabytes? 20 Terabytes? 20 Terabytes 10 Terabytes 9.6 Terabytes Copyright 2005 Intel Corporation 9
10 Basic Parallel Programming Models Single Instruction Multiple Data (SIMD) Single thread operating on long vector Multiple Instruction Multiple Data (MIMD) Message passing Each thread has its own address space Threads communicate by sending and receiving messages Shared memory Threads share a common address space Threads communicate by writing and reading shared memory locations Hybrid Networks of shared-memory machines Virtual shared memory across a cluster Totally implicit E.g., applicative programming languages Copyright 2005 Intel Corporation 10
11 Parallelism vs. Concurrency Concurrency = multiple threads in progress at same time Achievable on a single processor by interleaving Parallelism = multiple threads executing simultaneously Requires multiple execution units Copyright 2005 Intel Corporation 11
12 Reasons for Concurrency Throughput Hide I/O latency. Example: Let task A compute while task B waits to read disk. Functionality can improve functionality (e.g. listen to music file and surf Web) Copyright 2005 Intel Corporation 12
13 Reasons for Parallelism Functionality Multithreaded CPU: avoids annoying pauses when a task hogs a physical thread. Throughput Achieve good frame rate in video game: CPU simulates universe while graphics processor renders. Turn-Around Almost all champion chess programs. Capacity Most large-scale scientific problems Copyright 2005 Intel Corporation 13
14 Speedup Speedup measures improvement from parallelization speedup(p) = time for best serial version time for version with p physical threads Efficiency is relative to ideal speedup efficiency(p) = speedup p Copyright 2005 Intel Corporation 14
15 Typical Speedup Chart Speedup (Relative to Serial Version) Speedup of Parallel Foo Geomean of 10 runs [blocksize=16x16 pixels, HT-enabled: Xeon with 8 TPUs, 512K Cache ] Physical Threads IDEAL NEW AND IMPROVED! BRAND X Why does this chart violate our definition of speedup? Copyright 2005 Intel Corporation 15
16 Use Best Serial Code As Baseline Speedup(1)>1 in example indicates that the original serial code was not the best serial code. This commonly happens While tuning for parallelism, serial improvements are found. Example: no alias information useful for parallelizing can also help compiler generate better serial code Example: reblocking loops to help parallel decomposition sometimes helps serial code s reuse of cache. Reference: Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers Copyright 2005 Intel Corporation 16
17 Amdahl s ``Law Suppose program takes 1 unit of time to execute serially. Part of program is inherently serial (unparallelizable). s = fraction of time spent in serial portion 1-s = remaining parallelizable time time on single processor s 1-s s (1-s)/p time on parallel processor (in ideal world) Copyright 2005 Intel Corporation 17
18 Amdahl s Formulae speedup(p) 1 s p 1 + s serial execution time best possible parallel execution time Speedup limited by serial portion speedup( ) 1 s Situation is worse in real world, because parallel portion usually has inefficiencies. Copyright 2005 Intel Corporation 18
19 Diminishing Returns Suppose program has 20% serial portion speedup(2) speedup(4) = 2.5 Doubling the number of processors from 2 to 4 of processors got us only 50% improvement! speedup( ) = 5 This is depressing. Is there a way to evade Amdahl s law? Copyright 2005 Intel Corporation 19
20 Gustafson/Barsis s Argument Scale up problem size N as number of processors increases. Assume that work in parallel portion grows as N does E.g., finer mesh and more particles for gorier video game Parallelizable portion now requires time N(1-s) to execute serially Keep serial portion independent of problem size This is often achievable, though it requires care. E.g., must never loop over N in serial portion. E.g., if amount of I/O depends on N, I/O must be parallel too. Net effect is to shrink serial fraction as number of processors grows. Copyright 2005 Intel Corporation 20
21 Gustafson/Barsis s Formulae Revised speedup: speedup(p) Now set N=p: s+n(1-s) N(1 s) p + s serial execution time best possible parallel execution time speedup(p) s + p(1-s) Copyright 2005 Intel Corporation 21
22 Growing Returns Revisit program with 20% serial portion Assume that parallel work scales up with N speedup(2).2 + 2(1 0.2) = 1.8 speedup(4).2 + 4(1 0.2) = 3.4 Doubling the number of processors and doubling the problem size now almost doubles speedup. speedup( ).2+ (1 0.2) = Reference: John Gustafson, Reevaluatng Amdahl s Law, CACM May 1988, Volume 31. Copyright 2005 Intel Corporation 22
23 Summary of Episode I Reasons to Write Parallel Programs functionality throughput turn-around capacity Two flavors shared memory message passing Two laws Amdahl: speedup limited by serial portion Gustafson/Barsis: evade limit by scaling up problem Copyright 2005 Intel Corporation 23
24 Episode II Approaches to single-chip parallelism Structure of processors and memory Copyright 2005 Intel Corporation 24
25 Simple Programmer s View Mr. CPU Fast access to handful of registers variables Memory as equally accessible boxes. Copyright 2005 Intel Corporation 25
26 Pentium 4 Processor Pentium 4 is a trademark or registered trademark of Intel Corporation Copyright 2005 Intel Corporation or its subsidiaries in the United States and other countries. 26
27 Latency vs. Throughput Latency = how long to execute an instruction. Throughput = instructions per unit time Example: Pentium 4 integer multiply latency = clock cycles throughput is one result per 5 cycles Reference: Appendix C of IA-32 Intel Archtecture Optimization Reference Manual Copyright 2005 Intel Corporation 27
28 Instruction Level Parallelism SIMD Instruction operates on vector of data 4-element vectors have become very popular Super scalar Execute more than one instruction at once No clear winner in in-order vs. out-of-order debate In-order Instructions executed in program order Smaller, simpler, less power, but suffers from cache misses Out-of-order Instructions executed out of order so that instructions waiting for data do no delay subsequent independent instructions Bigger, more complex, almost quadratic power demand, but more tolerant of cache misses Copyright 2005 Intel Corporation 28
29 Multithreaded CPUs Like expert chess player playing against multiple opponents Each board is state of a thread Move on to next board while waiting for response (hide latency) Susan Polgar set record of 326 concurrent games (309 wins!) Been around since 1950s Simultaneous Multithreading (SMT) Feed execution units from multiple instruction streams Good fit for out-of-order machine Switch On Event Multithreading (SOEMT) Switch to another instruction stream while waiting on an outerlevel cache miss. Good fit for in-order machine Copyright 2005 Intel Corporation 29
30 Multicore Chips Multiple CPUs in same socket or die. Multiple cores on same die has both advantages and disadvantages Enables communication at chip speed instead of slower board speed. Lower yield from wafer. Assuming random distribution of failures, if fraction x of CPUs are good, then only x 2 CPU-pairs are good. Copyright 2005 Intel Corporation 30
31 Future Speed Improvements Will Mostly Be From Multithreading Improvements in past have been for single thread Examples Higher clock rate Bigger functional units (e.g. full array multipliers) Pipelining and out-of-order execution. These all consume superlinear transistors or power Approximately quadratic Cooling capacity limited (cannot put toaster in lap) Improvements in future are for multiple threads Multiple cores and more threads per core As long as program scales significantly better than p, we are ahead. Copyright 2005 Intel Corporation 31
32 Memory Memory also has latency and throughput Both are slower than CPU speeds Latency is hundreds of clocks Bandwidth can be 20x slower than CPU See STREAM benchmark at Memory cannot be large, fast, and cheap Hence cache hierarchy Some small fast expensive memory at top Lots of big slow cheap memory at bottom Copyright 2005 Intel Corporation 32
33 Example: One Core of Pentium D Processor register set 16 KB Level 1 Data Cache 12 KB Level 1 Instruction Cache 1 MB Level 2 Cache Main Memory Bandwidth width of arrow (Would be 5x21 feet if 1 GB were drawn to paper scale) Copyright 2005 Intel Corporation 33
34 Off-Chip Memory Is Slow! HT0 HT1 L1 cache L2 cache Latency length of arrow Bandwidth width of arrow Design your parallel programs to minimize off-chip communication. Intrachip locality is good. Interchip locality is bad. memory Copyright 2005 Intel Corporation 34
35 Shared Bus CPU CPU CPU CPU memory memory memory Uniform Memory Access (UMA) Any processor can reach any memory in the same time All devices on bus share the bus bandwidth Uniformly slow for everyone Copyright 2005 Intel Corporation 35
36 Point to Point memory memory CPU CPU CPU CPU All sorts of topologies possible Mesh, toruses, hypercube, fat trees,... memory memory Non-Uniform Memory Access (NUMA) Processors can reach near memory faster than far memory. Bus bandwidth grows as nodes are added Do not have to share bandwidth Copyright 2005 Intel Corporation 36
37 Cluster Each node in cluster is single or multiprocessor box. Nodes interconnected Bus (e.g. antique Ethernet*) Point to point (e.g. Scalable Coherent Interconnect*) Switched fabric (e.g. Infiniband*) *Other names and brands may be claimed as the property of others. Copyright 2005 Intel Corporation 37
38 Some Interconnect Metrics Bisection bandwidth Adversary cuts machine in half. Measure bandwidth between halves. Latency Near vs. far reference For message passing, software adds order of magnitude (or worse) to hardware latency. Copyright 2005 Intel Corporation 38
39 Shared Memory Is Really Message Passing Messages are cache lines/sectors Typically around bytes per line Pentium 4 Processor has sectored lines 64-byte sector and 128-byte line Writes invalidate sectors Reads pull in lines For cache without sectors, sector is same as line When a processor writes to a cache sector, the other processors have to be notified of the change. Copyright 2005 Intel Corporation 39
40 Example: False Sharing Consider two processors without shared cache. If two processors write different locations that are on same cache sector, and at least one access is write, then cache sector must move between processors. Copyright 2005 Intel Corporation 40
41 Cache Line Ping Pong Processor #0 Processor #1 Processor # x[0] x[1]++... x[0] x[2]++... Copyright 2005 Intel Corporation 41
42 8 Threads on 4-Socket HT Machine Thread i does "x[i*stride]++" Normalized Time (Log Scale) (Geomean of 3 Runs) Pop quiz: Cache sector is 64 bytes. What is sizeof(x[0])? Stride Copyright 2005 Intel Corporation 42
43 Summary of Episode II Approaches to single-chip parallelism SIMD Super-scalar Multithreaded CPU Multicore Most of future improvement in chip performance will come from parallelism Off-chip memory is slow Current machines depend on caches Uniform vs. Non-Uniform Memory Access Cache line/sector is unit of information interchange Copyright 2005 Intel Corporation 43
Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationHigh-End Computing Systems
High-End Computing Systems EE380 State-of-the-Art Lecture Hank Dietz Professor & Hardymon Chair in Networking Electrical & Computer Engineering Dept. University of Kentucky Lexington, KY 40506-0046 http://aggregate.org/hankd/
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationParallel Architecture, Software And Performance
Parallel Architecture, Software And Performance UCSB CS240A, T. Yang, 2016 Roadmap Parallel architectures for high performance computing Shared memory architecture with cache coherence Performance evaluation
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationComputer Architecture Crash course
Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationBİL 542 Parallel Computing
BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,
More informationEliminating Dark Bandwidth
Eliminating Dark Bandwidth Jonathan Beard (twitter: @jonathan_beard) Staff Research Engineer Memory Systems, HPC HCPM 2017 22 June 2017 Dark Bandwidth: Data moved throughout the memory hierarchy that is
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationCSE 392/CS 378: High-performance Computing - Principles and Practice
CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationParallel Architectures
Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationMultithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University
Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCOMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009
COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationChapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationIntroducing Multi-core Computing / Hyperthreading
Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationOverview of Parallel Computing. Timothy H. Kaiser, PH.D.
Overview of Parallel Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples
More informationHigh-End Computing Systems
High-End Computing Systems EE380 State-of-the-Art Lecture Hank Dietz Professor & Hardymon Chair in Networking Electrical & Computer Engineering Dept. University of Kentucky Lexington, KY 40506-0046 http://aggregate.org/hankd/
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationCSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Why Parallel Computing? Haidar M. Harmanani Spring 2017 Definitions What is parallel? Webster: An arrangement or state that permits several
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationMo Money, No Problems: Caches #2...
Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to
More informationChapter 1: Perspectives
Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical,
More informationOverview of High Performance Computing
Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationUsing Multiple Machines to Solve Models Faster with Gurobi 6.0
Using Multiple Machines to Solve Models Faster with Gurobi 6.0 Distributed Algorithms in Gurobi 6.0 Gurobi 6.0 includes 3 distributed algorithms Distributed concurrent LP (new in 6.0) MIP Distributed MIP
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationCSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review
CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More information