Parallel Computer Architecture Concepts
|
|
- August Bates
- 5 years ago
- Views:
Transcription
1 Outline This image cannot currently be displayed. arallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler ELAB / IDA Linköping university Sweden 2015 Lecture 1: arallel Computer Architecture Concepts arallel computer, multiprocessor, multicomputer SIMD vs. MIMD execution Shared memory vs. Distributed memory architecture Interconnection networks arallel architecture design concepts Instruction-level parallelism Hardware multithreading Multi-core and many-core Accelerators and heterogeneous systems Clusters Implications for programming and algorithm design 2 arallel Computer Traditional Use of arallel Computing: in HC 3 4 Example: Weather Forecast (very simplified ) arallel Computing for HC Applications 3D Space discretization (cells) Time discretization (steps) Start from current observations (sent from weather stations) Simulation step by evaluating weather model equations cell Air pressure Temperature Humidity Sun radiation Wind direction Wind velocity High erformance Computing Much computational work (in FLOs, floatingpoint operations) Often, large data sets E.g. climate simulations, particle physics, engineering, sequence matching or proteine docking in bioinformatics, Single-CU computers and even today s multicore processors cannot provide such massive computation power Aggregate LOTS of computers Clusters Need scalable algorithms Need to exploit multiple levels of parallelism NSC Triolith 5 6 1
2 arallel Computer Architecture Concepts Classification by Control Structure Classification of parallel computer architectures: by control structure by memory organization in particular, Distributed memory vs. Shared memory by interconnection network topology op op 1 op 2 op op Classification by Memory Organization Interconnection Networks (1) R 9 10 Interconnection Networks (2): Simple Topologies fully connected Interconnection Networks (3): Hypercube Inductive definition:
3 More about Interconnection Networks Fat-Tree, Butterfly, See TDDC78 Instruction Level arallelism (1): ipelined Execution Units Switching and routing algorithms Discussion of interconnection network properties Cost (#switches, #lines) Scalability (asymptotically, cost grows not much faster than #nodes) Node degree Longest path ( latency) Accumulated bandwidth Fault tolerance (node or switch failure) SIMD computing with ipelined Vector Units 15 e.g., vector supercomputers Cray (1970s, 1980s), Fujitsu, Instruction-Level arallelism (2): VLIW and Superscalar Multiple functional units in parallel 2 main paradigms: VLIW (very large instruction word) architecture ^ arallelism is explicit, progr./compiler-managed (hard) Superscalar architecture 16 Sequential instruction stream Hardware-managed dispatch power + area overhead IL in applications is limited typ. < instructions can be issued simultaneously Due to control and data dependences in applications Solution: Multithread the application and the processor Hardware Multithreading SIMD Instructions vector register E.g., data dependence 17 Single Instruction stream, Multiple Data streams single thread of control flow op SIMD unit restricted form of data parallelism apply the same primitive operation (a single instruction) in parallel to multiple data elements stored contiguously SIMD units use long vector registers each holding multiple data elements Common today MMX, SSE, SSE2, SSE3, Altivec, VMX, SU, erformance boost for operations on shorter data types Area- and energy-efficient Code to be rewritten (SIMDized) by programmer or compiler Does not help (much) for memory bandwidth 18 3
4 The Memory Wall erformance gap CU Memory Memory hierarchy Increasing cache sizes shows diminishing returns Costs power and chip area GUs spend the area instead on many simple cores with little memory Relies on good data locality in the application What if there is no / little data locality? Irregular applications, e.g. sorting, searching, optimization... Solution: Spread out / overlap memory access delay rogrammer/compiler: refetching, on-chip pipelining, SW-managed on-chip buffers Generally: Hardware multithreading, again! 19 Moore s Law (since 1965) Exponential increase in transistor density Data from Kunle Olukotun, Lance Hammond, Herb Sutter, 20Burton Smith, Chris Batten, and Krste Asanoviç The ower Issue ower = Static (leakage) power + Dynamic (switching) power Dynamic power ~ Voltage 2 * Clock frequency where Clock frequency approx. ~ voltage Dynamic power ~ Frequency 3 Total power ~ #processors Moore s Law vs. Clock Frequency ~3GHz #Transistors / mm 2 still growing exponentially according to Moore s Law Clock speed flattening out More transistors + Limited frequency More cores Conclusion: Moore s Law Continues, But... Single-processor erformance Scaling 16,0 Log2 Speedup 14,0 12,0 10,0 8,0 6,0 Throughput incr. 55%/year arallelism Assumed increase 17%/year possible Device speed Limit: Clock rate 4,0 ipelining Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç 23 2,0 0,0 Source: Doug Burger, UT Austin 2005 Limit: RISC IL RISC/CISC CI 90 nm 65 nm 45 nm 32nm 22nm 24 4
5 Solution for CU Design: Multicore + Multithreading Single-thread performance does not improve any more IL wall Memory wall ower wall but we can put more cores on a chip And hardware-multithread the cores to hide memory latency All major chip manufacturers produce multicore CUs today Main features of a multicore system There are multiple computational cores on the same chip. The cores might have (small) private on-chip memory modules and/or access to on-chip memory shared by several cores. The cores have access to a common off-chip main memory There is a way by which these cores communicate with each other and/or with the environment Standard CU Multicore Designs Some early dual-core CUs (2004/2005) Standard desktop/server CUs have a few cores with shared off-chip main memory core core core core On-chip cache (typ., 2 levels) L1-cache mostly core-private Interconnect / Memory interface L2-cache often shared by groups of cores main memory (DRAM) Memory access interface shared by all or groups of cores Caching multiple copies of the same data item Writing to one copy (only) causes inconsistency L1$ L1$ L1$ L1$ Shared memory coherence mechanism to enforce automatic updating or invalidation of all copies around 0 SMT L1$ D1$ L1$ 1 Memory Ctrl Main memory IBM ower5 (2004) D1$ 0 1 Memory Ctrl Main memory AMD Opteron Dualcore (2005) 0 1 Memory Ctrl Main memory Intel Xeon Dualcore(2005) $ = cache L1$ = level-1 instruction cache D1$ = level-1 data cache = level-2 cache (uniform) More about shared-memory architecture, caches, data locality, consistency issues and coherence protocols in TDDC78/TDDD SUN/Oracle SARC T Niagara (8 cores) SUN / Oracle SARC-T5 (2012) Sun UltraSARC Niagara Niagara T1 (2005): 8 cores, 32 HW threads Niagara T2 (2008): 8 cores, 64 HW threads Niagara T3 (2010): 16 cores, 128 HW threads T5 (2012): 16 cores, 128 HW threads Niagara T1 (2005): 8 cores, 32 HW threads Memory Ctrl Memory Ctrl Memory Ctrl Memory Ctrl Main memory Main memory Main memory Main memory 29 28nm process, 16 cores x 8 HW threads, L3 cache on-chip, On-die accelerators for common encryption algorithms 30 5
6 Scaling Up: Network-On-Chip Cache-coherent shared memory (hardware-controlled) does not scale well to many cores power- and area-hungry signal latency across whole chip not well predictable access times Idea: NCC-NUMA non-cache-coherent, non-uniform memory access hysically distributed on-chip [cache] memory, on-chip network, connecting Es or coherent tiles of Es global shared address space, but software responsible for maintaining coherence Examples: STI Cell/B.E., Tilera TILE64, Example: Cell/B.E. (IBM/Sony/Toshiba 2006) An on-chip network (four parallel unidirectional rings) interconnect the master core, the slave cores and the main memory interface LS = local on-chip memory, E = master, SE = slave Intel SCC, Kalray MA Towards Many-Core CUs... Intel, second-generation many-core research processor: 48-core (in-order x86) SCC Single-chip cloud computer, 2010 No longer fully cache coherent over the entire chip MI-like message passing over 2D mesh network on chip Towards Many-Core Architectures Tilera TILE64 (2007): 64 cores, 8x8 2D-mesh on-chip network Mem-controller C I/O I/O R 1 tile: VLIW-processor + cache + router (Image simplified) 33 Source: Intel 34 Kalray MA tiles with 16 VLIW compute cores each plus 1 control core per tile Message passing network on chip Virtually unlimited array extension by clustering several chips 28 nm CMOS technology Low power dissipation, typ. 5 W Intel Xeon HI (since late 2012) Up to 61 cores, 244 HW threads, 1.2 Tflops peak performance Simpler x86 (entium) cores (x 4 HW threads), with 512 bit wide SIMD vector registers Can also be used as a coprocessor, instead of a GU
7 General-purpose GUs Main GU providers for laptop/desktop Nvidia, AMD(ATI), Intel Example: NVIDIA s 10-series GU (Tesla, 2008) has 240 cores Each core has a Floating point / integer unit Logic unit Move, compare unit Branch unit Cores managed by thread manager Thread manager can spawn and manage 30,000+ threads Zero overhead thread switching 37 (Images removed) Nvidia Tesla C1060: 933 GFlops Nvidia Fermi (2010): 512 cores 1 Fermi C2050 GU SM 1 Streaming rocessor (S) L2 FU 1 shared-memory multiprocessor (SM) IntU 38 I-cache Scheduler Dispatch Register file 32 Streaming processors (cores) Load/Store units Special function units 64K configurable L1cache/ shared memory GU Architecture aradigm The future will be heterogeneous! Optimized for high throughput In theory, ~10x to ~100x higher throughput than CU is possible Massive hardware-multithreading hides memory access latency Massive parallelism GUs are good at data-parallel computations multiple threads executing the same instruction on different data, preferably located adjacently in memory 39 Need 2 kinds of cores often on same chip: For non-parallelizable code: arallelism only from running several serial applications simultaneously on different cores (e.g. on desktop: word processor, , virus scanner, not much more) Few (ca. 4-8) fat cores (power-hungry, area-costly, caches, out-of-order issue, ) for high single-thread performance For well-parallelizable code: (or on cloud servers) on hundreds of simple cores (power + area efficient) (GU-/SCC-like) 40 Heterogeneous / Hybrid Multi-/Manycore Heterogeneous / Hybrid Multi-/Manycore Systems Key concept: Master-slave parallelism, offloading General-purpose CU (master) processor controls execution of slave processors by submitting tasks to them and transfering operand data to the slaves local memory Master offloads computation to the slaves Slaves often optimized for heavy throughput computing Master could do something else while waiting for the result, or switch to a power-saving mode Master and slave cores might reside on the same chip (e.g., Cell/B.E.) or on different chips (e.g., most GU-based systems today) Slaves might have access to off-chip main memory (e.g., Cell) or not (e.g., today s GUs) 41 Cell/B.E. GU-based system: CU Main memory Offload heavy computation Data transfer 42 Device memory GU 7
8 Multi-GU Systems Connect one or few general-purpose (CU) multicore processors with shared off-chip memory to several GUs Increasingly popular in high-performance computing Cost and (quite) energy effective if offloaded computation fits GU architecture well Reconfigurable Computing Units FGA Field rogrammable Gate Array L2 L2 Main Memory (DRAM) Example: Beowulf-class C Clusters with off-the-shelf CUs (Xeon, Opteron, ) Cluster Example: Triolith (NSC, 2012 / 2013) Capability cluster (fast network for parallel applications) Final configuration: 1200 H SL230 compute nodes, each equipped with 2 Intel E (2.2 GHz Sandybridge) processors with 8 cores each cores in total Theoretical peak performance of 338 Tflops/s Mellanox Infiniband network NSC Triolith (Source: NSC) The Challenge Today, basically all computersare parallel computers! Single-thread performance stagnating Dozens, hundreds of cores Hundreds, thousands of hardware threads Heterogeneous (core types, accelerators) Data locality matters Clusters for HC, require message passing Utilizing more than one CU core requires thread-level parallelism One of the biggest software challenges: Exploiting parallelism Every programmer will eventually have to deal with it All application areas, not only traditional HC General-purpose, graphics, games, embedded, DS, Affects HW/SW system architecture, programming languages, algorithms, data structures arallel programming is more error-prone (deadlocks, races, further sources of inefficiencies) And thus more expensive and time-consuming 47 Can t the compiler fix it for us? Automatic parallelization? at compile time: Requires static analysis not effective for pointer-based languages needs programmer hints / rewriting... ok for few benign special cases: (Fortran) loop SIMDization, extraction of instruction-level parallelism, at run time (e.g. speculative multithreading) High overheads, not scalable More about parallelizing compilers in TDDD56 + TDDC
9 And worse yet, A lot of variations/choices in hardware Many will have performance implications No standard parallel programming model portability issue Understanding the hardware will make it easier to make programs get high performance erformance-aware programming gets more important also for single-threaded code Adaptation leads to portability issue again How to write future-proof parallel programs? The Challenge Bad news 1: Many programmers (also less skilled ones) need to use parallel programming in the future Bad news 2: There will be no single uniform parallel programming model as we were used to in the old sequential times Several competing general-purpose and domain-specific languages and their concepts will co-exist What we learned in the past Sequential von-neumann model programming, algorithms, data structures, complexity Sequential / few-threaded languages: C/C++, Java,... not designed for exploiting massive parallelism time T(n) = O ( n log n ) and what we need now arallel programming! arallel algorithms and data structures Analysis / cost model: parallel time, work, cost; scalability; erformance-awareness: data locality, load balancing, communication time number of processing units used T(n,p) = O ( (n log n)/p + log p ) 51 problem size 52 problem size This image cannot currently be displayed. Questions? 9
Parallel Computer Architecture Concepts
Outline Parallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler PELAB / IDA Linköping university Sweden 2017 Lecture 1: Parallel Computer Architecture Concepts Parallel computer, multiprocessor,
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationUnit 8: Superscalar Pipelines
A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More information4. Shared Memory Parallel Architectures
Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationHPC Issues for DFT Calculations. Adrian Jackson EPCC
HC Issues for DFT Calculations Adrian Jackson ECC Scientific Simulation Simulation fast becoming 4 th pillar of science Observation, Theory, Experimentation, Simulation Explore universe through simulation
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationDr. Joe Zhang PDC-3: Parallel Platforms
CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationLecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors
Lecture 17: arallel Architectures and Future Computer Architectures rof. Kunle Olukotun EE 282h Fall 98/99 1 Shared-emory ultiprocessors Several processors share one address space» conceptually a shared
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationWhat does Heterogeneity bring?
What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationECE 2162 Intro & Trends. Jun Yang Fall 2009
ECE 2162 Intro & Trends Jun Yang Fall 2009 Prerequisites CoE/ECE 0142: Computer Organization; or CoE/CS 1541: Introduction to Computer Architecture I will assume you have detailed knowledge of Pipelining
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationCSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review
CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationIntroducing Multi-core Computing / Hyperthreading
Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
More informationParallel Computer Architecture - Basics -
Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner
CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationCS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011
CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming
More informationMoore s Law. Computer architect goal Software developer assumption
Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationParallelism in Hardware
Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationECE/CS 250 Computer Architecture. Summer 2016
ECE/CS 250 Computer Architecture Summer 2016 Multicore Dan Sorin and Tyler Bletsch Duke University Multicore and Multithreaded Processors Why multicore? Thread-level parallelism Multithreaded cores Multiprocessors
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationLecture 8: RISC & Parallel Computers. Parallel computers
Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationCCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM
CCS HC taisuke@cs.tsukuba.ac.jp 1 2 CU memoryi/o 2 2 4single chipmulti-core CU 10 C CM (Massively arallel rocessor) M IBM BlueGene/L 65536 Interconnection Network 3 4 (distributed memory system) (shared
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology
More informationLecture 21: Parallelism ILP to Multicores. Parallel Processing 101
18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationChapter 1: Perspectives
Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical,
More informationFundamentals of Computer Design
CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationPARALLEL COMPUTER ARCHITECTURES
8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More information