INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
|
|
- Silas Bradford
- 5 years ago
- Views:
Transcription
1 UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version English Lecture 12 Title: Multiprocessors - Syncronization and Summary: Syncronization and Multi-Processor Systems; ; Examples - Cell and GPUs. 2010/2011 Nuno.Roma@ist.utl.pt
2 Architectures for Embedded Computing Multiprocessors: Syncronization and Prof. Nuno Roma ACE 2010/11 - DEI-IST 1 / 43 Previous Class In the previous class... Multiprocessor classification; MIMD architectures: Shared memory; Distributed memory: Distributed shared memory; Multi-computers; Memory coherency and consistency. Prof. Nuno Roma ACE 2010/11 - DEI-IST 2 / 43
3 Road Map Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 43 Summary Today: Syncronization and Multi-Processor Systems; (examples): Cell (STI - Sony, Toshiba, IBM); GPUs (NVidia, ATI). Bibliography: Computer Architecture: a Quantitative Approach, Chapter 4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 4 / 43
4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 5 / 43 Software synchronization mechanisms are usually carried out by using hardware primitives; Usually, the processors include read-modify-write instructions, which allow the implementation of atomic read and write operations to a given memory position: Without interruptions Without loosing the bus ownership Examples: TAS (test-and-set) Fetch-and-increment Exchange Ri,M[semaphore] Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 43
5 Example Atomic exchange of the values between Ri and M[sem]. Example: Semaphore sem controls the access to a mutual exclusion zone: 0 = free, 1 = busy. Two processes simultaneously try to get in the exclusion zone; both processes try to assert Ri with 1, by executing the instruction: Exchange Ri,M[sem]. Only one process will get the zero value in Ri. Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 43 Routines Spin locks: Provide an exclusive access to a shared resource. barriers: Provide synchronization in the execution of a set of processes. Prof. Nuno Roma ACE 2010/11 - DEI-IST 8 / 43
6 Spin Locks Os Spin Locks are executed by pooling a given resource, within a waiting loop: It is expected that the waiting time is short; The main objective is to minimize the latency when acceding the resource; DADDUI R2,R0,#1 lock: EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 43 Spin Locks Os Spin Locks are executed by pooling a given resource, within a waiting loop: It is expected that the waiting time is short; The main objective is to minimize the latency when acceding the resource; DADDUI R2,R0,#1 lock: EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock What about caching the lock variable? Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 43
7 Spin Locks Os Spin Locks are executed by pooling a given resource, within a waiting loop: It is expected that the waiting time is short; The main objective is to minimize the latency when acceding the resource; DADDUI R2,R0,#1 lock: EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock What about caching the lock variable? Advantages: Avoids using the bus/network to read/write the main memory; Great probability that such lock will be used again by the processor in a near future. Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 43 Cached Spin Locks Problem: With read-modify-write operations, each test implies a write operation. Critical if more than one processor is in the waiting loop! Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 43
8 Cached Spin Locks Problem: With read-modify-write operations, each test implies a write operation. Critical if more than one processor is in the waiting loop! Solution: Stay in a reading loop, and only use the atomic exchange operation when the resource is free; lock: LD R2,0(R1) ; non-atomic read BNEZ R2, lock DADDUI R2,R0,#1 EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock It still leads to a significant amount of traffic after the unlock operation... Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 43 Barriers The processes wait until all processes have reached the barrier; only then are the processes (all) released; The implementation uses two locking mechanisms: One semaphore to protect the increment operation over the counter that registers the number of processes that were already blocked; Another semaphore to block the processes until the counter reaches the total number of running processes. Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 43
9 Prof. Nuno Roma ACE 2010/11 - DEI-IST 12 / 43 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43
10 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43
11 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; MIMD (Multiple Instruction, Multiple Data): each processor executes independent instructions over an independent data set. Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43 SIMD Systems: Objective: simultaneous implementation of a given operation over a significant number of operands; Particularly suited to the parallel processing of significant amounts of data. Examples: Engine Units (GPUs) Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 43
12 Classification of Multi-Processor Systems Type Architecture Memory Management Examples General Purpose Processor (GPP) Homogeneous Hardware - Intel, AMD, IBM Power, SUN etc. multi-core families Dedicated Processors / Accelerators Heterogeneous Misc. Hardware + Software - Cell (PS3) - GPUs (NVidia); - FPGA/ASIC dedicated accelerators. Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 43 Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 43
13 (CBEA) Proposed by Sony, Toshiba e IBM (STI), in 2000 In the market since 2006 (Sony Playstation 3) Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 43 Characteristics of Cell architecture Characteristics: 9 core multi-processor: 1 PowerPC Processing Element (PPE); 8 Synergistic Processing Elements (SPE); Processing elements interconnected with a high performance bus - Element Interconnect Bus (EIB); Maximum operating frequency: 4 GHz; Maximum performance greater than 256 GFLOPs; transistors. Prof. Nuno Roma ACE 2010/11 - DEI-IST 18 / 43
14 Cell Architecture Prof. Nuno Roma ACE 2010/11 - DEI-IST 19 / 43 PowerPC Processing Element (PPE) PowerPC Processing Element - PPE: Function: Multi-processor resources management; Partitioning of the data under processing in smaller blocks, to be subsequently distributed to the SPEs using the EIB; SPE management and allocation; SPE synchronization and scheduling. Characteristics: General Purpose Processor (GPP); 64-bits; Dual-threading; 2-issue in-order (2 simultaneously executed instructions); Reduced Instruction Set Computing (RISC); Cache: L1-32kB; L2-512 kb. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 43
15 Synergistic Processing Element (SPE) Synergistic Processing Element - SPE: Characteristics: SIMD architecture (vectorial processor); 128 registers, each one with 128-bit; 1 Synergistic Processing Unit (SPU): 4 single-precision unit; 1 double-precision unit; 2-issue; transistors; Does NO have cache: Local Store (LS): 256kB private memory; Load-Store instructions operate over the private LS; Access to the external memory is only accomplished with DMA, by using the EIB. Prof. Nuno Roma ACE 2010/11 - DEI-IST 21 / 43 Cell Architecture Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 43
16 Element Interface Bus (EIB) Element Interface Bus - EIB: Characteristics: Responsible for the transfer of all data inside the processor; Four 128-bits channels; 2 clockwise communication rings; 2 anti-clockwise communication rings; Implements the interface with the main memory and the IO bus (FlexIO). Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 43 Programming Model PPE executes applications compiled with the Power (32-bits or 64-bits) or PowerPC compilers, without any modification needed; To achieve a satisfactory performance and efficiency levels, it is essential to use the SPEs; Problems: Different programming model (vectorial); Scarce local memory (Local-Store) within each SPE (256 kb); DMA transfer mechanism are mandatory, in order to transfer the data between the main memory and each local-store; Latency imposed in the accesses to main memory and IO ports; Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 43
17 Programming Model PPE executes applications compiled with the Power (32-bits or 64-bits) or PowerPC compilers, without any modification needed; To achieve a satisfactory performance and efficiency levels, it is essential to use the SPEs; Problems: Different programming model (vectorial); Scarce local memory (Local-Store) within each SPE (256 kb); DMA transfer mechanism are mandatory, in order to transfer the data between the main memory and each local-store; Latency imposed in the accesses to main memory and IO ports; Imposes important restrictions which significantly restrict the global system performance; Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 43 Programming Model PPE executes applications compiled with the Power (32-bits or 64-bits) or PowerPC compilers, without any modification needed; To achieve a satisfactory performance and efficiency levels, it is essential to use the SPEs; Problems: Different programming model (vectorial); Scarce local memory (Local-Store) within each SPE (256 kb); DMA transfer mechanism are mandatory, in order to transfer the data between the main memory and each local-store; Latency imposed in the accesses to main memory and IO ports; Imposes important restrictions which significantly restrict the global system performance; Imposes a (much) greater complexity to the programmer. Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 43
18 Performance Example: Matrices multiplication SOURCE: Henrique Costa, Multiprocessor Platforms for Natural Language Processing, MSc Thesis, IST, May Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 43 Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 43
19 Despite they were originally developed to graphical applications, they have gradually been adopted for general purpose processing: (GPGPU - General Purpose Graphical Processing Unit). Prof. Nuno Roma ACE 2010/11 - DEI-IST 27 / 43 Despite they were originally developed to graphical applications, they have gradually been adopted for general purpose processing: (GPGPU - General Purpose Graphical Processing Unit). Combination of massively parallel & vectorial architectures; Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 43
20 Specialized architecture, particularly targeted to the implementation of highly complex processing of 3D graphics: Real-time rendering: millions of pixels per second; The processing of each pixel requires hundreds of operations; Parallel processing structures. Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 43 Specialized architecture, particularly targeted to the implementation of highly complex processing of 3D graphics: Real-time rendering: millions of pixels per second; The processing of each pixel requires hundreds of operations; Parallel processing structures. Advantages: Speed; Low cost; Low power consumption. Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 43
21 Specialized architecture, particularly targeted to the implementation of highly complex processing of 3D graphics: Real-time rendering: millions of pixels per second; The processing of each pixel requires hundreds of operations; Parallel processing structures. Advantages: Speed; Low cost; Low power consumption. Disadvantages: Very specialized; Complex programming model; Bandwidth problems; Volatile market. Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 43 Comparison of CPU vs. GPU With GPUs, most resources are allocated to data processing By sharing the same control logic, more resources can be allocated to ALUs; SIMD!!! Prof. Nuno Roma ACE 2010/11 - DEI-IST 30 / 43
22 GPU Architecture Characteristics: Massive exploitation of parallelism among the operations; Application of the SIMD paradigm: At the threads level At the data level. Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 43 GPU Architecture Characteristics (e.g.: nvidia): Several multi-processors; Each multi-processor is an aggregate of several 32-bits SIMD processing elements; In each clock cycle, each multi-processor executes the same instructiuon in a group of threads (warp); Absence of explicit communication between the processing elements; Absence of any cache coherency mechanisms; 256 kb of cache, shared by all processing elements. Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 43
23 GPU Architecture Example: nvidia GeForce multi-processadores; 8 vectorial processors in each multi-processor; 330 GFLOPs. Prof. Nuno Roma ACE 2010/11 - DEI-IST 33 / 43 GPU Architecture Example: nvidia GeForce 8800 Each Streaming Multi-processor (SM) offers: 8 Streaming Processors (SP); Performance of about GHz; bits registers; Shared program and data caches; 16 kb of shared memory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 43
24 GPU Architecture Example: Radeon HD 2900XT Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 43 Programming Model Programming Model: 1. Program is defined as a set of similar threads; 2. The code that is executed by each thread is programmed using a Single-Program Multiple Data (SPMD) paradigm; 3. The result of each thread is obtained as a combination of several mathematical operations and read/write accesses to the main memory; 4. Data stored in the main memory may be subsequently used as input to other threads. Programming Languages: Graphics APIs: OpenGL, DirectX; Generic APIs: CUDA, OpenCL, etc. Prof. Nuno Roma ACE 2010/11 - DEI-IST 36 / 43
25 Programming Model Example: vector addition Prof. Nuno Roma ACE 2010/11 - DEI-IST 37 / 43 Difficulties Non-regular execution patterns (e.g.: branches): Threads with the same program are grouped in warps; To keep a regular execution flow, all possible ramifications of the execution pattern are simultaneously considered (both directions of the branch instruction (taken and not-taken); Subsequently, some threads become inactive, as a consequence of incorrent branch ramifications. High latency: Significant input and output data latency, between the GPU and the CPU; Masked by massively exploiting multi-threading. Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 43
26 Performance GPUs have soon beaten the performance offered by general purpose CPUs. Prof. Nuno Roma ACE 2010/11 - DEI-IST 39 / 43 Comparison: CPU, Cell and GPU Example: Matrix multiplication SOURCE: Henrique Costa, Multiprocessor Platforms for Natural Language Processing, MSc Thesis, IST, May Prof. Nuno Roma ACE 2010/11 - DEI-IST 40 / 43
27 Future Hybrid multi-core architectures, incorporating general purpose cores (CPU) and graphical accelerators (GPU): Examples: Fusion (AMD & ATI - first half of 2011); Larrabee (Intel - postponed...). Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 43 Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 43
28 Memory systems; Program access patterns; Cache memories: Operation principles; Internal organization; Cache management policies. Prof. Nuno Roma ACE 2010/11 - DEI-IST 43 / 43
INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 07
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 05
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 14
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 06
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 16
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 17
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 04
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 03 Title: Processor
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationParallel and Distributed Computing
Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationINF5063: Programming heterogeneous multi-core processors Introduction
INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using
More informationCell Broadband Engine. Spencer Dennis Nicholas Barlow
Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationChapter 5 Thread-Level Parallelism. Abdullah Muzahid
Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex
More informationHardware Accelerators
Hardware Accelerators José Costa Software for Embedded Systems Departamento de Engenharia Informática (DEI) Instituto Superior Técnico 2014-04-08 José Costa (DEI/IST) Hardware Accelerators 1 Outline Hardware
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationCONSOLE ARCHITECTURE
CONSOLE ARCHITECTURE Introduction Part 1 What is a console? Console components Differences between consoles and PCs Benefits of console development The development environment Console game design What
More informationFLYNN S TAXONOMY OF COMPUTER ARCHITECTURE
FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn s classification scheme is based on the notion of a stream of information.
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 21
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationCMSC 611: Advanced. Parallel Systems
CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems
More informationProcessor Performance and Parallelism Y. K. Malaiya
Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationM4 Parallelism. Implementation of Locks Cache Coherence
M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationProcessor Architectures
ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationComputer Organization. Chapter 16
William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data
More informationParallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence
Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly
More informationCS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28
CS 220: Introduction to Parallel Computing Introduction to CUDA Lecture 28 Today s Schedule Project 4 Read-Write Locks Introduction to CUDA 5/2/18 CS 220: Parallel Computing 2 Today s Schedule Project
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationWhat Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others
What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More information