INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Size: px
Start display at page:

Download "INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing"

Transcription

1 UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version English Lecture 12 Title: Multiprocessors - Syncronization and Summary: Syncronization and Multi-Processor Systems; ; Examples - Cell and GPUs. 2010/2011 Nuno.Roma@ist.utl.pt

2 Architectures for Embedded Computing Multiprocessors: Syncronization and Prof. Nuno Roma ACE 2010/11 - DEI-IST 1 / 43 Previous Class In the previous class... Multiprocessor classification; MIMD architectures: Shared memory; Distributed memory: Distributed shared memory; Multi-computers; Memory coherency and consistency. Prof. Nuno Roma ACE 2010/11 - DEI-IST 2 / 43

3 Road Map Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 43 Summary Today: Syncronization and Multi-Processor Systems; (examples): Cell (STI - Sony, Toshiba, IBM); GPUs (NVidia, ATI). Bibliography: Computer Architecture: a Quantitative Approach, Chapter 4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 4 / 43

4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 5 / 43 Software synchronization mechanisms are usually carried out by using hardware primitives; Usually, the processors include read-modify-write instructions, which allow the implementation of atomic read and write operations to a given memory position: Without interruptions Without loosing the bus ownership Examples: TAS (test-and-set) Fetch-and-increment Exchange Ri,M[semaphore] Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 43

5 Example Atomic exchange of the values between Ri and M[sem]. Example: Semaphore sem controls the access to a mutual exclusion zone: 0 = free, 1 = busy. Two processes simultaneously try to get in the exclusion zone; both processes try to assert Ri with 1, by executing the instruction: Exchange Ri,M[sem]. Only one process will get the zero value in Ri. Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 43 Routines Spin locks: Provide an exclusive access to a shared resource. barriers: Provide synchronization in the execution of a set of processes. Prof. Nuno Roma ACE 2010/11 - DEI-IST 8 / 43

6 Spin Locks Os Spin Locks are executed by pooling a given resource, within a waiting loop: It is expected that the waiting time is short; The main objective is to minimize the latency when acceding the resource; DADDUI R2,R0,#1 lock: EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 43 Spin Locks Os Spin Locks are executed by pooling a given resource, within a waiting loop: It is expected that the waiting time is short; The main objective is to minimize the latency when acceding the resource; DADDUI R2,R0,#1 lock: EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock What about caching the lock variable? Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 43

7 Spin Locks Os Spin Locks are executed by pooling a given resource, within a waiting loop: It is expected that the waiting time is short; The main objective is to minimize the latency when acceding the resource; DADDUI R2,R0,#1 lock: EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock What about caching the lock variable? Advantages: Avoids using the bus/network to read/write the main memory; Great probability that such lock will be used again by the processor in a near future. Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 43 Cached Spin Locks Problem: With read-modify-write operations, each test implies a write operation. Critical if more than one processor is in the waiting loop! Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 43

8 Cached Spin Locks Problem: With read-modify-write operations, each test implies a write operation. Critical if more than one processor is in the waiting loop! Solution: Stay in a reading loop, and only use the atomic exchange operation when the resource is free; lock: LD R2,0(R1) ; non-atomic read BNEZ R2, lock DADDUI R2,R0,#1 EXCH R2,0(R1) ; atomic exhange BNEZ R2,lock ; locked?. EXCH R2,0(R1) ; unlock It still leads to a significant amount of traffic after the unlock operation... Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 43 Barriers The processes wait until all processes have reached the barrier; only then are the processes (all) released; The implementation uses two locking mechanisms: One semaphore to protect the increment operation over the counter that registers the number of processes that were already blocked; Another semaphore to block the processes until the counter reaches the total number of running processes. Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 43

9 Prof. Nuno Roma ACE 2010/11 - DEI-IST 12 / 43 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43

10 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43

11 Multiprocessor Classes Flynn s Taxonomy SISD (Single Instruction, Single Data): uniprocessor case; SIMD (Single Instruction, Multiple Data): the same instruction is executed in the several processors, but each processor operates an independent data set: Vectorial Architectures; MISD (Multiple Instruction, Single Data): each processor executes a different instruction, but all process the same data set: There isn t any commercial solution of this type; MIMD (Multiple Instruction, Multiple Data): each processor executes independent instructions over an independent data set. Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 43 SIMD Systems: Objective: simultaneous implementation of a given operation over a significant number of operands; Particularly suited to the parallel processing of significant amounts of data. Examples: Engine Units (GPUs) Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 43

12 Classification of Multi-Processor Systems Type Architecture Memory Management Examples General Purpose Processor (GPP) Homogeneous Hardware - Intel, AMD, IBM Power, SUN etc. multi-core families Dedicated Processors / Accelerators Heterogeneous Misc. Hardware + Software - Cell (PS3) - GPUs (NVidia); - FPGA/ASIC dedicated accelerators. Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 43 Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 43

13 (CBEA) Proposed by Sony, Toshiba e IBM (STI), in 2000 In the market since 2006 (Sony Playstation 3) Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 43 Characteristics of Cell architecture Characteristics: 9 core multi-processor: 1 PowerPC Processing Element (PPE); 8 Synergistic Processing Elements (SPE); Processing elements interconnected with a high performance bus - Element Interconnect Bus (EIB); Maximum operating frequency: 4 GHz; Maximum performance greater than 256 GFLOPs; transistors. Prof. Nuno Roma ACE 2010/11 - DEI-IST 18 / 43

14 Cell Architecture Prof. Nuno Roma ACE 2010/11 - DEI-IST 19 / 43 PowerPC Processing Element (PPE) PowerPC Processing Element - PPE: Function: Multi-processor resources management; Partitioning of the data under processing in smaller blocks, to be subsequently distributed to the SPEs using the EIB; SPE management and allocation; SPE synchronization and scheduling. Characteristics: General Purpose Processor (GPP); 64-bits; Dual-threading; 2-issue in-order (2 simultaneously executed instructions); Reduced Instruction Set Computing (RISC); Cache: L1-32kB; L2-512 kb. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 43

15 Synergistic Processing Element (SPE) Synergistic Processing Element - SPE: Characteristics: SIMD architecture (vectorial processor); 128 registers, each one with 128-bit; 1 Synergistic Processing Unit (SPU): 4 single-precision unit; 1 double-precision unit; 2-issue; transistors; Does NO have cache: Local Store (LS): 256kB private memory; Load-Store instructions operate over the private LS; Access to the external memory is only accomplished with DMA, by using the EIB. Prof. Nuno Roma ACE 2010/11 - DEI-IST 21 / 43 Cell Architecture Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 43

16 Element Interface Bus (EIB) Element Interface Bus - EIB: Characteristics: Responsible for the transfer of all data inside the processor; Four 128-bits channels; 2 clockwise communication rings; 2 anti-clockwise communication rings; Implements the interface with the main memory and the IO bus (FlexIO). Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 43 Programming Model PPE executes applications compiled with the Power (32-bits or 64-bits) or PowerPC compilers, without any modification needed; To achieve a satisfactory performance and efficiency levels, it is essential to use the SPEs; Problems: Different programming model (vectorial); Scarce local memory (Local-Store) within each SPE (256 kb); DMA transfer mechanism are mandatory, in order to transfer the data between the main memory and each local-store; Latency imposed in the accesses to main memory and IO ports; Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 43

17 Programming Model PPE executes applications compiled with the Power (32-bits or 64-bits) or PowerPC compilers, without any modification needed; To achieve a satisfactory performance and efficiency levels, it is essential to use the SPEs; Problems: Different programming model (vectorial); Scarce local memory (Local-Store) within each SPE (256 kb); DMA transfer mechanism are mandatory, in order to transfer the data between the main memory and each local-store; Latency imposed in the accesses to main memory and IO ports; Imposes important restrictions which significantly restrict the global system performance; Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 43 Programming Model PPE executes applications compiled with the Power (32-bits or 64-bits) or PowerPC compilers, without any modification needed; To achieve a satisfactory performance and efficiency levels, it is essential to use the SPEs; Problems: Different programming model (vectorial); Scarce local memory (Local-Store) within each SPE (256 kb); DMA transfer mechanism are mandatory, in order to transfer the data between the main memory and each local-store; Latency imposed in the accesses to main memory and IO ports; Imposes important restrictions which significantly restrict the global system performance; Imposes a (much) greater complexity to the programmer. Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 43

18 Performance Example: Matrices multiplication SOURCE: Henrique Costa, Multiprocessor Platforms for Natural Language Processing, MSc Thesis, IST, May Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 43 Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 43

19 Despite they were originally developed to graphical applications, they have gradually been adopted for general purpose processing: (GPGPU - General Purpose Graphical Processing Unit). Prof. Nuno Roma ACE 2010/11 - DEI-IST 27 / 43 Despite they were originally developed to graphical applications, they have gradually been adopted for general purpose processing: (GPGPU - General Purpose Graphical Processing Unit). Combination of massively parallel & vectorial architectures; Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 43

20 Specialized architecture, particularly targeted to the implementation of highly complex processing of 3D graphics: Real-time rendering: millions of pixels per second; The processing of each pixel requires hundreds of operations; Parallel processing structures. Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 43 Specialized architecture, particularly targeted to the implementation of highly complex processing of 3D graphics: Real-time rendering: millions of pixels per second; The processing of each pixel requires hundreds of operations; Parallel processing structures. Advantages: Speed; Low cost; Low power consumption. Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 43

21 Specialized architecture, particularly targeted to the implementation of highly complex processing of 3D graphics: Real-time rendering: millions of pixels per second; The processing of each pixel requires hundreds of operations; Parallel processing structures. Advantages: Speed; Low cost; Low power consumption. Disadvantages: Very specialized; Complex programming model; Bandwidth problems; Volatile market. Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 43 Comparison of CPU vs. GPU With GPUs, most resources are allocated to data processing By sharing the same control logic, more resources can be allocated to ALUs; SIMD!!! Prof. Nuno Roma ACE 2010/11 - DEI-IST 30 / 43

22 GPU Architecture Characteristics: Massive exploitation of parallelism among the operations; Application of the SIMD paradigm: At the threads level At the data level. Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 43 GPU Architecture Characteristics (e.g.: nvidia): Several multi-processors; Each multi-processor is an aggregate of several 32-bits SIMD processing elements; In each clock cycle, each multi-processor executes the same instructiuon in a group of threads (warp); Absence of explicit communication between the processing elements; Absence of any cache coherency mechanisms; 256 kb of cache, shared by all processing elements. Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 43

23 GPU Architecture Example: nvidia GeForce multi-processadores; 8 vectorial processors in each multi-processor; 330 GFLOPs. Prof. Nuno Roma ACE 2010/11 - DEI-IST 33 / 43 GPU Architecture Example: nvidia GeForce 8800 Each Streaming Multi-processor (SM) offers: 8 Streaming Processors (SP); Performance of about GHz; bits registers; Shared program and data caches; 16 kb of shared memory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 43

24 GPU Architecture Example: Radeon HD 2900XT Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 43 Programming Model Programming Model: 1. Program is defined as a set of similar threads; 2. The code that is executed by each thread is programmed using a Single-Program Multiple Data (SPMD) paradigm; 3. The result of each thread is obtained as a combination of several mathematical operations and read/write accesses to the main memory; 4. Data stored in the main memory may be subsequently used as input to other threads. Programming Languages: Graphics APIs: OpenGL, DirectX; Generic APIs: CUDA, OpenCL, etc. Prof. Nuno Roma ACE 2010/11 - DEI-IST 36 / 43

25 Programming Model Example: vector addition Prof. Nuno Roma ACE 2010/11 - DEI-IST 37 / 43 Difficulties Non-regular execution patterns (e.g.: branches): Threads with the same program are grouped in warps; To keep a regular execution flow, all possible ramifications of the execution pattern are simultaneously considered (both directions of the branch instruction (taken and not-taken); Subsequently, some threads become inactive, as a consequence of incorrent branch ramifications. High latency: Significant input and output data latency, between the GPU and the CPU; Masked by massively exploiting multi-threading. Prof. Nuno Roma ACE 2010/11 - DEI-IST 38 / 43

26 Performance GPUs have soon beaten the performance offered by general purpose CPUs. Prof. Nuno Roma ACE 2010/11 - DEI-IST 39 / 43 Comparison: CPU, Cell and GPU Example: Matrix multiplication SOURCE: Henrique Costa, Multiprocessor Platforms for Natural Language Processing, MSc Thesis, IST, May Prof. Nuno Roma ACE 2010/11 - DEI-IST 40 / 43

27 Future Hybrid multi-core architectures, incorporating general purpose cores (CPU) and graphical accelerators (GPU): Examples: Fusion (AMD & ATI - first half of 2011); Larrabee (Intel - postponed...). Prof. Nuno Roma ACE 2010/11 - DEI-IST 41 / 43 Prof. Nuno Roma ACE 2010/11 - DEI-IST 42 / 43

28 Memory systems; Program access patterns; Cache memories: Operation principles; Internal organization; Cache management policies. Prof. Nuno Roma ACE 2010/11 - DEI-IST 43 / 43

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 07

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 05

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 14

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 06

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 16

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 17

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 04

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 03 Title: Processor

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

INF5063: Programming heterogeneous multi-core processors Introduction

INF5063: Programming heterogeneous multi-core processors Introduction INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using

More information

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Broadband Engine. Spencer Dennis Nicholas Barlow Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Chapter-4 Multiprocessors and Thread-Level Parallelism

Chapter-4 Multiprocessors and Thread-Level Parallelism Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

Hardware Accelerators

Hardware Accelerators Hardware Accelerators José Costa Software for Embedded Systems Departamento de Engenharia Informática (DEI) Instituto Superior Técnico 2014-04-08 José Costa (DEI/IST) Hardware Accelerators 1 Outline Hardware

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

CONSOLE ARCHITECTURE

CONSOLE ARCHITECTURE CONSOLE ARCHITECTURE Introduction Part 1 What is a console? Console components Differences between consoles and PCs Benefits of console development The development environment Console game design What

More information

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn s classification scheme is based on the notion of a stream of information.

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 21

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

CMSC 611: Advanced. Parallel Systems

CMSC 611: Advanced. Parallel Systems CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems

More information

Processor Performance and Parallelism Y. K. Malaiya

Processor Performance and Parallelism Y. K. Malaiya Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Parallel Processing & Multicore computers

Parallel Processing & Multicore computers Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Processor Architectures

Processor Architectures ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Computer Organization. Chapter 16

Computer Organization. Chapter 16 William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28 CS 220: Introduction to Parallel Computing Introduction to CUDA Lecture 28 Today s Schedule Project 4 Read-Write Locks Introduction to CUDA 5/2/18 CS 220: Parallel Computing 2 Today s Schedule Project

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University * slides thanks to Kavita Bala & many others Final Project Demo Sign-Up: Will be posted outside my office after lecture today.

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information