Computer Architecture

Similar documents
Parallel Computing: Parallel Architectures Jin, Hai

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Portland State University ECE 588/688. Graphics Processors

Antonio R. Miele Marco D. Santambrogio

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

WHY PARALLEL PROCESSING? (CE-401)

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Memory Systems IRAM. Principle of IRAM

GRAPHICS PROCESSING UNITS

Advanced Processor Architecture

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CS 426 Parallel Computing. Parallel Computing Platforms

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Parallel Processing SIMD, Vector and GPU s cont.

High Performance Computing. University questions with solution

Computer Architecture

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Multicore Hardware and Parallelism

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

PowerPC 740 and 750

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CS425 Computer Systems Architecture

THREAD LEVEL PARALLELISM

45-year CPU Evolution: 1 Law -2 Equations

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Online Course Evaluation. What we will do in the last week?

Hyperthreading Technology

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University


Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Introducing Multi-core Computing / Hyperthreading

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

! Readings! ! Room-level, on-chip! vs.!

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Introduction to Computing and Systems Architecture

Lecture 1: Introduction

How to Write Fast Code , spring th Lecture, Mar. 31 st

CUDA GPGPU Workshop 2012

All About the Cell Processor

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Parallelism in Hardware

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Parallel Systems I The GPU architecture. Jan Lemeire

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CS427 Multicore Architecture and Parallel Computing

Parallel Architecture. Hwansoo Han

Chapter 2 Parallel Computer Architecture

Portland State University ECE 588/688. Cray-1 and Cray T3E

Introduction to GPU programming with CUDA

CS 654 Computer Architecture Summary. Peter Kemper

Handout 2 ILP: Part B

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 14: Multithreading

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Computer Architecture!

The University of Texas at Austin

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Introduction II. Overview

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Keywords and Review Questions

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Lecture 25: Board Notes: Threads and GPUs

ASYNCHRONOUS SHADERS WHITE PAPER 0

Computer Architecture

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

A brief History of INTEL and Motorola Microprocessors Part 1

Parallel Computing. Parallel Computing. Hwansoo Han

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

PowerVR Hardware. Architecture Overview for Developers

GPUs and GPGPUs. Greg Blanton John T. Lubia

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Computer Architecture Spring 2016

Computer Architecture 计算机体系结构. Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统. Chao Li, PhD. 李超博士

Comp. Org II, Spring

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Transcription:

Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts Thread: Threads are lightweight processes. They consist of several instructions. The threads share a common (virtual) address space. Threads can communicate via this common address space. Task: Tasks are heavyweight processes. Each task has its own address space. Tasks can only communicate via inter task communication channels like shared memory, pipes, message queues or sockets. A task can contain several threads Computer Architecture Part 10 page 2 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts Instruction level parallelism is limited. To further exploit parallel processing, thread or task level parallelism can be used. Two major architectures are known: Multithreaded processors exploit thread level parallelism Chip multiprocessors (multi core processors, many core processors) exploit task level parallelism Both concepts are also used in combination Computer Architecture Part 10 page 3 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts In a multi-threaded processor instructions of several threads of the program are candidates for concurrent issuing. This can be done in a classical scalar pipeline to hide the latencies of memory access. Here, instructions from several threads can be processed in the different pipeline stages. In can be as well combined with a superscalar pipeline to increase the level of possible parallelism from the intra thread level to the inter thread level. This is called SMT (Simultaneous Multithreading). Computer Architecture Part 10 page 4 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts Chip multiprocessors combine multiple processor cores on a single chip. Therefore these processors are also called multi core processors. Today's multicore processors integrate 2-8 cores on a chip. By increasing the number of cores in the future (e.g. > 100), the term many core processors is used. These cores can execute several tasks in parallel. Cores can be homogeneous or heterogeneous. Having multithreaded cores, multithreading and chip multiprocessing can be combined. Computer Architecture Part 10 page 5 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multithreaded Architectures Multithreaded processor: Supports the execution of multiple threads by hardware It can store the context information of several threads in separate register sets and execute instructions of different threads at the same time in the processor pipeline Different stages of the processor pipeline can contain instructions from different threads This exploits thread level parallelism on basis of parallelism in time (pipelining) Computer Architecture Part 10 page 6 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multithreaded Architectures Goal: Reduction of latencies caused by memory accesses or dependencies Such latencies can be bridged by switching to another thread During the latency, instructions from other threads are feed into the pipeline => the processor ultilzation is raised, the throughput of a load consisting of multiple threads increases (while the throughput of a single thread remains the same) Explicit multithreaded processors: each thread is a real thread of the application program Implicit multithreaded processors: speculative parallel threads are created dynamically out of a sequential program Computer Architecture Part 10 page 7 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic multithreading techniques (2) (3) (4) (2) (3) (4) (a) single threaded prozessor (b) Cycle-by-cycle- Interleaving-Technik (fine-grain multithreading): Time (processor cycles) (2) (3) (4) (2) (2) (2) Context switches Context switch Context is switched each clock cycle (a) (b) (c) (c) Block-Interleaving-Technik (coarse-grain multithreading): Instructions of a thread are executed until an event causes a latency. Then context is switched. Computer Architecture Part 10 page 8 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Comparing multithreading to superscaler and VLIW (2) (3) (4) (2) (3) (4) Time (processor cycles) N N N N N (2) (2) (2) (3) (4) (4) (2) (2) (2) (2) Context switches N N (2) (2) (2) N (3) N N N (4) (4) N N N N N (2) (2) (2) (2) Context switches (a) (b) (c) (d) a: four times superscalar processor b: four times VLIW processor c: four times superscaler processor d: four times VLIW processor with cycle by cycle interleaving with cycle by cycle interleaving Computer Architecture Part 10 page 9 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Classification of block interleaving techniques Block Interleaving statisch dynamisch Explicit-switch Implicit-switch (switch-on-load, switch-on-store, switch-on-branch,...) Conditionalswitch Switchon-signal Switch-oncache-miss Switchon-use Computer Architecture Part 10 page 10 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Simultaneous multithreading (SMT) A simultaneous multithreaded processor is able to issue instructions of multiple threads to multiple execution units in a single clock cycle. This exploits thread level and instruction level parallelism in time and space Instruction Fetch... Instruction Decode and Rename... Instruction Window Issue 1 2 3 4 Reservation Stations Reservation Stations Execution 1... Execution 4 Retire and Write Back Computer Architecture Part 10 page 11 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Comparing SMT to chip multiprocessing Simultaneous multithreading (a) and chip multiprocessing (b) (2) (3) (4) (2) (3) (4) Time (processor cycles) (2) (2) (2) (2) (3) (4) (4) (4) (4) (2) (4) (4) (4) (4) (2) (3) (2) (4) (2) (2) (4) (4) (2) (2) (4) (4) (4) (4) (4) (a) (2) (2) (3) (2) (4) (4) (2) (3) (4) (4) (2) (2) (3) (3) (4) (4) (2) (3) (3) (4) (b) Computer Architecture Part 10 page 12 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Other applications of multithreading Resulting from the ability of fast context switching more application fields for multithreading arise Reduction of energy consumption Mispredictions in superscaler processors cost energy. Multithreaded processors can execute instructions from other threads instead Event handling Helper threads handle special events (e.g. carbage collection) Real-time processing Allows efficient real-time scheduling policies like LLF or GP Computer Architecture Part 10 page 13 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Chip multiprocessing architectures A Chip-Multiprocessor (CMP) combines several processors on a single chip Instead of chip-multiprocessor, today this is also called Multi-Core- Processor, where a core denotes a single processor on the multi-core processor chip Each core can have the complexity of today s microprocessors and holds ist own primary cache for instructions and data Usually, the cores are organized as memory coupled multi processors with a shared address space Furthermore, a secondary cache is contained on the chip For future multi-core processors containing a large number of cores (>100), the term Many-Core-Processor is used Computer Architecture Part 10 page 14 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Possible multi-core-configurations shared-main memory shared-secondary cache Processor Processo r Processor Processor Processor Processor Processor Processor Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Global Memory Global Memory Computer Architecture Part 10 page 15 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Possible multi-core-configurations (2) shared-primary cache Processor Processor Processor Processor Primary Cache Secondary Cache Global Memory Computer Architecture Part 10 page 16 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Chip-Multiprocessor / Multi-Core Simulations show the shared secondary cache architecture superior to shared primary cache and shared main memory Therefore, mostly a large shared secondary cache is implemented on the processor chip Cache coherency protocols known from symmetric multi-processor architectures (e.g. MESI protocol) guarantee a correct access to the shared memory cells from inside and outside the processor chip Today, chip multiprocessing is often combined with simultaneous multithreading There, each core is a SMT core giving the advantages of both approaches Computer Architecture Part 10 page 17 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

An early single chip multiprocessor proposal: Hydra A Single Chip CPU 0 Primary Primary I-cache D-cache CPU0 Mem.Controller Centralized Bus Arbitration Mechanisms CPU 1 Primary Primary I-cache D-cache CPU1 Mem.Controller CPU 2 Primary Primary I-cache D-cache CPU2 Mem.Controller CPU 3 Primary Primary I-cache D-cache CPU 3 Mem.Controller On-chip Secondary Cache Off-chip L3 Interface Rambus Mem. Interface DMA I/O Bus Interface Cache SRAM Array DRAM Main Memory I/O Device Computer Architecture Part 10 page 18 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples IBM Power5 Symmetric multi-core processor with two 64-bit 2 times SMT processors having 64 kbytes instruction cache and 32 kbytes data cache Both cores share a 1.41. MByte on-chip secondary cache Controller for third level cache as well on chip Four Power5 chips and four L3 cache chips are combined in a multi-chip module Computer Architecture Part 10 page 19 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples IBM Power6 Similar to Power5, but superscaler in-order-execution Level 1 cache size raised to 64 kbytes for instructions and data on each core 65 nm process 5 GHz clock frequency Computer Architecture Part 10 page 20 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples IBM Power7 4, 6 or 8 cores Turbo mode deactivates 4 out of 8 cores, but gives access to all memory controllers for the remaining 4 cores => improves single core performance Each core supports 4 times SMT 45 nm process 4 GHz clock frequency Computer Architecture Part 10 page 21 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples Intel Core 2 Duo (Wolfdale) 2 processor cores of Intel Core 2 architecture 32 kbytes data and instruction cache for each core Core 1 6 MBytes L2 cache 45 nm process 3 Ghz clock frequency L2 Cache Shared by both cores Core 2 Computer Architecture Part 10 page 22 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples Microarchitecture of Intel Core 2 family (a single core) Computer Architecture Part 10 page 23 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Source: c t 16/2006

Multi-Core examples Intel Core 2 Quad (Yorkfield) 2 Wolfdale dices in a multi-chip module => 4 processor cores of Intel Core 2 architecture 32 kbytes data and instruction cache for each core 6 MBytes L2 cache for each dice 45 nm process 3 Ghz clock frequency Computer Architecture Part 10 page 24 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples Intel Core i7-3930k (Sandy Bridge E) 6 core processor (Hexa-Core) 32 kbytes data and instruction cache for each core 256 kbytes L2 cache for each core 15 MBytes common L3 cache 32 nm process 3.3 Ghz clock frequency Computer Architecture Part 10 page 25 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Heterogeneous multi-cores While homogeneous multi-core processors are commonly used for general purpose computing, heterogeneous multi-core processors are seen as a future trend for embedded systems A first member of this technology is the IBM Cell processor containing a Power processor (Power Processor Element, PPE) and 8 dependend processors (Synergistic Processing Elements, SPE) PPE: based on Power architecture, two times SMT, controls the 8 SPEs SPE: contains a RISC processor with 128 bit SIMD (multimedia) instructions, a memory flow controller and a bus controller Originally designed for Sony Playstation 3, the cell processor is now used in various application domains Computer Architecture Part 10 page 26 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Cell Processor Die Computer Architecture Part 10 page 27 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs Heterogeneous Many-Cores 1000 and more streaming processor cores for shading First generation: Special purpose hardware for various shading tasks Second generation: Programmable streaming processors for pixel shading, vertex shading,.. Third generation: Unfied Shaders Example: Radeon R600 GPU Computer Architecture Part 10 page 28 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs Another example: NVIDIA GF100 4 Graphic Processing Clusters (GPC) 768 kbytes L2 Cache 6 memory controllers Computer Architecture Part 10 page 29 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs A GPC consists of: Raster Engine (triangle setup, rasterization, Z- management) Polymorph Engine (Vertex attribute fetch, tesselation) 4 Streaming Multiprocessors (Unified Shading: vertex-, geometry-, raster-, texture-, pixel- processing) => Computer Architecture Part 10 page 30 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs A SM consists of: 32 CUDA Cores (Compute Unified Device Architecture) 16 Load/Store Units 4 Special Function Units (sin, cos, square root calculation, etc.) => GF100 Overall: 4 x 4 x 32 = 512 CUDA Cores 4 x 4 x 16 = 256 Load/Store Units 4 x 4 x 4 = 64 Special Function Units 4 x 4 = PolyMorph Engines 4 Raster Engines Computer Architecture Part 10 page 31 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: performance Due to multithreading in PC and server operating systems, two to four cores significantly increase the processor throughput Exploiting eight or more cores requires parallel application programs Hence, software development is challenged to deliver the necessary number of parallel threads by either parallelizing compilers or parallel applications Experiences from multiprocessors show a moderate number of parallel threads resulting in high performance improvement, but this does not scale to a higher amount of parallelism Beginning with 4 to 8 threads, the performance improvement is dramatically reduced Using 8 cores, except for very computing intensive applications (signal processing, graphic processing) some cores will be temporarily idle Furthermore, memory bandwidth can become a bottleneck Computer Architecture Part 10 page 32 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: hardware While current multi-core processors use cache coupled interconnection, future processors might rely on grid structures (network on chip) to improve performance Adaptive and reconfigurable MPSoC (Multi-Processor Systens-on-.Chip) will gain importance for embedded systems and general purpose computing Heterogeneous many-core GPUs are state-of-the-art Reconfigurable cache memories might allow variable connections to different cores Available input/output bandwidth is still an open problem for throughput oriented programs Computer Architecture Part 10 page 33 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: hardware For data access, transactional memory might be is a model for future multi-core processors Similar to database systems, memory access is organized as a transaction being executed completely or not at all Hardware support for checkpointing and rollback is necessary As an advantage, concurrent access is simplified (no locks) Furthermore, fault tolerance and dependability techniques will become more important as the error probability will increase with decreasing transistor dimensions On chip power management will keep the importance it has already today Computer Architecture Part 10 page 34 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: software Currently, operating system concepts known from memory coupled multiprocessor systems are used. Here, the operating system scheduler assigns independent processes to the available processors Different to these concepts, the closer core connection of multi-core processors leads to a different computation versus synchronization ratio allowing to use more fine grain parallelism Parallel computing will become the future standard programming model Most of the currently existing software is sequential, thus can run only on one core Programming languages and tools to exploit the fine grain parallelism of multi-core processors need to be developed Furthermore, software engineering techniques are needed to allow the development of safe parallel programs Computer Architecture Part 10 page 35 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: software The application development for multi-core processors will become one of the main future market places for computer scientists Today s applications have to be proceeded with the goal to exploit parallelism, gain performance and increase comfort New applications currently not realizable due to a lack of processor performance will arise These are hard to predict Possible applications must have the need for high computational performance reachable by parallelism Such applications might come from speech recognition, image recognition, data mining, learning technologies or hardware synthesis Computer Architecture Part 10 page 36 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting