COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Similar documents
COSC 6385 Computer Architecture - Multi Processor Systems

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer Systems Architecture

Computer Systems Architecture

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

THREAD LEVEL PARALLELISM

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CSE 392/CS 378: High-performance Computing - Principles and Practice

WHY PARALLEL PROCESSING? (CE-401)

Introduction II. Overview

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Multiprocessors & Thread Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multi-core Architectures. Dr. Yingwu Zhu

Simultaneous Multithreading on Pentium 4

Introduction to Parallel Computing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Issues in Multiprocessors

Multi-core Architectures. Dr. Yingwu Zhu

BlueGene/L (No. 4 in the Latest Top500 List)

COSC4201 Multiprocessors

Computer Architecture Spring 2016

COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

Top500 Supercomputer list

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Advanced Parallel Programming I

Handout 3 Multiprocessor and thread level parallelism

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Computer Architecture Spring 2016

Parallel Architecture. Hwansoo Han

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Parallel Architectures

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

UNIT I (Two Marks Questions & Answers)

CS425 Computer Systems Architecture

Lecture 1: Introduction

Computer Science 146. Computer Architecture

CS 654 Computer Architecture Summary. Peter Kemper

Computer Architecture

Computer Architecture

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Comp. Org II, Spring

Parallel Processing & Multicore computers

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)


Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Online Course Evaluation. What we will do in the last week?

Modern CPU Architectures

Comp. Org II, Spring

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

High Performance Computing Systems

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Multi-core Programming - Introduction

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Processor Performance and Parallelism Y. K. Malaiya

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Issues in Multiprocessors

Lecture 7: Parallel Processing

Introducing Multi-core Computing / Hyperthreading

Fundamentals of Computer Design

Computer parallelism Flynn s categories

Hyperthreading Technology

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Computer Architecture Crash course

Masterpraktikum Scientific Computing

Advanced Processor Architecture

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Exploitation of instruction level parallelism

Chapter 2 Parallel Computer Architecture

Keywords and Review Questions

Lect. 2: Types of Parallelism

Cray XE6 Performance Workshop

Dr. Joe Zhang PDC-3: Parallel Platforms

Parallel Computing Platforms

Handout 2 ILP: Part B

Chapter 9 Multiprocessors

Transcription:

COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month Moore s Law Source: http://en.wikipedia.org/wki/images:moores_law.svg 1

What do we do with that many transistors? Optimizing the execution of a single instruction stream through Pipelining Overlap the execution of multiple instructions Example: all RISC architectures; Intel x86 underneath the hood Out-of-order execution: Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) Example: all commercial processors (Intel, AMD, IBM, SUN) Branch prediction and speculative execution: Reduce the number of stall cycles due to unresolved branches Example: (nearly) all commercial processors What do we do with that many transistors? (II) Multi-issue processors: Allow multiple instructions to start execution per clock cycle Superscalar (Intel x86, AMD, ) vs. VLIW architectures VLIW/EPIC architectures: Allow compilers to indicate independent instructions per issue packet Example: Intel Itanium Vector units: Allow for the efficient expression and execution of vector operations Example: SSE - SSE4, AVX instructions 2

Limitations of optimizing a single instruction stream (II) Problem: within a single instruction stream we do not find enough independent instructions to execute simultaneously due to data dependencies limitations of speculative execution across multiple branches difficulties to detect memory dependencies among instruction (alias analysis) Consequence: significant number of functional units are idling at any given time Question: Can we maybe execute instructions from another instructions stream Another thread? Another process? Thread-level parallelism Problems for executing instructions from multiple threads at the same time The instructions in each thread might use the same register names Each thread has its own program counter Virtual memory management allows for the execution of multiple threads and sharing of the main memory When to switch between different threads: Fine grain multithreading: switches between every instruction Course grain multithreading: switches only on costly stalls (e.g. level 2 cache misses) 3

Simultaneous Multi-Threading (SMT) Convert Thread-level parallelism to instruction-level parallelism Superscalar Course MT Fine MT SMT Simultaneous multi-threading (II) Dynamically scheduled processors already have most hardware mechanisms in place to support SMT (e.g. register renaming) Required additional hardware: Registerfile per thread Program counter per thread Operating system view: If a supports n simultaneous threads, the Operating System views them as n processors OS distributes most time consuming threads fairly across the n processors that it sees. 4

Example for SMT architectures (I) Intel Hyperthreading: First released for Intel Xeon processor family in 2002 Supports two architectural sets per, Each architectural set has its own General purpose registers Control registers Interrupt control registers Machine state registers Adds less than 5% to the relative chip size Reference: D.T. Marr et. al. Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal, 6(1), 2002, pp.4-15. ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyp er_threading_technology.pdf Example for SMT architectures (II) IBM Power 5 Same pipeline as IBM Power 4 processor but with SMT support Further improvements: Increase associativity of the L1 instruction cache Increase the size of the L2 and L3 caches Add separate instruction prefetch and buffering units for each SMT Increase the size of issue queues Increase the number of virtual registers used internally by the processor. 5

Simultaneous Multi-Threading Works well if Number of compute intensive threads does not exceed the number of threads supported in SMT Threads have highly different characteristics (e.g. one thread doing mostly integer operations, another mainly doing floating point operations) Does not work well if Threads try to utilize the same function units Assignment problems: e.g. a dual processor system, each processor supporting 2 threads simultaneously (OS thinks there are 4 processors) 2 compute intensive application processes might end up on the same processor instead of different processors (OS does not see the difference between SMT and real processors!) Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD: Single instruction multiple data MISD: Multiple instructions single data Non existent, just listed for completeness MIMD: Multiple instructions multiple data Most common and general parallel machine 6

Single Instruction Multiple Data Also known as Array-processors A single instruction stream is broadcasted to multiple processors, each having its own data stream Still used in some graphics cards today Instructions stream Data Data Data Data Control unit processor processor processor processor Multiple Instructions Multiple Data (I) Each processor has its own instruction stream and input data Very general case every other scenario can be mapped to MIMD Further breakdown of MIMD usually based on the memory organization Shared memory systems Distributed memory systems 7

Shared memory systems (I) All processes have access to the same address space E.g. PC with more than one processor Data exchange between processes by writing/reading shared variables Shared memory systems are easy to program Current standard in scientific programming: OpenMP Two versions of shared memory systems available today Centralized Shared Architectures Distributed Shared architectures Centralized Shared Architecture Also referred to as Symmetric Multi-Processors (SMP) All processors share the same physical main memory bandwidth per processor is limiting factor for this type of architecture Typical size: 2-32 processors 8

Distributed Shared Architectures Also referred to as Non-Uniform Architectures (NUMA) Some memory is closer to a certain processor than other memory The whole memory is still addressable from all processors Depending on what data item a processor retrieves, the access time might vary strongly NUMA architectures (II) Reduces the memory bottleneck compared to SMPs More difficult to program efficiently E.g. first touch policy: data item will be located in the memory of the processor which uses a data item first To reduce effects of non-uniform memory access, caches are often used ccnuma: cache-coherent non-uniform memory access architectures Largest example as of today: SGI Origin with 512 processors 9

Distributed memory machines (I) Each processor has its own address space Communication between processes by explicit data exchange Sockets Message passing Remote procedure call / remote method invocation Network interconnect Performance Metrics (I) Speedup: how much faster does a problem run on p processors compared to 1 processor? Ttotal (1) S( p) Ttotal ( p) Optimal: S(p) = p (linear speedup) Parallel Efficiency: Speedup normalized by the number of processors S( p) E( p) p Optimal: E(p) = 1.0 10

Amdahl s Law (I) Most applications have a (small) sequential fraction, which limits the speedup T T T ft ( 1 f ) T total sequential f: fraction of the code which can only be executed sequentially Ttotal(1) S( p) 1 f ( f ) T p parallel total Total 1 1 f (1) f p Total Example for Amdahl s Law 60.00 50.00 40.00 30.00 f=0 f=0.05 f=0.1 f=0.2 20.00 10.00 0.00 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 11

Amdahl s Law (II) Amdahl s Law assumes, that the problem size is constant In most applications, the sequential part is independent of the problem size, while the part which can be executed in parallel is not. Cache Coherence Real-world shared memory systems have caches between memory and Copies of a single data item can exist in multiple caches Modification of a shared data item by one leads to outdated copies in the cache of another Original data item Copy of data item in cache of 0 Cache 0 Cache 1 Copy of data item in cache of 1 12

Cache coherence (II) Typical solution: Caches keep track on whether a data item is shared between multiple processes Upon modification of a shared data item, notification of other caches has to occur Other caches will have to reload the shared data item on the next access into their cache Cache coherence only an issue in case multiple tasks access the same item Multiple threads Multiple processes have a joint shared memory segment Process is being migrated from one to another Cache Coherence Protocols Snooping Protocols Send all requests for data to all processors Processors snoop a bus to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for centralized shared memory machines Directory-Based Protocols Keep track of what is being shared in centralized location Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping Commonly used for distributed shared memory machines 13

Up to now: Categories of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity Misses: cache cannot contain all blocks required for the execution Conflict Misses: cache block has to be discarded because of block replacement strategy In multi-processor systems: Coherence Misses: cache block has to be discarded because another processor modified the content true sharing miss: another processor modified the content of the request element false sharing miss: another processor invalidated the block, although the actual item of interest is unchanged. 14