Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Similar documents
WHY PARALLEL PROCESSING? (CE-401)

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

THREAD LEVEL PARALLELISM

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chapter 2 Parallel Computer Architecture

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Online Course Evaluation. What we will do in the last week?

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Systems I The GPU architecture. Jan Lemeire

! Readings! ! Room-level, on-chip! vs.!

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Fundamentals of Computer Design

Fundamentals of Computers Design

Processor Performance and Parallelism Y. K. Malaiya

Parallel Processing SIMD, Vector and GPU s cont.

Computer Architecture

Introduction II. Overview

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Suggested Readings! Lecture 24" Parallel Processing on Multi-Core Chips! Technology Drive to Multi-core! ! Readings! ! H&P: Chapter 7! vs.! CSE 30321!

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1


Computing architectures Part 2 TMA4280 Introduction to Supercomputing

RAID 0 (non-redundant) RAID Types 4/25/2011

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Introduction to GPU computing

CS 426 Parallel Computing. Parallel Computing Platforms

UNIT I (Two Marks Questions & Answers)

Multicore Hardware and Parallelism

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Flynn s Taxonomy of Parallel Architectures

Chapter 4 Data-Level Parallelism

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

27. Parallel Programming I

Parallelism in Hardware

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Hyperthreading Technology

Hardware-Based Speculation

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

3.3 Hardware Parallel processing

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Systems Architecture

Top500 Supercomputer list

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

45-year CPU Evolution: 1 Law -2 Equations

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computer Systems Architecture

Classification of Parallel Architecture Designs

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Multithreaded Processors. Department of Electrical Engineering Stanford University

5 Computer Organization

Lecture 1: Introduction

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Parallel Computing Architectures

Architectures of Flynn s taxonomy -- A Comparison of Methods

Computer parallelism Flynn s categories

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Flynn s Classification

Introduction to GPU programming with CUDA

COMP 322: Fundamentals of Parallel Programming

Parallel Processors. Session 1 Introduction

Programmation Concurrente (SE205)

Parallel Architectures

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CS 475: Parallel Programming Introduction

Parallel computer architecture classification

Lecture 7: Parallel Processing

Multi-core Programming - Introduction

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to parallel computing

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Parallelism. CS6787 Lecture 8 Fall 2017

Introducing Multi-core Computing / Hyperthreading

Computer Architecture Spring 2016

27. Parallel Programming I

Multi-core Architectures. Dr. Yingwu Zhu

Copyright 2010, Elsevier Inc. All rights Reserved

Module 5 Introduction to Parallel Processing Systems

CS 1013 Advance Computer Architecture UNIT I

High-Performance Processors Design Choices

Processor Architecture and Interconnect

27. Parallel Programming I

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Processor Architectures

BlueGene/L (No. 4 in the Latest Top500 List)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer Architecture Crash course

A taxonomy of computer architectures

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Transcription:

Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com

Review

Review

Review

Review

Review

Review

Review

Review

Review

Review

Review

Review

Processor Architecture and Technology trends Processors chips are the key components of computers Processors chips consist of transistors (can be used as a rough estimate of its complexity and performance) Moore's law is an empirical observation which states that the number of transistors of a typical processor chip doubles every 18-24 months.

Processor Architecture and Technology trends

Processor Architecture and Technology trends Four phases of microprocessor design trends: 1. Parallelism at bit level 2. Parallelism by pipelining 3. Parallelism by multiple functional units 4. Parallelism at processor or thread level

Parallelism at bit level Increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. (For example, consider a case where an 8-bit processor must add two 16-bit integers. The processor must first add the 8 lower-order bits from each integer, then add the 8 higher-order bits, requiring two instructions to complete a single operation. A 16-bit processor would be able to complete the operation with single instruction.)

Parallelism by pipelining A typical partition of pipelining a) fetch b) decode c) execute d) Write-back

Parallelism by pipelining The ILP processors (instruction-level parallelism) are processors which use pipelining to execute instructions. Processors with a relatively large number of pipeline stages are sometimes called superpipelined.

Parallelism by multiple functional units Many processors are multiple-issue processors. They use multiple, independent functional units like ALUs (arithmetic logical units), FPUs (floating-point units), load/store units, or branch units. These units can work in parallel, i.e., different independent instructions can be executed in parallel by different functional units.

Parallelism by multiple functional units Multiple-issue processors can be distinguished into superscalar processors and VLIW (very long instruction word) processors. But using even more functional units provides little additional gain because of dependencies between instructions and branching of control flow.

Parallelism at processor or thread level The degree of parallelism obtained by pipelining and multiple functional units is limited. This limit has already been reached for some time for typical processors. But more and more transistors are available per processor chip according to Moore s law. This can be used to integrate larger caches on the chip. But the cache sizes cannot be arbitrarily increased

Parallelism at processor or thread level An alternative approach to use the increasing number of transistors on a chip is to put multiple, independent processor cores onto a single processor chip. This approach has been used for typical desktop processors since 2005. The resulting processor chips are called multicore processors. Each of the cores of a multicore processor must obtain a separate flow of control, i.e., parallel programming techniques must be used. The cores of a processor chip access the same memory and may even share caches. Therefore, memory accesses of the cores must be coordinated.

Flynn's Taxonomy of Parallel Architectures Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966. The classification system has stuck, and has been used as a tool in design of modern processors and their functionalities. Since the rise of multiprocessing CPUs, a multiprogramming context has evolved as an extension of the classification system. Source: wikipedia

Flynn's Taxonomy of Parallel Architectures The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture: Single Instruction, Single Data stream (SISD) Single Instruction, (SIMD) Multiple (MISD) Instruction, Multiple Single Data Data streams stream Multiple Instruction, Multiple Data streams (MIMD)

Flynn's Taxonomy of Parallel Architectures Single Instruction, Single Data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. "PU" is a central processing unit:

Flynn's Taxonomy of Parallel Architectures Single Instruction, Multiple Data streams (SIMD) A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.

Flynn's Taxonomy of Parallel Architectures Multiple Instruction, Single Data stream (MISD) Multiple instructions operate on a single data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result. Examples include the Space Shuttle flight control computer.

Flynn's Taxonomy of Parallel Architectures Multiple Multiple (MIMD) Instruction, Data streams Multiple autonomous processors simultaneously executing different instructions on different data. Distributed systems are generally recognized to be MIMD architectures; either exploiting a single shared memory space or a distributed memory space. A multi-core superscalar processor is an MIMD processor.

MIMD computer systems

Memory Organization of Parallel Computers Computers with Distributed Memory Organization Computers with Shared Memory Organization Reducing Memory Access Times

Computers with Distributed Memory Organization In computer science, distributed memory refers to a multiple-processor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors.

Computers with Distributed Memory Organization

Computers with Shared Memory Organization In computing, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, for example among its multiple threads, is also referred to as shared memory.

Thread A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system

Computers with Shared Memory Organization

Reducing Memory Access Times Memory access time has a large influence on program performance. This can also be observed for computer systems with a shared address space A significant contribution to these improvements comes from a reduction in processor cycle time. At the same time, the capacity of DRAM chips that are used for building main memory has been increasing by about 60% per year.

Reducing Memory Access Times In contrast, the access time of DRAM chips has only been decreasing by about 25% per year. Thus, memory access time does not keep pace with processor performance improvement, and there is an increasing gap between processor cycle time and memory access time. A suitable organization of memory access becomes more and more important to get good performance results at program level.

Reducing Memory Access Times This is also true for parallel programs, in particular if a shared address space is used. Reducing the average latency observed by a processor when accessing memory can increase the resulting program performance significantly. Two important approaches have been considered to reduce the average latency for memory access: 1. The simulation of virtual processors by each physical processor (multithreading). 2. the use of local caches to store data values that are accessed often.

Multithreading In computer architecture, multithreading is the ability of a central processing unit or a single core in a multicore processor to execute multiple processes or threads concurrently, appropriately supported by the operating system. The idea of interleaved multithreading is to hide the latency of memory accesses by simulating a fixed number of virtual processors for each physical processor. Fine-grained multithreading, switch is performed after each instruction. Coarse-grained multithreading, switches between virtual processors only on costly stalls

Multithreading There are two multithreading: drawbacks of fine-grained The programming must be based on a large number of virtual processors. Therefore, the algorithm used must have a sufficiently large potential of parallelism to employ all virtual processors. The physical processors must be specially designed for the simulation of virtual processors. A softwarebased simulation using standard microprocessors is too slow.

Caches In computing, a cache (/ ˈkæʃ/ KASH, or /ˈkeɪʃ/ KAYSH in AuE) is a component that stores data so future requests for that data can be served faster; the data stored in a cache might be the results of an earlier computation, or the duplicates of data stored elsewhere. A cache is a small, but fast memory between the processor and main memory

Caches A cache can be used to store data that is often accessed by the processor, thus avoiding expensive main memory access. The data stored in a cache is always a subset of the data in the main memory, and the management of the data elements in the cache is done by hardware.