Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Similar documents
FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Computer parallelism Flynn s categories

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Introduction II. Overview

BlueGene/L (No. 4 in the Latest Top500 List)

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Multiprocessors & Thread Level Parallelism

ECE 551 System on Chip Design

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Flynn s Taxonomy of Parallel Architectures

THREAD LEVEL PARALLELISM

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Outline Marquette University

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Computer Systems Architecture

Processor Architecture and Interconnect

Parallel Computing Platforms

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer Architecture

Chapter 11. Introduction to Multiprocessors

Copyright 2016 Xilinx

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Computer Systems Architecture

WHY PARALLEL PROCESSING? (CE-401)

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

4. Networks. in parallel computers. Advances in Computer Architecture

Processor Performance and Parallelism Y. K. Malaiya

Fundamentals of Quantitative Design and Analysis

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Chapter 5. Introduction ARM Cortex series

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Advanced Parallel Architecture. Annalisa Massini /2017

Parallel Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processors. Session 1 Introduction

Dr. Joe Zhang PDC-3: Parallel Platforms

Top500 Supercomputer list

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Parallel Computing Introduction

Issues in Multiprocessors

Introduction to parallel computing

Types of Parallel Computers

A taxonomy of computer architectures

High Performance Computing in C and C++

Computer Architecture Crash course

Introduction to the Qsys System Integration Tool

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Module 5 Introduction to Parallel Processing Systems

Issues in Multiprocessors

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

Online Course Evaluation. What we will do in the last week?

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Lecture 7: Parallel Processing

Parallel Architectures

Copyright 2012, Elsevier Inc. All rights reserved.

Buses. Maurizio Palesi. Maurizio Palesi 1

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

UNIT I (Two Marks Questions & Answers)

Objectives of the Course

CDA3101 Recitation Section 13

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computer-System Organization (cont.)

COSC 6385 Computer Architecture - Multi Processor Systems

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

SoC Design Lecture 11: SoC Bus Architectures. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

SHARED MEMORY VS DISTRIBUTED MEMORY

The CoreConnect Bus Architecture

Computer Architecture

Embedded Busses. Large semiconductor. Core vendors. Interconnect IP vendors. STBUS (STMicroelectronics) Many others!

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Introduction to Parallel Processing

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

EECS4201 Computer Architecture

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Lecture 1: Introduction

CS 475: Parallel Programming Introduction

Negotiating the Maze Getting the most out of memory systems today and tomorrow. Robert Kaye

Hardware Design. University of Pannonia Dept. Of Electrical Engineering and Information Systems. MicroBlaze v.8.10 / v.8.20

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 2 Parallel Computer Architecture

The ARM Cortex-A9 Processors

Transcription:

Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics

Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing parallel computations without being visible to the user, e.g Registers Memory Parallel busses Instructions pipeline Macroscopic parallelism - duplicated large-scale components providing parallelism on system level Dual- or Quad-core processors Vector or Graphics processors Co-processors I/O processors

Parallelism Symmetric vs Asymmetric Symmetric parallelism uses replications of identical processing elements that can operate in parallel Multicore processors Asymmetric parallelism uses a set of processing elements that operate in parallel but differs PC with CPU, Graphics processor, math processor, I/O processor

Parallelism Fine-grain vs Coarse-grain Fine-grain parallelism computers providing parallel computations on the level of instructions or data items Vector processors Digital signal processors with special SIMD instructions Coarse-grain parallelism computers providing parallelism on the level of programs or larger data structures Dual- or Quad-core processors

Parallelism Explicit vs Implicit Explicit parallelism programmer need to control how available parallelism is exploited in the code, through e.g. partitioning into parallel processes, constraints and special instructions. Implicit parallelism hardware can exploit parallelism in the executed code without constraints or any special instructions defined by the programmer

Flynn s taxonomy 1966 Michael J Flynn proposed a classification of computers One Instruction streams Many Data streams One SISD: Single instruction stream Single data stream MISD: Multiple instruction streams Single data stream Many SIMD: Single instruction stream Multiple data streams MIMD: Multiple instruction streams Multiple data streams

Flynn s taxonomy - SISD Instructions Processor Capable of executing single instructions, operating on a single data stream E.g. conventional von-neumann architecture Data

Flynn s taxonomy - SIMD Instructions Processor Processor Processor Processor Processor Data Capable of executing the same instruction on all processing elements operating on different data streams E.g. vector processors

Flynn s taxonomy - MISD Instructions Processor Processor Processor Processor Processor Data Executes different instructions on each processing element operating on the same data stream. (Useful for only a limited amount of applications)

Flynn s taxonomy - MIMD Instructions Processor Processor Processor Processor Processor Processor Data Executes multiple instructions on multiple data streams E.g. multiprocessors

System Bus Architectures Reference Multi-master point to point communication over a single system bus requires bus arbitration. Processors, co-processors and DMA-controllers are typically operating as bus masters.

System Bus Architectures Reference Time multiplexing of data and addresses on common lines Lower cost Lower performance

System Bus Architectures Reference A computer could be designed for using multiple buses for different purposes Cheaper solution to include a bridge Typically used for e.g. USB or Ethernet

System Bus Architectures Fetch and Store paradigm Reference

System Bus Architectures Conclusions: A system bus can only perform one transfer at a time It is thus a limited resource for communication More than one master can compete for access to this resource. Processors, co-processors and DMAcontrollers How to mitigate limitations on communication over a system bus?

Switching fabrics Significantly more expensive then a system bus

AXI4 channel switch Reference: Xilinx User Guide 1037 Xilinx AXI4 bus is a derivate of the Arm AMBA bus developed for SoC applications. Picture is showing a switch for AXI4 Connects one or more similar AXI memory mapped masters to one or more similar memory mapped slaves.

AXI4 and AXI4-Lite bus A master is taking the initiative to a data transfer, slave is responding Consists of five channels: Read address channel Write address channel Read data channel Write data channel Write response channel Master Slave Data can move simultaneously in both directions. AXI4 allow for burst of 256 data transfers using only one address. AXI4-Lite allow only for single data transactions.

AXI4 bus read operation Reference: Xilinx User Guide 1037

AXI4 bus write operation Reference: Xilinx User Guide 1037

AXI4-stream Unidirectional streaming of data Master Slave

AXI4-stream implementation Reference: Xilinx User Guide 1037 Used for high speed data centric streaming applications, e.g. video TLAST indicates packet boundaries TVALID indicates valid data

AXI4-Stream Interconnect Reference: Xilinx User Guide 1037 Parallel routing of traffic between N masters and M slaves

Multiprocessor architectures Reference Challenges for multiprocessor architectures Communication Coordination Contention

Challenges Communication must be scalable to handle communication between large number of processors Coordination a strategy for how to distribute tasks among all processors is required Contention situations where two or more processors try to access a resource at the same time. This problem explodes with increasing number of processors In particular problems will occur with memory accesses Cashing can mitigate but introduces another problem, Cache coherence how to guarantee that cache memories, local to each processor carries the same data for common memory locations?

Using Peripheral Processors Reference

Performance of Multi-Processor architectures Reference =

Data Pipelining Input data stream Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Output data stream A pipeline divides a larger computational task into a series of smaller tasks Benefits: Smaller tasks are less complex to describe Allow for reuse of code modules Reveals coarse grained parallelism that can be mapped to a multi-processor architecture for increased throughput

Data Pipelining Input data stream Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Output data stream Necessary conditions: Partionable problem Low communication overhead Equivalent processor speed as for single processor

Data Pipelining Input data stream Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Output data stream > > > > = 1 [ ] Latency= + + + + + [time units]

Data Flow Graph Input data stream Actor 1 Actor 2 Actor 3 Output data stream Actor 4 A data flow graph is describing computations without including any information on how the computation is going to be done. Hence, only data flow and no control flow is described. This programming paradigm is supported by functional languages such as DFL suitable for digital signal processing systems and also ideal for capturing pipelined computations. Imperative languages such as C and C++ model both control- and data flow and are no good for capturing parallelism.

Data Pipelining on FPGA logic Input data stream CN D Q Output data stream > Clk A large combinatorial network is driving an output register Propagation delay time for CN is Max frequency for clock signal Clk then becomes =

Data Pipelining on FPGA logic Assume that CN is partionable into M smaller combinatorial networks Insert registers in between all combinatorial nets CN 1 D Q CN 2 D Q CN 3 D Q CN 4 D Q CN M D Q > > > > > Clk > > > > = 1 Latency=

Power in computational logic The dynamic energy consumed when changing state of a cmos logic output = 1 2 is the total capacitive load of the output is the supply voltage The average dynamic power = = We can conclude that power dissipation is proportional to clock frequency and proportional to square of supply voltage Trying to increase speed of a processor by simply increasing clock frequency at the same time as physical scaling of technology increases can only be done until the power wall is reached With current technology =100

Power in computational logic The delay time for a gate can be approximated to = is the cmos threshold voltage and, are technology dependent constants Delay will depend mostly on and for larger supply voltages Delay will increase dramatically when is decreased close to Dynamic voltage and frequency scaling means that both supply voltage and clock frequency is adjusted so that a processor can deliver just enough speed A reduction of both frequency and supply voltage will result in a dramatic reduction of dynamic power consumption

Using sleep mode to control energy consumption Reference Energy consumed during shutdown Energy consumed during wakeup = = Energy consumed when running processor for time = Energy consumed when going to sleep for time t = + + Energy can be saved when <

Example Battery powered oil detector for wastewater A smart sensor can detect petroleum contamination in wastewater Numerous sensors are installed at selected checkpoints which allow tracing of sources of contamination Task for sensor is to measure wastewater every 15 minutes and send alarm data over radio link whenever a contamination is detected. This task finishes in milliseconds while the rest of the 15 minutes cycle is spent on sleeping