Lecture 12: EIT090 Computer Architecture

Similar documents
Davide Rossi DEI University of Bologna AA

Sistemi Embedded Introduzione

FPGA BASED SYSTEM DESIGN. Dr. Tayab Din Memon Lecture 1 & 2

Introduction to Embedded Systems. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Digital Systems Design. Introduction to embedded and digital systems

Introduction to Embedded Systems. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Microprocessors And Microcontroller

Embedded Systems and Software

ECE332, Week 2, Lecture 3. September 5, 2007

ECE332, Week 2, Lecture 3

Introduction to Embedded Systems

Design of Embedded Systems

CprE 588 Embedded Computer Systems

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 1 Introduction To 3410

CENG 336 Introduction to Embedded Systems Development. Lecture 1: An Introduction to Computers and Embedded Systems

Outline. Lecture 11: EIT090 Computer Architecture. Small-scale MIMD designs. Taxonomy. Anders Ardö. November 25, 2009

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lesson 2. Introduction to Real Time Embedded Systems Part II. mywbut.com

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CENG-336 Introduction to Embedded Systems Development

WHY PARALLEL PROCESSING? (CE-401)

Computer Organization. Chapter 16

Computer parallelism Flynn s categories

Embedded Computation

Fundamentals of Computer Design

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Computer Systems Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Introduction. Definition. What is an embedded system? What are embedded systems? Challenges in embedded computing system design. Design methodologies.

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Fundamentals of Computers Design

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multiprocessors & Thread Level Parallelism

CSCI 4717 Computer Architecture

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

CMPE 310: Systems Design and Programming

Handout 3 Multiprocessor and thread level parallelism

Organisasi Sistem Komputer

Processor Architecture and Interconnect

Chapter 18 Parallel Processing

COSC What is an embedded system?

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 24: Virtual Memory, Multiprocessors

Embedded system. Microprocessor System Design EHB432E Lecture -1. Embedded system. Embedded system. Istanbul Technical University

Computer Systems Architecture

Parallel Computing Platforms

Embedded System Design

Parallel Architecture. Hwansoo Han

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Multi-Processor / Parallel Processing

Computer Science 146. Computer Architecture

CMSC 611: Advanced. Parallel Systems

Lecture 1: Introduction

Lecture 9: MIMD Architecture

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Embedded System Design

Parallel Computing: Parallel Architectures Jin, Hai

Embedded Computing Platform. Architecture and Instruction Set

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Lecture 8: RISC & Parallel Computers. Parallel computers

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Copyright 2012, Elsevier Inc. All rights reserved.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Dept. Computer and Information Science (IDA) Linköpings universitet Sweden

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

THREAD LEVEL PARALLELISM

Parallel Processing SIMD, Vector and GPU s cont.

CS/COE1541: Intro. to Computer Architecture

Computer Architecture

COSC 6385 Computer Architecture - Multi Processor Systems

Chapter 5. Multiprocessors and Thread-Level Parallelism

Parallel Architectures

Introduction II. Overview

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Lecture 7: Parallel Processing

Lecture 9: MIMD Architectures

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Lecture 9: MIMD Architectures

Multithreading: Exploiting Thread-Level Parallelism within a Processor

EC EMBEDDED AND REAL TIME SYSTEMS

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Chapter-4 Multiprocessors and Thread-Level Parallelism

Embedded Systems: Hardware Components (part I) Todor Stefanov

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Chapter 18. Parallel Processing. Yonsei University

Introduction to Parallel Computing

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Transcription:

Lecture 12: EIT090 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University December 1, 2009 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 1 / 30 Taxonomy SISD (Single Instruction stream, Single Data stream) traditional uniprocessor SIMD (Single Instruction stream, Multiple Data stream) vector processors MISD (Multiple Instruction stream, Single Data stream) no commercial examples MIMD (Multiple Instruction stream, Multiple Data stream) multiprocessor A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 2 / 30 Small-scale MIMD designs Symmetric shared MultiProcessors (SMP) with Uniform Memory Access time (UMA) and bus interconnect Often limited to 20-30 processors Flynn (1966) A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 3 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 4 / 30

Distributed machines Shared vs. Message-passing Uses an interconnection network to connect processor- nodes = NUMA Scalable to a large number of nodes Can be either shared or private address space Message-passing: The programmer must explicitly distribute data No execution overhead between explicit communication Shared : The same data structures as in the sequential program can be used Shared access can lead to high communication overhead A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 5 / 30 The cache coherence problem A read operation from address X must see the latest value produced by a write to address X With several copies of X, this may be a problem Techniques: Hardware-based protocols: Transparent to the software system, but increases the com plexity of the machine Software-based protocols: Requires the user/compiler to detect when it is safe to cache, but do not require sophisticated hardware. Hard to do = limited use Policies: Write-invalidate remove (invalidate) other processor s copy of a data item when it is written Write-update update other processor s copy of a data item when it is written A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 7 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 6 / 30 Cache Coherence Protocols Snooping Status for a block is stored in every cache that has a copy of the block. Caches monitor (snoop) the shared bus to update status and take actions. Popular with single shared. Directory based Status for a block is stored in one location (the directory). Messages used to update status. Popular with distributed shared. A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 8 / 30

Synchronization Consistency models Why synchronize? We need to know when it is safe for different processes to use shared data Issues for synchronization: How do we implement the LOCK operation? Uninterruptable instruction to fetch and update (atomic operation) User level synchronization operation using this primitive For large scale multiprocessors, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronizations are needed Atomic exchange, Test-and-set, Fetch-and-add Sequential consistency Serializing Write operations must stall until performed! Relaxed consistency A relaxed consistency model allows operations to be observed out-of-order between synchronizat ion operations Possible to obtain significant performance advantages A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 9 / 30 TLP Thread Level Parallelism A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 10 / 30 Clusters Allow multiple threads to share functional units of a processor. Coarse multithreading thread switch on costly stalls Fine multithreading thread switch each instruction issue slot Simultaneous multithreading (SMT) several threads can issue instructions simultaneously (combines ILP and TLP) Loosely coupled desktop machines No shared High bandwidth, switch-based LAN Standard of-the-shelf components = cheap Easy to scale High availability High administration cost Major problem is power (servers and cooling) Supercomputers A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 11 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 12 / 30

Lecture 12 agenda Appendix D in "Computer Architecture" A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 13 / 30 Embedded processors A device that includes a (programmable) computer But is not itself a general-purpuse computer fastest growing segment washing machines, cars, cell phones, TVs,... wide range: low-end 8 bit full size 32 bit price key factor performance, power, real time applications types ASIC SoC DSP A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 14 / 30 Embedded systems overview Embedded computing systems Computing systems embedded within electronic devices Hard to define. Nearly any computing system other than a desktop computer Billions of units produced yearly, versus millions of desktop units Perhaps 50 per household and per automobile Computers are in here... and here... and even here... Lots more of these, though they cost a lot less each. A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 15 / 30 4

TA-150 Computer Controlled Stereo Reciever A short list of embedded systems Anti-lock brakes Auto-focus cameras Automatic teller machines Automatic toll systems Automatic transmission Avionic systems Battery chargers Camcorders Cell phones Cell-phone base stations Cordless phones Cruise control Curbside check-in systems Digital cameras Disk drives Electronic card readers Electronic instruments Electronic toys/games Factory control Fax machines Fingerprint identifiers Home security systems Life-support systems Medical testing systems Modems MPEG decoders Network cards Network switches/routers On-board navigation Pagers Photocopiers Point-of-sale systems Portable video games Printers Satellite phones Scanners Smart ovens/dishwashers Speech recognizers Stereo systems Teleconferencing systems Televisions Temperature controllers Theft tracking systems TV set-top boxes VCR s, DVD players Video game consoles Video phones Washers and dryers And the list goes on and on 5 A. Ardö, EIT TA-150 Computer Controlled Stereo Reciever Lecture 12: EIT090 Computer Architecture December 1, 2009 16 / 30 Some common characteristics of embedded systems Single-functioned Executes a single program, repeatedly Tightly-constrained Low cost, low power, small, fast, etc. Reactive and real-time Continually reacts to changes in the system s environment Must compute certain results in real-time without delay A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 16 / 30 6

Embedded system Embedded Real Time System Actuators, Control Output Environment An embedded system example -- a digital camera lens CCD Digital camera chip A2D JPEG codec CCD preprocessor Microcontroller Pixel coprocessor Multiplier/Accum D2A Input DMA controller Display ctrl Sensors Memory controller ISA bus interface UART LCD ctrl Single-functioned -- always a digital camera Tightly-constrained -- Low cost, low power, small, fast Reactive and real-time -- only to a small extent A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 17 / 30 Case study: Axis Etrax 7 From Computer Architecture in Industry by Kenny Ranerup A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 18 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 19 / 30

Computer Architecture in Industry ETRAX And Other Processors At Axis The CRIS CPU Architecture ASICs and processors have been developed at Axis Communications for many years. The first generation, CGA, was a special processor designed for parsing the IBM mainframe communication protocol. The second generation was a complete System on Chip ASIC for the same IBM mainframe market. The processor was a 6809 compatible design developed internally. ETRAX was the 3rd generation of ASICs developed at AXIS Communications. This SoC was targeted for the Print Server market and contained a new CPU architecture called CRIS. The fourth generation, ETRAX100, broadened the ETRAX platform to other applications and increased performance both on network interface and processor. Other special purpose processors have been developed, e.g. for controlling a camera ASIC and a programmable I/O processor. 32-bit data and addresses. 16-bit instruction width with some variable size instructions. RISC inspired instruction set but with complex addressing modes. 16 general purpose 32-bit registers. Condition code register for compare and branch instructions. Data Organization in Memory CRIS is a little endian CPU. Data has no alignment restrictions, but there is a performance penalty for unaligned data accesses. Instructions must be word aligned. Computer Architecture in Industry - Kenny Ranerup '03 - Kenny Ranerup '03 3 7 Instruction Format ETRAX 100 Block Diagram Basic instruction format is 16-bits and must be word aligned. Two register operands. Byte, word, dword operand size. Addressing mode. operand 2 mode opcode size operand 1 15 12 11 10 9 6 5 4 3 0 Computer Architecture in Industry - Kenny Ranerup '03 Computer Architecture in Industry - Kenny Ranerup '03 8 17

Computer Architecture in Industry Axis Etrax FS Architectural Experiments Measurement of instruction and address traces on running product. Trace driven cache simulator to determine cache configuration and algorithms. Effects of expanding datapath from 16 to 32 bits. Analysis of instruction traces and static code to find possible instruction set improvements. Code analysis to find the effects of C++ on instruction mix. Sketches of changes to CPU pipelining. Gate-level remapping of CPU to new technology to estimate cycle time and pipelining. Sketches of a zero-copy DMA architecture for network and peripherals. - Kenny Ranerup '03 18 Axis Network Camera A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 20 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 21 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 22 / 30

Design challenge optimizing design metrics Common metrics Unit cost: the monetary cost of manufacturing each copy of the system, excluding NRE cost NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of designing the system Size: the physical space required by the system Performance: the execution time or throughput of the system Power: the amount of power consumed by the system Flexibility: the ability to change the functionality of the system without incurring heavy NRE cost Design challenge optimizing design metrics Common metrics (continued) Time-to-prototype: the time needed to build a working version of the system Time-to-market: the time required to develop a system to the point that it can be released and sold to customers Maintainability: the ability to modify the system after its initial release Correctness, safety, many more 9 10 Design metric competition -- improving one may worsen others Design methodologies lens CCD Performance Digital camera chip A2D JPEG codec DMA controller CCD preprocessor Power NRE cost Microcontroller Pixel coprocessor Size D2A Multiplier/Accum Display ctrl Memory controller ISA bus interface UART LCD ctrl Expertise with both software and hardware is needed to optimize design metrics Not just a hardware or software expert, as is common A designer must be comfortable with various technologies in order to choose the best for a given application and constraints Hardware Software Heterogeneous systems: hardware (digital, analog), software Heterogeneous components: SoC, CPU, DSP, ASIC, bus,... Heterogeneous requirements: performance, cost, power,... 11 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 23 / 30

Hardware vs software hardware performance power cost software flexibility reconfigurability cost A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 24 / 30 Real time A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 25 / 30 Real time performance React to external evironment Permamnet interaction Endless execution External timing requirements Special application areas video process control medical applications airplane control - JAS Hard vs soft real time requirements Analyses WCET - Worst Case Execution Time A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 26 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 27 / 30

Processor technology The architecture of the computation engine used to implement a system s desired functionality Processor does not have to be programmable Processor not equal to general-purpose processor Controller Control logic and State register IR PC Datapath Register file General ALU Controller Control logic and State register IR PC Datapath Registers Custom ALU Controller Control logic State register Datapath index total + Program Data Assembly code for: total = 0 for i =1 to General-purpose ( software ) Data Program Assembly code for: total = 0 for i =1 to Application-specific Data Single-purpose ( hardware ) A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 28 / 30 19 Processor technology General-purpose processors Processors vary in their customization for the problem at hand General-purpose processor Desired functionality Application-specific processor total = 0 for i = 1 to N loop total += M[i] end loop Single-purpose processor Programmable device used in a variety of applications Also known as microprocessor Features Program General datapath with large register file and general ALU User benefits Low time-to-market and NRE costs High flexibility Pentium the most well-known, but there are hundreds of others Controller Control logic and State register IR PC Program Assembly code for: total = 0 for i =1 to Datapath Register file General ALU Data 20 21

Single-purpose processors Application-specific processors Digital circuit designed to execute exactly one program a.k.a. coprocessor, accelerator or peripheral Features Contains only the components needed to execute a single program No program Benefits Fast Low power Small size Controller Control logic State register Datapath index total + Data Programmable processor optimized for a particular class of applications having common characteristics Compromise between general-purpose and single-purpose processors Features Program Optimized datapath Special functional units Benefits Some flexibility, good performance, size and power Controller Control logic and State register IR PC Program Assembly code for: total = 0 for i =1 to Datapath Registers Custom ALU Data 22 23 Summary Important, found everywhere, high volume Hardware + software design Cover several areas microelectronics real time software + hardware SoC General purpose, application specific, single purpose A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 29 / 30 A. Ardö, EIT Lecture 12: EIT090 Computer Architecture December 1, 2009 30 / 30