Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Similar documents
Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Copyright 2012, Elsevier Inc. All rights reserved.

BlueGene/L (No. 4 in the Latest Top500 List)

Basic Computer Architecture

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Outline Marquette University

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

EECS4201 Computer Architecture

Computer Systems Architecture

THREAD LEVEL PARALLELISM

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Parallel Computing: Parallel Architectures Jin, Hai

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Fundamentals of Quantitative Design and Analysis

Advanced Computer Architecture. The Architecture of Parallel Computers

Introduction to parallel computing

Final Lecture. A few minutes to wrap up and add some perspective

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Lecture 1: Introduction

Embedded Systems: Hardware Components (part I) Todor Stefanov

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

RAID 0 (non-redundant) RAID Types 4/25/2011

CS 654 Computer Architecture Summary. Peter Kemper

Computer Systems Architecture

Memory Systems IRAM. Principle of IRAM

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Processor Performance and Parallelism Y. K. Malaiya

Top500 Supercomputer list

Embedded Systems. 7. System Components

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Implementation of DSP Algorithms

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

RISC Processors and Parallel Processing. Section and 3.3.6

What is Computer Architecture?

EE 4980 Modern Electronic Systems. Processor Advanced

CS Computer Architecture

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Real instruction set architectures. Part 2: a representative sample

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Mo Money, No Problems: Caches #2...

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Cache Justification for Digital Signal Processors

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Architectures of Flynn s taxonomy -- A Comparison of Methods

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Parallel Computing Architectures

Flynn s Taxonomy of Parallel Architectures

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Typical Processor Execution Cycle

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

COMPUTER STRUCTURE AND ORGANIZATION

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

The Processor: Instruction-Level Parallelism

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Chap. 4 Multiprocessors and Thread-Level Parallelism

Universität Dortmund. ARM Architecture

ELC4438: Embedded System Design Embedded Processor

Embedded Computation

A taxonomy of computer architectures

Multiprocessors & Thread Level Parallelism

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Chapter 11. Introduction to Multiprocessors

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Chapter 2: Data Manipulation

Page 1. Multilevel Memories (Improving performance using a little cash )

Chapter 2: Data Manipulation

PREPARED BY: S.SAKTHI, AP/IT

Lecture 8: RISC & Parallel Computers. Parallel computers

Copyright 2010, Elsevier Inc. All rights Reserved

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CSE 392/CS 378: High-performance Computing - Principles and Practice

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Computer Architecture

Introduction to Parallel Programming

CS 426 Parallel Computing. Parallel Computing Platforms

EC 513 Computer Architecture

Show Me the $... Performance And Caches

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Hardware-Based Speculation

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

04 - DSP Architecture and Microarchitecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

1. (10) True or False: (1) It is possible to have a WAW hazard in a 5-stage MIPS pipeline.

Comp. Org II, Spring

Computer Architecture Crash course

Multicore Hardware and Parallelism

UNIT I (Two Marks Questions & Answers)

Chapter 2: Data Manipulation. Copyright 2015 Pearson Education, Inc.


Transcription:

Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi

Comparing processors Evaluating processors Taxonomy of processors Embedded vs. general-purpose processors RISC DSPs 2/30

Evaluating processors Performance Average Peak Worst-case Best-case Cost Energy and power consumption Other criterias Predictability Security 3/30

Taxonomy of processors Flynn s taxonomy of processors Single-instruction, single-data (SISD) Single-instruction, multiple-data (SIMD) Multiple-instruction, multiple-data (MIMD) Multiple-instruction, single-data (MISD) Instruction set style RISC CISC Instruction issuing Single issue vs. multiple issue Static scheduling vs. dynamic scheduling Parallelism support Vector processing Multithreading 4/30

Embedded vs. general-purpose processors General-purpose processors are designed to perform work at many different tasks Embedded processors usally specialize in some task Market for embedded processors are a lot bigger than for general-purpose processors Embedded processors utilize commomly RICS architecture 5/30

RISC processors Often single issue processors and utilize von Neumann architecture Pipelined execution Instructions are simple enough to be processed in each pipeline step in one clock cycle Figure: The ARM11 pipeline, (C) Wayne Wolf 6/30

DSPs Instruction set usually specialized into signal processing Harward architecture is commom in DSPs DSPs usually have a single cycle multiply-accumulate instruction Figure: A block diagram of a DSP, (C) Wayne Wolf 7/30

Parallel execution mechanisms Very long instruction word processors Superscalar processors SIMD and vector processors Thread-level parallelism Processors resource utilization 8/30

Very long instruction word processors Initally designed for general-purpose processors, but had found more use in embedded systems Instructions are grouped into packets A packet contains multiple instuctions, which are executed in parallel by different functional units of the processor The execution unit for an instruction is determined by its position in the packet Compiler is responsible for ordering the instructions into a packet 9/30

Superscalar processors Dominant architecture in desktop and server machines More than one instruction issued in clock cycle Possible resource conflicts between instructions are checked on-the-fly Not as much used in embedded systems as in desktops and servers 10/30

SIMD and vector processors Exploit data parallelism Operand data sizes If operand size is small technique called subword parallelism could be utilized When exploiting subword parallelism CPU ALU is splitted to smaller ALUs Vector processing 11/30

Thread-level parallelism Exploits task-level parallelism Architectures featuring hardware multithreading require larger register banks since a separete set of registers must be allocated each running thread Simultaneus multithreading (SMT), multiple threads are executed in parallel 12/30

Processors resource utilization The characteristics of intended workloads should be studied and taken in account when selecting a processor for a system In many cases as in multimedia parallelism must be exploited in multiple levels 13/30

Variable-performance CPU architectures Power wall Dynamic voltage and frequency scaling Clock tree nightmare Better-than-worst-case design 14/30

Dynamic voltage and frequency scaling (DVFS) Popular technique to controll power consumption of the processor Takes advantage of wide operating voltage range of CMOS chips Possible maximum clock frequency is almost linear funtion of supply voltage Energy consumption is proportional to square of supply voltage 15/30

Better-than-worst-case design Traditionally all digital systems are governed by system clock The clock speed must be carefully selected so that computation will always finish during a clock cycle In average the system will be idle for some time during each clock cycle Better-than-worst-case design assumes that operations could be done faster than the worst-case Instead of waiting for the rare worst-case in every clock cycle use shorter clock cycle and detect errors caused by worst-cases 16/30

CPU memory Hierarchy Memory component models Register files Caches Scratch pad memories 17/30

Memory component models Needed for evaluation of memory design methods They model physical properties of memories: Area Delay Energy 18/30

Register files Nearest memory to CPU core Size of register file is a key parameter of CPU design Too small register file will lead to inefficient operation since programs need to spill contents of some registers to main memory Too large register file will unnessesarily increase the manufacturing costs of the chip and its energy consumption 19/30

Caches Fast memory between register file and main memory Cache size have similar properties as register file size Set associativity More sets more independent locations that map to same cache locations Line size Longer lines give more prefetching bandwidth Configurable caches whose set associativity and line size could be changed at runtime 20/30

Scratch pad memories Similar small memory near processor like cache memory, but without automatic management hardware Contents are managed by software, usually in combination of compile and runtime decision making Main advantake of scratch pad memory over cache memory is the predictability of its access time 21/30

Code compression One method to reduce program size Could also decrease energy consumption and improve performance Implementing a system that runs compressed code is quite straight forward and does not require big changes in processor or compiler Branches cause some troubles Branch destination addresses and offsets are different in compressed code Branches waste some resources Some processors have a compressed instruction set as an extension to basic instruction set. For example ARM 16-bit Thumb instruction set 22/30

Code and data compression Inspired by success of code compression Maintaining compressed data set in memory may induce more overhead than compressed program code Will be a trade of between data size and performance and energy consumption overhead May save significant amounts of energy 23/30

Low-power bus encoding Memory busses that connect CPU to cache and main memory are responsible for significant fraction of total energy consumption of the CPU Bus energy consumption is proportional to number of state changes on bus lines Bus encoding algorithms try to minimize count of state changes 24/30

Security Embedded systems can face similar attacks as desktop and server systems There are also new ways to attack at embedded systems such as side channel attacks In side channel attack information leaking from processors is used to figure out what the processor in doing For example it is shown that processors power consumption can be used to determine the encryption key used in the system DVFS can be used as protection 25/30

CPU simulation Performance Energy/power Temporal accuracy Trace vs. execution Simulation vs. direct execution 26/30

Trace-based analysis Trace-based systems do not directly collect information of the program s performance instead a trace is recorded Trace is analysed offline after the execution of the program A trace can be generated by Instrumentation Sampling Commom traces are control flow and memory accesses 27/30

Direct execution Utilize the host CPU to compute the state of the target CPU Primarily used for functional and cache simulation The state of host CPU can be used for suitable parts of target CPU Other parts of the state of the target CPU need to be simulated Simulation runs mostly as native code on host CPU so the simulation can be very fast 28/30

Microarchitecture-modeling simulators The accuracy of the simulator depend on the level of detail of the model Can be used to simulate systems performance and energy usage, if the model of the system is detailed enough There is a trade-off between simulation speed and accuracy 29/30

Automated CPU design Application-specific instruction processors (ASIPs) Configurable processors Instruction set, memory hierarchy, busses and peripherals can be modified to fit to specific task Done with set of tools, which are used to analyze the intended workloads, to create a custom processor configuration based on the analysis, generate tool chain (compiler) for the custom architecture 30/30