Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi
Comparing processors Evaluating processors Taxonomy of processors Embedded vs. general-purpose processors RISC DSPs 2/30
Evaluating processors Performance Average Peak Worst-case Best-case Cost Energy and power consumption Other criterias Predictability Security 3/30
Taxonomy of processors Flynn s taxonomy of processors Single-instruction, single-data (SISD) Single-instruction, multiple-data (SIMD) Multiple-instruction, multiple-data (MIMD) Multiple-instruction, single-data (MISD) Instruction set style RISC CISC Instruction issuing Single issue vs. multiple issue Static scheduling vs. dynamic scheduling Parallelism support Vector processing Multithreading 4/30
Embedded vs. general-purpose processors General-purpose processors are designed to perform work at many different tasks Embedded processors usally specialize in some task Market for embedded processors are a lot bigger than for general-purpose processors Embedded processors utilize commomly RICS architecture 5/30
RISC processors Often single issue processors and utilize von Neumann architecture Pipelined execution Instructions are simple enough to be processed in each pipeline step in one clock cycle Figure: The ARM11 pipeline, (C) Wayne Wolf 6/30
DSPs Instruction set usually specialized into signal processing Harward architecture is commom in DSPs DSPs usually have a single cycle multiply-accumulate instruction Figure: A block diagram of a DSP, (C) Wayne Wolf 7/30
Parallel execution mechanisms Very long instruction word processors Superscalar processors SIMD and vector processors Thread-level parallelism Processors resource utilization 8/30
Very long instruction word processors Initally designed for general-purpose processors, but had found more use in embedded systems Instructions are grouped into packets A packet contains multiple instuctions, which are executed in parallel by different functional units of the processor The execution unit for an instruction is determined by its position in the packet Compiler is responsible for ordering the instructions into a packet 9/30
Superscalar processors Dominant architecture in desktop and server machines More than one instruction issued in clock cycle Possible resource conflicts between instructions are checked on-the-fly Not as much used in embedded systems as in desktops and servers 10/30
SIMD and vector processors Exploit data parallelism Operand data sizes If operand size is small technique called subword parallelism could be utilized When exploiting subword parallelism CPU ALU is splitted to smaller ALUs Vector processing 11/30
Thread-level parallelism Exploits task-level parallelism Architectures featuring hardware multithreading require larger register banks since a separete set of registers must be allocated each running thread Simultaneus multithreading (SMT), multiple threads are executed in parallel 12/30
Processors resource utilization The characteristics of intended workloads should be studied and taken in account when selecting a processor for a system In many cases as in multimedia parallelism must be exploited in multiple levels 13/30
Variable-performance CPU architectures Power wall Dynamic voltage and frequency scaling Clock tree nightmare Better-than-worst-case design 14/30
Dynamic voltage and frequency scaling (DVFS) Popular technique to controll power consumption of the processor Takes advantage of wide operating voltage range of CMOS chips Possible maximum clock frequency is almost linear funtion of supply voltage Energy consumption is proportional to square of supply voltage 15/30
Better-than-worst-case design Traditionally all digital systems are governed by system clock The clock speed must be carefully selected so that computation will always finish during a clock cycle In average the system will be idle for some time during each clock cycle Better-than-worst-case design assumes that operations could be done faster than the worst-case Instead of waiting for the rare worst-case in every clock cycle use shorter clock cycle and detect errors caused by worst-cases 16/30
CPU memory Hierarchy Memory component models Register files Caches Scratch pad memories 17/30
Memory component models Needed for evaluation of memory design methods They model physical properties of memories: Area Delay Energy 18/30
Register files Nearest memory to CPU core Size of register file is a key parameter of CPU design Too small register file will lead to inefficient operation since programs need to spill contents of some registers to main memory Too large register file will unnessesarily increase the manufacturing costs of the chip and its energy consumption 19/30
Caches Fast memory between register file and main memory Cache size have similar properties as register file size Set associativity More sets more independent locations that map to same cache locations Line size Longer lines give more prefetching bandwidth Configurable caches whose set associativity and line size could be changed at runtime 20/30
Scratch pad memories Similar small memory near processor like cache memory, but without automatic management hardware Contents are managed by software, usually in combination of compile and runtime decision making Main advantake of scratch pad memory over cache memory is the predictability of its access time 21/30
Code compression One method to reduce program size Could also decrease energy consumption and improve performance Implementing a system that runs compressed code is quite straight forward and does not require big changes in processor or compiler Branches cause some troubles Branch destination addresses and offsets are different in compressed code Branches waste some resources Some processors have a compressed instruction set as an extension to basic instruction set. For example ARM 16-bit Thumb instruction set 22/30
Code and data compression Inspired by success of code compression Maintaining compressed data set in memory may induce more overhead than compressed program code Will be a trade of between data size and performance and energy consumption overhead May save significant amounts of energy 23/30
Low-power bus encoding Memory busses that connect CPU to cache and main memory are responsible for significant fraction of total energy consumption of the CPU Bus energy consumption is proportional to number of state changes on bus lines Bus encoding algorithms try to minimize count of state changes 24/30
Security Embedded systems can face similar attacks as desktop and server systems There are also new ways to attack at embedded systems such as side channel attacks In side channel attack information leaking from processors is used to figure out what the processor in doing For example it is shown that processors power consumption can be used to determine the encryption key used in the system DVFS can be used as protection 25/30
CPU simulation Performance Energy/power Temporal accuracy Trace vs. execution Simulation vs. direct execution 26/30
Trace-based analysis Trace-based systems do not directly collect information of the program s performance instead a trace is recorded Trace is analysed offline after the execution of the program A trace can be generated by Instrumentation Sampling Commom traces are control flow and memory accesses 27/30
Direct execution Utilize the host CPU to compute the state of the target CPU Primarily used for functional and cache simulation The state of host CPU can be used for suitable parts of target CPU Other parts of the state of the target CPU need to be simulated Simulation runs mostly as native code on host CPU so the simulation can be very fast 28/30
Microarchitecture-modeling simulators The accuracy of the simulator depend on the level of detail of the model Can be used to simulate systems performance and energy usage, if the model of the system is detailed enough There is a trade-off between simulation speed and accuracy 29/30
Automated CPU design Application-specific instruction processors (ASIPs) Configurable processors Instruction set, memory hierarchy, busses and peripherals can be modified to fit to specific task Done with set of tools, which are used to analyze the intended workloads, to create a custom processor configuration based on the analysis, generate tool chain (compiler) for the custom architecture 30/30