Information Processing. Peter Marwedel Informatik 12 Univ. Dortmund Germany

Size: px

Start display at page:

Download "Information Processing. Peter Marwedel Informatik 12 Univ. Dortmund Germany"

Spencer Bond
5 years ago
Views:

1 Information Processing Peter Marwedel Informatik 12 Univ. Dortmund Germany

2 Embedded System Hardware Embedded system hardware is frequently used in a loop ( hardware in a loop ): actuators - 2 -

3 Processing units Need for efficiency (power + energy): Why worry about energy and power? Power is considered as the most important constraint in embedded systems [in: L. Eggermont (ed): Embedded Systems Roadmap 2002, STW] Current UMTS phones can hardly be operated for more than an hour, if data is being transmitted. [from a report of the Financial Times, Germany, on an analysis by Credit Suisse First Boston;

4 Power and energy are related to each other P E P dt E' E In many cases, faster execution also means less energy, but the opposite may be true if power has to be increased to allow faster execution. t - 4 -

5 Low Power vs. Low Energy Consumption Minimizing the power consumption is is important for for the design of of the power supply the design of of voltage regulators the dimensioning of of interconnect short term cooling Minimizing the energy consumption is is important due to to restricted availability of of energy (mobile systems) limited battery capacities (only (only slowly improving) very very high high costs of of energy (solar panels, in in space) cooling high high costs limited space dependability long lifetimes, low temperatures - 5 -

6 Key requirements for processors Energy/power-efficiency - 6 -

7 Power density continues to get worse Nuclear reactor Prescott: 90 W/cm², 90 nm [c t 4/2004] - 7 -

8 Dynamic power management (DPM) Example: STRONGARM SA1100 RUN: operational IDLE: a sw routine may stop the CPU when not in use, while monitoring interrupts SLEEP: Shutdown of onchip activity 400mW RUN 10µs 160ms 10µs 90µs IDLE 50mW Power fault signal Power fault signal 90µs SLEEP 160µW - 8 -

9 Fundamentals of dynamic voltage scaling (DVS) Power consumption of CMOS circuits (ignoring leakage): P : C V f L dd : C : switching activity : L load capacitance supply voltage clock V 2 dd f with frequency V t ( V Delay for CMOS circuits: :threshhold voltage t k C L V dd V dd V substancially t 2 with thanv dd ) Decreasing V dd reduces P quadratically, while the run-time of algorithms is only linearly increased E=P x t decreases linearly (ignoring the effects of the memory system and V t ) - 9 -

10 Voltage scaling: Example [Courtesy, Yasuura, 2000] V dd Exploitation discussed in in codesign chapter

11 Key requirement #2: Code-size efficiency CISC machines: RISC machines designed for run-time-, not for code-size-efficiency Compression techniques: key idea

12 Code-size efficiency Compression techniques (continued): 2nd instruction set, e.g. ARM Thumb instruction set: Rd Constant major opcode minor opcode source= destination 16-bit Thumb instr. ADD Rd #constant zero extended Rd 0 Rd 0000 Constant Dynamically decoded at run-time Reduction to % of original code size 130% of ARM performance with 8/16 bit memory 85% of ARM performance with 32-bit memory [ARM, R. Gupta] Same approach for for LSI TinyRisc, Requires support by by compiler, assembler etc

13 Application: y[j] = i=0 x[j-i]*a[i] i: 0 i n-1: y i [j] = y i-1 [j] + x[j-i]*a[i] Architecture: Example: Data path ADSP210x Addressregisters A0, A1, A2.. i+1, j-i+1 Address generation unit (AGU) Key requirement #3: Run-time efficiency - Domain-oriented architectures - AX D AR x +,-,.. AY AF n-1 P a x[j-i] MX * +,- MR MY a[i] MF x[j-i]*a[i] y i-1 [j] Application maps nicely onto architecture MR:=0; A1:=1; A2:=n-2; MX:=x[n-1]; MY:=a[0]; for ( j:=1 to n) {MR:=MR+MX*MY; MY:=a[A1]; MX:=x[A2]; A1++; A2--}

14 DSP-Processors: multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions MR:=0; A1:=1; A2:=n-2; MX:=x[n-1]; MY:=a[0]; for ( j:=1 to n) {MR:=MR+MX*MY; MY:=a[A1]; MX:=x[A2]; A1++; A2--} Multiply/accumulate (MAC) instruction Zero-overhead loop (ZOL) instruction preceding MAC instruction. Loop testing done in parallel to MAC operations

15 Heterogeneous registers Example (ADSP 210x): D P Addressregisters A0, A1, A2.. Address generation unit (AGU) AX AY +,-,.. AR AF MX * +,- MR MY MF Different functionality of of registers An, An, AX, AX, AY, AY, AF,MX, MY, MY, MF, MF, MR MR

Separate address generation units (AGUs) Example (ADSP 210x): Data memory can only be fetched with address contained in A, but this can be done in parallel with operation in main data path (takes

16 Separate address generation units (AGUs) Example (ADSP 210x): Data memory can only be fetched with address contained in A, but this can be done in parallel with operation in main data path (takes effectively 0 time). A := A ± 1 also takes 0 time, same for A := A ± M; A := <immediate in instruction> requires extra instruction Minimize load immediates Optimization in codesign chapter

17 Saturating arithmetic Returns largest/smallest number in in case of of over/underflows Example: a 0111 b standard wrap around arithmetic (1)0000 saturating arithmetic 1111 (a+b)/2: correct 1000 wrap around arithmetic 0000 saturating arithmetic + shifted 0111 Appropriate for for DSP/multimedia applications: almost correct No No timeliness of of results if if interrupts are are generated for for overflows Precise values less less important Wrap around arithmetic would be be worse

18 Fixed-point arithmetic Shifting required after multiplications and divisions in in order to to maintain binary point

19 Properties of fixed-point arithmetic Automatic scaling a key advantage for multiplications. Example: x= 0.5 x x = = For iwl=1 and fwl=3 decimal digits, the less significant digits are automatically chopped off: x = Like a floating point system with numbers [0..1), with no stored exponent (bits used to increase precision). Appropriate for DSP/multimedia applications (well-known value ranges)

20 Real-time capability Timing behavior has to be predictable Features that cause problems: Unpredictable access to shared resources Caches with difficult to predict replacement strategies Unified caches (conflicts betw. instructions and data) Pipelines with difficult to predict stall cycles ("bubbles") Unpredictable communication times for multiprocessors Branch prediction, speculative execution Interrupts that are possible any time Memory refreshes that are possible any time Instructions that have data-dependent execution times Trying to avoid as many of these as possible. [Dagstuhl workshop on predictability, Nov , 2003]

Embedded System Hardware - Processing -

Springer, 2010 12 Embedded System Hardware - Processing - Peter Marwedel Informatik 12 TU Dortmund Germany 2013 年 11 月 12 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.