Efficient Computing in Cyber-Physical Systems

Size: px

Start display at page:

Download "Efficient Computing in Cyber-Physical Systems"

Ella Francis
6 years ago
Views:

1 12 Efficient Computing in Cyber-Physical Systems Peter Marwedel TU Dortmund (Germany) Informatik 12 Springer, /06/20

2 Cyber-physical systems and embedded systems Embedded systems (ES): information processing embedded into a larger product [Peter Marwedel, 2003] Cyber-physical systems (CPS): integrations of computation with physical processes [Edward Lee, 2006]. CPS = ES + physical environment Information processing Embedded systems ("small computers") Physics - 2 -

3 Outline Scope, opportunities & challenges (Partial) solutions to challenges Timing predictability - WCET-aware compilation: unrolling, locking - register allocation, scratch pads Energy efficiency - scratch pads: (non)-overlaying, compile/run-time allocation Tradeoffs between multiple objectives Summary - PAMONO bio sensor - automatic parallelization: energy efficiency vs. performance - approximate computations: timeliness vs. exactness - 3 -

4 Opportunities Transportation Automotive Avionics Rail Medical Hospitals Support of handicapped Patient information Support for the elderly ~2000 : Microsoft Cliparts + P. Marwedel, 2011, - 4 -

5 Opportunities (2) (Urban) living Telecommunication Consumer electronics Smart home Logistics Smart grid Public safety Structural health monitoring Robotics Military systems Graphics: P. Marwedel, 2011, Microsoft - 5 -

Challenge (1): timing predictability Timing: The system must react within a time interval dictated by the environment [Adopted from Bergé] execute t

6 Challenge (1): timing predictability Timing: The system must react within a time interval dictated by the environment [Adopted from Bergé] execute t Consequences frequently underestimated Always! Not subject to statistical arguments (except possibly extremely small probabilities) Graphics: Microsoft Cliparts - 6 -

7 Challenges (2): Energy/power efficiency Need for energy/power efficient processing ( sweet corner ) Graphics: P. Marwedel, H. De Man, Microsoft - 7 -

8 Challenge(3): Many more objectives exactness of results, QoS average case delay safety security thermal behavior reliability EMC characteristics radiation hardness cost, size weight environmental friendliness extendability Threatened by increased openness Graphics: P. Marwedel, 2011; Microsoft - 8 -

9 Challenges (4) Analog/digital interface (quantization, ) Consequences of networked embedded systems digital-analog signal e.g. maintain level of local control - 9 -

10 Outline Scope, opportunities & challenges (Partial) solutions to challenges Timing predictability - WCET-aware compilation: unrolling, locking - register allocation, scratch pads Energy efficiency - scratch pads: (non)-overlaying, compile/run-time allocation Tradeoffs between multiple objectives Summary - PAMONO bio sensor - automatic parallelization: energy efficiency vs. performance - approximate computations: timeliness vs. exactness

11 Let s look at timing! Definition of worst case execution time: disrtribution of times BCET EST BCET worst-case guarantee worst-case performance possible execution times timing predictability WCET WCET EST Time constraint time WCET EST must be 1. safe (i.e. WCET) and 2. tight (WCET EST -WCET WCET EST ) Graphics: adopted from R. Wilhelm + Microsoft Cliparts

12 Current Trial-and-Error Based Development 1. Specification of CPS/ES software 2. Generation of Code (ANSI-C or similar) 3. Compilation for given target processor 4. Execution and/or simulation of machine code, using a (e.g. random) set of input data 5. Measurement-based computation of estimated worst case execution time (WCET meas ) 6. Adding safety margin (e.g. 20%) on top of WCET meas and call this WCET hypo 7. If WCET hypo > real-time constraint: change some detail, go back to 1 or

13 Structure of WCC (WCET-aware C-compiler) CRL2LLIR LLIR2CRL

Challenges for WCET EST -Minimization Worst-Case Execution Path (WCEP) WCET EST of a program = Length of longest execution path (WCEP) in that program

14 Challenges for WCET EST -Minimization Worst-Case Execution Path (WCEP) WCET EST of a program = Length of longest execution path (WCEP) in that program WCET EST -Minimization: Reduction of the longest path Other optimizations do not result in a reduction of WCET EST Optimizations need to know the WCEP

WCET-oriented optimizations Extended loop analysis (CGO 09) Instruction cache locking (CODES/ISSS 07, CGO 12) Cache partitioning (WCET-Workshop 09) Procedure cloning (WCET-WS 07, CODES 07, SCOPES 08)

15 WCET-oriented optimizations Extended loop analysis (CGO 09) Instruction cache locking (CODES/ISSS 07, CGO 12) Cache partitioning (WCET-Workshop 09) Procedure cloning (WCET-WS 07, CODES 07, SCOPES 08) Procedure/code positioning (ECRTS 08, CASES 11 (2x)) Function inlining (SMART 09, SMART 10) Loop unswitching/invariant paths (SCOPES 09) Loop unrolling (ECRTS 09) Register allocation (DAC 09, ECRTS 11)) Scratchpad optimization (DAC 09) Extension towards multi-objective optimization (RTSS 08) Superblock-based optimizations (ICESS 10) Surveys (Springer 10, Springer 12)

WCET-aware Loop Unrolling via Back-annotation WCET-information available at assembly level Unrolling to be applied at internal representation of source code Solution: Back-annotation: Experimental

16 WCET-aware Loop Unrolling via Back-annotation WCET-information available at assembly level Unrolling to be applied at internal representation of source code Solution: Back-annotation: Experimental WCET-aware compiler WCC allows copying information: assembly code source code WCET data Back- Annotation High-Level IR (Source Code) Mem. Spec. Code generation Assembly code size Amount of spill code Low-Level IR (Assembly Code) Memory architecture info available

17 Results for unrolling: WCET Relative WCET [%] kbyte 1 kbyte 2 kbyte 4 kbyte 8 kbyte 16 kbyte 16 I-Cache Size STANDARD 100%: Avg. WCET for all benchmarks with O3 & no unrolling WCET reduction between 10.2% and 15.4% WCET-driven Unrolling outperforms std. unrolling by 13.7%

18 Relative WCET EST with I-Cache Locking 5 Benchmarks/ARM920T/Postpass-Opt Rel. WCET_EST [%] 110% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% ADPCM G723 Statemate Compress MPEG2 H. Falk, Heiko, S. Plazar, H. Theiling: Compile Time Decided Instruction Cache Locking Using Worst-Case Execution Paths, Intern. Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), % (ARM920T) Cache Size [bytes] [S. Plazar et al.]

19 Register Allocation H. Falk, Heiko: WCET-aware Register Allocation based on Graph Coloring, 46th Design Automation Conference (DAC), % 100% = WCET EST using Standard Graph Coloring (highest degree) 100% 90% 80% Relative WCETEST [%] (Optimization Level -O3) 70% 60% 50% 40% 30% 20% 10% 0% adpcm_verify cjpeg_transupp compress crc dijkstra duff edge_detect edn epic expint fdct fft_1024 fft_256 fir fir2dim gsm gsm_encode h263 h264dec_block h264dec_macro iir_4_64 iir_biquad_n jfdctint latnrm_32_64 lmsfir_8_1 lmsfir_32_64 lpc ludcmp matmult matrix2_fixed matrix2_float md5 minver mult_10_10 mult_4_4 ndes prime qmf_receive qmf_transmit qurt rijndael_enc select sha spectral startup v32_benc Average

20 Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable SPMs are small, physically separate memories mapped into the address space; Selection is by an appropriate address decoder (simple!) select Address space SPM 0 FFF.. Example scratch pad memory ARM7TDMI cores, wellknown for low power consumption Small; no tag memory

21 WCET EST -aware scratch pad allocation Setup Bosch Democar: Runnable IgnitionSWCSync WCET EST -aware scratch pad allocation of program code by WCC, including fully automated WCET EST analyses using ait P. Marwedel, 2011 WCET EST reduced to about 50%, compared to gcc. Property of the PREDATOR consortium. Confidential

22 Outline Scope, opportunities & challenges (Partial) solutions to challenges Timing predictability - WCET-aware compilation: unrolling, locking - register allocation, scratch pads Energy efficiency - scratch pads: (non)-overlaying, compile/run-time allocation Tradeoffs between multiple objectives Summary - PAMONO bio sensor - automatic parallelization: energy efficiency vs. performance - approximate computations: timeliness vs. exactness

23 Why care about energy efficiency? Execution platform E.g. Global warming Cost of energy Increasing performance Problems with cooling, avoiding hot spots Avoiding high currents & metal migration Reliability Energy a very scarce resource Plugged Factory Relevant during use? Uncharged periods Car Unplugged Sensor Power Graphics: P. Marwedel,

24 Where is the energy consumed? - Consumer portable systems - According to International Technology Roadmap for Semiconductors (ITRS), 2010 update, [ Violation of W constraint for small mobiles. It not just the logic, don t ignore memory! ITRS,

25 Energy consumption of memories CACTI: Scratchpad (SRAM) vs. DRAM (DDR2): DRAM High Performance - 16 banks SPM Access time (ns) K SRAM 8K 32K 128K 512K 2M 8M 32M 128M 512M 2G High Performance - 16 banks SPM Energy (nj) - read High Performance - 16 banks DDR Access time (ns) High Performance - 16 banks DDR Energy (nj) - read 16 bit read; size in bytes; 65 nm for SRAM, 80 nm for DRAM Source: Olivera Jovanovic, TU Dortmund,

26 Migration of data & instructions, global optimization model (TU Dortmund) Example: main memory for i.{ } for j..{ } while... repeat call... array... Which memory object (array, loop, etc.) to be stored in SPM? Non-overlaying ( static ) allocation: Gain g k and size s k for each object k. Maximise gain G = g k, respecting size of SPM SSP s k. scratch pad memory, capacity SSP? array Solution: knapsack algorithm. Overlaying ( dynamic ) allocation: processor int... Moving objects back and forth

27 Reduction in energy and average run-time for migrating functions and variables Cycles [x100] Energy [µj] Multi_sort (mix of sort algorithms) Measured processor / external memory energy + CACTI values for SPM (combined model) Numbers will change with technology, algorithms remain unchanged

28 Non-overlaying allocation problematic for multiple hot spots Overlaying allocation Memory CPU Memory SPM Effectively results in a kind of compilercontrolled overlays for SPM Address assignment within SPM required

29 Overlaying allocation by Verma et al. (1) B1 DEF A Based on control flow graph. B2 MOD A B3 B5 USE C C USE A B4 B6 USE C B7 B8 USE A [M.Verma, P.Marwedel: Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004]

30 Overlaying allocation by Verma et al. (2) B1 B2 DEF A B9 SPILL_STORE(A); SPILL_LOAD(C); MOD A USE A B3 B4 B7 B5 B6 USE C C USE C B10 Global set of ILP equations reflects cost/benefit relations of potential copy points SPILL_LOAD(A); B8 USE A Code handled like data

31 Runtime/energy reduction with respect to non-overlaying ( static ) allocation

32 Dynamic set of multiple applications Compile-time partitioning of SPM not feasible Introduction of SPM-manager Runtime decisions, but compile-time supported CPU MEM SPM Address space: SPM App. 1 App. 2 App. n? SPM Manager SPM App. 3 App. 2 App. 1 t [R. Pyka, Ch. Faßbach, M. Verma, H. Falk, P. Marwedel: Operating system integrated energy aware scratchpad allocation strategies for multi-process applications, SCOPES, 2007]

33 Comparison of SPMM to Caches for SORT Baseline: Main memory only SPMM peak energy reduction by 83% at 4k Bytes scratchpad SPM Size Δ 4-way ,81% ,35% ,39% ,64% ,73%

34 Outline Scope, opportunities & challenges (Partial) solutions to challenges Timing predictability - WCET-aware compilation: unrolling, locking - register allocation, scratch pads Energy efficiency - scratch pads: (non)-overlaying, compile/run-time allocation Tradeoffs between multiple objectives Summary - PAMONO bio sensor - automatic parallelization: energy efficiency vs. performance - approximate computations: timeliness vs. exactness

Energy consumption of (bio-) virus detection Detection of nano objects using plasmon effect (Optical effect of objects wavelength of light) Nano object

35 Energy consumption of (bio-) virus detection Detection of nano objects using plasmon effect (Optical effect of objects wavelength of light) Nano object 0,06 Reflectivity 0,04 Prisma Sensor Image sequence Signal ( I/I) 0,02 0,00-0,02 Laser CCD -0,04 55,0 55,2 55,4 55,6 55,8 56,0 56,2 incidence angle, grad

36 Timeliness in (bio-) virus detection (2) Target: 30 frames per second at 1024x256 pixels Frame Rate (fps) Image Size 1024x x x x1024 CPU: i CPU: i7-620m (mobile) GPU: 3100M (mobile) GPU: GTX GPU: GTX 560Ti Low-end Mobile GPU meets realtime constraint up to 1024x256 pixels Faster GPUs give room for larger usable sensor area

Timeliness in (bio-) virus detection (3) current clamp Energy per frame CPU 3.26 J 5.84 J 10.52 J Reduced to Energy per frame GPU 0.93 J 1.56 J 2.76 J avg 27% C. Timm, A. Gelenberg, P. Marwedel, F.

37 Timeliness in (bio-) virus detection (3) current clamp Energy per frame CPU 3.26 J 5.84 J J Reduced to Energy per frame GPU 0.93 J 1.56 J 2.76 J avg 27% C. Timm, A. Gelenberg, P. Marwedel, F. Weichert: Energy Considerations within the Integration of General Purpose GPUs in Embedded Systems. Intern. Conf. on Advances in Distributed and Parallel Computing, 2010 C. Timm, F. Weichert, P. Marwedel, H. Müller: Design Space Exploration Towards a Realtime and Energy-Aware GPGPU-based Analysis of Biosensor Data. Computer Science - Research and Development, ENA-HPC,

38 Multi-objective based parallelization of sequential programs D. Cordes, P. Marwedel: Multi-Objective Aware Extraction of Task-Level Parallelism Using Genetic Algorithms, DATE

39 Tradeoff between timeliness and exactness Traditional error correction requires resources, including time Conflicting with timing requirements of real-time systems ➀ ➁ ➂

Tradeoff between timeliness and exactness (2) In some cases, errors have no effect In some cases, timing requirements are dominating & error correction may be compromised In other cases, error

40 Tradeoff between timeliness and exactness (2) In some cases, errors have no effect In some cases, timing requirements are dominating & error correction may be compromised In other cases, error correction requirements are dominating & timing requirements may be compromised Register 0 Register 1 Register 2 Register 3 unused variable i unused variable j No effect Imprecise output System crash Classification requires application knowledge

41 Analysis of H.264 video decoder (~ 3500 lines of C code) Fraction of memory space Resolution Memory size of reliable data Memory size of unreliable data 176 x kb (55%) 74 kb (45%) 352 x kb (43%) 297 kb (57%) 1280 x kb (37%) 2700 kb (63%) Impact of uncorrected errors: Injection into unreliable memory 240 faults / sec 80 faults / sec Average PSNR of frames with errors db db Deteriorated exactness, timeliness maintained!

42 Summary Scope, opportunities & challenges (Partial) solutions to challenges Timing predictability - WCET-aware compilation: unrolling, locking - register allocation, scratch pads Energy efficiency - scratch pads: (non)-overlaying, compile/run-time allocation Tradeoffs between multiple objectives Summary - PAMONO bio sensor - automatic parallelization: energy efficiency vs. performance - approximate computations: timeliness vs. exactness

Memory-architecture aware compilation

- 1- ARTIST2 Summer School 2008 in Europe Autrans (near Grenoble), France September 8-12, 8 2008 Memory-architecture aware compilation Lecturers: Peter Marwedel, Heiko Falk Informatik 12 TU Dortmund, Germany