PREcision Timed (PRET) Architecture

Size: px

Start display at page:

Download "PREcision Timed (PRET) Architecture"

Quentin Thompson
5 years ago
Views:

1 PREcision Timed (PRET) Architecture Isaac Liu Advisor Edward A. Lee Dissertation Talk April 24, 2012 Berkeley, CA

2 Acknowledgements Many people were involved in this project: Edward A. Lee UC Berkeley David Broman UC Berkeley Ben Lickly UC Berkeley Hiren Patel University of Waterloo Jan Reineke Saarland University Stephen Edwards Columbia University Sungjun Kim Columbia University Matt Viele Drivven Inc. Gerald Wang National Instruments Hugo Andrade National Instruments And many more 2/38

3 Cyber Physical Systems Two key characteristics of physical processes Inherently Concurrency Uncontrollable passage of time E-Corner, Siemens Avionics: Instrumentation (Soleil Synchrotron) Key Challenges [Sangiovanni-Vincentelli, 07]: Composability Military systems: Timing Predictability Dependability Concurrency Automotive: Passage of Time Daimler-Chrysler Courtesy of Doug Schmidt! 3/38

+,-./,(& $%&'()* Composability Electronics and the Car!"# 8+5%4.&!9>#+0#:/!?789@! -&/-<%8''4/'.'3&)&%.,&)'&/G"%</1&/,@',7&-&'.%&' 3<$%',<'</&8'' )&G&2<*&)'.-',7%&&'-&*.%.,&'$/",-'+<//&+,&)':=' 6/'456'.

%&)' %&-<$%+&-8''97&':&/&3",-',<'456'.%&'%<<,&)' +<1*$,"/#'%&-<$%+&-'$-&-'2&--'*7=-"+.2'%&-<$%+&-' <*,"1">.,"</'+.*.:"2","&-8' ;7&/'+<1*.%&)',<',7&'3&)&%.,&)'-=-,&1',7.,'7<-,-' www.4wings.

cd-e'"-'%&)$+&)'3%<1' 6/)&7'(28$9+,/750+,$,7%&&',<'</&8''97&'+<11$/"+.,"</'"/,&%3.+&-'.%&'!More than 30% of the cost of a car is now in Electron %&)$+&)'3%<1'3"G&',<'3<$%8''!"/.22=@',7&'/$1:&%'<3' M",7"/'.

$/+,"</-',7%<$#7',7&'$-&'<3'+</3"#$%.,"</ 97&-&'-7.%&)'%&-<$%+&-'"/+2$)&',7&'+<1*$, *%<+&--<%A-E@'+<11</'+<11$/"+.,"</-'/&,./)'+<11</'4OP'$/",A-E8''Q$%"/#',7&'.22<+. *%<+&--@',7&'-=-,&1'"/,&#%.,<%'1.

F"/',<',7&'-*.%&'.22<+.,"</'*%<+&3&)&%.,&)'&/G"%</1&/,8''456'.))-',7&'.))" +.*.:"2",=',<'%&-&%G&'.'-*.%&'%&-<$%+&'*<<2'.:2&',<':&'.22<+.,&)',<'./='N<-,&)'!$/+,"</ [ CB. Watkins, 07] [CB. Watkins, 07]! -7.

4 +,-./,(& $%&'()* Composability Electronics and the Car!"# 3<$%',<'</&8'' )&G&2<*&)'.-',7%&&'-&*.%.,&'$/",-'+<//&+,&)':=' 6/'456'.%+7",&+,$%&'.22<;-',7&'-=-,&1 )&)"+.,&)'+<11$/"+.,"</'+7.//&2-8''6-'-7<;/'"/' %&-<$%+&-8''97&':&/&3",-',<'456'.%&'%<<,&)' +<1*$,"/#'%&-<$%+&-'$-&-'2&--'*7=-"+.2'%&-<$%+&-' <*,"1">.,"</'+.*.:"2","&-8' ;7&/'+<1*.%&)',<',7&'3&)&%.,&)'-=-,&1',7.,'7<-,-' 6/)&7'(28$9+,/750+,$,7%&&',<'</&8''97&'+<11$/"+.,"</'"/,&%3.+&-'.%&'!More than 30% of the cost of a car is now in Electron %&)$+&)'3%<1'3"G&',<'3<$%8''!"/.22=@',7&'/$1:&%'<3' M",7"/'./'456'.%+7",&+,$%&@',7&'-7.%&! 9 0% of all innovations will be based on electronic sy *7=-"+.2'+<11$/"+.,"</'+7.//&2-'"-'%&)$+&)'3%<1' +<1*$,"/#'%&-<$%+&-'.%&'.22<+.,&)',<',7&'N 3<$%',<'</&8''!$/+,"</-',7%<$#7',7&'$-&'<3'+</3"#$%.,"</ 97&-&'-7.%&)'%&-<$%+&-'"/+2$)&',7&'+<1*$, *%<+&--<%A-E@'+<11</'+<11$/"+.,"</-'/&,./)'+<11</'4OP'$/",A-E8''Q$%"/#',7&'.22<+. *%<+&--@',7&'-=-,&1'"/,&#%.,<%'1."/,."/-',7& 32&0":"2",=',<')=/.1"+.22='1./.#&'-*.%&'%&,7%<$#7',7&'1./"*$2.,"</'<3',7&'+</3"#$%.,",.:2&-8''97&'-=-,&1'"/,&#%.,<%'+<$2)'.22<+., %&-<$%+&-',<'&.+7'"/)"G")$.2'N<-,&)'!$/+," ;7"+7'"-'.F"/',<',7&'-*.%&'.22<+.,"</'*%<+&3&)&%.,&)'&/G"%</1&/,8''456'.))-',7&'.))" +.*.:"2",=',<'%&-&%G&'.'-*.%&'%&-<$%+&'*<<2'.:2&',<':&'.22<+.,&)',<'./='N<-,&)'!$/+,"</ [ CB. Watkins, 07] [CB. Watkins, 07]! -7.%"/#',7&'%&-<$%+&8''97"-'#"G&-',7&'-=-,&1 "/,&#%.,<%',7&')=/.1"+'.:"2",=',<'"/+%&.-&'< [Sangiovanni-Vincentelli, ee249 lecture 1] "#$%&'!()!*+,-.&#/+0!+1!.0!23.,-4'!"'5'&.6'5! )&+%&.-&',7&'%&-<$%+&'.22<+.,"</'3<%'.'#"G&/.05!789!9&:;#6':6%&'! Dissertation Talk, Apr. 24, /38!$/+,"</'"/',7&'3$,$%&@'<%',<'.))'.'/&;'N<+,-./,(& $%&'()*!"# +,-./,(&,-./,(& $%&'()* %&'()*!"#!"# IMA Integrated Modular Avionics

5 Timing Predictability How long does it take to execute the following code? Cache Hit? Miss? Data Dependency for (i = 1; i < n; i++) if ( a[i] > b[i] ) else Let s assume we know n 10 c[i] = c[i-1] + a[i]; c[i] = c[i-1] + b[i]; Branch predicted correctly? Out of order execution? Multithreading? Assume branch mispredict, cache miss? 5/38

6 Timing Anomalies [Lundqvist et al., 99] [Wenzel et al., 05] [Reineke et al., 06] [Engblom, 03] 6/38

7 Challenges in WCET Analysis However, both the precision of the results and the efficiency of the analysis methods are highly dependent on the predictability of the execution platform. In fact, the architecture determines whether a static timing analysis is practically feasible at all and whether the most precise obtainable results are precise enough. (Emphasis added) [Wilhelm, 03] Heckmann et al., The influence of processor architecture on the design and the results of wcet tools, IEEE 03 7/38

8 Contribution Propose an architecture that allows for timing predictability and composable resource sharing without sacrificing performance. 8/38

Time WCET Superscalar Out of Order Multicore inst1 inst2 inst3 IF! ID! EX! M! WB! IF! ID! EX! M! WB! IF! ID! EX! M! WB! inst4 IF!

9 Architecture Improvements 10 cycle Cache 1 cycle $ Mem Avg. Time WCET Pipelines inst1 IF! ID! EX! M! WB! inst2: if x>0 IF! ID! EX! M! WB! inst3 1 cycle IF! ID! EX! M! WB! inst3 3 cycle IF! ID! EX! M! WB! Avg. Time WCET Superscalar Out of Order Multicore inst1 inst2 inst3 IF! ID! EX! M! WB! IF! ID! EX! M! WB! IF! ID! EX! M! WB! inst4 IF! ID! EX! M! WB! inst5 IF! ID! EX! M! WB! inst6 IF! ID! EX! M! WB! Avg. Time WCET Shared resources Avg. Time WCET [Courtesy of Sami Yehia, Thales] 9/38

10 Execution Time Variance [Wilhelm et al., 08] Future applications, including safety-critical and active-safety ones, need shorter latencies and time determinism - reduced jitter - to increase performance. [Sangiovanni-Vincentelli, 07] 10/38

11 Related Work Modifying Modern Processors Superscalar [Rochange et al., 05], [Whitham et al., 08] VLIW [Yan et al., 08] Multithreading [Kreuzinger et al., 00], [El-Haj-Mahmoud et al., 05] SMT [Barre et al., 08], [Mische et al., 08], [Metzlaff et al., 08] WCET Analysis Pipeline Analysis [Schneider et al., 99], [Ferdinand et al., 01], [Lagenbach et al., 02], [Kirner et al. 09] Cache Analysis [Heckmann et al., 03], [Reineke et al., 07] Stack Based Architecture Java Optimized Processor [Schoeberl, 06] 11/38

12 Precision Timed Architecture Summary of architectural features: Traditional Deep out-of-order pipelines (Instructional level parallelism) Caches (Hardware replacement policy) Best effort DRAM Controller PRET Thread-interleaved pipelines (Thread level parallelism) Scratchpads (Software controlled replacement) Predictable DRAM Controller See S. Edwards and E. A. Lee, "The Case for the Precision Timed (PRET) Machine," in the Wild and Crazy Ideas Track of the Design Automation Conference (DAC), June /38

13 Pipelining LD R1, 45(r2) DADD R5, R1, R7 BE R5, R3, R0 ST R5, 48(R2) Unpipelined The Dream F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W The Reality F D E M W Memory Hazard F D E M W Data Hazard F D E M W Branch Hazard Edwards, RePP 09 13/38

14 Interleaved Pipeline t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 F D X M W F D D D F F F D X M W F D D D F F F D X M W F D D D D PC PC PC PC IR GPR1 GPR1 X Y D$ Denelcor, HEP (1981), Lee and Messerschmitt, DSP (1987), CDC 6000 (1961) t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 F D X M W F D X M W F D X M W F D X M W F D X M W Remove Data Dependencies!! [Asonavic, CS252 lecture F07] Also called Fine Grained Multithreading! 14/38

15 Thread Interleaved Execution cycle gcd: cmp r0, r1 beq end blt less sub r0, r0, r1 b gcd less: sub r1, r1, r0 b gcd end: add r1, r1, r0 mov r3, r1 add r0, r1, r2 sub r1, r0, r1 ldr r2, [r1] sub r0, r2, r1 cmp r0, r3 cmp r0, r1 F D E M W Thread 0 add r0, r1, r2 F D E M W Thread 2 blt less F D E M W ldr r2, [r1] F D E M W Thread 4 b gcd F D E M W beq end F D E M W sub r1, r0, r1 F D E M W sub r1, r1, r0 F D E M W ldr r2, [r1] F D E M W cmp r0, r1 F D E M W cycle cycle blt less F D E M W Thread ldr r2, [r1] F D E M W cmp Thread r0, r13 cmp b gcd add r0, r1, r2 add F D E M Memory Access sub r0, r2, r1 F D E beq end beq sub r1, r0, r0 sub beq end F D blt less blt ldr r2, [r1] ldr sub r0, r0, r1 F sub r0, r0, r1 sub sub r0, r2, r1 sub b gcd cmp b cmp r0, r3 Thread 0: GCD with conditional branches Thread 1: Data dependent code 15/38

16 Interleaved Pipeline Trade-offs: Need enough concurrency to utilize processor Favor throughput over latency However Simpler WCET analysis (Timing Predictability) Interference free multiple context execution (Composability) Simple pipeline design (Energy, Cost ) Improved throughput and clock rate (Performance) 16/38

17 Memory Hierarchy CPU Register File L1 Cache CPU L2 Cache Register File Main Memory Scratchpad Memory Main Memory Use Scratchpads instead of Caches! 17/38

18 Scratchpads Trade-offs: Need explicit management from the software (compiler/programmer) However Simpler WCET analysis (Timing Predictability) Customize to workload (Performance) Simple circuit design (Energy, Cost ) 18/38

19 Main Memory DRAMs: Two key problems: Bank Conflicts DRAM Refresh Variable Access Times 19/38

20 PRET DRAM Controller Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Provides four independent and predictable resources Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 Allows for predictable refreshes [Reineke, CODES 11] 20/38

21 Main Memory Trade-Offs: Shared memory on scratchpad Longer average memory latencies However Predictable access latencies (Timing Predictability) Better throughput and latency when fully utilized (Performance) Reineke et al., PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation, CODES 11 21/38

22 PTARM PTARM Thread-Interleaved Pipeline Addr. Mux Scratchpads BootROM DRAM Controller I/O Bus UART UART Gateway DVI Controller LED Registers Integrated Logic Analyzer xcvlx110t RS232 DVI Transmitter On Board LEDs DDR2 DRAM Memory Module Download at 22/38

23 Pipeline Performance 23/38

24 DRAM Performance Varying Interference Varying Bandwidth Reineke et al., PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation, CODES 11 24/38

25 Contribution Propose an architecture that allows for timing predictability and composable resource sharing without sacrificing performance. Use architectural techniques that provides composability and timing predictability Expose time in the Instruction Set Architecture. 25/38

26 Current Methods WCET Fly-by-wire aircraft controlled by software. They have to purchase and store microprocessors for at least 50 years production and maintenance 26/38

27 Levels of Abstraction [Lee, 08] 27/38

28 ISA with time Extend Instruction Set with timing instructions that specify and control timing behaviors of code blocks. Assume a platform clock synchronous with the execution of instructions Timing instructions use platform clock to control execution time Deadline of Task A) Finish the task, and detect at the end if deadline was missed B) Immediately handle a a missed deadline C) Continue as long as execution time does not exceed D) Ensure execution does not continue until specified time Task Next Task Stall Miss Handler Interrupted Code 28/38

Timing Control New instruction get time (gt) gt r1, r2 ; get time (ns) -- Code block -- adds r2, r2, #500 ; add 500 ns adc r1, r1, #0 ; add with carry (time in 2 32-bit reg) du r1, r2 ; delay until

29 Timing Control New instruction get time (gt) gt r1, r2 ; get time (ns) -- Code block -- adds r2, r2, #500 ; add 500 ns adc r1, r1, #0 ; add with carry (time in 2 32-bit reg) du r1, r2 ; delay until 500ns have elapsed New instruction delay until (du) Processor frequency Task (execution time in clock cycles) Padding using delay until Where could this be useful? - Finishing early is not always better: - Scheduling Anomalies (Graham s anomalies) - Communication protocols and External synchronization 29/38

deactivate exception New instruction deactivate exception (de) Processor frequency Task (execution time in clock

30 Timing Exceptions New instruction exception on expire (ee) gt r1, r2 ; get time (ns) adds r2, r2, #500 ; add 500 ns adc r1, r1, #0 ; add with carry (time in 2 32-bit reg) ee r1, r2 ; register timer exception -- Code block -- de ; deactivate exception New instruction deactivate exception (de) Processor frequency Task (execution time in clock cycles) Exception handler Where could this be useful? - Immediate deadline miss detection Hardware exception thrown 30/38

31 ISA with time Traditional Approach Programming Model Timing Dependent on the Hardware Platform Our Objective Programming Model Make time an engineering abstraction within the programming model Timing is independent of the hardware platform (within certain constraints) A Timing Requirements-Aware Scratchpad Memory Allocation Scheme for a Precision Timed Architecture [Patel et al. 08] 31/38

32 Contribution Propose an architecture that allows for timing predictability and composable resource sharing without sacrificing performance. Use architectural techniques that provides composability and timing predictability Expose time in the Instruction Set Architecture. ISA extensions to specify temporal properties 32/38

33 Real-Time Engine Fuel Rail Simulation PRET Cores 1D CFD Simulation Network of Pipes Real-Time requirements: 5.33us Common Fuel Rail: 234 nodes Implemented on Xilinx V6 FPGA 33/38

34 Timing Side-Channel Attacks Timing exploits: Algorithms Caches Branch Predictors Pipelines Root cause: uncontrollable timing side effects! Execution Time 34/38

35 Summary Problem Statement: Conventional methods are limiting the scaling of Cyber Physical Systems design because of its lack of precise timing control and analysis Solution: To rethink the design of the bottom layers of abstraction, with emphasis on temporal predictability for Cyber Physical Systems Outcome of Research: Precision To propose Timed changes Architecture in the abstraction (PRET) for layer timing to expose predictability time throughout and composability layers, and propose with ISA a computer extensions for architecture exposing temporal that focuses properties on timing predictability and composability for Cyber Physical Systems. 35/38

36 Publications Liu, Viele, Wang, Lee, Andrade, A Heterogeneous Architecture for Evaluating Real-Time One Dimensional Computational Fluid Dynamics, FCCM 12 Reineke, Liu, Patel, Kim, Lee, PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation, CODES 11 Bui, Lee, Liu, Patel, Reineke, Temporal Isolation on Multiprocessing Architectures, DAC 11 Liu, Reineke, Lee, A PRET Architecture Supporting Concurrent Programs with Composable Timing Properties, ACSSC 10 Edwards, Kim, Lee, Liu, Patel, Schoeberl, A Disruptive Computer Design Idea: Architectures with Repeatable Timing, ICCD 09 Liu, Lickly, Patel, Lee, Poster Abstract: Timing Instructions - ISA Extensions for Timing Guarantees, RTAS 09 Liu and McGrogan. Elimination of Side Channel Attacks on a Precision Timed Architecture, Technical Report, UCB 2009 Lickly, Liu, Kim, Patel, Edwards, Lee, Predictable Programming on a Precision Timed Architecture, CASES 08 36/38

37 Thank You Please visit Questions?

38 BACKUP SLIDES 38/25

39 Thank You Qual Committee Edward A. Lee - Berkeley Hiren Patel Univ. of Waterloo Martin Schoeberl Univ. of Denmark Stephen A. Edwards Columbia Univ. Ben Lickly, Sungjun Kim John Eidson, Marc Geilen, Sami Yehia (Thales), Maarten Wiggers, Jan Reineke, Slobodon Matic, Jia Zou Christopher Brooks, Mary Stewart My Family 39/25

40 Research Efforts In All Fronts 40/25

41 Definitions Predictability The ability to analyze the execution time Repeatability The ability to repeat the execution given the same inputs Composability The functional and temporal behavior of an application is the same, irrespective of the presence or absence of other applications Robust Small changes in input leads to small changes in output 41/25

42 WCET Analysis 42/25

43 Computer Architecture Typical metrics in processor design for Embedded Systems Performance (Average Case) Power Area (Size) Compiler Support (Developmental Effort) Cost Multiple Context Analyzability 43/25

44 Branch Prediction Anomaly Experiment for{k=1; k<32; k++) { starttimer(); for(n=0; n < ; n++) // OUTER LOOP { for(i=0; i < k; i++) // INNER LOOP { nop(); // Some compiler-dependent way to get a nop } } stoptimer(); recordtime(); } 44/25

Richard s Anomalies Increasing the number of processors C 1 = 3 C 2 = 2 1 2 9 8 C 9 = 9 C 8 = 4 9 tasks with precedences and the shown execution times, where lower numbered tasks have higher

45 Richard s Anomalies Increasing the number of processors C 1 = 3 C 2 = C 9 = 9 C 8 = 4 9 tasks with precedences and the shown execution times, where lower numbered tasks have higher priority than higher numbered tasks. Optimal 3 processor schedule: C 3 = C 7 = 4 C 4 = C 6 = 4 5 C 5 = 4 The optimal schedule with four processors has a longer execution time. 45/25

Richard s Anomalies Reducing all execution times by 1 Reducing computation times C 1 = 2 C 2 = 1 1 2 9 8 C 9 = 8 C 8 = 3 9 tasks with precedences and the shown execution times, where lower numbered

46 Richard s Anomalies Reducing all execution times by 1 Reducing computation times C 1 = 2 C 2 = C 9 = 8 C 8 = 3 9 tasks with precedences and the shown execution times, where lower numbered tasks have higher priority than higher numbered tasks. Optimal 3 processor schedule: C 3 = C 7 = 3 C 4 = C 6 = 3 5 C 5 = 3 Reducing the computation times by 1 also results in a longer execution time. 46/25

Richard s Anomalies Removing precedence constraints Weakening the precedence constraints C 1 = 3 C 2 = 2 1 2 9 8 C 9 = 9 C 8 = 4 9 tasks with precedences and the shown execution times, where lower

47 Richard s Anomalies Removing precedence constraints Weakening the precedence constraints C 1 = 3 C 2 = C 9 = 9 C 8 = 4 9 tasks with precedences and the shown execution times, where lower numbered tasks have higher priority than higher numbered tasks. Optimal 3 processor schedule: C 3 = C 7 = 4 C 4 = C 6 = 4 5 C 5 = 4 Weakening precedence constraints can also result in a longer schedule. 47/25

48 Progress Work Completed: SPARC instruction set simulator C++ cycle accurate simulator PTARM architecture Synthesizable VHDL ARM core VGA controller and Serial Communication Work in Progress: WCET analysis tool (~2 weeks) Benchmarking the pipeline (~ 1 semester) Scratchpad allocation with timed programming models (~1 semester) Proof of concept workflow (~ 1 semesters) 48/25

49 Contribution Expose time in the abstraction layers. ISA extensions to specify temporal properties Propose an architecture that allows for timing predictability and composable resource sharing. 49/25

Precision Timed (PRET) Machines

Precision Timed (PRET) Machines Edward A. Lee Robert S. Pepper Distinguished Professor UC Berkeley BWRC Open House, Berkeley, CA February, 2012 Key Collaborators on work shown here: Steven Edwards Jeff