Statically Calculating Secondary Thread Performance in ASTI Systems

Size: px

Start display at page:

Download "Statically Calculating Secondary Thread Performance in ASTI Systems"

Oscar Gilmore
6 years ago
Views:

1 Statically Calculating Secondary Thread Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimarães Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering North Carolina State University 1

2 Overview ASTI: Asynchronous Software Thread Integration Register File Partitioning Experiments Conclusions 2

3 Basic Idea of ASTI Goal: recover fine-grain idle-time for use by other threads Examine program to find a function f with significant internal idle time Idle time is imposed by instruction-level timing requirements (e.g. for input, output instructions) If an idle time piece n is coarse-grain (T Idle (f,n) >> 2*T ContextSwitch ), then we can recover it efficiently with context switching If it is fine grain (T Idle (f,n)!>> 2*T ContextSwitch ), then apply ASTI (Asynchronous Software Thread Integration) Details of ASTI in LCTES 2004, CASES

4 ASTI Applied to Communication Protocols Executive ReceiveMessage ReceiveBit Prepare message buffer Subroutine calls Read bit from bus 3 times and vote Idle time Return Check for errors, save bit, update CRC Sample bus for resynchronization Primary Thread Integrated Secondary Secondary Thread Thread Executive ReceiveMessage ReceiveBit Need only first and last Coroutine coroutine calls Wasted Recover idle even time, too short short idle for cocall time 4

5 Protocol Controller Options Analog & Digital I/O Analog & Digital I/O Analog & Digital I/O Analog & Digital I/O System MCU System MCU Optional System MCU I/O Expander Discrete Protocol Controller On-Board Protocol Controller Generic MCU with ASTI S/W Protocol Controller Physical Layer Transceiver Physical Layer Transceiver Physical Layer Transceiver Physical Layer Transceiver I/O Expander Communication Network Discrete protocol controller with MCU MCU with on-board protocol controller Generic MCU with ASTI SW protocol controller 5

6 But what about Caches? Deep instruction pipelines? Branch prediction? Superscalar instruction execution? Speculative execution? The reorder buffer? Page faults? Forwarding paths? Load queues? Data prefetching? Predicated execution? Branch delay slots? Instruction prefetching? Store forwarding? R-ops Dynamic optimization The phase of the moon Wind direction Et cetera, et cetera 6

7 Register File Partitioning Single register file must support primary and secondary threads Three ways to use a register For primary thread exclusively For secondary thread exclusively Shared between the two, swapped on coroutine calls Register file may not be homogeneous Pointer/address registers Immediate-operand capable registers... so need to pick best partition for each type. How? 7

8 Primary and Secondary Thread Performance Impact of register file partitioning More registers for primary thread Less spilling and filling -> primary code takes fewer cycles More idle time -> more cycles for secondary thread Fewer registers for secondary thread -> more spilling and filling - > secondary thread requires more cycles, response time rises More registers for secondary thread Similar case More registers swapped Both threads require fewer cycles to execute Coroutine call takes longer -> More cycles wasted switching between threads -> Now coroutine call fits doesn t fit into shorter idle time fragments, reducing cycles available for secondary thread How do we find the best register file partitioning? Too complex to compute everything analytically Instead compile and analyze iteratively to perform design space exploration 8

9 Thrint foo.s Control-flow Analysis Data-flow Analysis Static Timing Analysis GProf Integration Analysis Integration foo.int.s foo.id XVCG GnuPlot Thread Integrating Compiler Back-End: Thrint Have enhanced Thrint to Integrate threads using ASTI methods (was just STI) Measure best, worst case performance for secondary thread 9

10 Iterative Partition Analysis Toolchain Register File Partitioning Decisions s_m.c Primary Thread r_m.c s_b.c gcc Thrint: ICTA T SegmentIdle r_b.c Original Performance of Secondary Thread Secondary Thread gcc Thrint T Sec Performance Comparison: Slowdown vs. Dedicated MCU gcc Thrint T Sec-Seg-Part Performance of Segmented, Partitioned Secondary Thread 10

11 Atmel AVR 8-bit load/store architecture for microcontrollers Register File (32) Pointer + immediate (6) Immediate (10) Other (16) Protocol controllers in C CAN: 62.5 kbps MIL-STD-1553: 1 Mbps Secondary threads in C Network-RS232 bridge PID controller Compiled with AVR-GCC, -O3 Experiments Bus Bridge MCU Dig. Out Dig. In ASTI Software Primary Thread (J1850) Message Queues Secondary Thread (Interface) UART System MCU UART 11

12 Performance Evaluation Measure slowdown of integrated secondary thread (worst-case execution path) with partitioned register file, compared with original full-register file performance Need to evaluate and schedule for worst-case to ensure system always meets its deadlines How much performance do we give up by partitioning the register file? Not all partitions are schedulable Not enough time for coroutine call Not enough time for primary thread to meet its I/O instruction deadlines 12

13 Results I: Average Performance Slowdown 60% 40% 20% Bridge (Host Interface) PID Controller 0% 1553Send 1553Receive CANSend CANReceive Average performance for all feasible partitioning approaches 13

14 Results II: Best Performance 2.0% 1.5% Bridge (Host Interface) PID Controller Slowdown 1.0% 0.5% 0.0% -0.5% 1553Send 1553Receive CANSend CANReceive Find best (least slow-down) of all feasible partitionings AVR register file is adequate to handle register pressure for both threads, or idle time is adequate for coroutine calls 14

Primary: CAN send Secondary: RS232-CAN Bridge Other registers Cocall must be

15 Detailed Analysis Example Primary: 1553 send Secondary: PID controller Immediate registers Secondary is sensitive to # of immediate registers Primary: CAN send Secondary: RS232-CAN Bridge Other registers Cocall must be brief for schedulability Best is 1.5% slowdown: 10,6 to 14,2 with no swapped registers 15

16 Conclusions Conclusions and Future Work Performance varies significantly for AVR architecture Average case bad Best case close to non-partitioned register file Future Work Derive and evaluate heuristics to search efficiently through partitioning design space Replace coroutine call with dispatcher to support multiple secondary threads 16

17 Questions? Have you applied this to SPEC? No, that s not representative of embedded software-implemented communication protocols Don t caches break the timing predictability you need? The processors we use run at under 50 MHz, so we don t have a memory wall Why not use a multithreaded processor? They re too expensive, too rare, and businesses prefer familiar processors Why not just design an ASIC to do it? Too expensive to get the first one Why not program an FPGA to do it? Too expensive to get the rest of them 17

18 Appendices 18

19 Why Network Communication Protocol Controllers? Multiple threads must be able to make progress, even with fullyloaded bus Idle time is very fine grain (under one bit time) Each application domain customizes its protocols Wireless sensor networks tweak the medium access control, etc. for minimal energy use Automotive: optimize for guaranteed (hard real-time) delivery Chicken and egg problem Protocol controller chip won t appear until adequate market anticipated Chip costs remain high until volumes amortize design costs Delay until protocol controller appears as peripheral on cheap MCUs MCUs are good fit for many embedded protocols, if concurrency is cheap 10 to 200 cycles of processing needed per bit 1 kbps 1 Mbps bus speed Temporally predictable MCUs are cheap and flexible 1 MHz for $ MHz for $5-$10 (but you pay in increased energy use and other issues) 19

20 Assumptions - Small Embedded Systems Processors Not practical to design a custom processor Not practical to use fast processor (e.g. raise clock speed by 10x or more) Can handle some code explosion (e.g. up to 3x) Using a generic microcontroller (e.g. 4, 8 or 16 bit) without memory protection, virtual memory, caches. Workload 32 Bit 8% 16 Bit 12% DSP 11% At most a few threads need to make asynchronous progress, others can wait One hard-real-time thread with tight deadlines Other threads may have deadlines which are significantly longer Interrupts are delayed or handled with polling servers (CASES 2003) Subroutine calls are cloned or inlined (CASES 2004) 4 Bit 12% 8 Bit 57% 20

21 Control-Flow View of ASTI Idle Time Idle Time Idle Time Primary Function Control Flow Secondary Function Break secondary thread into segments lasting approximately for the total idle time Integrate intervening primary code into each segment Insert coroutine calls at start of idle time and end of each segment 21

22 Big Picture How do we efficiently allocate N concurrent (potentially real-time) tasks onto fewer than N processors? Compilation and scheduling for concurrent/parallel/distributed systems Real-time systems Hardware/software cosynthesis Bottlenecks Scheduling each context switch Performing each context switch We focus on 1 processor, and that processor is generic (low-cost) with no special features for accelerating context switch bottlenecks Note: threads must be able to make independent (asynchronous) progress 22

23 Steps in STI: Source Code Preparation Structure program (C) to accumulate work to perform in integrated functions Write functions (C) to be integrated Compile to assembly code, partitioning register file for functions to be integrated (-ffixed) 23

24 Control Dependence Graph Thread Representation Procedure Code Conditional Loop CDG s hierarchical structure simplifies integration Vertical = conditional nesting, Horizontal = execution order Summary information at each level Conditional Nesting Our Thrint back-end compiler operates on CDGs of host, guest threads Annotates host with execution time predictions Execution Order Moves guest code into host, enforcing ctl/data/time dependencies Find gap, or else descend into subgraph 24 Have code transformations to handle conditionals & loops

25 Parse Asm Form CFG/CDG Read Integration Directives Static Timing Analysis Node Labelling Thrint Overview STI Pad Timing Jitter in secondary thread Plan integration Pad excess timing jitter ASTI Pad Timing Jitter in message level function. Pad Timing Jitter in bit level function. Idle Time Analysis Temporal Determinism Analysis Data-Flow Analysis Register Virtualization Integration Register Reallocation Static Timing Analysis Timing Verification Code Regeneration For each guest For each host Do host loop transformations Pad excess timing jitter Clone and insert guest node(s) For each guest For each host If Fused loop, add fused loop control test code Delete original guests Plan integration in secondary thread Pad jitter in predicate nodes and blocking I/O loops Integrate cocalls within the secondary thread. Integrate intervening guest code at appropriate locations. Delete original guests 25

26 Steps in STI: Analysis and Integration Planning Parse assembly code to form CFG and then CDG Perform tree-based static timing analysis Pad away timing variations from conditionals with nops or nop loops (example) Perform basic data-flow analysis to identify loop-control variables and possibly iteration counts Compare duration of primary functions with maximum allowed latency for ISRs and other short-laxity tasks Create polling servers to handle these as needed Compare duration of secondary functions with amount of idle time time in primary functions, considering minimum period for primary function Break long secondary functions into segments which fit into primary functions idle time minus polling servers minus two context switch times Also end segments when reaching a loop with an unknown iteration count Define target times for regions in primary code which are time-critical 26

27 Steps in STI: Integration Note: conditionals have been padded away previously Single primary events Move primary code to execute at proper times within secondary code Replicate primary code into conditionals Split and peel loops and insert primary code Guard primary code within loop to trigger on given iteration Looping primary events Peel off primary function loop iterations which don t overlap with secondary loops Integrate as single primary events Fuse loop iterations which do overlap Fuse loop control tests Unroll loop to match idle time in primary loop to work in secondary loop Create clean-up loops to perform remaining iterations Redo static timing analysis and verify correct timing Recreate assembly file Compile, link, download and run! 27

28 Protocol Software Structure protocol_executive() send idle receive send_message() send_bit() receive_message() receive_bit() Most idle time is located in these functions 28

What about Interrupts? What about Frequent Primaries and Long Secondaries? Interrupts? STI disables interrupts while integrated threads run STIGLitz: ints. disabled for one field of video (16.

29 What about Interrupts? What about Frequent Primaries and Long Secondaries? Interrupts? STI disables interrupts while integrated threads run STIGLitz: ints. disabled for one field of video ( ms) Frequent primaries and long secondaries? Primary thread needs to run again before integrated version would finish Solutions Use polling servers to service each non-deferrable thread (e.g. UART) Break secondary into segments and integrate primary in multiple times Laxity for Secondary Thread (max. latency allowed) Worst case execution time of integrated thread Minimum primary thread period minus maximum primary thread work WCET for Secondary Thread 29

30 Detail: Register File Partitioning vs. Performance Problem: STI requires that integrated threads share the register file Trade-off: Code compiled to fit into fewer registers switches contexts faster Dispatcher switches contexts roughly every 900 cycles Two context switches for one register take 12 cycles Code compiled to fit into fewer registers runs slower More variables must remain in memory Goal: Squeeze pre-integrated threads into as few registers as practical Method: Determine sensitivity of the host threads execution time to the number of registers available Divide AVR registers into three classes: Pointer registers (r26-r31) Immediate-operand capable registers (r16-r25) Other registers (r0-r15) Analyze DrawSprite, DrawLine, DrawCircle functions Limit registers available to the register allocator through gcc s ffixed option. Measure execution time using an on-chip timer/counter 30

31 Measurements Results DrawLine and DrawCircle not very sensitive DrawSprite very sensitive Strange speed-up when excluding one pointer register Design decisions DrawLine and DrawCircle Exclude eight other" registers and two pointer registers Use 22 registers Each context switch: 132 cycles DrawSprite Exclude only one other register and two pointer registers Use 29 registers Each context switch: 174 cycles Normalized Run Time Normalized Run Time Normalized Run Time Draw_Circle Sensitivity to Register Exclusion DrawCircle - Immediate DrawCircle - Pointer DrawCircle - Other Total Registers Excluded Draw_Line Sensitivity to Register Exclusion DrawLine - Pointer DrawLine - Other DrawLine - Immediate Total Registers Excluded Draw_Sprite Registers_removed DrawSprite - Immediate DrawSprite - Pointer DrawSprite - Other 31

32 To Do Remove intervening code from primary code - animate 32

Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration

Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration RTSS 2002 -- December 3-5, Austin, Texas Alex Dean Center for Embedded Systems Research Dept. of ECE, NC State