Complementing Software Pipelining with Software Thread Integration

Complementing Software Pipelining with Software Thread Integration LCTES 05 - June 16, 2005 Won So and Alexander G. Dean Center for Embedded System Research Dept. of ECE, North Carolina State University wso@ncsu.edu, alex_dean@ncsu.edu

Multimedia applications Motivation Demand high performance to embedded systems DSPs with VLIW or EPIC architectures Maximize processing bandwidth Deliver predictable, repeatable performance E.g. Philips Trimedia, TI VelociTI and StarCore etc. However, speed is limited Difficult to find independent instructions. Software pipelining (SWP) can suffer or fail. Complex control flow Excessive register pressure Tight loop-carried dependence

Software Thread Integration (STI) Software technique which interleaves multiple threads into a single implicitly multithreaded one Idea: STI for high-ilp Merge multiple procedures into one Increase compiler s scope Code transformations to reconcile control flow diffs. Enable arbitrary alignment of code regions Move code to use available resources Method: Procedure cloning and integration Integrate parallel procedure calls in the app. Create procedures with better efficiency Convert procedure level parallelism into ILP

STI for High-ILP Assumption: Parallel procs. identified a priori. Manually identified or automatically extracted Exploit stream programming languages (e.g. StreamIt) Goal: Code transformations to improve ILP Apply transformations based on schedule analysis Complement key optimization (i.e. SWP) Contribution: Improve DSP app. dev. process Efficient compilation of C or C-like languages with additional dataflow information Reduce the need for extensive manual C and assembly code optimization and tuning

SWP for complex control flow Hierarchical reduction [Lam88] Enhanced MS [Water92] MS for multiple exits [Lavery92] Multiple-II MS [Water-Perez95] All-path pipelining [Stoodley96] Related Work Loop optimizations Loop jamming Loop unrolling Unroll-and-jam [Carr96] STI complements existing SWP methods. STI jams whole procedures. Procedure cloning [Cooper93] Integrated procedures do the work of multiple ones. STI transforms control flow. Procedure inlining Interprocedural optimizations STI STI exploits coarsegrain parallelism. StreamIt Stream programming

Classification of Loops IPCs of Loops from TI Image/DSP library 8 7 Speedup < 2 Speedup >= 2 After SWP Before SWP 6 5 IPC 4 3 2 1 0 Loops SWP-Good Speedup=2, High-IPC SWP-Poor Speedup<2, Low-IPC and dependence bounded SWP-Fail Calls, conditionals, lack of registers, no valid schedule

STI Overview STI transformations Reconcile control flow diffs: Enable arbitrary alignment Use CDG: Apply hierarchically and repeatedly Conditionals: Duplication Loops: Jamming, peeling, unrolling and splitting Two levels of integration Assembly HLL We pursue this. Side effects Code size increase Increase of register pressure Additional data memory traffic

STI for Loops 3URFD D D 3URFE E E 3E E 1 E,QWHJUDWHG3URF DE D 3D D 1 D 3D 3E /RRSSHHOLQJ /MDPPLQJ /VSOL WLQJ E 3E DE $VVXP H1 D E 1) Loop jamming + splitting 2) Loop unrolling + jamming + splitting 3) Loop peeling + jamming + splitting

Conditional Duplicate code into all conditionals Increase instructions in BBs STI for Loops (contd.) Call Treat them as regular statements Find more instructions to fill delay slots

STI Transformations STI transformations for code regions A and B A B SWP-Good Loop SWP-Poor SWP-Fail Acyclic Loop SWP- Good SWP- Poor Do not apply STI STI: Jam (+Unroll B) STI: Jam (+Unroll A) STI: Jam (+Unroll loop with smaller II) Do not apply STI STI: Loop peeling SWP- Fail Do not apply STI STI: Jam (+Unroll, +Duplicate) Acyclic STI: Loop peeling STI: Code motion

Platform and Tools Target architecture: TI TMS320C64x Fixed-point DSP, VelociTi.2 Clustered: (4FU + 32 Reg.) x 2 ISA: Predication, SIMD, 1~5 delay slots On-chip: 16KB L1P/L1D$, 1024KB SRAM Compiler and tool TI C6x compiler Option o2: all optimizations but interprocedural ones Option mt: aggressive memory anti-aliasing C64x simulator in Code Composer Studio (CCS) 2.20 stall.xpath: stalls due to cross-path communication stall.mem: stalls due to memory bank conflict stall.l1p: stalls due to program cache misses stall.l1d: stalls due to data cache misses exe.cycles: cycles other than stalls

iir fft hist errdif Experiments SWP-Poor SWP-Good SWP-Fail fir fdct idct corr s1cond s1call s2cond s2call Integration SWP-Poor + SWP-Poor SWP-Good + SWP-Good SWP-Poor + SWP-Good SWP-Fail + SWP-Fail iir_sti2 fir_sti2 fir_iir s1cond_sti2 fft_sti2 fdct_sti2 firu8_iir s1call_sti2 hist_sti2 idct_sti2 corr_errdif s1condcall errdif_sti2 corru8_errdif s2cond_sti2 s2call_sti2 s2condcall

2 SWP-Poor Poor + SWP-Poor Poor SWP-Poor Speedup 1.8 1.6 1.4 1.2 1 iir fft hist errdif fir fdct (SWP-Poor+SWP-Poor) speedup > 1 (SWP-Good+SWP-Good) speedup < 1 0.8 idct 100.00% 80.00% 60.00% Increasing Number of Input Items iir / 32 iir / 64 iir / 128 blank fft / 64 fft / 256 fft / 1024 blank hist / 64 hist / 256 hist / 1024 blank errdif / 64 errdif / 256 errdif / 1024 SWP-Good Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% -20.00% stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p -40.00% Cycle Categories

1.8 SWP-Poor Poor + SWP-Good Speedup 1.7 1.6 1.5 1.4 1.3 1.2 1.1 fir_iir firu2_iir firu4_iir firu8_iir corr_errdif corru2_errdif corru4_errdif corru8_errdif (SWP-Poor+SWP-Good) speedup > 1 Increase unroll factors increase speedup 1 Increasing Number of Input Items 80.00% 70.00% 60.00% 50.00% fir_iir / 32 fir_iir / 64 fir_iir / 128 blank firu8_iir / 32 firu8_iir / 64 firu8_iir / 128 blank corr_errdif / 8 corr_errdif / 16 corr_errdif / 32 blank corru8_diff / 8 corru8_diff / 16 corru8_diff / 32 Source of speedup exe.cycles % Speedup 40.00% 30.00% 20.00% 10.00% Impact of stalls not consistent 0.00% -10.00% stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Cycle Categories

2 SWP-Fail + SWP-Fail Speedup 1.8 1.6 1.4 1.2 s1cond s1call s1condcall s2cond s2call s2condcall (SWP-Fail+SWP-Fail) speedup > 1 1 0.8 Increasing Number of Input Items 100.00% 80.00% 60.00% s1cond / 8 s1cond / 32 s1cond / 128 blank s1call / 8 s1call / 32 s1call / 128 blank s1condcall / 8 s1condcall / 32 s1condcall / 128 blank s2cond / 8 s2cond / 32 s2cond / 128 blank s2call / 8 s2call / 32 s2call / 128 blank s2condcall / 8 s2condcall / 32 s2condcall / 128 Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% -20.00% -40.00% stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p Cycle Categories

Conclusions STI transformations for high-ilp Determined by control structure and utilization STI complements SWP SWP-Poor+SWP-Poor: 26% speedup (HM) SWP-Poor+SWP-Good: 55% speedup (HM) SWP-Fail+SWP-Fail: 16% speedup (HM) Future work: automatic integration Algorithm for code transformation Heuristics to match code regions (loop and acyclic) Reconcile more complex control flow differences Estimate the impact of dynamic events Develop a tool chain for automatic integration Support STI for StreamIt programs

Any questions? Thanks.

Bytes 3500 3000 2500 2000 1500 1000 500 0 4.5 Code Size original SWP-Poor SWP-Good 1.7 SWP-Fail 1.3 1.7 1.8 1.2 1.2 5.2 1.6 3.7 iir fft hist errdif fir fdct idct s1cond s1call s2cond s2call sti2 1.8 Code size increase after STI Significant in s1cond, s2cond b/c of conditionals Bytes 1400 1200 1000 800 600 400 200 SWP-Fail+SWP-Fail 1.3 1.4 f1 f2 sum f1_f2 f1u2_f2 f1u4_f2 f1u8_f2 SWP-Poor+SWP-Good 1.0 0.8 0.7 0.6 1.0 0.8 0.7 0.7 (SWP-Fail+SWP-Fail) b/c of conditionals (SWP-Poor+SWP-Good) code size is less than sum increase after unrolling 0 s1cond+s1call s2cond+s2call fir+iir corr+errdif

STI Overview Back-up Procedure Cloning and Integration Method Results Automatic Integration Overview of Algorithm Tool Chain

STI Overview STI Overview Design spaces and steps for STI

Procedure Cloning and Integration STEP 1: Identify candidate procedures Find procedures dominating performance by profiling or run-time estimation STEP 2: Examine parallelism Find independent procedure-level data parallelism Each procedure call handles its own data sets Data sets are independent of each other STEP 3: Perform procedure integration Design the control flow of integrated procedure Simple techniques for identical procedures 1)Loops with same loop counts: loop jamming 2)Loops with conditionals: duplicate conditionals

PCI (contd.) STEP 4: Optimize the application Dynamic approach RTOS chooses the most efficient version at run-time Static approach Replace original procedure calls to integrated procedures Select the combinations of most efficient versions RTOS Direct call Application Application RTOS Int. Proc. Clones Other Procedures Int. Proc. Clones Other Procedures

Application: JPEG Preliminary Results cjpeg and djpeg with lena.ppm (512x512x24bit) Target architecture: Itanium EPIC + predication + speculation 16KB L1 D$/I$, L2 96KB U$, L3 2048KB U$ Software tools Compilers: GCC, Pro64, ORCC and Intel compilers OS: Linux for IA-64 pfmon: Performance Monitoring Unit (PMU) library and tool Integration method Integrated 3 procedures FDCT and Encode in cjpeg, IDCT in djpeg Self-integrate 2 or 3 procedure calls Manually integrated at C source level and statically executed

Performance of Procedures STI speeds up best compiler (Intel-O2-u0) by 17% [FDCT/CJPEG] Performance 13% speedup [EOB/CJPEG] Performance Slowdown in many cases because of I$ misses [IDCT/DJPEG] Performance 2 NOSTI STI2 STI3 1.8 NOSTI STI2 STI3 3 NOSTI STI2 STI3 1.8 1.6 Normalized Performance 1.6 1.4 1.2 1 0.8 0.6 Normalized Performance 1.4 1.2 1 0.8 0.6 Normalized Performance 2.5 2 1.5 1 0.4 0.4 0.5 0.2 0.2 0 GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Intel-O2- GCC- O2 Pro64- O2 ORCC- O2 ORCC- O3 Compilers Intel-O2 Intel-O3 Intel- O2-u0 0 GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Compilers 0 Compilers u0 Sweet spot varies between one, two and three threads Application speedup with the best compiler cjpeg 11% djpeg 4%

Overview of Algorithm 1. Form CDG for each procedure. 2. Annotate CDG based on analysis of ASM code. Utilizations: # instructions, # cycles SWP info: SWP-Good/Poor/Fail Code size, Reg. use, memory traffic, working set size etc. 3. Rank order code regions. In terms of idle resources and other factors 2 1 3

Overview of Algorithm (contd.) 4. Choose the best combination of code regions. Align loop regions first then do rest of code Avoid unbeneficial combinations (e.g. SWP-Fail + SWP-Good) 5. Try integration. Overlap loop iterations by loop jamming and unrolling Find opportunity for loop peeling and code motion 6. Generate C code for integrated procedure. 7. Compile it. 8. Analyze the performance. Decide whether it is worth to try other transformations 7 8 4 5 6

Automation of Code Transformations Plan to build C to C translator Now a Perl script TI C6x compiler Will be developed Can use open compilers e.g. gcc, orcc