Complementing Software Pipelining with Software Thread Integration

Similar documents
SOFTWARE THREAD INTEGRATION FOR CONVERTING TLP TO ILP ON VLIW/EPIC ARCHITECTURES WON SO

Complementing Software Pipelining with Software Thread Integration

Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

C152 Laboratory Exercise 5

Processor (IV) - advanced ILP. Hwansoo Han

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

C152 Laboratory Exercise 5

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Advanced Computer Architecture

Exploitation of instruction level parallelism

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Multi-cycle Instructions in the Pipeline (Floating Point)

Statically Calculating Secondary Thread Performance in ASTI Systems

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Generic Software pipelining at the Assembly Level

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

HY425 Lecture 09: Software to exploit ILP

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

HY425 Lecture 09: Software to exploit ILP

Getting CPI under 1: Outline

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multithreading: Exploiting Thread-Level Parallelism within a Processor

VLIW/EPIC: Statically Scheduled ILP

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Static Compiler Optimization Techniques

CS425 Computer Systems Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ABSTRACT. SO, WON. Software Thread Integration for Instruction Level Parallelism. (Under the direction of Associate Professor Alexander G. Dean).

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DSP Mapping, Coding, Optimization

Martin Kruliš, v

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Comparing Memory Systems for Chip Multiprocessors

C6000 Compiler Roadmap

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CS 426 Parallel Computing. Parallel Computing Platforms

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Cache Aware Optimization of Stream Programs

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Spring 2 Spring Loop Optimizations

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

The University of Texas at Austin

Profiling: Understand Your Application

TDT 4260 lecture 7 spring semester 2015

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

Advanced Instruction-Level Parallelism

UNIT I (Two Marks Questions & Answers)

Evaluation of Static and Dynamic Scheduling for Media Processors.

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

TDT 4260 TDT ILP Chap 2, App. C

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

The Processor: Instruction-Level Parallelism

EECC551 Exam Review 4 questions out of 6 questions

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Processors, Performance, and Profiling

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

EECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis

Simultaneous Multithreading (SMT)

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Achieving Out-of-Order Performance with Almost In-Order Complexity

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

CS553 Lecture Profile-Guided Optimizations 3

Instruction Scheduling

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Simultaneous Multithreading: a Platform for Next Generation Processors

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Embedded Systems. 7. System Components

Lecture 13 - VLIW Machines and Statically Scheduled ILP

IA-64 Compiler Technology

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Basics of Performance Engineering

Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration

INSTRUCTION LEVEL PARALLELISM

Execution-based Prediction Using Speculative Slices

Transcription:

Complementing Software Pipelining with Software Thread Integration LCTES 05 - June 16, 2005 Won So and Alexander G. Dean Center for Embedded System Research Dept. of ECE, North Carolina State University wso@ncsu.edu, alex_dean@ncsu.edu

Multimedia applications Motivation Demand high performance to embedded systems DSPs with VLIW or EPIC architectures Maximize processing bandwidth Deliver predictable, repeatable performance E.g. Philips Trimedia, TI VelociTI and StarCore etc. However, speed is limited Difficult to find independent instructions. Software pipelining (SWP) can suffer or fail. Complex control flow Excessive register pressure Tight loop-carried dependence

Software Thread Integration (STI) Software technique which interleaves multiple threads into a single implicitly multithreaded one Idea: STI for high-ilp Merge multiple procedures into one Increase compiler s scope Code transformations to reconcile control flow diffs. Enable arbitrary alignment of code regions Move code to use available resources Method: Procedure cloning and integration Integrate parallel procedure calls in the app. Create procedures with better efficiency Convert procedure level parallelism into ILP

STI for High-ILP Assumption: Parallel procs. identified a priori. Manually identified or automatically extracted Exploit stream programming languages (e.g. StreamIt) Goal: Code transformations to improve ILP Apply transformations based on schedule analysis Complement key optimization (i.e. SWP) Contribution: Improve DSP app. dev. process Efficient compilation of C or C-like languages with additional dataflow information Reduce the need for extensive manual C and assembly code optimization and tuning

SWP for complex control flow Hierarchical reduction [Lam88] Enhanced MS [Water92] MS for multiple exits [Lavery92] Multiple-II MS [Water-Perez95] All-path pipelining [Stoodley96] Related Work Loop optimizations Loop jamming Loop unrolling Unroll-and-jam [Carr96] STI complements existing SWP methods. STI jams whole procedures. Procedure cloning [Cooper93] Integrated procedures do the work of multiple ones. STI transforms control flow. Procedure inlining Interprocedural optimizations STI STI exploits coarsegrain parallelism. StreamIt Stream programming

Classification of Loops IPCs of Loops from TI Image/DSP library 8 7 Speedup < 2 Speedup >= 2 After SWP Before SWP 6 5 IPC 4 3 2 1 0 Loops SWP-Good Speedup=2, High-IPC SWP-Poor Speedup<2, Low-IPC and dependence bounded SWP-Fail Calls, conditionals, lack of registers, no valid schedule

STI Overview STI transformations Reconcile control flow diffs: Enable arbitrary alignment Use CDG: Apply hierarchically and repeatedly Conditionals: Duplication Loops: Jamming, peeling, unrolling and splitting Two levels of integration Assembly HLL We pursue this. Side effects Code size increase Increase of register pressure Additional data memory traffic

STI for Loops 3URFD D D 3URFE E E 3E E 1 E,QWHJUDWHG3URF DE D 3D D 1 D 3D 3E /RRSSHHOLQJ /MDPPLQJ /VSOL WLQJ E 3E DE $VVXP H1 D E 1) Loop jamming + splitting 2) Loop unrolling + jamming + splitting 3) Loop peeling + jamming + splitting

Conditional Duplicate code into all conditionals Increase instructions in BBs STI for Loops (contd.) Call Treat them as regular statements Find more instructions to fill delay slots

STI Transformations STI transformations for code regions A and B A B SWP-Good Loop SWP-Poor SWP-Fail Acyclic Loop SWP- Good SWP- Poor Do not apply STI STI: Jam (+Unroll B) STI: Jam (+Unroll A) STI: Jam (+Unroll loop with smaller II) Do not apply STI STI: Loop peeling SWP- Fail Do not apply STI STI: Jam (+Unroll, +Duplicate) Acyclic STI: Loop peeling STI: Code motion

Platform and Tools Target architecture: TI TMS320C64x Fixed-point DSP, VelociTi.2 Clustered: (4FU + 32 Reg.) x 2 ISA: Predication, SIMD, 1~5 delay slots On-chip: 16KB L1P/L1D$, 1024KB SRAM Compiler and tool TI C6x compiler Option o2: all optimizations but interprocedural ones Option mt: aggressive memory anti-aliasing C64x simulator in Code Composer Studio (CCS) 2.20 stall.xpath: stalls due to cross-path communication stall.mem: stalls due to memory bank conflict stall.l1p: stalls due to program cache misses stall.l1d: stalls due to data cache misses exe.cycles: cycles other than stalls

iir fft hist errdif Experiments SWP-Poor SWP-Good SWP-Fail fir fdct idct corr s1cond s1call s2cond s2call Integration SWP-Poor + SWP-Poor SWP-Good + SWP-Good SWP-Poor + SWP-Good SWP-Fail + SWP-Fail iir_sti2 fir_sti2 fir_iir s1cond_sti2 fft_sti2 fdct_sti2 firu8_iir s1call_sti2 hist_sti2 idct_sti2 corr_errdif s1condcall errdif_sti2 corru8_errdif s2cond_sti2 s2call_sti2 s2condcall

2 SWP-Poor Poor + SWP-Poor Poor SWP-Poor Speedup 1.8 1.6 1.4 1.2 1 iir fft hist errdif fir fdct (SWP-Poor+SWP-Poor) speedup > 1 (SWP-Good+SWP-Good) speedup < 1 0.8 idct 100.00% 80.00% 60.00% Increasing Number of Input Items iir / 32 iir / 64 iir / 128 blank fft / 64 fft / 256 fft / 1024 blank hist / 64 hist / 256 hist / 1024 blank errdif / 64 errdif / 256 errdif / 1024 SWP-Good Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% -20.00% stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p -40.00% Cycle Categories

1.8 SWP-Poor Poor + SWP-Good Speedup 1.7 1.6 1.5 1.4 1.3 1.2 1.1 fir_iir firu2_iir firu4_iir firu8_iir corr_errdif corru2_errdif corru4_errdif corru8_errdif (SWP-Poor+SWP-Good) speedup > 1 Increase unroll factors increase speedup 1 Increasing Number of Input Items 80.00% 70.00% 60.00% 50.00% fir_iir / 32 fir_iir / 64 fir_iir / 128 blank firu8_iir / 32 firu8_iir / 64 firu8_iir / 128 blank corr_errdif / 8 corr_errdif / 16 corr_errdif / 32 blank corru8_diff / 8 corru8_diff / 16 corru8_diff / 32 Source of speedup exe.cycles % Speedup 40.00% 30.00% 20.00% 10.00% Impact of stalls not consistent 0.00% -10.00% stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Cycle Categories

2 SWP-Fail + SWP-Fail Speedup 1.8 1.6 1.4 1.2 s1cond s1call s1condcall s2cond s2call s2condcall (SWP-Fail+SWP-Fail) speedup > 1 1 0.8 Increasing Number of Input Items 100.00% 80.00% 60.00% s1cond / 8 s1cond / 32 s1cond / 128 blank s1call / 8 s1call / 32 s1call / 128 blank s1condcall / 8 s1condcall / 32 s1condcall / 128 blank s2cond / 8 s2cond / 32 s2cond / 128 blank s2call / 8 s2call / 32 s2call / 128 blank s2condcall / 8 s2condcall / 32 s2condcall / 128 Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% -20.00% -40.00% stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p Cycle Categories

Conclusions STI transformations for high-ilp Determined by control structure and utilization STI complements SWP SWP-Poor+SWP-Poor: 26% speedup (HM) SWP-Poor+SWP-Good: 55% speedup (HM) SWP-Fail+SWP-Fail: 16% speedup (HM) Future work: automatic integration Algorithm for code transformation Heuristics to match code regions (loop and acyclic) Reconcile more complex control flow differences Estimate the impact of dynamic events Develop a tool chain for automatic integration Support STI for StreamIt programs

Any questions? Thanks.

Bytes 3500 3000 2500 2000 1500 1000 500 0 4.5 Code Size original SWP-Poor SWP-Good 1.7 SWP-Fail 1.3 1.7 1.8 1.2 1.2 5.2 1.6 3.7 iir fft hist errdif fir fdct idct s1cond s1call s2cond s2call sti2 1.8 Code size increase after STI Significant in s1cond, s2cond b/c of conditionals Bytes 1400 1200 1000 800 600 400 200 SWP-Fail+SWP-Fail 1.3 1.4 f1 f2 sum f1_f2 f1u2_f2 f1u4_f2 f1u8_f2 SWP-Poor+SWP-Good 1.0 0.8 0.7 0.6 1.0 0.8 0.7 0.7 (SWP-Fail+SWP-Fail) b/c of conditionals (SWP-Poor+SWP-Good) code size is less than sum increase after unrolling 0 s1cond+s1call s2cond+s2call fir+iir corr+errdif

STI Overview Back-up Procedure Cloning and Integration Method Results Automatic Integration Overview of Algorithm Tool Chain

STI Overview STI Overview Design spaces and steps for STI

Procedure Cloning and Integration STEP 1: Identify candidate procedures Find procedures dominating performance by profiling or run-time estimation STEP 2: Examine parallelism Find independent procedure-level data parallelism Each procedure call handles its own data sets Data sets are independent of each other STEP 3: Perform procedure integration Design the control flow of integrated procedure Simple techniques for identical procedures 1)Loops with same loop counts: loop jamming 2)Loops with conditionals: duplicate conditionals

PCI (contd.) STEP 4: Optimize the application Dynamic approach RTOS chooses the most efficient version at run-time Static approach Replace original procedure calls to integrated procedures Select the combinations of most efficient versions RTOS Direct call Application Application RTOS Int. Proc. Clones Other Procedures Int. Proc. Clones Other Procedures

Application: JPEG Preliminary Results cjpeg and djpeg with lena.ppm (512x512x24bit) Target architecture: Itanium EPIC + predication + speculation 16KB L1 D$/I$, L2 96KB U$, L3 2048KB U$ Software tools Compilers: GCC, Pro64, ORCC and Intel compilers OS: Linux for IA-64 pfmon: Performance Monitoring Unit (PMU) library and tool Integration method Integrated 3 procedures FDCT and Encode in cjpeg, IDCT in djpeg Self-integrate 2 or 3 procedure calls Manually integrated at C source level and statically executed

Performance of Procedures STI speeds up best compiler (Intel-O2-u0) by 17% [FDCT/CJPEG] Performance 13% speedup [EOB/CJPEG] Performance Slowdown in many cases because of I$ misses [IDCT/DJPEG] Performance 2 NOSTI STI2 STI3 1.8 NOSTI STI2 STI3 3 NOSTI STI2 STI3 1.8 1.6 Normalized Performance 1.6 1.4 1.2 1 0.8 0.6 Normalized Performance 1.4 1.2 1 0.8 0.6 Normalized Performance 2.5 2 1.5 1 0.4 0.4 0.5 0.2 0.2 0 GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Intel-O2- GCC- O2 Pro64- O2 ORCC- O2 ORCC- O3 Compilers Intel-O2 Intel-O3 Intel- O2-u0 0 GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Compilers 0 Compilers u0 Sweet spot varies between one, two and three threads Application speedup with the best compiler cjpeg 11% djpeg 4%

Overview of Algorithm 1. Form CDG for each procedure. 2. Annotate CDG based on analysis of ASM code. Utilizations: # instructions, # cycles SWP info: SWP-Good/Poor/Fail Code size, Reg. use, memory traffic, working set size etc. 3. Rank order code regions. In terms of idle resources and other factors 2 1 3

Overview of Algorithm (contd.) 4. Choose the best combination of code regions. Align loop regions first then do rest of code Avoid unbeneficial combinations (e.g. SWP-Fail + SWP-Good) 5. Try integration. Overlap loop iterations by loop jamming and unrolling Find opportunity for loop peeling and code motion 6. Generate C code for integrated procedure. 7. Compile it. 8. Analyze the performance. Decide whether it is worth to try other transformations 7 8 4 5 6

Automation of Code Transformations Plan to build C to C translator Now a Perl script TI C6x compiler Will be developed Can use open compilers e.g. gcc, orcc