Complementing Software Pipelining with Software Thread Integration

Size: px
Start display at page:

Download "Complementing Software Pipelining with Software Thread Integration"

Transcription

1 Complementing Software Pipelining with Software Thread Integration LCTES 05 - June 16, 2005 Won So and Alexander G. Dean Center for Embedded System Research Dept. of ECE, North Carolina State University wso@ncsu.edu, alex_dean@ncsu.edu

2 Multimedia applications Motivation Demand high performance to embedded systems DSPs with VLIW or EPIC architectures Maximize processing bandwidth Deliver predictable, repeatable performance E.g. Philips Trimedia, TI VelociTI and StarCore etc. However, speed is limited Difficult to find independent instructions. Software pipelining (SWP) can suffer or fail. Complex control flow Excessive register pressure Tight loop-carried dependence

3 Software Thread Integration (STI) Software technique which interleaves multiple threads into a single implicitly multithreaded one Idea: STI for high-ilp Merge multiple procedures into one Increase compiler s scope Code transformations to reconcile control flow diffs. Enable arbitrary alignment of code regions Move code to use available resources Method: Procedure cloning and integration Integrate parallel procedure calls in the app. Create procedures with better efficiency Convert procedure level parallelism into ILP

4 STI for High-ILP Assumption: Parallel procs. identified a priori. Manually identified or automatically extracted Exploit stream programming languages (e.g. StreamIt) Goal: Code transformations to improve ILP Apply transformations based on schedule analysis Complement key optimization (i.e. SWP) Contribution: Improve DSP app. dev. process Efficient compilation of C or C-like languages with additional dataflow information Reduce the need for extensive manual C and assembly code optimization and tuning

5 SWP for complex control flow Hierarchical reduction [Lam88] Enhanced MS [Water92] MS for multiple exits [Lavery92] Multiple-II MS [Water-Perez95] All-path pipelining [Stoodley96] Related Work Loop optimizations Loop jamming Loop unrolling Unroll-and-jam [Carr96] STI complements existing SWP methods. STI jams whole procedures. Procedure cloning [Cooper93] Integrated procedures do the work of multiple ones. STI transforms control flow. Procedure inlining Interprocedural optimizations STI STI exploits coarsegrain parallelism. StreamIt Stream programming

6 Classification of Loops IPCs of Loops from TI Image/DSP library 8 7 Speedup < 2 Speedup >= 2 After SWP Before SWP 6 5 IPC Loops SWP-Good Speedup=2, High-IPC SWP-Poor Speedup<2, Low-IPC and dependence bounded SWP-Fail Calls, conditionals, lack of registers, no valid schedule

7 STI Overview STI transformations Reconcile control flow diffs: Enable arbitrary alignment Use CDG: Apply hierarchically and repeatedly Conditionals: Duplication Loops: Jamming, peeling, unrolling and splitting Two levels of integration Assembly HLL We pursue this. Side effects Code size increase Increase of register pressure Additional data memory traffic

8 STI for Loops 3URFD D D 3URFE E E 3E E 1 E,QWHJUDWHG3URF DE D 3D D 1 D 3D 3E /RRSSHHOLQJ /MDPPLQJ /VSOL WLQJ E 3E DE $VVXP H1 D E 1) Loop jamming + splitting 2) Loop unrolling + jamming + splitting 3) Loop peeling + jamming + splitting

9 Conditional Duplicate code into all conditionals Increase instructions in BBs STI for Loops (contd.) Call Treat them as regular statements Find more instructions to fill delay slots

10 STI Transformations STI transformations for code regions A and B A B SWP-Good Loop SWP-Poor SWP-Fail Acyclic Loop SWP- Good SWP- Poor Do not apply STI STI: Jam (+Unroll B) STI: Jam (+Unroll A) STI: Jam (+Unroll loop with smaller II) Do not apply STI STI: Loop peeling SWP- Fail Do not apply STI STI: Jam (+Unroll, +Duplicate) Acyclic STI: Loop peeling STI: Code motion

11 Platform and Tools Target architecture: TI TMS320C64x Fixed-point DSP, VelociTi.2 Clustered: (4FU + 32 Reg.) x 2 ISA: Predication, SIMD, 1~5 delay slots On-chip: 16KB L1P/L1D$, 1024KB SRAM Compiler and tool TI C6x compiler Option o2: all optimizations but interprocedural ones Option mt: aggressive memory anti-aliasing C64x simulator in Code Composer Studio (CCS) 2.20 stall.xpath: stalls due to cross-path communication stall.mem: stalls due to memory bank conflict stall.l1p: stalls due to program cache misses stall.l1d: stalls due to data cache misses exe.cycles: cycles other than stalls

12 iir fft hist errdif Experiments SWP-Poor SWP-Good SWP-Fail fir fdct idct corr s1cond s1call s2cond s2call Integration SWP-Poor + SWP-Poor SWP-Good + SWP-Good SWP-Poor + SWP-Good SWP-Fail + SWP-Fail iir_sti2 fir_sti2 fir_iir s1cond_sti2 fft_sti2 fdct_sti2 firu8_iir s1call_sti2 hist_sti2 idct_sti2 corr_errdif s1condcall errdif_sti2 corru8_errdif s2cond_sti2 s2call_sti2 s2condcall

13 2 SWP-Poor Poor + SWP-Poor Poor SWP-Poor Speedup iir fft hist errdif fir fdct (SWP-Poor+SWP-Poor) speedup > 1 (SWP-Good+SWP-Good) speedup < idct % 80.00% 60.00% Increasing Number of Input Items iir / 32 iir / 64 iir / 128 blank fft / 64 fft / 256 fft / 1024 blank hist / 64 hist / 256 hist / 1024 blank errdif / 64 errdif / 256 errdif / 1024 SWP-Good Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% % stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p % Cycle Categories

14 1.8 SWP-Poor Poor + SWP-Good Speedup fir_iir firu2_iir firu4_iir firu8_iir corr_errdif corru2_errdif corru4_errdif corru8_errdif (SWP-Poor+SWP-Good) speedup > 1 Increase unroll factors increase speedup 1 Increasing Number of Input Items 80.00% 70.00% 60.00% 50.00% fir_iir / 32 fir_iir / 64 fir_iir / 128 blank firu8_iir / 32 firu8_iir / 64 firu8_iir / 128 blank corr_errdif / 8 corr_errdif / 16 corr_errdif / 32 blank corru8_diff / 8 corru8_diff / 16 corru8_diff / 32 Source of speedup exe.cycles % Speedup 40.00% 30.00% 20.00% 10.00% Impact of stalls not consistent 0.00% % stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Cycle Categories

15 2 SWP-Fail + SWP-Fail Speedup s1cond s1call s1condcall s2cond s2call s2condcall (SWP-Fail+SWP-Fail) speedup > Increasing Number of Input Items % 80.00% 60.00% s1cond / 8 s1cond / 32 s1cond / 128 blank s1call / 8 s1call / 32 s1call / 128 blank s1condcall / 8 s1condcall / 32 s1condcall / 128 blank s2cond / 8 s2cond / 32 s2cond / 128 blank s2call / 8 s2call / 32 s2call / 128 blank s2condcall / 8 s2condcall / 32 s2condcall / 128 Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% % % stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p Cycle Categories

16 Conclusions STI transformations for high-ilp Determined by control structure and utilization STI complements SWP SWP-Poor+SWP-Poor: 26% speedup (HM) SWP-Poor+SWP-Good: 55% speedup (HM) SWP-Fail+SWP-Fail: 16% speedup (HM) Future work: automatic integration Algorithm for code transformation Heuristics to match code regions (loop and acyclic) Reconcile more complex control flow differences Estimate the impact of dynamic events Develop a tool chain for automatic integration Support STI for StreamIt programs

17 Any questions? Thanks.

18 Bytes Code Size original SWP-Poor SWP-Good 1.7 SWP-Fail iir fft hist errdif fir fdct idct s1cond s1call s2cond s2call sti2 1.8 Code size increase after STI Significant in s1cond, s2cond b/c of conditionals Bytes SWP-Fail+SWP-Fail f1 f2 sum f1_f2 f1u2_f2 f1u4_f2 f1u8_f2 SWP-Poor+SWP-Good (SWP-Fail+SWP-Fail) b/c of conditionals (SWP-Poor+SWP-Good) code size is less than sum increase after unrolling 0 s1cond+s1call s2cond+s2call fir+iir corr+errdif

19 STI Overview Back-up Procedure Cloning and Integration Method Results Automatic Integration Overview of Algorithm Tool Chain

20 STI Overview STI Overview Design spaces and steps for STI

21 Procedure Cloning and Integration STEP 1: Identify candidate procedures Find procedures dominating performance by profiling or run-time estimation STEP 2: Examine parallelism Find independent procedure-level data parallelism Each procedure call handles its own data sets Data sets are independent of each other STEP 3: Perform procedure integration Design the control flow of integrated procedure Simple techniques for identical procedures 1)Loops with same loop counts: loop jamming 2)Loops with conditionals: duplicate conditionals

22 PCI (contd.) STEP 4: Optimize the application Dynamic approach RTOS chooses the most efficient version at run-time Static approach Replace original procedure calls to integrated procedures Select the combinations of most efficient versions RTOS Direct call Application Application RTOS Int. Proc. Clones Other Procedures Int. Proc. Clones Other Procedures

23 Application: JPEG Preliminary Results cjpeg and djpeg with lena.ppm (512x512x24bit) Target architecture: Itanium EPIC + predication + speculation 16KB L1 D$/I$, L2 96KB U$, L3 2048KB U$ Software tools Compilers: GCC, Pro64, ORCC and Intel compilers OS: Linux for IA-64 pfmon: Performance Monitoring Unit (PMU) library and tool Integration method Integrated 3 procedures FDCT and Encode in cjpeg, IDCT in djpeg Self-integrate 2 or 3 procedure calls Manually integrated at C source level and statically executed

24 Performance of Procedures STI speeds up best compiler (Intel-O2-u0) by 17% [FDCT/CJPEG] Performance 13% speedup [EOB/CJPEG] Performance Slowdown in many cases because of I$ misses [IDCT/DJPEG] Performance 2 NOSTI STI2 STI3 1.8 NOSTI STI2 STI3 3 NOSTI STI2 STI Normalized Performance Normalized Performance Normalized Performance GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Intel-O2- GCC- O2 Pro64- O2 ORCC- O2 ORCC- O3 Compilers Intel-O2 Intel-O3 Intel- O2-u0 0 GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Compilers 0 Compilers u0 Sweet spot varies between one, two and three threads Application speedup with the best compiler cjpeg 11% djpeg 4%

25 Overview of Algorithm 1. Form CDG for each procedure. 2. Annotate CDG based on analysis of ASM code. Utilizations: # instructions, # cycles SWP info: SWP-Good/Poor/Fail Code size, Reg. use, memory traffic, working set size etc. 3. Rank order code regions. In terms of idle resources and other factors 2 1 3

26 Overview of Algorithm (contd.) 4. Choose the best combination of code regions. Align loop regions first then do rest of code Avoid unbeneficial combinations (e.g. SWP-Fail + SWP-Good) 5. Try integration. Overlap loop iterations by loop jamming and unrolling Find opportunity for loop peeling and code motion 6. Generate C code for integrated procedure. 7. Compile it. 8. Analyze the performance. Decide whether it is worth to try other transformations

27 Automation of Code Transformations Plan to build C to C translator Now a Perl script TI C6x compiler Will be developed Can use open compilers e.g. gcc, orcc

SOFTWARE THREAD INTEGRATION FOR CONVERTING TLP TO ILP ON VLIW/EPIC ARCHITECTURES WON SO

SOFTWARE THREAD INTEGRATION FOR CONVERTING TLP TO ILP ON VLIW/EPIC ARCHITECTURES WON SO SOFTWARE THREAD INTEGRATION FOR CONVERTING TLP TO ILP ON VLIW/EPIC ARCHITECTURES by WON SO A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements

More information

Complementing Software Pipelining with Software Thread Integration

Complementing Software Pipelining with Software Thread Integration Complementing Software Pipelining with Software Thread Integration Won So Alexander G. Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering North Carolina State University

More information

Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain Won So and Alex Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering NC State

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

C152 Laboratory Exercise 5

C152 Laboratory Exercise 5 C152 Laboratory Exercise 5 Professor: Krste Asanovic GSI: Henry Cook Department of Electrical Engineering & Computer Science University of California, Berkeley April 9, 2008 1 Introduction and goals The

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

C152 Laboratory Exercise 5

C152 Laboratory Exercise 5 C152 Laboratory Exercise 5 Professor: Krste Asanovic TA: Scott Beamer Department of Electrical Engineering & Computer Science University of California, Berkeley April 7, 2009 1 Introduction and goals The

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 3 (Cont III): Exploiting ILP with Software Approaches Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Exposing ILP (3.2) Want to find sequences of unrelated instructions that can be overlapped

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Statically Calculating Secondary Thread Performance in ASTI Systems

Statically Calculating Secondary Thread Performance in ASTI Systems Statically Calculating Secondary Thread Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimarães Dean alex_dean@ncsu.edu Center for Embedded Systems Research Department of Electrical

More information

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two? Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Generic Software pipelining at the Assembly Level

Generic Software pipelining at the Assembly Level Generic Software pipelining at the Assembly Level Markus Pister pister@cs.uni-sb.de Daniel Kästner kaestner@absint.com Embedded Systems (ES) 2/23 Embedded Systems (ES) are widely used Many systems of daily

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

VLIW/EPIC: Statically Scheduled ILP

VLIW/EPIC: Statically Scheduled ILP 6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind

More information

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine Intel Itanium Line Processor Efforts Xiaobin Li PASCAL EECS Dept. UC, Irvine Outline Intel Itanium Line Roadmap IA-64 Architecture Itanium Processor Microarchitecture Case Study of Exploiting TLP at VLIW

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Static Compiler Optimization Techniques

Static Compiler Optimization Techniques Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. 11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

ABSTRACT. SO, WON. Software Thread Integration for Instruction Level Parallelism. (Under the direction of Associate Professor Alexander G. Dean).

ABSTRACT. SO, WON. Software Thread Integration for Instruction Level Parallelism. (Under the direction of Associate Professor Alexander G. Dean). ABSTRACT SO, WON. Software Thread Integration for Instruction Level Parallelism. (Under the direction of Associate Professor Alexander G. Dean). Multimedia applications require a significantly higher level

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

DSP Mapping, Coding, Optimization

DSP Mapping, Coding, Optimization DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

C6000 Compiler Roadmap

C6000 Compiler Roadmap C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 10 Compiler Techniques / VLIW Israel Koren ECE568/Koren Part.10.1 FP Loop Example Add a scalar

More information

Cache Aware Optimization of Stream Programs

Cache Aware Optimization of Stream Programs Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

Spring 2 Spring Loop Optimizations

Spring 2 Spring Loop Optimizations Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.

Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators. Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators Comp 412 COMP 412 FALL 2016 source code IR Front End Optimizer Back

More information

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures An introduction to DSP s Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures DSP example: mobile phone DSP example: mobile phone with video camera DSP: applications Why a DSP?

More information

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2018-11-28-13-01 1 Motivating VLIW Processors

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Evaluation of Static and Dynamic Scheduling for Media Processors.

Evaluation of Static and Dynamic Scheduling for Media Processors. Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts 1 and Wayne Wolf 2 1 Dept. of Computer Science, Washington University, St. Louis, MO 2 Dept. of Electrical Engineering, Princeton

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

TDT 4260 TDT ILP Chap 2, App. C

TDT 4260 TDT ILP Chap 2, App. C TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2015-11-30-13-42 1 Motivating VLIW Processors

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time Two hours No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE System Architecture Date Time Please answer any THREE Questions from the FOUR questions provided Use a SEPARATE answerbook

More information

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

EECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis

EECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis EECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis Fall 2011, University of Michigan September 7, 2011 About Me Mahlke = mall key» But just call me Scott 10 years here at

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Achieving Out-of-Order Performance with Almost In-Order Complexity

Achieving Out-of-Order Performance with Almost In-Order Complexity Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost

More information

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

CS553 Lecture Profile-Guided Optimizations 3

CS553 Lecture Profile-Guided Optimizations 3 Profile-Guided Optimizations Last time Instruction scheduling Register renaming alanced Load Scheduling Loop unrolling Software pipelining Today More instruction scheduling Profiling Trace scheduling CS553

More information

Instruction Scheduling

Instruction Scheduling Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

IA-64 Compiler Technology

IA-64 Compiler Technology IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI)

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration

Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration RTSS 2002 -- December 3-5, Austin, Texas Alex Dean Center for Embedded Systems Research Dept. of ECE, NC State

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information