Complementing Software Pipelining with Software Thread Integration
|
|
- Sabina Mathews
- 6 years ago
- Views:
Transcription
1 Complementing Software Pipelining with Software Thread Integration LCTES 05 - June 16, 2005 Won So and Alexander G. Dean Center for Embedded System Research Dept. of ECE, North Carolina State University wso@ncsu.edu, alex_dean@ncsu.edu
2 Multimedia applications Motivation Demand high performance to embedded systems DSPs with VLIW or EPIC architectures Maximize processing bandwidth Deliver predictable, repeatable performance E.g. Philips Trimedia, TI VelociTI and StarCore etc. However, speed is limited Difficult to find independent instructions. Software pipelining (SWP) can suffer or fail. Complex control flow Excessive register pressure Tight loop-carried dependence
3 Software Thread Integration (STI) Software technique which interleaves multiple threads into a single implicitly multithreaded one Idea: STI for high-ilp Merge multiple procedures into one Increase compiler s scope Code transformations to reconcile control flow diffs. Enable arbitrary alignment of code regions Move code to use available resources Method: Procedure cloning and integration Integrate parallel procedure calls in the app. Create procedures with better efficiency Convert procedure level parallelism into ILP
4 STI for High-ILP Assumption: Parallel procs. identified a priori. Manually identified or automatically extracted Exploit stream programming languages (e.g. StreamIt) Goal: Code transformations to improve ILP Apply transformations based on schedule analysis Complement key optimization (i.e. SWP) Contribution: Improve DSP app. dev. process Efficient compilation of C or C-like languages with additional dataflow information Reduce the need for extensive manual C and assembly code optimization and tuning
5 SWP for complex control flow Hierarchical reduction [Lam88] Enhanced MS [Water92] MS for multiple exits [Lavery92] Multiple-II MS [Water-Perez95] All-path pipelining [Stoodley96] Related Work Loop optimizations Loop jamming Loop unrolling Unroll-and-jam [Carr96] STI complements existing SWP methods. STI jams whole procedures. Procedure cloning [Cooper93] Integrated procedures do the work of multiple ones. STI transforms control flow. Procedure inlining Interprocedural optimizations STI STI exploits coarsegrain parallelism. StreamIt Stream programming
6 Classification of Loops IPCs of Loops from TI Image/DSP library 8 7 Speedup < 2 Speedup >= 2 After SWP Before SWP 6 5 IPC Loops SWP-Good Speedup=2, High-IPC SWP-Poor Speedup<2, Low-IPC and dependence bounded SWP-Fail Calls, conditionals, lack of registers, no valid schedule
7 STI Overview STI transformations Reconcile control flow diffs: Enable arbitrary alignment Use CDG: Apply hierarchically and repeatedly Conditionals: Duplication Loops: Jamming, peeling, unrolling and splitting Two levels of integration Assembly HLL We pursue this. Side effects Code size increase Increase of register pressure Additional data memory traffic
8 STI for Loops 3URFD D D 3URFE E E 3E E 1 E,QWHJUDWHG3URF DE D 3D D 1 D 3D 3E /RRSSHHOLQJ /MDPPLQJ /VSOL WLQJ E 3E DE $VVXP H1 D E 1) Loop jamming + splitting 2) Loop unrolling + jamming + splitting 3) Loop peeling + jamming + splitting
9 Conditional Duplicate code into all conditionals Increase instructions in BBs STI for Loops (contd.) Call Treat them as regular statements Find more instructions to fill delay slots
10 STI Transformations STI transformations for code regions A and B A B SWP-Good Loop SWP-Poor SWP-Fail Acyclic Loop SWP- Good SWP- Poor Do not apply STI STI: Jam (+Unroll B) STI: Jam (+Unroll A) STI: Jam (+Unroll loop with smaller II) Do not apply STI STI: Loop peeling SWP- Fail Do not apply STI STI: Jam (+Unroll, +Duplicate) Acyclic STI: Loop peeling STI: Code motion
11 Platform and Tools Target architecture: TI TMS320C64x Fixed-point DSP, VelociTi.2 Clustered: (4FU + 32 Reg.) x 2 ISA: Predication, SIMD, 1~5 delay slots On-chip: 16KB L1P/L1D$, 1024KB SRAM Compiler and tool TI C6x compiler Option o2: all optimizations but interprocedural ones Option mt: aggressive memory anti-aliasing C64x simulator in Code Composer Studio (CCS) 2.20 stall.xpath: stalls due to cross-path communication stall.mem: stalls due to memory bank conflict stall.l1p: stalls due to program cache misses stall.l1d: stalls due to data cache misses exe.cycles: cycles other than stalls
12 iir fft hist errdif Experiments SWP-Poor SWP-Good SWP-Fail fir fdct idct corr s1cond s1call s2cond s2call Integration SWP-Poor + SWP-Poor SWP-Good + SWP-Good SWP-Poor + SWP-Good SWP-Fail + SWP-Fail iir_sti2 fir_sti2 fir_iir s1cond_sti2 fft_sti2 fdct_sti2 firu8_iir s1call_sti2 hist_sti2 idct_sti2 corr_errdif s1condcall errdif_sti2 corru8_errdif s2cond_sti2 s2call_sti2 s2condcall
13 2 SWP-Poor Poor + SWP-Poor Poor SWP-Poor Speedup iir fft hist errdif fir fdct (SWP-Poor+SWP-Poor) speedup > 1 (SWP-Good+SWP-Good) speedup < idct % 80.00% 60.00% Increasing Number of Input Items iir / 32 iir / 64 iir / 128 blank fft / 64 fft / 256 fft / 1024 blank hist / 64 hist / 256 hist / 1024 blank errdif / 64 errdif / 256 errdif / 1024 SWP-Good Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% % stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p % Cycle Categories
14 1.8 SWP-Poor Poor + SWP-Good Speedup fir_iir firu2_iir firu4_iir firu8_iir corr_errdif corru2_errdif corru4_errdif corru8_errdif (SWP-Poor+SWP-Good) speedup > 1 Increase unroll factors increase speedup 1 Increasing Number of Input Items 80.00% 70.00% 60.00% 50.00% fir_iir / 32 fir_iir / 64 fir_iir / 128 blank firu8_iir / 32 firu8_iir / 64 firu8_iir / 128 blank corr_errdif / 8 corr_errdif / 16 corr_errdif / 32 blank corru8_diff / 8 corru8_diff / 16 corru8_diff / 32 Source of speedup exe.cycles % Speedup 40.00% 30.00% 20.00% 10.00% Impact of stalls not consistent 0.00% % stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Cycle Categories
15 2 SWP-Fail + SWP-Fail Speedup s1cond s1call s1condcall s2cond s2call s2condcall (SWP-Fail+SWP-Fail) speedup > Increasing Number of Input Items % 80.00% 60.00% s1cond / 8 s1cond / 32 s1cond / 128 blank s1call / 8 s1call / 32 s1call / 128 blank s1condcall / 8 s1condcall / 32 s1condcall / 128 blank s2cond / 8 s2cond / 32 s2cond / 128 blank s2call / 8 s2call / 32 s2call / 128 blank s2condcall / 8 s2condcall / 32 s2condcall / 128 Source of speedup exe.cycles % Speedup 40.00% 20.00% 0.00% % % stall.mem stall.xpath stall.l1d stall.l1p exe.cycles Source of slowdown stall.mem, stall1d, stall1p Cycle Categories
16 Conclusions STI transformations for high-ilp Determined by control structure and utilization STI complements SWP SWP-Poor+SWP-Poor: 26% speedup (HM) SWP-Poor+SWP-Good: 55% speedup (HM) SWP-Fail+SWP-Fail: 16% speedup (HM) Future work: automatic integration Algorithm for code transformation Heuristics to match code regions (loop and acyclic) Reconcile more complex control flow differences Estimate the impact of dynamic events Develop a tool chain for automatic integration Support STI for StreamIt programs
17 Any questions? Thanks.
18 Bytes Code Size original SWP-Poor SWP-Good 1.7 SWP-Fail iir fft hist errdif fir fdct idct s1cond s1call s2cond s2call sti2 1.8 Code size increase after STI Significant in s1cond, s2cond b/c of conditionals Bytes SWP-Fail+SWP-Fail f1 f2 sum f1_f2 f1u2_f2 f1u4_f2 f1u8_f2 SWP-Poor+SWP-Good (SWP-Fail+SWP-Fail) b/c of conditionals (SWP-Poor+SWP-Good) code size is less than sum increase after unrolling 0 s1cond+s1call s2cond+s2call fir+iir corr+errdif
19 STI Overview Back-up Procedure Cloning and Integration Method Results Automatic Integration Overview of Algorithm Tool Chain
20 STI Overview STI Overview Design spaces and steps for STI
21 Procedure Cloning and Integration STEP 1: Identify candidate procedures Find procedures dominating performance by profiling or run-time estimation STEP 2: Examine parallelism Find independent procedure-level data parallelism Each procedure call handles its own data sets Data sets are independent of each other STEP 3: Perform procedure integration Design the control flow of integrated procedure Simple techniques for identical procedures 1)Loops with same loop counts: loop jamming 2)Loops with conditionals: duplicate conditionals
22 PCI (contd.) STEP 4: Optimize the application Dynamic approach RTOS chooses the most efficient version at run-time Static approach Replace original procedure calls to integrated procedures Select the combinations of most efficient versions RTOS Direct call Application Application RTOS Int. Proc. Clones Other Procedures Int. Proc. Clones Other Procedures
23 Application: JPEG Preliminary Results cjpeg and djpeg with lena.ppm (512x512x24bit) Target architecture: Itanium EPIC + predication + speculation 16KB L1 D$/I$, L2 96KB U$, L3 2048KB U$ Software tools Compilers: GCC, Pro64, ORCC and Intel compilers OS: Linux for IA-64 pfmon: Performance Monitoring Unit (PMU) library and tool Integration method Integrated 3 procedures FDCT and Encode in cjpeg, IDCT in djpeg Self-integrate 2 or 3 procedure calls Manually integrated at C source level and statically executed
24 Performance of Procedures STI speeds up best compiler (Intel-O2-u0) by 17% [FDCT/CJPEG] Performance 13% speedup [EOB/CJPEG] Performance Slowdown in many cases because of I$ misses [IDCT/DJPEG] Performance 2 NOSTI STI2 STI3 1.8 NOSTI STI2 STI3 3 NOSTI STI2 STI Normalized Performance Normalized Performance Normalized Performance GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Intel-O2- GCC- O2 Pro64- O2 ORCC- O2 ORCC- O3 Compilers Intel-O2 Intel-O3 Intel- O2-u0 0 GCC-O2 Pro64-O2 ORCC-O2 ORCC-O3 Intel-O2 Intel-O3 Compilers 0 Compilers u0 Sweet spot varies between one, two and three threads Application speedup with the best compiler cjpeg 11% djpeg 4%
25 Overview of Algorithm 1. Form CDG for each procedure. 2. Annotate CDG based on analysis of ASM code. Utilizations: # instructions, # cycles SWP info: SWP-Good/Poor/Fail Code size, Reg. use, memory traffic, working set size etc. 3. Rank order code regions. In terms of idle resources and other factors 2 1 3
26 Overview of Algorithm (contd.) 4. Choose the best combination of code regions. Align loop regions first then do rest of code Avoid unbeneficial combinations (e.g. SWP-Fail + SWP-Good) 5. Try integration. Overlap loop iterations by loop jamming and unrolling Find opportunity for loop peeling and code motion 6. Generate C code for integrated procedure. 7. Compile it. 8. Analyze the performance. Decide whether it is worth to try other transformations
27 Automation of Code Transformations Plan to build C to C translator Now a Perl script TI C6x compiler Will be developed Can use open compilers e.g. gcc, orcc
SOFTWARE THREAD INTEGRATION FOR CONVERTING TLP TO ILP ON VLIW/EPIC ARCHITECTURES WON SO
SOFTWARE THREAD INTEGRATION FOR CONVERTING TLP TO ILP ON VLIW/EPIC ARCHITECTURES by WON SO A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements
More informationComplementing Software Pipelining with Software Thread Integration
Complementing Software Pipelining with Software Thread Integration Won So Alexander G. Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering North Carolina State University
More informationProcedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain
Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain Won So and Alex Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering NC State
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationC152 Laboratory Exercise 5
C152 Laboratory Exercise 5 Professor: Krste Asanovic GSI: Henry Cook Department of Electrical Engineering & Computer Science University of California, Berkeley April 9, 2008 1 Introduction and goals The
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationC152 Laboratory Exercise 5
C152 Laboratory Exercise 5 Professor: Krste Asanovic TA: Scott Beamer Department of Electrical Engineering & Computer Science University of California, Berkeley April 7, 2009 1 Introduction and goals The
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationChapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,
Chapter 3 (Cont III): Exploiting ILP with Software Approaches Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Exposing ILP (3.2) Want to find sequences of unrelated instructions that can be overlapped
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationStatically Calculating Secondary Thread Performance in ASTI Systems
Statically Calculating Secondary Thread Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimarães Dean alex_dean@ncsu.edu Center for Embedded Systems Research Department of Electrical
More informationEvaluation of Static and Dynamic Scheduling for Media Processors. Overview
Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationPage # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?
Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationGeneric Software pipelining at the Assembly Level
Generic Software pipelining at the Assembly Level Markus Pister pister@cs.uni-sb.de Daniel Kästner kaestner@absint.com Embedded Systems (ES) 2/23 Embedded Systems (ES) are widely used Many systems of daily
More informationEvaluating Inter-cluster Communication in Clustered VLIW Architectures
Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationUCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine
Intel Itanium Line Processor Efforts Xiaobin Li PASCAL EECS Dept. UC, Irvine Outline Intel Itanium Line Roadmap IA-64 Architecture Itanium Processor Microarchitecture Case Study of Exploiting TLP at VLIW
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationStatic Compiler Optimization Techniques
Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationThese slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.
11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationABSTRACT. SO, WON. Software Thread Integration for Instruction Level Parallelism. (Under the direction of Associate Professor Alexander G. Dean).
ABSTRACT SO, WON. Software Thread Integration for Instruction Level Parallelism. (Under the direction of Associate Professor Alexander G. Dean). Multimedia applications require a significantly higher level
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationDSP Mapping, Coding, Optimization
DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationC6000 Compiler Roadmap
C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 10 Compiler Techniques / VLIW Israel Koren ECE568/Koren Part.10.1 FP Loop Example Add a scalar
More informationCache Aware Optimization of Stream Programs
Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with
More informationMulti-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview
Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?
More informationSpring 2 Spring Loop Optimizations
Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition
More informationROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationProfiling: Understand Your Application
Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationInstruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators.
Instruction Scheduling Beyond Basic Blocks Extended Basic Blocks, Superblock Cloning, & Traces, with a quick introduction to Dominators Comp 412 COMP 412 FALL 2016 source code IR Front End Optimizer Back
More informationAn introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures
An introduction to DSP s Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures DSP example: mobile phone DSP example: mobile phone with video camera DSP: applications Why a DSP?
More informationECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors
ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2018-11-28-13-01 1 Motivating VLIW Processors
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationEvaluation of Static and Dynamic Scheduling for Media Processors.
Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts 1 and Wayne Wolf 2 1 Dept. of Computer Science, Washington University, St. Louis, MO 2 Dept. of Electrical Engineering, Princeton
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationTDT 4260 TDT ILP Chap 2, App. C
TDT 4260 ILP Chap 2, App. C Intro Ian Bratt (ianbra@idi.ntnu.no) ntnu no) Instruction level parallelism (ILP) A program is sequence of instructions typically written to be executed one after the other
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors
ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2015-11-30-13-42 1 Motivating VLIW Processors
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationUnder the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.
Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationTwo hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time
Two hours No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE System Architecture Date Time Please answer any THREE Questions from the FOUR questions provided Use a SEPARATE answerbook
More informationAdministration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering
dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More informationEECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis
EECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis Fall 2011, University of Michigan September 7, 2011 About Me Mahlke = mall key» But just call me Scott 10 years here at
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationAchieving Out-of-Order Performance with Almost In-Order Complexity
Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationCS553 Lecture Profile-Guided Optimizations 3
Profile-Guided Optimizations Last time Instruction scheduling Register renaming alanced Load Scheduling Loop unrolling Software pipelining Today More instruction scheduling Profiling Trace scheduling CS553
More informationInstruction Scheduling
Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationSoftware Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationIA-64 Compiler Technology
IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI)
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationCompiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration
Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration RTSS 2002 -- December 3-5, Austin, Texas Alex Dean Center for Embedded Systems Research Dept. of ECE, NC State
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More information