Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:
|
|
- Bernadette Shepherd
- 5 years ago
- Views:
Transcription
1 Announcements HW1 is available online Next Class Liang will give a tutorial on TinyOS/motes Very useful! Classroom: EADS Hall 116 This Wed ONLY Proposal is due on 5pm, Wed me your proposal Loops Good targets for optimization. Basic loop optimizations: Unrolling Fusion code motion; induction-variable elimination; strength reduction. Chenyang Lu CSE 467S 1 Chenyang Lu CSE 467S 2 Loop unrolling Reduces loop overhead for (i=0; i<4; i++) a[i] = b[i] * c[i]; for (i=0; i<4; i+=2) { a[i] = b[i] * c[]; a[i+1] = b[i+1] * c[i+1]; Unnecessary on SHARC Loop fusion Combines multiple loops into 1: a[i] = b[i] * 5; for (j=0; j<n; j++) w[j] = c[j] * d[j]; { a[i] = b[i] * 5; w[i] = c[i] * d[i]; Necessary conditions Loops share a same index No dependencies between two loops Chenyang Lu CSE 467S 3 Chenyang Lu CSE 467S 4 Code motion Array for (i=0; i<n*m; i++) z[i] = a[i] + b[i]; i=0; X i=0; = N*M i<n*m i<x Y N z[i] = a[i] + b[i]; i = i+1; Chenyang Lu CSE 467S 5 Chenyang Lu CSE 467S 6
2 One-dimensional arrays C array name points to 0th element: Two-dimensional arrays Row-major layout: a a[0] a[1] a[2] a[i] = *(a + i) N... a[0,0] a[0,1] M... a[1,0] a[1,1] a[i][j] = *(a + i*m + j) Chenyang Lu CSE 467S 7 Chenyang Lu CSE 467S 8 for (j=0; j<m; j++) z[i][j] = b[i][j]; zptr = z; bptr = b; for (j=0; j<m; j++) { zind = i*m+j; bind = i*m+j; *(zptr+zind)=*(bptr+bind) zptr = z; bptr = b; for (j=0; j<m; j++) { zbind = i*m+j; *(zptr+zbind)=*(bptr+zbind); zptr = z; bptr = b; zbind = 0; for (j=0; j<m; j++) { zbind++; *(zptr+zbind)=*(bptr+zbind); induction var elimination strength reduction Chenyang Lu CSE 467S 9 Cache analysis Because loops use large quantities of data (arrays), cache conflicts are common. Chenyang Lu CSE 467S 10 Direct-mapped cache Array conflicts in cache for (j=0; j<m; j++) a[i][j] = a[i][j] + b[i][j]; 1 0xabcd byte byte byte... valid tag data cache block a[0,0] tag index offset = b[0,0] hit value byte main memory cache Chenyang Lu CSE 467S 11 Chenyang Lu CSE 467S 12
3 Array conflicts, cont d. Array elements conflict because they are in the same line, even if not mapped to same location. Solutions: move one array; pad array. Static Cache Locking Lock instructions in cache before execution Predictable execution time Similarly, you may lock code and data in main memory to avoid paging Chenyang Lu CSE 467S 13 Chenyang Lu CSE 467S 14 Register allocation Reduce the number of used registers Fit all frequently used variables in registers Load once, use many times Reduce number of cache/memory access Register lifetime graph 1. w = a + b; 2. x = c + w; 3. y = c + d; 4. z = a - b; a b c d w x y z no. of needed register = Chenyang Lu CSE 467S 15 Chenyang Lu CSE 467S 16 After rescheduling 1. w = a + b; 2. z = a - b; 3. x = c + w; 4. y = c + d; a b c d w x y z no. of needed register = Note: Must make sure no dependencies among instructions are changed Performance optimization hints Use registers efficiently. Optimize loops. Optimize function calls. Optimize cache behavior: instruction conflicts can be handled by rewriting code, rescheduling; conflicting scalar data can easily be moved; conflicting array data can be moved, padded. Chenyang Lu CSE 467S 17 Chenyang Lu CSE 467S 18
4 Execution Time Analysis Motivation Embedded systems must meet deadlines. Need to analyze execution time. Chenyang Lu CSE 467S 19 Chenyang Lu CSE 467S 20 Performance analysis Execution time affected by both program path and instruction timing Path depends on input data values. Instruction timing depends on pipelining cache behavior: memory access can be 10 times slower than cache! Accurate execution time is unknown a priori Program paths for (i=0, f=0; i<n; i++) f = f + c[i]*x[i]; Loop initiation block executed once. Loop test executed N+1 times. Loop body and variable update executed N times. i=0; f=0; i<n N Y f = f + c[i]*x[i]; i = i+1; Chenyang Lu CSE 467S 21 Chenyang Lu CSE 467S 22 Execution time metrics Average-case For typical data values, whatever they are. Soft real-time Worst-case For any possible input set Hard real-time Longest program path may NOT lead to the longest execution time Best-case For any possible input set Approaches Analysis: Compile-time tools Measurement Chenyang Lu CSE 467S 23 Chenyang Lu CSE 467S 24
5 Analyze execution time Analyze optimized assembly/binary code, not highlevel language code: non-obvious translations of HLL statements into instructions; E.x., Heap operations: new(obj); Challenges Program path depends on input data Modern processors: Pipelining, cache effects are hard to predict Analysis tends to be pessimistic Measure execution time CPU simulator. I/O may be hard. May not be totally accurate. Time stamping Requires instrumented program. Timer granularity Gettimeofday on UNIX/Linux: 10 ms Gethrtime on Pentium: read a 64 bit clock cycle counter. and return the number of clock cycles since the CPU was powered up or reset. Logic analyzer: Limited logic analyzer memory depth. Chenyang Lu CSE 467S 25 Chenyang Lu CSE 467S 26 Example: Output from a Logic Analyzer Trace-driven analysis Trace: a record of the program path of a program. Help study cache behavior A useful trace: requires proper input values; is large (gigabytes). Timing diagram of event propagation on Mote Granularity: 50 microsecond Chenyang Lu CSE 467S 27 Chenyang Lu CSE 467S 28 Trace generation Hardware capture logic analyzer Limited buffer space Cannot observe on-chip cache hardware assist in CPU Pentium supports automatic tracing of branches Software PC sampling Instrumentation instructions Simulation Optimizing for program size Goal: reduce hardware cost of memory; reduce power consumption of memory units. Two opportunities: data; instructions. Chenyang Lu CSE 467S 29 Chenyang Lu CSE 467S 30
6 Data size minimization Reuse constants, variables, data buffers in different parts of code. E.x., buffering in TinyOS E.x., pack multiple flags in one byte Requires careful verification of correctness. Generate data using instructions. Reducing code size Avoid loop unrolling. Inlining? Choose CPU with compact instructions. Some CPUs support dense instruction set ARM Thumb, MIPS-16 Chenyang Lu CSE 467S 31 Chenyang Lu CSE 467S 32 Effects of inlining on TinyOS Code compression Use statistical compression to reduce code size, decompress on-the-fly: Inlining reduces code size AND improves performance! Can you guess why? main memory decompressor table LDR r0,[r4] cache CPU Chenyang Lu CSE 467S 33 Chenyang Lu CSE 467S 34 Reading Textbook 5.6, 5.7, 5.8. Chenyang Lu CSE 467S 35
Program Op*miza*on and Analysis. Chenyang Lu CSE 467S
Program Op*miza*on and Analysis Chenyang Lu CSE 467S 1 Program Transforma*on op#mize Analyze HLL compile assembly assemble Physical Address Rela5ve Address assembly object load executable link Absolute
More informationProgram design and analysis
Program design and analysis Optimizing for execution time. Optimizing for energy/power. Optimizing for program size. Motivation Embedded systems must often meet deadlines. Faster may not be fast enough.
More informationCPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors
CPUs Caches. Memory management. CPU performance. Cache : MainMemory :: Window : 1. Door 2. Bigger Door 3. The Great Outdoors 4. Horizontal Blinds 18% 9% 64% 9% Door Bigger Door The Great Outdoors Horizontal
More informationUSC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4
CS553 Compiler Construction Instructor: URL: Michelle Strout mstrout@cs.colostate.edu USC 227 Office hours: 3-4 Monday and Wednesday http://www.cs.colostate.edu/~cs553 CS553 Lecture 1 Introduction 3 Plan
More informationCS 2461: Computer Architecture 1
Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationOffice Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.
CS380C Compilers Instructor: TA: lin@cs.utexas.edu Office Hours: Mon/Wed 3:30-4:30 GDC 5.512 Jia Chen jchen@cs.utexas.edu Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.440 January 21, 2015 Introduction
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access
More informationCACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás
CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationECE 486/586. Computer Architecture. Lecture # 7
ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix
More informationMemories. CPE480/CS480/EE480, Spring Hank Dietz.
Memories CPE480/CS480/EE480, Spring 2018 Hank Dietz http://aggregate.org/ee480 What we want, what we have What we want: Unlimited memory space Fast, constant, access time (UMA: Uniform Memory Access) What
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction
More informationPERFORMANCE OPTIMISATION
PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction
More informationWrite only as much as necessary. Be brief!
1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationMemory management units
Memory management units Memory management unit (MMU) translates addresses: CPU logical address memory management unit physical address main memory Computers as Components 1 Access time comparison Media
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationMemory Hierarchy Basics
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationLoop Transformations! Part II!
Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow
More informationCS553 Lecture Profile-Guided Optimizations 3
Profile-Guided Optimizations Last time Instruction scheduling Register renaming alanced Load Scheduling Loop unrolling Software pipelining Today More instruction scheduling Profiling Trace scheduling CS553
More informationCS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall
CS 701 Charles N. Fischer Class Meets Tuesdays & Thursdays, 11:00 12:15 2321 Engineering Hall Fall 2003 Instructor http://www.cs.wisc.edu/~fischer/cs703.html Charles N. Fischer 5397 Computer Sciences Telephone:
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationLecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )
Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationFinal Exam Review Questions
EECS 665 Final Exam Review Questions 1. Give the three-address code (like the quadruples in the csem assignment) that could be emitted to translate the following assignment statement. However, you may
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence
More informationEnergy Awareness for Embedded Systems OPTIMIZING EMBEDDED SOFTWARE FOR POWER
Energy Awareness for Embedded Systems OPTIMIZING EMBEDDED SOFTWARE FOR POWER Introduction Review of Power Consumption Understanding Power for Embedded Systems Software and Hardware Optimizations Review
More informationBasic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1
Basic Memory Management Program must be brought into memory and placed within a process for it to be run Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester Mono-programming
More informationMemory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple
Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationReview. Topics. Lecture 3. Advanced Programming Topics. Review: How executable files are generated. Defining macros through compilation flags
Review Dynamic memory allocation Look a-like dynamic 2D array Simulated 2D array How cache memory / cache line works Lecture 3 Command line arguments Pre-processor directives #define #ifdef #else #endif
More informationLecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationComputer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can
More informationECE 587 Advanced Computer Architecture I
ECE 587 Advanced Computer Architecture I Instructor: Alaa Alameldeen alaa@ece.pdx.edu Spring 2015 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2015 1 When and Where? When:
More informationChapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationName: 1. Caches a) The average memory access time (AMAT) can be modeled using the following formula: AMAT = Hit time + Miss rate * Miss penalty
1. Caches a) The average memory access time (AMAT) can be modeled using the following formula: ( 3 Pts) AMAT Hit time + Miss rate * Miss penalty Name and explain (briefly) one technique for each of the
More informationPrinciples in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008
Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.
More informationCaches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016
Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss
More informationL2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary
HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40
More informationHigh Performance Computing and Programming, Lecture 3
High Performance Computing and Programming, Lecture 3 Memory usage and some other things Ali Dorostkar Division of Scientific Computing, Department of Information Technology, Uppsala University, Sweden
More informationAdvanced Memory Organizations
CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationEITF20: Computer Architecture Part4.1.1: Cache - 2
EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationOperating Systems. Operating Systems Sina Meraji U of T
Operating Systems Operating Systems Sina Meraji U of T Recap Last time we looked at memory management techniques Fixed partitioning Dynamic partitioning Paging Example Address Translation Suppose addresses
More informationComputer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James
Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationHiroaki Kobayashi 12/21/2004
Hiroaki Kobayashi 12/21/2004 1 Loop Unrolling Static Branch Prediction Static Multiple Issue: The VLIW Approach Software Pipelining Global Code Scheduling Trace Scheduling Superblock Scheduling Conditional
More informationBackground. Virtual Memory (2/2) Demand Paging Example. First-In-First-Out (FIFO) Algorithm. Page Replacement Algorithms. Performance of Demand Paging
Virtual Memory (/) Background Page Replacement Allocation of Frames Thrashing Background Virtual memory separation of user logical memory from physical memory. Only part of the program needs to be in memory
More informationBranch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken
Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of
More informationOptimisation p.1/22. Optimisation
Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant
More informationPerformance of serial C programs. Performance of serial C programs p. 1
Performance of serial C programs Performance of serial C programs p. 1 Motivations In essence, parallel computations consist of serial computations (executed on multiple computing units) and the needed
More informationEmbedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.
Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors
More informationCOSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University
COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating
More informationCMSC411 Fall 2013 Midterm 2 Solutions
CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has
More informationregisters data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.
Cache associativity Cache and performance 12 1 CMPE110 Spring 2005 A. Di Blas 110 Spring 2005 CMPE Cache Direct-mapped cache Reads and writes Textbook Edition: 7.1 to 7.3 Second Third Edition: 7.1 to 7.3
More informationCS252 Graduate Computer Architecture Midterm 1 Solutions
CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate
More informationECE 2300 Digital Logic & Computer Organization. More Caches
ECE 23 Digital Logic & Computer Organization Spring 218 More Caches 1 Announcements Prelim 2 stats High: 79.5 (out of 8), Mean: 65.9, Median: 68 Prelab 5(C) deadline extended to Saturday 3pm No further
More informationCS422 Computer Architecture
CS422 Computer Architecture Spring 2004 Lecture 19, 04 Mar 2004 Bhaskaran Raman Department of CSE IIT Kanpur http://web.cse.iitk.ac.in/~cs422/index.html Topics for Today Cache Performance Cache Misses:
More informationPhoto David Wright STEVEN R. BAGLEY PIPELINES AND ILP
Photo David Wright https://www.flickr.com/photos/dhwright/3312563248 STEVEN R. BAGLEY PIPELINES AND ILP INTRODUCTION Been considering what makes the CPU run at a particular speed Spent the last two weeks
More informationOperating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy
Operating Systems Designed and Presented by Dr. Ayman Elshenawy Elsefy Dept. of Systems & Computer Eng.. AL-AZHAR University Website : eaymanelshenawy.wordpress.com Email : eaymanelshenawy@yahoo.com Reference
More informationCache Designs and Tricks. Kyle Eli, Chun-Lung Lim
Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds
More informationOutline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate
Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,
More informationCaches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017
Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationCOSC 6385 Computer Architecture - Review for the 2 nd Quiz
COSC 6385 Computer Architecture - Review for the 2 nd Quiz Fall 2006 Covered topic area End of section 3 Multiple issue Speculative execution Limitations of hardware ILP Section 4 Vector Processors (Appendix
More informationImproving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion
Improving Cache Performance Dr. Yitzhak Birk Electrical Engineering Department, Technion 1 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationChapter 8 & Chapter 9 Main Memory & Virtual Memory
Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array
More informationCS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.
Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26, 2010 1 Homework 1 Due 10:00 PM, Wed., Sept. 1 To submit your homework: - Submit a PDF file - Use the handin program
More informationTour of common optimizations
Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)
More informationPipelined processors and Hazards
Pipelined processors and Hazards Two options Processor HLL Compiler ALU LU Output Program Control unit 1. Either the control unit can be smart, i,e. it can delay instruction phases to avoid hazards. Processor
More informationImproving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs
Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Misses Classifying Misses: 3 Cs! Compulsory The first access to a block is
More informationOptimising with the IBM compilers
Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationDepartment of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.
Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline
More informationCS 351 Final Exam Solutions
CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationCS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours
CS433 Final Exam Prof Josep Torrellas December 12, 2006 Time: 2 hours Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 6 Questions. Please budget your time. 3. Calculators
More informationLoop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion
Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the
More informationMemory Hierarchy. Advanced Optimizations. Slides contents from:
Memory Hierarchy Advanced Optimizations Slides contents from: Hennessy & Patterson, 5ed. Appendix B and Chapter 2. David Wentzlaff, ELE 475 Computer Architecture. MJT, High Performance Computing, NPTEL.
More informationECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance
ECE 23 Digital Logic & Computer Organization Spring 28 More s Measuring Performance Announcements HW7 due tomorrow :59pm Prelab 5(c) due Saturday 3pm Lab 6 (last one) released HW8 (last one) to be released
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationInside out of your computer memories (III) Hung-Wei Tseng
Inside out of your computer memories (III) Hung-Wei Tseng Why memory hierarchy? CPU main memory lw $t2, 0($a0) add $t3, $t2, $a1 addi $a0, $a0, 4 subi $a1, $a1, 1 bne $a1, LOOP lw $t2, 0($a0) add $t3,
More informationMemory. Lecture 22 CS301
Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch
More informationProcessor. Quest for Performance. CS528 Single Core Architecture. Single Processor Performance. Instructions: add/sub/and/or 8/9/2016
8/9/6 CS58 Single Core Architecture A Sahu Dept of CSE, Guwahati A Sahu Quest for Performance Pipelining Superscalar Architecture Out of Order Execution Caches, SM SA Advancements Parallelism Multi core
More information