A Preliminary Assessment of the ACRI 1 Fortran Compiler

Size: px
Start display at page:

Download "A Preliminary Assessment of the ACRI 1 Fortran Compiler"

Transcription

1 A Preliminary Assessment of the ACRI 1 Fortran Compiler Joan M. Parcerisa, Antonio González, Josep Llosa, Toni Jerez Computer Architecture Department Universitat Politècnica de Catalunya Report No UPC-DAC (also available as UPC-CEPBA-94-17)

2 1 1. INTRODUCTION. UPC is currently developing a series of tests which are intended to validate the performance advantages of the ACRI architecture and to assess the difference between the compiler s version and the hand coded version of various kernels. The subject of the following study belongs to the second of the above mentioned issues. The compiler, however, is still under development. So, this work has to be regarded only as a preliminary approach. Our aim is to offer as soon as possible, some early diagnostics, though they risk to be superseded by further versions of the compiler. This work explores how the current version of the af90 fortran compiler (v ) deals with a particular set of routines that we have well studied in the past. That is, the axpy, dot product, matrix by vector, and matrix by matrix products. Other routines and kernels will be used in the future. First, we present an outline of each of the tests being used. For each algorithm, some of the more useful transformations are described, either to obtain best utilization of the resources of the architecture, or to reduce memory traffic. Next, we present the performance obtained by simulation with the compiled and the hand coded version of each test, and we try to identify the reasons for the differences. The simulator that has been used is the asim (v ) architecture simulator. 2. AN OVERVIEW OF THE TESTS In this section we review some of the algorithm transformations proposed by UPC for several linear algebra routines [1]. They are intended to exploit data locality, thus reducing the memory traffic, and also to reduce the negative effect of data dependencies, in order to achieve the maximum utilization of the functional units in the Stream Units, thus avoiding stalls or idle cycles. All of the following routines take advantage of the efficient use of the guard register and the shifting register file when applying software pipelining to the loops. The guard register is useful to eliminate the code corresponding to the prologue and the epilogue of the pipelined loop, by executing each instruction conditionally to its guard (which is derived from the pipestage where it is scheduled). The shifting register is useful when holding temporary values in an pipelined loop, as it provides a means for renaming the registers, thus avoiding to unroll the loop to prevent data dependencies.

3 AXPY The following is the basic algorithm, in Fortran: DO I=1, N Y(I) = Y(I) + ALPHA * X(I) It has not any recurrence. It can be efficiently coded by means of applying software pipelining to it. The current loop can be easily pipelined by means of using the guard register and the shift register file. The product must be held into a register during several cycles until it is read by the addition. As far as we do loop pipelining, we need to preserve it from being rewritten by further products issued during the following pipestages, before the addition takes place. One method would be to unroll the loop as many iterations as needed to ensure that both each product and its corresponding addition are issued in the same pipestage, so that we can code each unrolled product writing to a different register. The second method consists of using the shifting register file as explained in the above section. This is more efficient, since the code obtained is shorter and, unlike the unrolled loop, it can deal with any value of the loop counter. With this two features, the body loop for the DU code looks like this:.bflags 4, 0, 5, 5.block 2 g1 mult $3, $lq, $s0; g6 addt $s5, $rq, $sq.block 2.2. The DOT product The Fortran code for the basic algorithm is: DO I=1, N DOT = DOT + X(I) * Y(I) Here we can see that the loop has a recurrence with a one iteration distance. Only one packet (two instructions) is to be issued per iteration, but the addition operation uses the same register as source and destination operands. As far as the latency of the addition is 3 cycles, the scoreboarding will stall the DU for 2 additional cycles at every iteration, that is, increasing threefold the execution time.

4 3 Transformation: To avoid the effect of the recurrence, we split the variable DOT into three instances, and unroll three iterations of the loop, so that each one performs the addition with a different variable. Obviously, at the end of the loop it will be necessary to reduce the split variable by adding the three values. The following pseudo-code illustrates this idea: dot1 = DOT dot2 = 0 dot3 = 0 for i=1 to n step 3 do dot1 = dot1 + X(i) * Y(i) dot2 = dot2 + X(i+1)* Y(i+1) dot3 = dot3 + X(i+2)* Y(i+2) endfor DOT = dot1 + dot2 + dot3 Again, like in the AXPY routine, the shift register feature will help us to split the variable into different registers automatically by means of renaming it, instead of unrolling the loop. The body loop for the DU code will be:.bflags 4, 0, 5, 5.block 2 g1 mult $lq, $rq, $s0; g6 addt $s5, $s9, $s6.block Again, there is an additional benefit of using the register shifting: it works with any value of the loop counter, while the 3-iteration unrolled loop only deals with sizes multiple of 3, thus needing some additional code. On the other hand, the drawback of this transformation is an additional cost of 9 cycles due to the reduction (3 additions) Matrix by Vector The Fortran code for the basic algorithm is: DO I=1, M DO J=1 TO N Y(I) = Y(I) + A(I,J) * X(J) Here, there are 4 references to memory per iteration, but if we hold Y(I) in a register during the calculation of the innermost loop, the number of references is reduced to 2: only one packet (two instructions) must be issued per iteration. But, just like in the DOT routine, it will take 3 cycles per iteration due to the one iteration recurrence.

5 4 Transformation 1: If we hold Y(I) into a register during the innermost loop calculation, this loop is exactly the same as that of the DOT product. So, it can be applied the same transformation to the loop. And the code generated for the innermost loop body will also be the same as that of the DOT. Transformation 2: The recurrence can be avoided by applying the strip-mine and interchange, plus loop unroll technique. That is, for example, applying the strip mining to the outer loop so that it processes the matrix in stripes of 3 rows (each strip is then processed in a new nested loop which iterates only 3 times). Next, we interchange this new loop with the innermost loop. And finally, we unroll completely the new innermost loop (3 iterations), so that we have again 2 loops: for i=1 to M step 3 do for j=1 to N do Y(i) = Y(i) + A(i,j) * X(j) Y(i+1) = Y(i+1) + A(i+1,j) * X(j) Y(i+2) = Y(i+2) + A(i+2,j) * X(j) endfor endfor And the DU code for the body of the innermost loop will look like this:.bflags 4, 0, 5, 2.block 2 g1 mult $lq, $rq, $s0; g3 addt $s4, $r4, $r4 g1 mult $lq, $rq, $s2; g3 addt $s7, $r5, $r5 g1 mult $lq, $rq, $s5; g2 addt $s1, $r3, $r3.block After the transformation, the innermost loop calculates in parallel the dot products associated with 3 consecutive rows. The recurrence is still of 1-iteration distance, but now we execute 3 packets per iteration, so that when an addition takes place, the previous one has already written its result, and there is not any stall condition due to the dependence. Unlike Transformation 1, this technique does not imply any additional cost due to the reduction, but it requires 3 registers to hold the values of Y(i), Y(i+1) and Y(i+2). Furthermore, there is an additional benefit of doing this transformation: the value of X(J) can be reused 3 times by keeping it into a register. This idea is easily extended to minimize the memory references to X by increasing the width of the strips (thus increasing the unrolling degree in the innermost loop) as much as possible. Although 57 iterations is the maximum unrolling length due to the available number of registers, 56 is a more handy unrolling factor. Thus, the number of references to X is reduced from MN to MN/56, and the total number of references becomes MN(1+1/56) + 2M. As the cycle count is about MN cycles, the memory traffic generated is near 1 reference/cycle, that is, it results almost halved.

6 5 On the other hand, moving X(J) to a register will take 3 additional cycles per iteration (because of the dependence with the first product), so that the cycle count for the body loop increases from 56 to 59 cycles. There is still a sophisticated scheduling which can unroll only 52 times the loop but reduces this overhead to a single cycle per iteration, that is 53 cycles. But it is so irregular that we do not consider it as a reference for our compiler diagnostics Matrix by Matrix The Fortran code for the basic algorithm is: DO J=1, N DO I=1 TO M DO K=1 TO P C(I,J) = C(I,J) + A(I,K) * B(K,J) Again, the 4 references per iteration, can be reduced to 2 by allocating C(I,J) into a register, and a single packet would be enough to code the innermost loop in each stream unit. And it would also take 3 cycles per iteration in the DU due to the one iteration recurrence. Transformation 1: Assuming that we hold C(I,J) into a register, the innermost loop has the same structure as DOT. So, the same transformation applied to DOT can be used here. And the code generated for the innermost loop body can also be the same. The memory references are reduced from 4MNP to (2+2N)MP, that is, near the half of the memory traffic. Transformation 2: The two innermost loops are identical to the Matrix by Vector algorithm. So, the same transformation of the previous section can be applied to them, which was based on strip mining, loop interchange and unroll. And the code generated for the innermost loop body will also be the same. The memory references are reduced to (2+57N/56)MP, that is, near the half than with Transformation 1. Transformation 3: Here, strip mining, loop interchange and unroll can be extended by adding an additional dimension. Now, we are going to update the C matrix by blocks instead of strips, exploiting the most locality of this algorithm by using the register file as much as possible. That is, we perform the strip mining into both the outermost and the middle loops, being a and b the widths. Next, we interchange loops so that the two new loops become the innermost loops, and finally

7 6 we unroll them so that we get again only three loops: J, I and K. But now, the K loop body has a b lines of code. On the current architecture, the optimal values to minimize the memory traffic are a=7 and b=7. The algorithm will look like this: DO J=1, N, b DO I=1, M, a DO K=1, P C(I, J) = C(I, J) + A(I, K) * B(K, J) C(I+1, J) = C(I+1, J) + A(I+1, K) * B(K, J) C(I+2, J) = C(I+2, J) + A(I+2, K) * B(K, J) C(I+3, J) = C(I+3, J) + A(I+3, K) * B(K, J) C(I+a-1,J) = C(I+a-1,J) + A(I+a-1,K) * B(K, J) C(I, J+1) = C(I, J+1) + A(I, K) * B(K, J+1) C(I+1, J+1) = C(I+1, J+1) + A(I+1, K) * B(K, J+1) C(I+2, J+1) = C(I+2, J+1) + A(I+2, K) * B(K, J+1) C(I+3, J+1) = C(I+3, J+1) + A(I+3, K) * B(K, J+1) C(I+a-1,J+1) = C(I+a-1,J+1) + A(I+a-1,K) * B(K, J+1) C(I, J+b-1) = C(I, J+b-1) + A(I, K) * B(K, J+b-1) C(I+1, J+b-1) = C(I+1, J+b-1) + A(I+1, K) * B(K, J+b-1) C(I+2, J+b-1) = C(I+2, J+b-1) + A(I+2, K) * B(K, J+b-1) C(I+3, J+b-1) = C(I+3, J+b-1) + A(I+3, K) * B(K, J+b-1) C(I+a-1,J+b-1) = C(I+a-1,J+b-1) + A(I+a-1,K) * B(K, J+b-1) The reutilization is done by holding A(I, K) to A(I+a-1, K) and B(K,J) to B(K, J+b-1) in registers during the calculation. Moving those values to registers has an additional cost of 8 cycles per iteration, so the innermost loop body cycle count increases from 49 to 57 cycles. But the number of references to memory are reduced to (2+N+N/56)MP which is near the half than in the previous transformation.

8 7 3. SIMULATION RESULTS Table 1 compares the performance of the axpy compiled routine to the hand coded version. Differences are not very important, except for some additional overhead. The compiler generates a kernel loop code similar to the hand coded Table 1: Cycle Count with the axpy routine N Hand coded Compiled version. Table 2 includes results for 2 versions of the dot routine. In the first one, the successive products are accumulated on the parameter of the routine. The second version first copies this parameter to a local variable, then uses it to do the calculations, and finally copies the result back to the parameter. In the first case, the compiler generates an inefficient code which issues a dispatch at every iteration. But in the second case, the code generated has a kernel loop with a body similar to that of the hand coded version, i.e. the compiler does the transformation explained in the previous section. Again, differences are in the start-up code of the loop. Table 2: Cycle count with the DOT routine N Hand coded Compiled 1 Compiled 2 (with local var) Table 3 describes the results of 2 versions of the Matrix x Vector routine. First, we have compiled the original Fortran routine. The code generated by the compiler follows the transformation 1 explained in the previous section. Secondly, we have applied the transformation 2 (unrolling 3 iterations) directly to the Fortran source program, and then the code generated for the body of the kernel loop, is well optimized, just as we would do by hand. However, there is a significative loss of

9 8 performance in the compiler routine with respect to the hand coded version. An Table 3: Cycle count (and Mflops) with the MxV routine N (rows) M (cols) Hand coded (56 unrolled) Compiled 1 (not unrolled) Compiled 2 (3 unrolled) (206.7) ( 63.1) (115.6) (292.9) ( 67.1) (146.6) (293.8) (227.2) * (212.5) (193.7) (185.9) effort has been done to find out why it looses so much time, and it will be described in the following section. Table 4 shows the results of the Matrix x Matrix routine. As with the MxV, the compiler by itself is capable to apply only the transformation 1 explained in the previous section. However, the compiler results are quite worse than those of the hand coded version. Here the differences are even bigger than in the previous algorithm, but the reasons for it are the same (refer to next section). Table 4: Cycle count (and Mflops) on the MxM routine M N P Hand coded (49 unrolled) Compiled (not unrolled) (120.0) (11.8) (177.4) (24.9) * (64.3) (256.2) (26.2) * (66.2) (261.9) (26.3) * (66.4) Note: The cells marked (*) mean unavailable results due to simulator failure.

10 9 4. SOME EARLY DIAGNOSTICS From the tests analyzed before, we can see that the current version of the compiler (version 0.3.0) can efficiently apply loop pipelining and some transformations to the loops, such as that of the DOT routine. Other transformations, like strip-mine plus interchange and unroll which are oriented to get block algorithms that can exploit more efficiently the temporal locality are not currently performed. They can be programmed directly on the fortran code but, if the number of unrolled iterations is too large, the compiler generates a code with a separate dispatch for each iteration. Some significant loss of performance has been detected when testing the MxV routine (and also with MxM). Table 5 illustrates those losses of cycles, and how they are closely related to the number of dispatches been issued. As it seems that a near constant overhead is added to each dispatch, we have put the attention on the setup sections (stream units) of the innermost loop. And we have identified in the AU code, in the setup section of the innermost loop, some sequences of instructions that produce Loss of Decoupling. Table 5: Loss of cycles of the MxV (3 unrolled) compiled routine with respect to a hand coded version (56 unrolled). N (rows) M (cols) dispatches lost cycles lost c. /dispatch Table 6: Loss of cycles of the MxM (not unrolled) compiled routine with respect to a hand coded version (49 unrolled). N M P dispatches lost cycles lost c. /dispatch Typically, they are like the following:

11 10 ldq aq, address ;begins an access to memory (a few instructions) mov aq, $57 ;aq is still empty => AU stalls ;waiting for memory AU stalls for a memory latency period, approximately. Another variant of this sequence has been found, which is preceded by a store-queue instruction: stq address ;puts the address into the SAQ ldq aq, address ;a hit in the SAQ => bypass (a few instructions) mov aq, $57 ;aq is still empty => AU stalls ;waiting for DU Here, as the data comes from the DU (via bypass), the AU must wait until it gets synchronized with the DU. If the same sequence appears later, the impact will be then smaller. But instead of this, if the AU issues a ldq for a DU queue, and the DU immediately requires such data element, it will be stalled until the memory delivers the data. That is to say, those sequences produce, by themselves or by a combination of them, severe degradation of the performance. These stalls cause more impact when the loop counter of the dispatches is small, because they are placed in the setup section of the loop. We believe that much of this problems could be alleviated either by a higher reutilization of data at the AU, either by simplifying the communication protocol between the stream units, or by improving the instruction scheduling on the AU. REFERENCES [1] J. Cortadella et al. "Linear Algebra Routines and FFT on the ACRI Architecture" SHIPS P Deliverable June 92 - May 93. Univ. Politècnica de Catalunya

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Coarse-Grained Parallelism

Coarse-Grained Parallelism Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and

More information

Program Transformations for the Memory Hierarchy

Program Transformations for the Memory Hierarchy Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution. The Lecture Contains: Amdahl s Law Induction Variable Substitution Index Recurrence Loop Unrolling Constant Propagation And Expression Evaluation Loop Vectorization Partial Loop Vectorization Nested Loops

More information

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Transforming Imperfectly Nested Loops

Transforming Imperfectly Nested Loops Transforming Imperfectly Nested Loops 1 Classes of loop transformations: Iteration re-numbering: (eg) loop interchange Example DO 10 J = 1,100 DO 10 I = 1,100 DO 10 I = 1,100 vs DO 10 J = 1,100 Y(I) =

More information

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses

More information

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V. Introduction to optimizations CS3300 - Compiler Design Introduction to Optimizations V. Krishna Nandivada IIT Madras Copyright c 2018 by Antony L. Hosking. Permission to make digital or hard copies of

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

ECE 5730 Memory Systems

ECE 5730 Memory Systems ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

Hiroaki Kobayashi 12/21/2004

Hiroaki Kobayashi 12/21/2004 Hiroaki Kobayashi 12/21/2004 1 Loop Unrolling Static Branch Prediction Static Multiple Issue: The VLIW Approach Software Pipelining Global Code Scheduling Trace Scheduling Superblock Scheduling Conditional

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

Hi DISC: A Decoupled Architecture for Applications in Data Intensive Computing

Hi DISC: A Decoupled Architecture for Applications in Data Intensive Computing Hi DISC: A Decoupled Architecture for Applications in Data Intensive Computing Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and Wonwoo Ro University of Southern California http://www-pdpc.usc.edu

More information

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:! Class Notes 18 June 2014 Tufts COMP 140, Chris Gregg Detecting and Enhancing Loop-Level Parallelism Loops: the reason we can parallelize so many things If the compiler can figure out if a loop is parallel,

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance: #1 Lec # 9 Winter 2003 1-21-2004 Classification Steady-State Cache Misses: The Three C s of cache Misses: Compulsory Misses Capacity Misses Conflict Misses Techniques To Improve Cache Performance: Reduce

More information

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Advanced Pipelining and Instruction- Level Parallelism 4

Advanced Pipelining and Instruction- Level Parallelism 4 4 Advanced Pipelining and Instruction- Level Parallelism 4 Who s first? America. Who s second? Sir, there is no second. Dialog between two observers of the sailing race later named The America s Cup and

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix

More information

Optimizing MMM & ATLAS Library Generator

Optimizing MMM & ATLAS Library Generator Optimizing MMM & ATLAS Library Generator Recall: MMM miss ratios L1 Cache Miss Ratio for Intel Pentium III MMM with N = 1 1300 16 32/lock 4-way 8-byte elements IJ version (large cache) DO I = 1, N//row-major

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction

More information

CS222: Cache Performance Improvement

CS222: Cache Performance Improvement CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Parallel Processing SIMD, Vector and GPU s

Parallel Processing SIMD, Vector and GPU s Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction

More information

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

Instruction Scheduling. Software Pipelining - 3

Instruction Scheduling. Software Pipelining - 3 Instruction Scheduling and Software Pipelining - 3 Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Instruction

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

A Cost-Effective Clustered Architecture

A Cost-Effective Clustered Architecture A Cost-Effective Clustered Architecture Ramon Canal, Joan-Manuel Parcerisa, Antonio González Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Cr. Jordi Girona, - Mòdul D6

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞 Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition

More information

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Parallel optimal numerical computing (on GRID) Belgrade October, 4, 2006 Rodica Potolea

Parallel optimal numerical computing (on GRID) Belgrade October, 4, 2006 Rodica Potolea Parallel optimal numerical computing (on GRID) Belgrade October, 4, 2006 Rodica Potolea Outline Parallel solutions to problems Efficiency Optimal parallel solutions GRID Parallel implementations Conclusions

More information

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Pipeline Wrap-Up and Static ILP Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle

More information

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations DECstation 5 Miss Rates Cache Performance Measures % 3 5 5 5 KB KB KB 8 KB 6 KB 3 KB KB 8 KB Cache size Direct-mapped cache with 3-byte blocks Percentage of instruction references is 75% Instr. Cache Data

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

Improving Unstructured Mesh Application Performance on Stream Architectures

Improving Unstructured Mesh Application Performance on Stream Architectures Improving Unstructured Mesh Application Performance on Stream Architectures Nuwan Jayasena, Francois Labonte, Yangjin Oh, Hsiao-Heng Lee, and Anand Ramalingam Stanford University 6/6/2 1 Summary and Findings

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Loop Modifications to Enhance Data-Parallel Performance

Loop Modifications to Enhance Data-Parallel Performance Loop Modifications to Enhance Data-Parallel Performance Abstract In data-parallel applications, the same independent

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism

More information

An Efficient Vector/Matrix Multiply Routine using MMX Technology

An Efficient Vector/Matrix Multiply Routine using MMX Technology An Efficient Vector/Matrix Multiply Routine using MMX Technology Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } }

1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } } Introduction: Convolution is a common operation in digital signal processing. In this project, you will be creating a custom circuit implemented on the Nallatech board that exploits a significant amount

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Link Service Routines

Link Service Routines Link Service Routines for better interrupt handling Introduction by Ralph Moore smx Architect When I set out to design smx, I wanted to have very low interrupt latency. Therefore, I created a construct

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

ELEC-H-473 Microprocessor architecture Caches

ELEC-H-473 Microprocessor architecture Caches ELEC-H-473 Microprocessor architecture Caches Lecture 05 Dragomir MILOJEVIC dragomir.milojevic@ulb.ac.be 2013 Previously on ELEC-H-473 Reasons that cause pipeline inefficiency : Variance in execution time

More information