Cache oriented implementation for numerical codes

Size: px
Start display at page:

Download "Cache oriented implementation for numerical codes"

Transcription

1 Cache oriented implementation for numerical codes Martin Schulz, IWRMM, University of Karlsruhe As widely known, naively written numerical software may use only a small part of the possible performance of the underlying machine; it is much less known how to actually achieve it. Therefore the characteristics and potential bottlenecks of modern computers are studied in detail with respect to numerical simulation software, with the emphasis on today s PC hardware running the Linux operation system. The expected performance shows to be limited by the data access (with loads and stores weighted differently) therefore the data reuse from the processor cache is crucial and is discussed by both theoretical and practical aspects. A basic finite volume scheme is chosen for the discussion of different memory access patterns, which are decisive for the overall performance of the code. Introduction Once a numerical scheme is chosen, the further processing seems straight forward. Most mathematicians stop here and move on to other interesting problems. At first sight, the implementation of a given numerical algorithm seems trivial, but there are indeed lots of issues to be considered before actually obtaining a reasonable code to run on a computer. Simply counting floating point operations is no more sufficient to create fast programs. Computers of today are highly complex systems of many different components with given interactions that can hardly be investigated in all detail and completeness. Therefore high level programming languages deploy an abstract and rather simplistic (from the hardware point of view) programming model. It consists of data, operations and control structures, to describe operations 1

2 execution order but does not take system specific issues such as execution times, memory bandwidths and delays, caches or memory access patterns into account. In fact, numerical software may use only a small fraction of the possible performance of the underlying machine. It can be somewhat depressing to see an uptodate computer delivering only the performance that was possible already 3-5 years ago. Much more exciting is the perspective that, by taking care while coding, it is possible to achieve today the performance other (careless) programmers have to wait another 3-5 years for. This paper focuses on the current high performing and still cost-efficient systems based on PC technology and their use for scientific computing. After a discussion of caching issues and a brief introduction to the 80x87 FPU, measurements are presented that allow a rule of thumb for realistically expectable performance figures of numerical codes. This paper tries as well to give hints on how to actually achieve this performance, with data prefetch commands and data access patterns being the most important points. 1 Considerations on the underlying computing system 1.1 Memory demand Since the processor power of PC s, workstations and Supercomputers came closer nowadays, one main criterion of the architecture of choice is among the the potential for vectorization or parallelization the demand of the numerical scheme for the memory system. This demand splits mainly into the size of available memory and the system properties of bandwidth and latency. To note some figures, a current PC is effectively limited to 786MB (AMD 750), 1GB (Intel BX) or 1.5 GB (VIA KT133) today. The current linux kernel on 32-bit systems is able to support a maximum of 2 GB for a single process. There are some special designs to allow more memory (such as the Serverworks chipsets) but these are rare and will not get into mainstream market as they do not allow a simple addressing scheme on 32-bit systems. If more memory is needed, a way out is to use a 64-bit based architecture, such as SUN Sparc (max 4GB on E450), SGI Mips, IBM Power3, or the DEC alpha processor line (bought by Compaq, recently bought by Intel), or move on to the upper range of supercomputers. You could as well wait another year or two for the upcoming Intel Itanium and AMD Sledgehammer 64-bit processors. If the dataset of your problem fits into 1GB of RAM, it seems natural to choose the currently most cost efficient solution by using and Intel Pentium 2/3 or AMD Athlon based PC architecture. We will go on to investigate the further properties of these systems. Since computer architecture is similar among differ- 2

3 ent microprocessor lines, the basic concepts and techniques from section 3 can be deployed for other machine types as well. 1.2 Cache organization As already well known, todays main processors (Central Processing Unit, CPU) can execute their operations much faster than the main memory system can effectively handle the corresponding data, see [9]. It has been observed that the processors in general work on a rather small subset of the main memory at a time, so processor manufacturers have introduced the concept of so-called caches, which are small but very fast memories. Caches store recently manipulated data and provide it in the case the processor needs it again. As the processor advances and moves on to work on other parts of the address space, this data eventually gets written back into main memory. It is not sufficient for a cache to hold only the data of the copy. Information about its corresponding address in main memory and certain flags (to notify whether the data is modified, exclusive to the CPU, shared or invalid (the so called MESI-state)) need to be stored, as well. Since this logic is expensive to implement in hardware, chip designers reduced the generality of the address mapping and introduced the notion of cache associativity. First of all, the cache does not store single bytes or words, but always complete cachelines, typically 32 or 64 bytes of contiguous data. The mapping of a memory address to its cacheline is given by the truncation of the last 4 digits in binary address representation: cachelinenumber = address div 32 A cache is called fully associative, if every cacheline can get stored at any location in the cache. Fully associative caches are rare today, more often, 4-way or 2-way associative caches can be found (details for Intel /AMD below), where each cacheline can be stored only at 4 or 2 places of the cache. This place is determined by the lower bits of the cacheline number (see fig 1): cachelineposition = cachelinenumber mod 512 These numbers suppose a 64 kb 4-way associative L1 cache with cachelines of 32 bytes, as it can be found on Pentium 2 processors. As a consequence, a single byte consisting of the address bits 4-11 are used to identify possible cacheline locations for any given address. The 18 remaining high bits (from 12 on) have to be stored along with the cached data to identify the actual memory position of the cached data. Reducing the number of these bits saves some silicon, but leads to smaller cacheable areas, such as the one for the formerly well sold Intel TX Pentium chipset. If a cacheline can only be stored at a single position, the cache is called direct mapped, which sounds much better than 1-way associative. This is particularly 3

4 Figure 1: cache position simple to implement, since the last bits of any memory address represent its sole possible location the cache. SUN uses this concept for the Enterprise 450 Server, combined with a large L2-cache. main memory.... (small boxes represent cache lines, 32B each) processor cache (4 way) Figure 2: schematic cache organization The set of cachelines is partitioned into a number of subsets within which the cachelines compete for actual storage space in the cache. This concept supports contiguous memory access, but can lead to rather disastrous results if the memory is sparsely accessed with an unfavorable stride. Examples of this are given in figure 14 and in the Elch Test by Stefan Turek [10]. Processor cache details differ from system to system, here a some gathered specifications: level-1-cache: level-2-cache: size cacheline size cacheline Pentium 4 8 kb 4 way 64 B 256 kb 8 way 64 B Pentium 2,3 32 kb 4 way 32 B 512 kb 4 way 32 B Athlon 64 kb 2 way 64 B 256 kb 16 way 64 B SUN E kb direct 32 B 4096 kb direct 64 B As can be seen, these processors do not have a single cache, but a hierarchy of two caches. Not taken into account here are the further CPU caches such as the instruction cache, the write buffer, the translation look-aside buffer and others. 4

5 1.3 Bandwidth and Latency As long as all data of a numerical model fits into the processors cache, the processor can run a full internal speed. But if the dataset gets larger, the data has to be moved from the main memory to the CPU and back again. This is known to be much slower than the actual data processing itself [9, 10]. Consequently, the impact of a larger dataset is twofold: First, the amount of data gets larger, second, the processing of that data gets much slower. Communication between the CPU and the memory system takes place via the front side bus (FSB) which is 64 bit wide (for both Pentium 2/3 and Athlon) 8 bytes are called a quadword. The FSB runs at speeds of 100MHz (e.g. Intel BX, AMD 750) or 133 MHz (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The theoretical bandwidth is about 800MB/s or 1064MB/s respectively. Due to the cache structure, always whole cachelines are transferred at a time (see section 1.2), therefore it is most effective to employ a dense (contiguous) memory scheme such that all transferred data might be used Prefetch Both Intel and AMD introduced special prefetch operations in the command set of their processors to enable the programmer or compiler to indicate data which will be worked on in near future. That way, the processor can request the corresponding cacheline from memory in advance in order to have the data available when it is needed. As a consequence, the latency of the main memory the inherent delay before the data is actually provided to the CPU is hidden behind doing useful work, waiting for data is avoided. The disadvantage is that there are additional commands to execute and the prefetches need to be issued at appropriate places in the code. In case of misprediction of the next memory access, the effective datarate may go down, because of the misguided and therefore useless data transfers. For details, see the discussion in the Intel optimization document guide [5] and the AMD Optimization guide [1]. Since the GNU compiler does not issue these commands by itself, I wrote some small macros to provide them by the means of inline assembly statements. As shown in the later sections, it is possible to get a certain speed up by concise use them, but be aware that time measurements are always necessary, since it is easily possible to actually slow down the code using them inappropriately. On a Pentium 2/3 or Athlon with MMX extension, the following macro can be used to prefetch the cacheline of the variable var into L1-cache: #define PREFETCH(var) asm ("prefetchnta (%0) \n\t" : : "r" (&var)) To fetch 4 cachelines or 4 32 = 128 = 0x80 bytes ahead, use the following: #define PREFETCH4(var) asm ("prefetchnta 0x80(%0) \n\t" : : "r" (&var)) 5

6 These prefetchnta (non-temporal access) commands indicate to get the data into L1 cache only, without eating up space in the L2 cache. That way, the so-called cache pollution by data that is used only once can be avoided. The kni memcpy routine in the file /usr/src/linux/arch/i386/lib/simd.c of the linux kernel source tree sheds some light on the business of memory prefetch and other processor specific optimizations, where Doug Ledford writes: /* * Note: Intermixing the prefetch at *exactly* this point * in time has been shown to be the fastest possible. * Timing these prefetch instructions is a complete black * art with nothing but trial and error showing the way. * To that extent, this optimum version was found by using * a userland version of this routine that we clocked for * lots of runs. We then fiddled with ordering until we * settled on our highest speed routines. So, the long * and short of this is, don t mess with instruction ordering * here or suffer performance penalties you will. */ So the programmer has to decide how far he will go down the road Preload Another common way to hide memory latencies, is the one called preload. It is much like prefetch, but actually will load the data into a register; since the other commands do not depend on that data, the execution can go on based on the out-of-order execution features of modern processors. A nice thing about the preload is that it can be implemented by machine independent assembler programming: In fact, an empty inline asm statement with an input suffices to make gcc load that input data into a register: #define PRELOAD(var) asm VOLATILE ("" ::"r" (var)) This preload can be done even on machines without prefetch operation but has the disadvantage of blocking a register for other uses (the compiler has less space to store intermediate values), eats up out-of-order capacity for other operations and therefore did not yield any enhancements in the measurements of sections If you can, use prefetch instead of preload. 1.4 Virtual address translation The above sketch of the working of the CPU cache is still too simple by talking about addresses. In fact there are two kinds of addresses in use here: physical addresses and virtual addresses. 6

7 While in protected mode, the CPU provides to each process (instance of a running program) its own address space, completely separated from each other. These virtual addresses are translated into physical addresses by the CPU under the control of the operation system. This mechanism is called paging and has two main purposes: It supports swapping and memory protection. Swapping a technique to use the same physical memory for several processes, by moving temporarily non-used data to an external data storage such as a hard disk and back again when needed. This is done by pages of 4 kb size, so the main memory could be interpreted as a fully associative cache for the swap space using 4 kb cache lines. If a program references a (virtual) address, the CPU looks that address up in the address translation table. If the corresponding page table entry has the present bit set, the sought after page is located in memory and the page table entry provides the physical address of memory page containing the referenced byte. If the present bit is not set, it can mean two things: Either the page is located on disk, this is called a page fault and forces the page on disk to be loaded into RAM again. Or it is possible that this address is not valid for the process, a segmentation fault occurs. The entries of the address translation table are generated by the operation system and may even due to swapping change in the lifetime of a process. In fact it defines an injective mapping form virtual to physical addresses, which may have jumps at 4kB boundaries but is monotone and contiguous inside the 4kB pages. This issue is also known as page coloring. See also section 3.8. The CPU does its caching based on the physical addresses, over which user space programs have no influence. The address mapping on a machine with enough memory seems fairly randomized after some uptime, resulting in difficulties to deliberately reproduce the worst case scenarios mentioned above. Best figures (in the sense of worst case scenarios) were obtained on freshly rebooted machines. A randomized mapping alleviates the effect of cache trashing, because less cachelines are now competing for the same space in cache. It favors local (inside 4kB pages) cache effects but renders the deliberate use of the full cache size and structure nearly impossible, as the predictability of mid-scale cache behavior gradually decreases TLB bottleneck The lookup in the address translation table involves yet another type of cache, not discussed up to there. Each used page table entry gets stored in the TLB (Translation look-aside buffer) to avoid the reloading of the translation table from memory for a page used not too long ago. The TLB has 64 entries on the Pentium processor and has its worst impact when referencing only little data on a large 7

8 number of 4 kb pages. It is not of great importance for numerical simulations since we aim at contiguous memory access patterns anyway Quadword alignment The AMD optimization guide [1] stresses the importance of correctly aligned memory access. To maintain the best possible speed, memory accesses should be quadword aligned, in other words, the address should be divisible by 8. The GNU compiler issues.aligned commands to the underlying assembler by himself, and the quadword alignment inside the 4kB pages implicates the quadword alignment in terms of physical addresses, so this seems not a great issue here Memory bank conflicts The Intel Architecture Reference Manual [4] mentions memory bank conflicts as further possible cause for memory access delay. Access to different pages of the same DRAM bank introduce a delay due to DRAM page opening and closing. Since the operation system handles the physical address allocation at run-time, both programmers and compilers have little control on this effect. 2 Pentium processors from scientific computing viewpoint The early PC s CPUs, such as 8088 and 8086 and later and processors did not have any floating point hardware. An additional floating point coprocessor chip (8087, or respectively) was available and had to be plugged into a separate socket. It had separate registers and separate commands for floating point operations but needed to be controlled from the CPU. From on (80486Sx being an exception) the floating point unit (FPU) was integrated into the main processor, but the basic structure remained the same. Later additions to the Pentium processor line include MMX, ISSE and SSE2, all aiming at parallel data processing for further speedups: MMX defines new operations to handle 64-bit packed integer data types (byte, word (2 bytes) and doubleword (4 bytes), signed and unsigned each). These operations work on eight new 64-bit wide so-called MMX registers. As the numerical simulation in scientific computing does not require heavy work in integer arithmetic, these additions are of little use for the applications discussed here, with one exception: the prefetch commands discussed in section were introduced with MMX. 8

9 ISSE (also known as KNI, Katmai New Instructions) introduced eight new registers (XMM) of 128 bit length and new operation on packed single precision floating point data. This allows to execute the same operations on 4 single precision numbers at once, using the SIMD (single instruction multiple data) paradigm. Since scientific numerical computations are usually done in double precision, this is of little use as well. SSE2 (Streaming SIMD extensions) was the next enhancement by Intel for the Pentium 3. Further commands to use the XMM registers for 2 double precision numbers (instead of 4 single precision values with ISSE) and to work on them in parallel were added. The new cflush command was introduced as well, which serves to tell the processor that a certain piece of data will not be used again in near future by invalidating its cache line. This way, the cache can be kept clean from data used only once and can therefore hold more data for effective reuse. The boundary data in section 3.3 would be a good candidate for application. The GNU compiler currently only uses the 80x87 FPU when using double precision numbers. The gas (GNU assembler) does not yet support the SSE2 instruction set, not even for manual use. Therefore, only the 80x87 FPU is discussed here. 2.1 Structure of the Intel 80x87 FPU The 80x87 FPU provides 8 registers of 80 bit length that are capable to hold a number in extended precision each. These registers named %st(0) through %st(7) are organized as a register stack. Operations always work on the top of the stack, if not explicitly stated otherwise. As an example, take the evaluation of a linked triad of section 2.4: d i = a i + b c i Consider the addresses of a i, b, c i and d i to be stored in the general purpose registers %eax, %ebx, %ecx, and %edx respectively. Then the above operation can be achieved by: fldl (%ebx) # load b label: fldl (%ecx) # load c_i fmul %st(1) # b*c_i fldl (%eax) # load a_i faddp # a_i+b*c_i fstpl (%edx) # store c_i 9

10 First the two numbers b and c i are pushed onto the FPU register stack by the fldl 1 commands. Then the fmul command multiplies them, stores the result in %st(0), replacing the value of c i. The subsequent fldl command pushes a i on top of the stack. faddp computes the sum, pops the stack and writes the result into %st(0) again. The fstpl command at the end takes this value, stores it to the memory location of d i and pops the stack again. At time 6, the value of b is on top of the stack and can be used in the next turn after incrementing the addresses by 8 and a jump to label:, skipping the first command. time 1 time 2 time 3 time 4 time 5 time 6 st(0) b c i b c i a i a i + b c i b st(1) b b b c i b st(2) b. For a more detailed discussion of the 80x87 FPU and its precise floating point arithmetic properties, see [8]. The CPU also provides a fxch command to rename the floating point registers. This is useful to issue further independent commands while the result of the previous operation is not yet available. The feature allows compilers to use several processing units in parallel and thus helps to minimize the CPU cycles. This however, works only for data from cache. In the context of memory bandwidth bounded performance, the fxch operation does not help FPU performance Some of the arithmetic floating point operations provided by the 80x87 are listed in the table below, along with measured CPU cycles on Pentium2/3 and Athlon systems. Description Pentium2/3 Athlon fmul, fmulp multiply 3 2 fadd, faddp addition 1 2 fdiv division fsqrt square root It is stated in literature, that the newer 80x87 need two cycles for a floating point multiplication, and the processor has two multiplication units that can work in parallel, but cannot be started in the same cycle. Nevertheless I measured 3 CPU cycles for consecutive multiplications the Pentium3. A more extensive list of floating point operations and their timing properties is found in [2]. 1 The l suffix of the command indicates that double values (long) are to be loaded. 10

11 Note that the CPU cycles relate to the clock speed of the processor core, which is usually much higher than the clock speed of its interface to the system bus, which is referred to as FSB cycles mentioned in other part of the paper. The ratio of FSB to CPU cycle times is somewhere in the range of 4 (Pentium 400, BX) to 10 (Athlon 1GHz, AMD 750). 2.2 Measurements of memory transfers To obtain information about the actually achievable fraction of the theoretical peak performance, I did some measurements with hand coded assembly loops accessing contiguous data in simple patterns. They can be found as inline assembly statements in the unified bandwidth estimation program bandwidth.cc which can be downloaded from the website stated at the end of the paper. For a discussion of the assembly programming language, see [6, 7]. The use of gcc inline assembly is discussed in [3, 11]. A selection of these loops is discussed below. Each of the loops works on 24 bytes per Iteration, the space needed for 3 double precision numbers. While the first column of the tables below lists the respective machine type, the second column contains the operation frequency, i.e. the execution speed of the issued numerical operations. The next column then states the corresponding amount of transferred cacheline data. Note that the cacheline data is larger than the real data (i.e. the amount of data actually operated on) in the case where the data is not accessed contiguously. The last column then gives the respective FSB cycles necessary per quadword of transferred cacheline data Reading contiguous double precision data A simulation of the data flow reading all components of a 3-vector is done by loading three double precision values and incrementing the address by 8 bytes afterwards in a loop. Since data is accessed contiguously, all data that get transferred is actually used. The caches are irrelevant here since all data is used only once and the underlying vector is too long to fit into cache loop iteration... qw data Figure 3: load contiguous quadwords cachelines 11

12 We get the following results: qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Reading contiguous double precision data using prefetch This can be done faster by using prefetch commands, which request the data before it is actually used (see section 1.3.1). At the beginning of the loop, a nontemporal prefetch command (prefetchnta) is inserted that addresses 4 cachelines ahead. Non-temporal data is the Intel nomenclature for data that is used only once and therefore not worth being stored in L2 cache. prefetch loop iteration... qw data Figure 4: load contiguous quadwords with prefetch cachelines qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Some of the chipsets come reasonably close to their theoretical peak, whereas the VIA Apollo is rather disappointing. As can be seen, the effect of the prefetch is much larger for the Athlon systems than for the Pentiums; tough they are slower without prefetch, they are faster with Reading sparse integer data To see whether loading to the floating point registers is the bottleneck, the loop is modified to load only the first doubleword of each triple of quadwords into a general purpose integer register (32 bit), using the same prefetch. 12

13 Since the address stride of the loop is smaller than the length of a cacheline, all cachelines are loaded, even though only a small fraction of the data is actually loaded into a register. prefetch loop iteration... dw data Figure 5: load sparse dw with prefetch cachelines dw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Due to the skip of 20 of 24 bytes in each loop, the real data rate is only 1/6 of the cacheline data rate. Also note that the cacheline transfer rate is smaller as before on some systems. The Via Apollo is better here, but still beaten by Via KT Writing contiguous double precision data Up to now, we were loading data from memory. Storing data seems to have completely different characteristics. The following table lists the performance of storing 3 double precision values per loop from floating point register to memory, in some sense this is the inverse of section loop iteration... qw data Figure 6: store contiguous quadwords cachelines 13

14 qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Writing contiguous double precision data using prefetch To see the effect of prefetch on the store operations, the same prefetch command is inserted as in section It can be seen, that the prefetch 4 lines ahead improves only the performance of the slower Pentium. Generally, there is only little effect. prefetch loop iteration... qw data Figure 7: store contiguous quadwords with prefetch cachelines qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Writing integer data using prefetch To show that the difference of store versus load performance is not induced by the use of floating point registers, an alternative loop was timed. It contains 6 doubleword stores from a general purpose 32-bit register. The results vary only slightly from those of the previous section. dw MHz MB/s cacheline FSB/qw Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT

15 prefetch loop iteration... dw data cachelines Figure 8: store contiguous doublewords with prefetch 2.3 Rule of thumb The measurements of the previous sections show the possible store and load performance, obtained by a straight forward assembler code without any compiler interference. Obviously, we cannot expect a real program to have higher load or store rates, so we should be happy if we come close to the figures measured above. As a rule of thumb for the maximal expected performance on different machines, count both the quadword loads and stores (including non-used cacheline data) and multiply them by the FSB cycle numbers from section and 2.2.5: FSB cycles FSB cycles per qw load per qw store Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT The resulting sum is the number of FSB cycles a good implementation is expected to spend on memory transfers. If your code reaches this performance, you should be satisfied. The kni memcpy routine mentioned above (section 1.3.1) actually beats these numbers by about 30% on Pentium2/3, it does not run on Athlon due to the use of the Katmai instruction set extension. So it is indeed possible to achieve higher throughput than stated by the above rule of thumb, taking explicitly care about alignment, use of special purpose registers, non-temporal stores that do not pollute the caches and so on. However, there seems to be no general rules how to use these instruments, other than trial and error. We can only speculate why the stores are so much slower than the loads. There are some possible reasons: Before writing to a cacheline, it has to be read into cache and gets written back afterward. As a consequence, the cacheline gets transferred twice, instead of once. 15

16 As any cacheline that gets written to is first fetched from main memory if necessary, an update operation is considered the same as a mere store for our purpose. There seems to be no instructions for a userland program to notify the processor, that a certain part of the memory is write only for some time. The memory type range registers (MTRR) provide a similar functionality but are targeted to be used by the operating system, not userspace programs. DRAM chip characteristic: Writing may be slower than reading, since it has to be verified that the DRAM cell really has stored the correct value. ECC RAM seems to be slower than non-ecc RAM for stores, not for loads. Another explication could be the bus address snooping for the coherence test of dual-capable processors. This is necessary because the cached data must be exclusive to the CPU when writing to it (see MESI state, section 1.2) The fast loads may indicate that a most significant word first technique might be used whereas a write operation is safely written only after the whole cacheline got transferred. 2.4 Linked triad Schönauer [9] states the linked triad of the form (also known as daxpy operation) d i = a i + b c i to be one of the most often used operations in scientific supercomputing and takes it as benchmark for his classification of supercomputers. To verify the rule of thumb, I coded this operation in assembler as well. The basic operation is discussed as the example in section 2.1. To increase the performance, the loop is threefold unrolled and interspersed with prefetch commands for the three memory lines. MHz eval MFlop/s FSB/eval expected Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT The table lists the evaluation frequency in its first column. The loop contains two loads and one store for each index, hence column two is the double of the first column. The third column relates the evaluation frequency to the FSB. These numbers correlate fairly well with the excepted values from the rule of thumb above (last column). 16

17 2.5 Compiler influences Coding small things in assembler may sometimes be real fun, but implementing and debugging complex numerical algorithms is another issue. FORTRAN or C/C++ are the usual choices as programming languages for that purpose. The code written in these higher level languages is used to generate automatically the assembly code which is then further translated into machine code. That process of automatic translations usually involves a certain loss in execution speed. In some cases, the compiler may even generate better code than an assembler programmer would do. However, considering problems limited by the speed of the data from/to memory, the memory access structure is of crucial importance. Unfortunately, it is not well recognized by compilers, which are tuned to issue the commands such that as many processor parts as possible are kept busy. Usually, the memory bandwidth and latency is not taken into account, hence the automatically generated code is most effective on data that is already in cache. The linked triad is used as an test example as above. The code for a threefold unrolled loop in C reads: for(int i=0; i<3*num; i+=3) { y1[i ]=y2[i ]+fac*y3[i ]; y1[i+1]=y2[i+1]+fac*y3[i+1]; y1[i+2]=y2[i+2]+fac*y3[i+2]; } To see what can be done using the compiler and additional manually inserted prefetch commands, a similar, but enhanced loop is used: for(int i=0; i<3*num; i+=3) { PREFETCH3(y1[i]); y1[i ]=y2[i ]+fac*y3[i ]; PREFETCH3(y2[i]); y1[i+1]=y2[i+1]+fac*y3[i+1]; PREFETCH3(y3[i]); y1[i+2]=y2[i+2]+fac*y3[i+2]; } 17

18 gcc gcc + prefetch Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT SUN Enterprise The table shows the evaluation frequency in MHz, to be compared to the first column of the table in section 2.4. As we have seen before, the Athlon systems profit more from the prefetch commands and nearly reach the assembler loop from section 2.4. For the Pentium systems, the gain of the additional prefetches is smaller. For comparison reasons, the gcc figure is given here as well for a SUN Enterprise 450. All experiments with different compiler options for the SUN workshop compiler resulted in lower rates than the GNU compiler using the -O2 optimization option. 2.6 Indirect addressing When working in finite volume contexts, topological neighborhoods come into account. Roughly speaking, each cell boundary needs references to the left and right neighbor cell. Consider a loop over all these cell boundaries, which computes all fluxes and updates the corresponding cell variables. Suppose, this neighbors information is stored together with some given flux in a data structure like this: struct boundary_struct { double flux; int left; int right; }; Consider a field boundary of such structures and another field cellvalue of double precision values. The following basic codes update the cellvalues by the flux given by boundary[n].flux. The third version is the usual form. For demonstration purposes, the second assignment is left out in the first and second version. Version two and four are enhanced by prefetch commands. The first reduced version does only half of the work: for( int n=0; n<num; n++) { cellvalue[boundary[n].left ] -= boundary[n].flux; }; 18

19 As can be seen in the table near the end of this section, it is important for the subscript to be known early enough, otherwise, the execution will stall and wait for the value of boundary[n].left. This is accomplished by an additional prefetch command in the following: for( int n=0; n<num; n++) { PREFETCH4(boundary[n]); cellvalue[boundary[n].left ] -= boundary[n].flux; }; A naive implementation of the full cellvalue update would be for( int n=0; n<num; n++) { cellvalue[boundary[n].left ] -= boundary[n].flux; cellvalue[boundary[n].right] += boundary[n].flux; }; And with the same prefetch applied as before: for( int n=0; n<num; n++) { PREFETCH4(boundary[n]); cellvalue[boundary[n].left ] -= boundary[n].flux; cellvalue[boundary[n].right] += boundary[n].flux; }; The following measurements were all done with GNU compiler and high optimization (-O6). boundary struct is 2 quadwords wide, hence the reduced loops contain 2 loads and 1 store, whereas the fulls loop contain 2 loads and 2 quadword stores. In the table below, the measured FSB cycles are compared to expected ones according to the rule of thumb (see section 2.3). Depending on the offset between the left and right cellnumbers of the boundaries, one cellvalue may still be in cache by the time the CPU does the second update to that cell. If this is the case, there are only two quadword loads and one store going over the FSB. The two cases are listed separately. reduced from memory from cache 1 st 1 st+p 2 st 2 st+p exp. 2 st 2 st+p exp. Laptop Pentium Pentium2 400MHz Pentium2 600MHz Pentium Athlon 1GHz Athlon 1GHz The table lists the number of necessary FSB cycles for data completely from memory and partially from cache. Note that these cycles do not show the raw 19

20 performance but represent the ratio between the time needed for the execution and the FSB cycle time (which differs from system to system). Therefore, these numbers show how well the code fits to a particular machine or vice versa. If the cell offset is 1, the same cell is used twice in immediately following operations. This is called a store-to-load dependency, the data from the first operation get stored and read in again immediately. Though this writing only goes to cache, it forces the CPU to wait for the first operation and the write to complete. AMD has implemented a store-to-load optimization, therefore this effect is barely noticeable on Athlon, whereas losses about 10% can be measured on a Pentium system. When working with longer vectors, this effect probably will become less important, since several components have to be stored before the first one gets loaded again. 2.7 SMP versus serial program As long as the memory subsystem forms the bottleneck for the computation, it will not be of any use to parallelize the program to run on a SMP (Symmetric Multi-Processing) system such as the popular Dual-Pentium 2 systems. The upcoming dual-athlon systems based on the AMD 760 MP chipset may behave somewhat different, since they use independent point-to-point connections between the processors and the interface to the chipset, the so called northbrigde, in contrast to the Intel systems, which use a shared bus for both processors. However, in the light of the above results, it is not clear whether this will help much, since the memory memory subsystem needs to get strengthened along with the growing CPU power as well. 3 Consequences for the implementation of a numerical scheme All of the above considered the processing of data from memory, not data from cache. This is particularly important for two reasons: Decent compilers have much knowledge about the characteristics of modern processors and go great lengths by using dependency graphs and other tools to reduce CPU cycles. This works well for data from cache, there is little gain to expect by manual interaction here. The same compilers however usually have very little knowledge about memory access characteristics and the access patterns used the the actual program in question. At this point, the programmer can use its knowledge about memory access patterns to help the compiler to produce better performing code. 20

21 If the dataset is large enough, it won t fit into CPU cache anymore. So we inevitably end up in the data-from-memory situation anyway. However, not everything is lost. First, we have shown above, how to effectively pull the data trough the cache to keep the datarate high. Second, by modifications to the sequence order of the data (see section 3.3) as well as by choices for the control flow structure of the program (see section 3.5), we can influence, how often the data is pulled through the cache, so the total amount of transferred data can be reduced. Manipulations of the processing sequence is known in numerical linear algebra as strip-mining and cache-blocking. The next sections will discuss possibilities in a finite volume context. 3.1 Finite volume example Finite volume discretizations are popular in the domain of computational fluid dynamics. They work by partitioning the computational domain into a large number of small so called (computational) cells, on which the solution is approximated by simple functions such as (cellwise) constant or linear. The evolution in time of that numerical solution is modeled in terms of fluxes, which represent the boundary integral of the transport of certain quantities over the cell boundary. These quantities usually are physically conservative such as mass, momentum or energy. As these quantities do not change in total quantity (if they do, they do so by additional source terms), a cell wins what another cell looses. This is a highly desired property for physically conserved quantities. N W E S Figure 9: finite volume basics: fluxes over cell boundary 21

22 3.2 Striking the balance To obtain the time evolution of the solution for given fluxes, it is necessary to strike a balance of in- and outflow for every cell, see figure 9. By iterating over all cells, every cell get worked on once, whereas the cell boundaries get loaded twice. By looping over the cell boundaries, these get loaded once, but the cell data structures get loaded twice. There are roughly half as much cells as boundaries, so the first possibility seems favorable. However it suffers from the following shortcomings: The data structure for boundaries is likely to be larger than the one for the cells The cell-focused loop results in an unfavorable access pattern. The singlepass technique (see section 3.5) basically enforces the iteration by boundaries. Therefore, the boundary-focused loop is discussed here. 3.3 Processing sequence The next point to choose is the sequence in which to iterate over the boundaries, which implies a favorable sequence to store them. For the fluxes in west-east direction, it is best to work by rows (and store data in that way as well): for(j=0;j<nycells; j++) for(i=0; i<nxcells-1; i++) { boundary[nbound].left =i +nxcells*j; boundary[nbound].right=i+1+nxcells*j; nbound++; }; So each west-east boundary and each cell get loaded once, since the cell values can stay in cache for reuse in the next step. Having chosen such a layout, working on the south-north boundaries is harder now, the sequence of these is still to be chosen. Working by columns would imply accessing the cell data in a noncontiguous manner which is generally not to be recommended. You will have to make sure that the stride by which you access the data is such that you do not suffer from cache stumbling: If the distance of the cells associated to a boundary is a multiple of the length of the cache, all the cellvalue accesses compete for the same places in the cache. For direct mapped caches, this is disastrous, for higher associative caches, it is slightly better. Prefetches will further aggravate the situation as they introduce additional cachelines into the competition. 22

23 Figure 10: work by row The separation of the selection and the processing of data, as proposed by Schönauer [9] looks not very promising either, since the data does not get reused many times, so the cost of data moving will outweigh the gain. Therefore a reasonable way is to process the south-north boundaries by rows as well: Figure 11: work by row: south-north boundaries for(j=0;j<nycells-1; j++) for(i=0; i<nxcells; i++) { boundary[nbound].left =i+nxcells*j; boundary[nbound].right=i+nxcells*(j+1); nbound++; }; The consequences for the data reuse from cache depend on the grid size. For a Pentium 3, the 4-way associative 512 kb cache may be interpreted as 4 layers of 128 kb length, that corresponds to roughly 5400 cells if each cell has to store 3 double precision numbers. Therefore, if the rows of the grid contain less than 5400 cells, the processing of the next row should have the the cell values from the previous run in cache, so effectively, each cell value will get loaded once in this second step as well. We encountered exactly this situation already in section 2.6, where the measurements for the data partially from cache approximated the reduced loop with 23

24 only one store (not perfectly though). If you use this scheme, choose the shorter edge of the grid to correspond to the rows. Figure 12: work by blocks of rows: south-north boundaries As the number of cells in a row grows, the cellvalues will drop out of cache before they can get used again and need therefore to get loaded twice for the south-north boundaries. In this case, it is better to switch the strategy and to work on several rows at once (see figure 12). If we were working with physical addresses (see section 1.4), the optimal number of rows could be deduced by: nrows = n associative sizeof(structboundary) sizeof(structcelldata) 1 for(j=0;j<nycells-1; j+=nrows) for(i=0; i<nxcells; i++) { if (j+nrows>nycells-1) nrows=nycells-1-j; for(jj=j; jj<j+nrows;jj++) { boundary[nbound].left =i+nxcells*jj; boundary[nbound].right=i+nxcells*(jj+1); nbound++; }; }; When calculating the expected performance, consider the cell data of the beginning of a row be pushed out of the cache when the CPU starts working on the next bunch of rows and the absence of cache trashing. In each sweep, the CPU will encounter (nrow + 1) nxcells stores while working on nrow nxcells boundaries, this gives a ratio of (nrow + 1) nxcells nrow nxcells = nrow stores per boundary. Due to the size of 16 bytes per boundary, there are two loads as well. 24

25 This still is not ideal, since every cell still get loaded at least twice, due to the 2-way passing of first working on west-east boundaries, then on south-north boundaries. We can do better, the previous paragraph leading the way: While we have the data in cache for the south-north work, we can intersperse the west-east work as well, thereby avoiding to load most of the cells twice. (see figure 13) Figure 13: work by blocks of rows for(j=0;j<nycells-1; j+=nrows) { for(i=0; i<nxcells-1; i++) { if (j+nrows>nycells-1) nrows=nycells-1-j; for(jj=j; jj<j+nrows;jj++) { // west-east boundary[nbound].left =i +nxcells*jj; boundary[nbound].right=i+1+nxcells*jj; nbound++; // south-nord boundary[nbound].left =i+nxcells*jj; boundary[nbound].right=i+nxcells*(jj+1); nbound++; }; }; // now i=nxcells-1 for(jj=j; jj<j+nrows;jj++) { // south-nord boundary[nbound].left =i+nxcells*jj; boundary[nbound].right=i+nxcells*(jj+1); nbound++; }; }; //now j=nycells-1; for(i=0; i<nxcells-1; i++) { // west-east boundary[nbound].left =i +nxcells*j; boundary[nbound].right=i+1+nxcells*j; nbound++; 25

26 }; In fact, we get the data for the west-east boundaries for free, since it is in cache for the south-north boundaries anyway. In each sweep, the CPU will encounter (nrow + 1) nxcells stores (same number as above) while working on the double number 2 nrow nxcells boundaries, this gives a ratio of (nrow + 1) nxcells 2 nrow nxcells = nrow stores per boundary. While the number of stores remains constant, the number of loads doubles due to the double number of boundaries worked on. Therefore the expected cycles are determined by: ( nrow ) stores + 2 loads (1) per boundary. In an ideal world, each boundary and each cell value would get loaded once in each run over the whole grid; this is exactly the asymptotic of the above formula for large values of nrow. 3.4 Performance figures Taking a 3000x3000 finite volume grid, a row of the grid consists of 24 kb of data for a scalar equation. This should nicely fit into the cache of Pentium and Athlon processors, therefore the assumption about data from previous row pushed out of cache does not hold. Consequently, no speedup can be expected from varying the value of nrow. The nrow-technique seems however to work in favor of heavily loaded machines with long uptime, where chances are that data from the previous row may be partially pushed out of the cache through occasional cacheline conflicts due to a randomized address translation. On freshly rebooted machines, nrow does not have a noteworthy effect. The table lists the minimal obtained FSB cycles needed per boundary for small values of nrow, and compares them to the ideal world numbers of one store per cell and two loads per boundary. The sample implementation works fairly well on most machines. Note that the figures for specific values of nrow are not firmly reproducible throughout a day-by-day basis, due to the strong dependency on the address translation table. expected ideal measured loss Laptop Pentium % Pentium2 400MHz+BX % Pentium2 600MHz+BX % Pentium 866+Via Apollo 133A % Athlon 1GHz+AMD % Athlon 1GHz+Via KT % 26

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures Jan Treibig, Simon Hausmann, Ulrich Ruede Zusammenfassung The Lattice Boltzmann method (LBM) is a well established algorithm

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Chapter 5:: Target Machine Architecture (cont.)

Chapter 5:: Target Machine Architecture (cont.) Chapter 5:: Target Machine Architecture (cont.) Programming Language Pragmatics Michael L. Scott Review Describe the heap for dynamic memory allocation? What is scope and with most languages how what happens

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

The Central Processing Unit

The Central Processing Unit The Central Processing Unit All computers derive from the same basic design, usually referred to as the von Neumann architecture. This concept involves solving a problem by defining a sequence of commands

More information

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Roadmap. Java: Assembly language: OS: Machine code: Computer system: Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Assembly language: Machine code: get_mpg: pushq movq... popq ret %rbp %rsp, %rbp %rbp 0111010000011000

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

CS 240 Stage 3 Abstractions for Practical Systems

CS 240 Stage 3 Abstractions for Practical Systems CS 240 Stage 3 Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the process model Virtual memory Dynamic memory allocation Victory lap Memory Hierarchy: Cache Memory

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different

More information

Welcome to Part 3: Memory Systems and I/O

Welcome to Part 3: Memory Systems and I/O Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

The Processor Memory Hierarchy

The Processor Memory Hierarchy Corrected COMP 506 Rice University Spring 2018 The Processor Memory Hierarchy source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved.

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Memory Design. Cache Memory. Processor operates much faster than the main memory can. Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Computer Labs: Profiling and Optimization

Computer Labs: Profiling and Optimization Computer Labs: Profiling and Optimization 2 o MIEIC Pedro F. Souto (pfs@fe.up.pt) December 15, 2010 Optimization Speed matters, and it depends mostly on the right choice of Data structures Algorithms If

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

Various optimization and performance tips for processors

Various optimization and performance tips for processors Various optimization and performance tips for processors Kazushige Goto Texas Advanced Computing Center 2006/12/7 Kazushige Goto (TACC) 1 Contents Introducing myself Merit/demerit

More information

05/12/11 10:39:08 linux-processor-caches-linux-journal-2004.txt 1

05/12/11 10:39:08 linux-processor-caches-linux-journal-2004.txt 1 10:39:08 linux-processor-caches-linux-journal-2004.txt 1 [7105aa.png] Feature Understanding Caching Architectures that support Linux differ in how they handle caching at the hardware level. by James Bottomley

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi

Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi Computer Architecture Prof. Smruthi Ranjan Sarangi Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 32 The Memory Systems Part III Welcome back. (Refer Slide

More information

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS

TECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information