Cache oriented implementation for numerical codes

Size: px

Start display at page:

Download "Cache oriented implementation for numerical codes"

Barnard Daniel
6 years ago
Views:

1 Cache oriented implementation for numerical codes Martin Schulz, IWRMM, University of Karlsruhe As widely known, naively written numerical software may use only a small part of the possible performance of the underlying machine; it is much less known how to actually achieve it. Therefore the characteristics and potential bottlenecks of modern computers are studied in detail with respect to numerical simulation software, with the emphasis on today s PC hardware running the Linux operation system. The expected performance shows to be limited by the data access (with loads and stores weighted differently) therefore the data reuse from the processor cache is crucial and is discussed by both theoretical and practical aspects. A basic finite volume scheme is chosen for the discussion of different memory access patterns, which are decisive for the overall performance of the code. Introduction Once a numerical scheme is chosen, the further processing seems straight forward. Most mathematicians stop here and move on to other interesting problems. At first sight, the implementation of a given numerical algorithm seems trivial, but there are indeed lots of issues to be considered before actually obtaining a reasonable code to run on a computer. Simply counting floating point operations is no more sufficient to create fast programs. Computers of today are highly complex systems of many different components with given interactions that can hardly be investigated in all detail and completeness. Therefore high level programming languages deploy an abstract and rather simplistic (from the hardware point of view) programming model. It consists of data, operations and control structures, to describe operations 1

2 execution order but does not take system specific issues such as execution times, memory bandwidths and delays, caches or memory access patterns into account. In fact, numerical software may use only a small fraction of the possible performance of the underlying machine. It can be somewhat depressing to see an uptodate computer delivering only the performance that was possible already 3-5 years ago. Much more exciting is the perspective that, by taking care while coding, it is possible to achieve today the performance other (careless) programmers have to wait another 3-5 years for. This paper focuses on the current high performing and still cost-efficient systems based on PC technology and their use for scientific computing. After a discussion of caching issues and a brief introduction to the 80x87 FPU, measurements are presented that allow a rule of thumb for realistically expectable performance figures of numerical codes. This paper tries as well to give hints on how to actually achieve this performance, with data prefetch commands and data access patterns being the most important points. 1 Considerations on the underlying computing system 1.1 Memory demand Since the processor power of PC s, workstations and Supercomputers came closer nowadays, one main criterion of the architecture of choice is among the the potential for vectorization or parallelization the demand of the numerical scheme for the memory system. This demand splits mainly into the size of available memory and the system properties of bandwidth and latency. To note some figures, a current PC is effectively limited to 786MB (AMD 750), 1GB (Intel BX) or 1.5 GB (VIA KT133) today. The current linux kernel on 32-bit systems is able to support a maximum of 2 GB for a single process. There are some special designs to allow more memory (such as the Serverworks chipsets) but these are rare and will not get into mainstream market as they do not allow a simple addressing scheme on 32-bit systems. If more memory is needed, a way out is to use a 64-bit based architecture, such as SUN Sparc (max 4GB on E450), SGI Mips, IBM Power3, or the DEC alpha processor line (bought by Compaq, recently bought by Intel), or move on to the upper range of supercomputers. You could as well wait another year or two for the upcoming Intel Itanium and AMD Sledgehammer 64-bit processors. If the dataset of your problem fits into 1GB of RAM, it seems natural to choose the currently most cost efficient solution by using and Intel Pentium 2/3 or AMD Athlon based PC architecture. We will go on to investigate the further properties of these systems. Since computer architecture is similar among differ- 2

3 ent microprocessor lines, the basic concepts and techniques from section 3 can be deployed for other machine types as well. 1.2 Cache organization As already well known, todays main processors (Central Processing Unit, CPU) can execute their operations much faster than the main memory system can effectively handle the corresponding data, see [9]. It has been observed that the processors in general work on a rather small subset of the main memory at a time, so processor manufacturers have introduced the concept of so-called caches, which are small but very fast memories. Caches store recently manipulated data and provide it in the case the processor needs it again. As the processor advances and moves on to work on other parts of the address space, this data eventually gets written back into main memory. It is not sufficient for a cache to hold only the data of the copy. Information about its corresponding address in main memory and certain flags (to notify whether the data is modified, exclusive to the CPU, shared or invalid (the so called MESI-state)) need to be stored, as well. Since this logic is expensive to implement in hardware, chip designers reduced the generality of the address mapping and introduced the notion of cache associativity. First of all, the cache does not store single bytes or words, but always complete cachelines, typically 32 or 64 bytes of contiguous data. The mapping of a memory address to its cacheline is given by the truncation of the last 4 digits in binary address representation: cachelinenumber = address div 32 A cache is called fully associative, if every cacheline can get stored at any location in the cache. Fully associative caches are rare today, more often, 4-way or 2-way associative caches can be found (details for Intel /AMD below), where each cacheline can be stored only at 4 or 2 places of the cache. This place is determined by the lower bits of the cacheline number (see fig 1): cachelineposition = cachelinenumber mod 512 These numbers suppose a 64 kb 4-way associative L1 cache with cachelines of 32 bytes, as it can be found on Pentium 2 processors. As a consequence, a single byte consisting of the address bits 4-11 are used to identify possible cacheline locations for any given address. The 18 remaining high bits (from 12 on) have to be stored along with the cached data to identify the actual memory position of the cached data. Reducing the number of these bits saves some silicon, but leads to smaller cacheable areas, such as the one for the formerly well sold Intel TX Pentium chipset. If a cacheline can only be stored at a single position, the cache is called direct mapped, which sounds much better than 1-way associative. This is particularly 3

4 Figure 1: cache position simple to implement, since the last bits of any memory address represent its sole possible location the cache. SUN uses this concept for the Enterprise 450 Server, combined with a large L2-cache. main memory.... (small boxes represent cache lines, 32B each) processor cache (4 way) Figure 2: schematic cache organization The set of cachelines is partitioned into a number of subsets within which the cachelines compete for actual storage space in the cache. This concept supports contiguous memory access, but can lead to rather disastrous results if the memory is sparsely accessed with an unfavorable stride. Examples of this are given in figure 14 and in the Elch Test by Stefan Turek [10]. Processor cache details differ from system to system, here a some gathered specifications: level-1-cache: level-2-cache: size cacheline size cacheline Pentium 4 8 kb 4 way 64 B 256 kb 8 way 64 B Pentium 2,3 32 kb 4 way 32 B 512 kb 4 way 32 B Athlon 64 kb 2 way 64 B 256 kb 16 way 64 B SUN E kb direct 32 B 4096 kb direct 64 B As can be seen, these processors do not have a single cache, but a hierarchy of two caches. Not taken into account here are the further CPU caches such as the instruction cache, the write buffer, the translation look-aside buffer and others. 4

5 1.3 Bandwidth and Latency As long as all data of a numerical model fits into the processors cache, the processor can run a full internal speed. But if the dataset gets larger, the data has to be moved from the main memory to the CPU and back again. This is known to be much slower than the actual data processing itself [9, 10]. Consequently, the impact of a larger dataset is twofold: First, the amount of data gets larger, second, the processing of that data gets much slower. Communication between the CPU and the memory system takes place via the front side bus (FSB) which is 64 bit wide (for both Pentium 2/3 and Athlon) 8 bytes are called a quadword. The FSB runs at speeds of 100MHz (e.g. Intel BX, AMD 750) or 133 MHz (Via KT133, Via Apollo 133A), Intel Celeron processors ran at 66MHz for a while. The theoretical bandwidth is about 800MB/s or 1064MB/s respectively. Due to the cache structure, always whole cachelines are transferred at a time (see section 1.2), therefore it is most effective to employ a dense (contiguous) memory scheme such that all transferred data might be used Prefetch Both Intel and AMD introduced special prefetch operations in the command set of their processors to enable the programmer or compiler to indicate data which will be worked on in near future. That way, the processor can request the corresponding cacheline from memory in advance in order to have the data available when it is needed. As a consequence, the latency of the main memory the inherent delay before the data is actually provided to the CPU is hidden behind doing useful work, waiting for data is avoided. The disadvantage is that there are additional commands to execute and the prefetches need to be issued at appropriate places in the code. In case of misprediction of the next memory access, the effective datarate may go down, because of the misguided and therefore useless data transfers. For details, see the discussion in the Intel optimization document guide [5] and the AMD Optimization guide [1]. Since the GNU compiler does not issue these commands by itself, I wrote some small macros to provide them by the means of inline assembly statements. As shown in the later sections, it is possible to get a certain speed up by concise use them, but be aware that time measurements are always necessary, since it is easily possible to actually slow down the code using them inappropriately. On a Pentium 2/3 or Athlon with MMX extension, the following macro can be used to prefetch the cacheline of the variable var into L1-cache: #define PREFETCH(var) asm ("prefetchnta (%0) \n\t" : : "r" (&var)) To fetch 4 cachelines or 4 32 = 128 = 0x80 bytes ahead, use the following: #define PREFETCH4(var) asm ("prefetchnta 0x80(%0) \n\t" : : "r" (&var)) 5

6 These prefetchnta (non-temporal access) commands indicate to get the data into L1 cache only, without eating up space in the L2 cache. That way, the so-called cache pollution by data that is used only once can be avoided. The kni memcpy routine in the file /usr/src/linux/arch/i386/lib/simd.c of the linux kernel source tree sheds some light on the business of memory prefetch and other processor specific optimizations, where Doug Ledford writes: /* * Note: Intermixing the prefetch at *exactly* this point * in time has been shown to be the fastest possible. * Timing these prefetch instructions is a complete black * art with nothing but trial and error showing the way. * To that extent, this optimum version was found by using * a userland version of this routine that we clocked for * lots of runs. We then fiddled with ordering until we * settled on our highest speed routines. So, the long * and short of this is, don t mess with instruction ordering * here or suffer performance penalties you will. */ So the programmer has to decide how far he will go down the road Preload Another common way to hide memory latencies, is the one called preload. It is much like prefetch, but actually will load the data into a register; since the other commands do not depend on that data, the execution can go on based on the out-of-order execution features of modern processors. A nice thing about the preload is that it can be implemented by machine independent assembler programming: In fact, an empty inline asm statement with an input suffices to make gcc load that input data into a register: #define PRELOAD(var) asm VOLATILE ("" ::"r" (var)) This preload can be done even on machines without prefetch operation but has the disadvantage of blocking a register for other uses (the compiler has less space to store intermediate values), eats up out-of-order capacity for other operations and therefore did not yield any enhancements in the measurements of sections If you can, use prefetch instead of preload. 1.4 Virtual address translation The above sketch of the working of the CPU cache is still too simple by talking about addresses. In fact there are two kinds of addresses in use here: physical addresses and virtual addresses. 6

7 While in protected mode, the CPU provides to each process (instance of a running program) its own address space, completely separated from each other. These virtual addresses are translated into physical addresses by the CPU under the control of the operation system. This mechanism is called paging and has two main purposes: It supports swapping and memory protection. Swapping a technique to use the same physical memory for several processes, by moving temporarily non-used data to an external data storage such as a hard disk and back again when needed. This is done by pages of 4 kb size, so the main memory could be interpreted as a fully associative cache for the swap space using 4 kb cache lines. If a program references a (virtual) address, the CPU looks that address up in the address translation table. If the corresponding page table entry has the present bit set, the sought after page is located in memory and the page table entry provides the physical address of memory page containing the referenced byte. If the present bit is not set, it can mean two things: Either the page is located on disk, this is called a page fault and forces the page on disk to be loaded into RAM again. Or it is possible that this address is not valid for the process, a segmentation fault occurs. The entries of the address translation table are generated by the operation system and may even due to swapping change in the lifetime of a process. In fact it defines an injective mapping form virtual to physical addresses, which may have jumps at 4kB boundaries but is monotone and contiguous inside the 4kB pages. This issue is also known as page coloring. See also section 3.8. The CPU does its caching based on the physical addresses, over which user space programs have no influence. The address mapping on a machine with enough memory seems fairly randomized after some uptime, resulting in difficulties to deliberately reproduce the worst case scenarios mentioned above. Best figures (in the sense of worst case scenarios) were obtained on freshly rebooted machines. A randomized mapping alleviates the effect of cache trashing, because less cachelines are now competing for the same space in cache. It favors local (inside 4kB pages) cache effects but renders the deliberate use of the full cache size and structure nearly impossible, as the predictability of mid-scale cache behavior gradually decreases TLB bottleneck The lookup in the address translation table involves yet another type of cache, not discussed up to there. Each used page table entry gets stored in the TLB (Translation look-aside buffer) to avoid the reloading of the translation table from memory for a page used not too long ago. The TLB has 64 entries on the Pentium processor and has its worst impact when referencing only little data on a large 7

8 number of 4 kb pages. It is not of great importance for numerical simulations since we aim at contiguous memory access patterns anyway Quadword alignment The AMD optimization guide [1] stresses the importance of correctly aligned memory access. To maintain the best possible speed, memory accesses should be quadword aligned, in other words, the address should be divisible by 8. The GNU compiler issues.aligned commands to the underlying assembler by himself, and the quadword alignment inside the 4kB pages implicates the quadword alignment in terms of physical addresses, so this seems not a great issue here Memory bank conflicts The Intel Architecture Reference Manual [4] mentions memory bank conflicts as further possible cause for memory access delay. Access to different pages of the same DRAM bank introduce a delay due to DRAM page opening and closing. Since the operation system handles the physical address allocation at run-time, both programmers and compilers have little control on this effect. 2 Pentium processors from scientific computing viewpoint The early PC s CPUs, such as 8088 and 8086 and later and processors did not have any floating point hardware. An additional floating point coprocessor chip (8087, or respectively) was available and had to be plugged into a separate socket. It had separate registers and separate commands for floating point operations but needed to be controlled from the CPU. From on (80486Sx being an exception) the floating point unit (FPU) was integrated into the main processor, but the basic structure remained the same. Later additions to the Pentium processor line include MMX, ISSE and SSE2, all aiming at parallel data processing for further speedups: MMX defines new operations to handle 64-bit packed integer data types (byte, word (2 bytes) and doubleword (4 bytes), signed and unsigned each). These operations work on eight new 64-bit wide so-called MMX registers. As the numerical simulation in scientific computing does not require heavy work in integer arithmetic, these additions are of little use for the applications discussed here, with one exception: the prefetch commands discussed in section were introduced with MMX. 8

9 ISSE (also known as KNI, Katmai New Instructions) introduced eight new registers (XMM) of 128 bit length and new operation on packed single precision floating point data. This allows to execute the same operations on 4 single precision numbers at once, using the SIMD (single instruction multiple data) paradigm. Since scientific numerical computations are usually done in double precision, this is of little use as well. SSE2 (Streaming SIMD extensions) was the next enhancement by Intel for the Pentium 3. Further commands to use the XMM registers for 2 double precision numbers (instead of 4 single precision values with ISSE) and to work on them in parallel were added. The new cflush command was introduced as well, which serves to tell the processor that a certain piece of data will not be used again in near future by invalidating its cache line. This way, the cache can be kept clean from data used only once and can therefore hold more data for effective reuse. The boundary data in section 3.3 would be a good candidate for application. The GNU compiler currently only uses the 80x87 FPU when using double precision numbers. The gas (GNU assembler) does not yet support the SSE2 instruction set, not even for manual use. Therefore, only the 80x87 FPU is discussed here. 2.1 Structure of the Intel 80x87 FPU The 80x87 FPU provides 8 registers of 80 bit length that are capable to hold a number in extended precision each. These registers named %st(0) through %st(7) are organized as a register stack. Operations always work on the top of the stack, if not explicitly stated otherwise. As an example, take the evaluation of a linked triad of section 2.4: d i = a i + b c i Consider the addresses of a i, b, c i and d i to be stored in the general purpose registers %eax, %ebx, %ecx, and %edx respectively. Then the above operation can be achieved by: fldl (%ebx) # load b label: fldl (%ecx) # load c_i fmul %st(1) # b*c_i fldl (%eax) # load a_i faddp # a_i+b*c_i fstpl (%edx) # store c_i 9

10 First the two numbers b and c i are pushed onto the FPU register stack by the fldl 1 commands. Then the fmul command multiplies them, stores the result in %st(0), replacing the value of c i. The subsequent fldl command pushes a i on top of the stack. faddp computes the sum, pops the stack and writes the result into %st(0) again. The fstpl command at the end takes this value, stores it to the memory location of d i and pops the stack again. At time 6, the value of b is on top of the stack and can be used in the next turn after incrementing the addresses by 8 and a jump to label:, skipping the first command. time 1 time 2 time 3 time 4 time 5 time 6 st(0) b c i b c i a i a i + b c i b st(1) b b b c i b st(2) b. For a more detailed discussion of the 80x87 FPU and its precise floating point arithmetic properties, see [8]. The CPU also provides a fxch command to rename the floating point registers. This is useful to issue further independent commands while the result of the previous operation is not yet available. The feature allows compilers to use several processing units in parallel and thus helps to minimize the CPU cycles. This however, works only for data from cache. In the context of memory bandwidth bounded performance, the fxch operation does not help FPU performance Some of the arithmetic floating point operations provided by the 80x87 are listed in the table below, along with measured CPU cycles on Pentium2/3 and Athlon systems. Description Pentium2/3 Athlon fmul, fmulp multiply 3 2 fadd, faddp addition 1 2 fdiv division fsqrt square root It is stated in literature, that the newer 80x87 need two cycles for a floating point multiplication, and the processor has two multiplication units that can work in parallel, but cannot be started in the same cycle. Nevertheless I measured 3 CPU cycles for consecutive multiplications the Pentium3. A more extensive list of floating point operations and their timing properties is found in [2]. 1 The l suffix of the command indicates that double values (long) are to be loaded. 10

11 Note that the CPU cycles relate to the clock speed of the processor core, which is usually much higher than the clock speed of its interface to the system bus, which is referred to as FSB cycles mentioned in other part of the paper. The ratio of FSB to CPU cycle times is somewhere in the range of 4 (Pentium 400, BX) to 10 (Athlon 1GHz, AMD 750). 2.2 Measurements of memory transfers To obtain information about the actually achievable fraction of the theoretical peak performance, I did some measurements with hand coded assembly loops accessing contiguous data in simple patterns. They can be found as inline assembly statements in the unified bandwidth estimation program bandwidth.cc which can be downloaded from the website stated at the end of the paper. For a discussion of the assembly programming language, see [6, 7]. The use of gcc inline assembly is discussed in [3, 11]. A selection of these loops is discussed below. Each of the loops works on 24 bytes per Iteration, the space needed for 3 double precision numbers. While the first column of the tables below lists the respective machine type, the second column contains the operation frequency, i.e. the execution speed of the issued numerical operations. The next column then states the corresponding amount of transferred cacheline data. Note that the cacheline data is larger than the real data (i.e. the amount of data actually operated on) in the case where the data is not accessed contiguously. The last column then gives the respective FSB cycles necessary per quadword of transferred cacheline data Reading contiguous double precision data A simulation of the data flow reading all components of a 3-vector is done by loading three double precision values and incrementing the address by 8 bytes afterwards in a loop. Since data is accessed contiguously, all data that get transferred is actually used. The caches are irrelevant here since all data is used only once and the underlying vector is too long to fit into cache loop iteration... qw data Figure 3: load contiguous quadwords cachelines 11

12 We get the following results: qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Reading contiguous double precision data using prefetch This can be done faster by using prefetch commands, which request the data before it is actually used (see section 1.3.1). At the beginning of the loop, a nontemporal prefetch command (prefetchnta) is inserted that addresses 4 cachelines ahead. Non-temporal data is the Intel nomenclature for data that is used only once and therefore not worth being stored in L2 cache. prefetch loop iteration... qw data Figure 4: load contiguous quadwords with prefetch cachelines qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Some of the chipsets come reasonably close to their theoretical peak, whereas the VIA Apollo is rather disappointing. As can be seen, the effect of the prefetch is much larger for the Athlon systems than for the Pentiums; tough they are slower without prefetch, they are faster with Reading sparse integer data To see whether loading to the floating point registers is the bottleneck, the loop is modified to load only the first doubleword of each triple of quadwords into a general purpose integer register (32 bit), using the same prefetch. 12

13 Since the address stride of the loop is smaller than the length of a cacheline, all cachelines are loaded, even though only a small fraction of the data is actually loaded into a register. prefetch loop iteration... dw data Figure 5: load sparse dw with prefetch cachelines dw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Due to the skip of 20 of 24 bytes in each loop, the real data rate is only 1/6 of the cacheline data rate. Also note that the cacheline transfer rate is smaller as before on some systems. The Via Apollo is better here, but still beaten by Via KT Writing contiguous double precision data Up to now, we were loading data from memory. Storing data seems to have completely different characteristics. The following table lists the performance of storing 3 double precision values per loop from floating point register to memory, in some sense this is the inverse of section loop iteration... qw data Figure 6: store contiguous quadwords cachelines 13

14 qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Writing contiguous double precision data using prefetch To see the effect of prefetch on the store operations, the same prefetch command is inserted as in section It can be seen, that the prefetch 4 lines ahead improves only the performance of the slower Pentium. Generally, there is only little effect. prefetch loop iteration... qw data Figure 7: store contiguous quadwords with prefetch cachelines qw MHz MB/s cacheline FSB cycles Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT Writing integer data using prefetch To show that the difference of store versus load performance is not induced by the use of floating point registers, an alternative loop was timed. It contains 6 doubleword stores from a general purpose 32-bit register. The results vary only slightly from those of the previous section. dw MHz MB/s cacheline FSB/qw Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT

15 prefetch loop iteration... dw data cachelines Figure 8: store contiguous doublewords with prefetch 2.3 Rule of thumb The measurements of the previous sections show the possible store and load performance, obtained by a straight forward assembler code without any compiler interference. Obviously, we cannot expect a real program to have higher load or store rates, so we should be happy if we come close to the figures measured above. As a rule of thumb for the maximal expected performance on different machines, count both the quadword loads and stores (including non-used cacheline data) and multiply them by the FSB cycle numbers from section and 2.2.5: FSB cycles FSB cycles per qw load per qw store Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT The resulting sum is the number of FSB cycles a good implementation is expected to spend on memory transfers. If your code reaches this performance, you should be satisfied. The kni memcpy routine mentioned above (section 1.3.1) actually beats these numbers by about 30% on Pentium2/3, it does not run on Athlon due to the use of the Katmai instruction set extension. So it is indeed possible to achieve higher throughput than stated by the above rule of thumb, taking explicitly care about alignment, use of special purpose registers, non-temporal stores that do not pollute the caches and so on. However, there seems to be no general rules how to use these instruments, other than trial and error. We can only speculate why the stores are so much slower than the loads. There are some possible reasons: Before writing to a cacheline, it has to be read into cache and gets written back afterward. As a consequence, the cacheline gets transferred twice, instead of once. 15

16 As any cacheline that gets written to is first fetched from main memory if necessary, an update operation is considered the same as a mere store for our purpose. There seems to be no instructions for a userland program to notify the processor, that a certain part of the memory is write only for some time. The memory type range registers (MTRR) provide a similar functionality but are targeted to be used by the operating system, not userspace programs. DRAM chip characteristic: Writing may be slower than reading, since it has to be verified that the DRAM cell really has stored the correct value. ECC RAM seems to be slower than non-ecc RAM for stores, not for loads. Another explication could be the bus address snooping for the coherence test of dual-capable processors. This is necessary because the cached data must be exclusive to the CPU when writing to it (see MESI state, section 1.2) The fast loads may indicate that a most significant word first technique might be used whereas a write operation is safely written only after the whole cacheline got transferred. 2.4 Linked triad Schönauer [9] states the linked triad of the form (also known as daxpy operation) d i = a i + b c i to be one of the most often used operations in scientific supercomputing and takes it as benchmark for his classification of supercomputers. To verify the rule of thumb, I coded this operation in assembler as well. The basic operation is discussed as the example in section 2.1. To increase the performance, the loop is threefold unrolled and interspersed with prefetch commands for the three memory lines. MHz eval MFlop/s FSB/eval expected Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT The table lists the evaluation frequency in its first column. The loop contains two loads and one store for each index, hence column two is the double of the first column. The third column relates the evaluation frequency to the FSB. These numbers correlate fairly well with the excepted values from the rule of thumb above (last column). 16

17 2.5 Compiler influences Coding small things in assembler may sometimes be real fun, but implementing and debugging complex numerical algorithms is another issue. FORTRAN or C/C++ are the usual choices as programming languages for that purpose. The code written in these higher level languages is used to generate automatically the assembly code which is then further translated into machine code. That process of automatic translations usually involves a certain loss in execution speed. In some cases, the compiler may even generate better code than an assembler programmer would do. However, considering problems limited by the speed of the data from/to memory, the memory access structure is of crucial importance. Unfortunately, it is not well recognized by compilers, which are tuned to issue the commands such that as many processor parts as possible are kept busy. Usually, the memory bandwidth and latency is not taken into account, hence the automatically generated code is most effective on data that is already in cache. The linked triad is used as an test example as above. The code for a threefold unrolled loop in C reads: for(int i=0; i<3*num; i+=3) { y1[i ]=y2[i ]+fac*y3[i ]; y1[i+1]=y2[i+1]+fac*y3[i+1]; y1[i+2]=y2[i+2]+fac*y3[i+2]; } To see what can be done using the compiler and additional manually inserted prefetch commands, a similar, but enhanced loop is used: for(int i=0; i<3*num; i+=3) { PREFETCH3(y1[i]); y1[i ]=y2[i ]+fac*y3[i ]; PREFETCH3(y2[i]); y1[i+1]=y2[i+1]+fac*y3[i+1]; PREFETCH3(y3[i]); y1[i+2]=y2[i+2]+fac*y3[i+2]; } 17

18 gcc gcc + prefetch Laptop Pentium Pentium2 400MHz+BX Pentium2 600MHz+BX Pentium 866+Via Apollo 133A Athlon 1GHz+AMD Athlon 1GHz+Via KT SUN Enterprise The table shows the evaluation frequency in MHz, to be compared to the first column of the table in section 2.4. As we have seen before, the Athlon systems profit more from the prefetch commands and nearly reach the assembler loop from section 2.4. For the Pentium systems, the gain of the additional prefetches is smaller. For comparison reasons, the gcc figure is given here as well for a SUN Enterprise 450. All experiments with different compiler options for the SUN workshop compiler resulted in lower rates than the GNU compiler using the -O2 optimization option. 2.6 Indirect addressing When working in finite volume contexts, topological neighborhoods come into account. Roughly speaking, each cell boundary needs references to the left and right neighbor cell. Consider a loop over all these cell boundaries, which computes all fluxes and updates the corresponding cell variables. Suppose, this neighbors information is stored together with some given flux in a data structure like this: struct boundary_struct { double flux; int left; int right; }; Consider a field boundary of such structures and another field cellvalue of double precision values. The following basic codes update the cellvalues by the flux given by boundary[n].flux. The third version is the usual form. For demonstration purposes, the second assignment is left out in the first and second version. Version two and four are enhanced by prefetch commands. The first reduced version does only half of the work: for( int n=0; n<num; n++) { cellvalue[boundary[n].left ] -= boundary[n].flux; }; 18

19 As can be seen in the table near the end of this section, it is important for the subscript to be known early enough, otherwise, the execution will stall and wait for the value of boundary[n].left. This is accomplished by an additional prefetch command in the following: for( int n=0; n<num; n++) { PREFETCH4(boundary[n]); cellvalue[boundary[n].left ] -= boundary[n].flux; }; A naive implementation of the full cellvalue update would be for( int n=0; n<num; n++) { cellvalue[boundary[n].left ] -= boundary[n].flux; cellvalue[boundary[n].right] += boundary[n].flux; }; And with the same prefetch applied as before: for( int n=0; n<num; n++) { PREFETCH4(boundary[n]); cellvalue[boundary[n].left ] -= boundary[n].flux; cellvalue[boundary[n].right] += boundary[n].flux; }; The following measurements were all done with GNU compiler and high optimization (-O6). boundary struct is 2 quadwords wide, hence the reduced loops contain 2 loads and 1 store, whereas the fulls loop contain 2 loads and 2 quadword stores. In the table below, the measured FSB cycles are compared to expected ones according to the rule of thumb (see section 2.3). Depending on the offset between the left and right cellnumbers of the boundaries, one cellvalue may still be in cache by the time the CPU does the second update to that cell. If this is the case, there are only two quadword loads and one store going over the FSB. The two cases are listed separately. reduced from memory from cache 1 st 1 st+p 2 st 2 st+p exp. 2 st 2 st+p exp. Laptop Pentium Pentium2 400MHz Pentium2 600MHz Pentium Athlon 1GHz Athlon 1GHz The table lists the number of necessary FSB cycles for data completely from memory and partially from cache. Note that these cycles do not show the raw 19

20 performance but represent the ratio between the time needed for the execution and the FSB cycle time (which differs from system to system). Therefore, these numbers show how well the code fits to a particular machine or vice versa. If the cell offset is 1, the same cell is used twice in immediately following operations. This is called a store-to-load dependency, the data from the first operation get stored and read in again immediately. Though this writing only goes to cache, it forces the CPU to wait for the first operation and the write to complete. AMD has implemented a store-to-load optimization, therefore this effect is barely noticeable on Athlon, whereas losses about 10% can be measured on a Pentium system. When working with longer vectors, this effect probably will become less important, since several components have to be stored before the first one gets loaded again. 2.7 SMP versus serial program As long as the memory subsystem forms the bottleneck for the computation, it will not be of any use to parallelize the program to run on a SMP (Symmetric Multi-Processing) system such as the popular Dual-Pentium 2 systems. The upcoming dual-athlon systems based on the AMD 760 MP chipset may behave somewhat different, since they use independent point-to-point connections between the processors and the interface to the chipset, the so called northbrigde, in contrast to the Intel systems, which use a shared bus for both processors. However, in the light of the above results, it is not clear whether this will help much, since the memory memory subsystem needs to get strengthened along with the growing CPU power as well. 3 Consequences for the implementation of a numerical scheme All of the above considered the processing of data from memory, not data from cache. This is particularly important for two reasons: Decent compilers have much knowledge about the characteristics of modern processors and go great lengths by using dependency graphs and other tools to reduce CPU cycles. This works well for data from cache, there is little gain to expect by manual interaction here. The same compilers however usually have very little knowledge about memory access characteristics and the access patterns used the the actual program in question. At this point, the programmer can use its knowledge about memory access patterns to help the compiler to produce better performing code. 20

21 If the dataset is large enough, it won t fit into CPU cache anymore. So we inevitably end up in the data-from-memory situation anyway. However, not everything is lost. First, we have shown above, how to effectively pull the data trough the cache to keep the datarate high. Second, by modifications to the sequence order of the data (see section 3.3) as well as by choices for the control flow structure of the program (see section 3.5), we can influence, how often the data is pulled through the cache, so the total amount of transferred data can be reduced. Manipulations of the processing sequence is known in numerical linear algebra as strip-mining and cache-blocking. The next sections will discuss possibilities in a finite volume context. 3.1 Finite volume example Finite volume discretizations are popular in the domain of computational fluid dynamics. They work by partitioning the computational domain into a large number of small so called (computational) cells, on which the solution is approximated by simple functions such as (cellwise) constant or linear. The evolution in time of that numerical solution is modeled in terms of fluxes, which represent the boundary integral of the transport of certain quantities over the cell boundary. These quantities usually are physically conservative such as mass, momentum or energy. As these quantities do not change in total quantity (if they do, they do so by additional source terms), a cell wins what another cell looses. This is a highly desired property for physically conserved quantities. N W E S Figure 9: finite volume basics: fluxes over cell boundary 21

22 3.2 Striking the balance To obtain the time evolution of the solution for given fluxes, it is necessary to strike a balance of in- and outflow for every cell, see figure 9. By iterating over all cells, every cell get worked on once, whereas the cell boundaries get loaded twice. By looping over the cell boundaries, these get loaded once, but the cell data structures get loaded twice. There are roughly half as much cells as boundaries, so the first possibility seems favorable. However it suffers from the following shortcomings: The data structure for boundaries is likely to be larger than the one for the cells The cell-focused loop results in an unfavorable access pattern. The singlepass technique (see section 3.5) basically enforces the iteration by boundaries. Therefore, the boundary-focused loop is discussed here. 3.3 Processing sequence The next point to choose is the sequence in which to iterate over the boundaries, which implies a favorable sequence to store them. For the fluxes in west-east direction, it is best to work by rows (and store data in that way as well): for(j=0;j<nycells; j++) for(i=0; i<nxcells-1; i++) { boundary[nbound].left =i +nxcells*j; boundary[nbound].right=i+1+nxcells*j; nbound++; }; So each west-east boundary and each cell get loaded once, since the cell values can stay in cache for reuse in the next step. Having chosen such a layout, working on the south-north boundaries is harder now, the sequence of these is still to be chosen. Working by columns would imply accessing the cell data in a noncontiguous manner which is generally not to be recommended. You will have to make sure that the stride by which you access the data is such that you do not suffer from cache stumbling: If the distance of the cells associated to a boundary is a multiple of the length of the cache, all the cellvalue accesses compete for the same places in the cache. For direct mapped caches, this is disastrous, for higher associative caches, it is slightly better. Prefetches will further aggravate the situation as they introduce additional cachelines into the competition. 22

23 Figure 10: work by row The separation of the selection and the processing of data, as proposed by Schönauer [9] looks not very promising either, since the data does not get reused many times, so the cost of data moving will outweigh the gain. Therefore a reasonable way is to process the south-north boundaries by rows as well: Figure 11: work by row: south-north boundaries for(j=0;j<nycells-1; j++) for(i=0; i<nxcells; i++) { boundary[nbound].left =i+nxcells*j; boundary[nbound].right=i+nxcells*(j+1); nbound++; }; The consequences for the data reuse from cache depend on the grid size. For a Pentium 3, the 4-way associative 512 kb cache may be interpreted as 4 layers of 128 kb length, that corresponds to roughly 5400 cells if each cell has to store 3 double precision numbers. Therefore, if the rows of the grid contain less than 5400 cells, the processing of the next row should have the the cell values from the previous run in cache, so effectively, each cell value will get loaded once in this second step as well. We encountered exactly this situation already in section 2.6, where the measurements for the data partially from cache approximated the reduced loop with 23

24 only one store (not perfectly though). If you use this scheme, choose the shorter edge of the grid to correspond to the rows. Figure 12: work by blocks of rows: south-north boundaries As the number of cells in a row grows, the cellvalues will drop out of cache before they can get used again and need therefore to get loaded twice for the south-north boundaries. In this case, it is better to switch the strategy and to work on several rows at once (see figure 12). If we were working with physical addresses (see section 1.4), the optimal number of rows could be deduced by: nrows = n associative sizeof(structboundary) sizeof(structcelldata) 1 for(j=0;j<nycells-1; j+=nrows) for(i=0; i<nxcells; i++) { if (j+nrows>nycells-1) nrows=nycells-1-j; for(jj=j; jj<j+nrows;jj++) { boundary[nbound].left =i+nxcells*jj; boundary[nbound].right=i+nxcells*(jj+1); nbound++; }; }; When calculating the expected performance, consider the cell data of the beginning of a row be pushed out of the cache when the CPU starts working on the next bunch of rows and the absence of cache trashing. In each sweep, the CPU will encounter (nrow + 1) nxcells stores while working on nrow nxcells boundaries, this gives a ratio of (nrow + 1) nxcells nrow nxcells = nrow stores per boundary. Due to the size of 16 bytes per boundary, there are two loads as well. 24

25 This still is not ideal, since every cell still get loaded at least twice, due to the 2-way passing of first working on west-east boundaries, then on south-north boundaries. We can do better, the previous paragraph leading the way: While we have the data in cache for the south-north work, we can intersperse the west-east work as well, thereby avoiding to load most of the cells twice. (see figure 13) Figure 13: work by blocks of rows for(j=0;j<nycells-1; j+=nrows) { for(i=0; i<nxcells-1; i++) { if (j+nrows>nycells-1) nrows=nycells-1-j; for(jj=j; jj<j+nrows;jj++) { // west-east boundary[nbound].left =i +nxcells*jj; boundary[nbound].right=i+1+nxcells*jj; nbound++; // south-nord boundary[nbound].left =i+nxcells*jj; boundary[nbound].right=i+nxcells*(jj+1); nbound++; }; }; // now i=nxcells-1 for(jj=j; jj<j+nrows;jj++) { // south-nord boundary[nbound].left =i+nxcells*jj; boundary[nbound].right=i+nxcells*(jj+1); nbound++; }; }; //now j=nycells-1; for(i=0; i<nxcells-1; i++) { // west-east boundary[nbound].left =i +nxcells*j; boundary[nbound].right=i+1+nxcells*j; nbound++; 25

26 }; In fact, we get the data for the west-east boundaries for free, since it is in cache for the south-north boundaries anyway. In each sweep, the CPU will encounter (nrow + 1) nxcells stores (same number as above) while working on the double number 2 nrow nxcells boundaries, this gives a ratio of (nrow + 1) nxcells 2 nrow nxcells = nrow stores per boundary. While the number of stores remains constant, the number of loads doubles due to the double number of boundaries worked on. Therefore the expected cycles are determined by: ( nrow ) stores + 2 loads (1) per boundary. In an ideal world, each boundary and each cell value would get loaded once in each run over the whole grid; this is exactly the asymptotic of the above formula for large values of nrow. 3.4 Performance figures Taking a 3000x3000 finite volume grid, a row of the grid consists of 24 kb of data for a scalar equation. This should nicely fit into the cache of Pentium and Athlon processors, therefore the assumption about data from previous row pushed out of cache does not hold. Consequently, no speedup can be expected from varying the value of nrow. The nrow-technique seems however to work in favor of heavily loaded machines with long uptime, where chances are that data from the previous row may be partially pushed out of the cache through occasional cacheline conflicts due to a randomized address translation. On freshly rebooted machines, nrow does not have a noteworthy effect. The table lists the minimal obtained FSB cycles needed per boundary for small values of nrow, and compares them to the ideal world numbers of one store per cell and two loads per boundary. The sample implementation works fairly well on most machines. Note that the figures for specific values of nrow are not firmly reproducible throughout a day-by-day basis, due to the strong dependency on the address translation table. expected ideal measured loss Laptop Pentium % Pentium2 400MHz+BX % Pentium2 600MHz+BX % Pentium 866+Via Apollo 133A % Athlon 1GHz+AMD % Athlon 1GHz+Via KT % 26

Optimising for the p690 memory system

Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor