Algorithm Analysis Techniques for Single Chip Computer Systems

Size: px

Start display at page:

Download "Algorithm Analysis Techniques for Single Chip Computer Systems"

Annabelle Hunt
5 years ago
Views:

1 Algorithm Analysis Techniques for Single Chip Computer Systems Matthew Frank MTLCS Cambridge, MA December 2, 1998 Abstract Circuit fabrication techniques have advanced to the point where it is possible to put an entire computer system, including processor, cache, and memory, on a single chip. This complete integration dictates changes in the basic assumptions that can be made about system latencies. n particular, in single chip systems wire delays dominate all other costs, so memory access times increase as memory size grows. The result is that, to achieve the best possible performance, an algorithm design needs to account for the geometry of data layout. This paper provides a case study for algorithm analysis where memory latency grows as the square root of memory size, consistent with the real limitations found in the 2dimensional VLS implementation of a single chip computer system. We study divideandconquer sorting algorithms, and find that while a traditional implementation would require asymptotic time, caching techniques can be used to reduce this cost to. A similar analysis of a tiled matrix multiplication algorithm shows that an uncached implementation would require time while caching reduces the cost to. 1 Memory Access Costs Before 1980 computers were constructed from thousands of chips, each chip containing just a few logic gates. Since the delay through one of these chips was greater than the propogation delay of 10 meters of wire (the size of the room containing the computer), a reasonable engineering approximation was to assume that the distance between components was irrelevant. 1

2 A 3 The situation will be reversed in the next generation of computer systems, which will fit entirely on a single integrated circuit. n systems being built today, the wire delay across 2 cm of silicon is greater than 10 gate delays. n five to ten years a 2 cm wire delay will be in the range of hundreds to thousands of gate delays. A reasonable engineering approximation is to assume that gate delays are irrelevant and that the distances between various system components are all that matter. Geometry dominates. This paper is a first attempt at analyzing algorithm behavior for systems like single chip computers, where wire delays are dominant. We begin with the assumption that a memory of size has access time. We show for two algorithms, divideandconquer sort and matrix multiply, that while caching techniques help, they can not completely hide the cost of accessing memory. For sorting, we demonstrate a caching scheme that reduces average memory latency to "! $&%('. For matrix multiply, we show a caching scheme that reduces average memory latency from to *) +. n the next section we discuss the basic caching model. Section 3 provides an analysis of divideandconquer sorting. Section 4 presents the analysis of matrixmultiply. Section 5 discusses some of the broader consequences of memory access costs that grow as memory size increases. 2 Caching A cache is a small memory that is used as a scratchpad during computation. Since the cache is smaller than the main memory its access time is smaller. The hope is that commonly used data elements can be copied into the smaller memory, and then accessed multiple times at the smaller cost. Suppose we have a cache of size,.01, where, 2, is a fraction that represents the cache size as a fraction of the main memory size. Then the cost of accessing the cache is 1. Suppose also that some fraction, 678:9;, of memory accesses miss in the cache and must be satisfied from main memory at a cost, while the remaining <8=9; > the lower cost. Then the average memory latency, A 4@ :9; accesses hit in the cache at 8B, is given by: 8CD 1FË 678:9; HG (1) Note that this equation implies a tradeoff between cache miss rate and cache access time. As the cache grows, the miss rate decreases but the access time grows. We can minimize the average memory latency by taking the derivative of A 8B and setting it equal to 0. 2

3 2 3 A E 1cde 8C 678:9; LKBG (2) 3 DivideandConquer Sorting A basic divideandconquer sort of an element array performs MONQP> steps, each of which touches all array elements. f memory costs then such an algorithm requires R SMONQP" time. Suppose, however, that we are provided with a cache of size,tu1 where 3.4. Then the cost of a cache access will be 1. The sorting algorithm can leverage this faster memory by dividing the array into VTW2 X chunks. Each chunk is copied into the cache, sorted in the cache with the smaller memory cost, and then copied back to main memory. Finally the sorted chunks in main memory are merged together, unfortunately incurring the higher memory cost. Given a cache of size,, the number of accesses that can be performed in the cache is SMONQPY, and the number of accesses to main memory is SMONQPZV 0MONQP"\[ 2 ]MONQP". The miss rate for sorting, 6 sort 8^9;B, is then the number of accesses in main memory divided by the total number of accesses: SMONQP> 0MONQP" M_NQP> MONQP" G (3) Now we can combine the cache access cost and the miss rate to calculate the average memory latency, A sort8`9;c for an element sort with cache of size 1. A sort8`9;c> 1aE ^6 sort8^9;cd 1 Now we find the minimal value for given respect to and setting it equal to 0. A sort8^9;b bm_nqp> Solving this equation gives the optimal value for : 6 sort8^9;cd 8M_NQPb f 0M_NQPb MONQP> G (4) by taking the derivative with LKBG (5) G (6) Finally, we can plug back into A sort to get the optimal average memory access time: 3

4 qr r f e E A sort8^9;bcgh Since the entire algorithm requires SMONQP> time of divideandconquer sorting is: M_NQP= j $&%(' i "k l MONQP" MONQP" M_NQP MONQP*8MONQPb MONQP> MONQP*8MONQPb HG (7) MONQP> memory accesses, the total running SMONQP"^A"csort m SMONQPn8M_NQPb og (8) The extra factor of MONQP=8M_NQPb can $&%(' be elminated by using a multilevel cache hierarchy. For example if we provide memories, each 4 times the size of the previous, then exactly references will be satisfied from each memory. Each memory has an access cost of 2 times the previous. The total running time is then: 4 Tiled Matrix Multiply p l $&%(' s*t u8 (9) The technique for analyzing tiled matrix multiplication is similar to the technique we used in the previous section. The algorithm we examine is as follows: for i = 1 to M by T for j = 1 to M by T for k = 1 to M by T for ii = i to i+t1 for jj = j to j+t1 c = C[ii,jj] for kk = k to k+t1 c = c + A[ii,kk] * B[kk,jj] C[ii,jj] = c This algorithm uses a tiling factor v. Each vuwxv submatrix of the y w y matrices A and B is brought into the cache. Each element is accessed from the 4

5 4 v ) v f cache v times before being replaced. Since the main memory size is zf{ y the memory access time is. Since the cache size is v f the cache access time is v. We can then calculate the average memory latency, A mm 8^9v for matrix multiply. A mm 8^9v D}v~E G (10) To find the optimal tiling factor, v, given, we must take the derivative and set it equal to 0. A mm8^9v 4 v f The solution to this equation yield the optimal value for v. LKBG (11) v c ) (12) When we insert v c into A mm 8^9v we get the optimal average memory access time: A mm8^9v ) TE u ) xog (13) Thus, matrix multiplication, which would be an 8y i 8 f algorithm without caching, is improved to 8 2( ƒ with a tile of size y w y. This is a factor of ) greater than would be found in an analysis assuming memory costs of. 5 mplications The results of this paper strongly indicate that single chip computer systems, even those with just a single processor, should be treated as distributed systems. This is excellent news, since there is a large body of established techniques for dealing with latency problems in distributed systems. The most promising of these are using prefetching to leverage the large available communication bandwidth to overlap multiple latencies, and distributing computation by putting some processing resources near each portion of memory so that the data doesn t need to be moved at all. On the flip side, these results call into question the efficacy of traditional areatime tradeoffs. n single chip computer systems, distance and time are equivalent so adding area adds time. The problem with this tradeoff becomes even more 5

6 apparent in the energy domain. While this paper has focused on application speed, it could have just as well focused on energy consumption. n single chip computer systems, the energy consumed is also proportional to the sum of distances that signals need to travel. While prefetching can trade off increased bandwidth requirements to overlap high latency costs, it does not reduce the application energy costs. Only geometric optimizations that reduce the signal propogation distance can improve energy consumption. Finally, these results suggest that parallel applications may not be as inefficient factors that we observe in the memory as is traditionally believed. The extra latency analyses in Sections 3 and 4 seem similar to the extra factors that are often observed in applications parallelized onto mesh based communication networks. The results in this paper indicate that these additional factors are not overheads from parallelization, but may actually represent a fundamental cost of computing in finite dimensional space. 6

Cache-Efficient Algorithms

6.172 Performance Engineering of Software Systems LECTURE 8 Cache-Efficient Algorithms Charles E. Leiserson October 5, 2010 2010 Charles E. Leiserson 1 Ideal-Cache Model Recall: Two-level hierarchy. Cache