The Supercomputer Industry in Light of the Top500 Data

Size: px

Start display at page:

Download "The Supercomputer Industry in Light of the Top500 Data"

Agatha Patterson
6 years ago
Views:

1 The Supercomputer Industry in Light of the Top500 Data Dror G. Feitelson School of Computer Science and Engineering The Hebrew University of Jerusalem Jerusalem, Israel Abstract The Top500 list lists the 500 most powerful computers installed worldwide, and has been updated semiannually for the last 10 years. Analyzing this data enables an impartial analysis of the current state and the development trends of the supercomputer industry, and sheds some light on the challenges which it faces. Introduction High-performance computing is now widely considered a strategic asset. The supercomputer industry, as the creator of this capability, is therefore very important, and is often also a source of national pride. At the same time, this industry has traditionally been a leader in the development of computer technology. It is therefore interesting to analyze the state of this industry and its development trends. Luckily, data for such an analysis is available in the form of the Top500 list, which now covers 10 years. By identifying invariants and trends in this data, we can predict growth rates, comment on technology, and identify limiting factors. The Top500 List In 1986, Hans Meuer started publishing lists of system counts for different supercomputer vendors. Even earlier, in 1979, Jack Dongarra formulated the Linpack benchmark, and 1

2 started using it to gauge the capabilities of various models and configurations of computers manufactured by different vendors [4]. in 1993 these two efforts led to the establishment of the Top500 list. This lists the 500 most powerful computer systems installed around the world in rank order. The emphasis on actual installations is crucial: it added a dimension of market success, and removed (or at least significantly reduced) the clutter caused by numerous computer models and configurations that were offered but not used. The list of the top ranking 500 installations worldwide has been updated twice a year, in June and November, ever since [5]. Over the years the list has gained in reputation, with national labs vying to host machines that contend for the top spot. It has been used to gauge the standing of different vendors and different countries in terms of the production and use of supercomputers, and also to comment on the state of the industry as a whole [7, 2, 12]. However, the list is not without its shortcomings. Chief among them is the use of the Linpack benchmark to rank the machines. This benchmark focuses on dense matrix operations, and tends to report optimistic performance figures that are closer to peak performance than those typically observed in practice using some production applications. Moreover, it may be that the discrepancy between the Linpack measurements and the performance as measured by other applications is different for different classes of computers, so the relative ranking would be different when using other benchmarks. In addition, Linpack is not a throughput benchmark, and there is no regard for the efficiency of the work done or to loss of resources to fragmentation when multiple jobs are executed concurrently. Finally, it does not take cost into account [3, 9, 11, 15, 10]. Nevertheless, Linpack has the advantage that it is easily ported and measured, making the Top500 list the only list of its kind. And the accumulated data from 10 years allows for some interesting insights, that are at least partly independent of the actual mechanism used for the ranking. In the following analysis we use the November lists, and denote the maximal computation rate achieved on the Linpack benchmark by Rmax. Ranking and Performance Predictions It has been widely observed that the performance of machines at a certain rank, and of the list as a whole, grows exponentially with time. This can be seen by tabulating the log of Rmax values as a function of year for a given rank, which leads to a straight line (Figure 1). The slope, however, depends on rank: performing a linear regression on data from 1997 to 2003, Rmax doubles every 13.5 months for the rank 500 machines, but only every

3 Rmax 1e+07 1e rank 1 rank 2 rank 4 rank 8 rank 16 rank 32 rank 64 rank 125 rank 250 rank Figure 1: Increase in computational power with time, as measured using Linpack. 1e+07 Rmax 1e rank Figure 2: Distribution of computing power across the lists. months for the rank 16 machines (the data at the very highest ranks is too noisy for an accurate estimate). In other words, the bottom of the list is improving faster than the top. In any case, the rate across the list is somewhat lower than the doubling every year reported in the past [7]. But it is still faster than the doubling every 18 months predicted by Moore s Law. 3

4 By tabulating the Rmax values as a function of rank on log-log axes, we find that the computational power drops off polynomially with rank (Figure 2). To measure this, we perform a least-squares regression starting from position 32 to the end of the list (the data near the top of the list tends to be much noisier). The results indicate that the slope of the line is growing slightly smaller with time (except in 2000, when it grew). Again, this means that the bottom of the list is actually growing at a slightly faster rate than the top. The current value is about This means that as we double the rank, the computational power drops to about 63% of what it was at the original rank. Based on these regularities, we can actually make crude predictions of the Rmax values at different ranks in future years. We know Rmax grows exponentially with time. Assuming an average time constant of about 1.15 years, we can write Rmax(r, t) = Rmax(r, t 0 ) 2 (t t 0)/1.15 where Rmax(r, t) is the Linpack rating at rank r at time t, Rmax(r, t 0 ) is the rating at this rank at time t 0, and t and t 0 are measured in years. We also know that log(rmax) drops linearly with log(rank), with a slope of about 0.7, log(rmax(r)) = C 0.7 log r which leads to Rmax(r) = C /r 0.7 Using this to calculate the Rmax value at a rank that is a factor of α higher we get Rmax(αr) = C /(αr) 0.7 = Rmax(r)/α 0.7 implying that Rmax changes by a factor of 1/α 0.7. Putting all the above together we then get the approximate expression Rmax(r, t) = Rmax(r 0, t 0 ) (r 0 /r) (t t 0)/1.15 As an example, let s use the Rmax of the rank 500 machine in 1997, which is 9513, to estimate the Rmax of the rank 100 machine in The formula gives 9513 (500/100) /1.15 = 1, 091, 848. The true value is 1,142,000, so the error is less than 5%. If the slope of log(rmax) by log(rank) does indeed decrease, the consequence is that machines will tend to fall off the list faster, and the list will end up dominated by new 4

5 year new machines Table 1: The number of new entries in the list each year. machines. To a degree, this is already the case, and has been for a long time: The number of machines in the list drops off exponentially with age [7], and more than half the list is new each year. However, the number of new machines has been relatively static at around 300 at least since 1997 (Table 1). If it grows, it may make sense to increase the length of the list to some number higher than 500. This is also desirable since the prevalence of certain machine models and configurations leads to long stretches of constant values in the current lists. Vectors and Micros Much publicity had been given in 2002 to the capture of the top spot by Japan s Earth Simulator [6], after several years of dominance by American ASCI machines. This has also re-ignited the controversy regarding the use of proprietary vector processors as opposed to commodity microprocessors. NEC, the creator of the Earth Simulator, has pursued the vector path originally developed by Cray Research, as have other Japanese firms. Most American companies have preferred to use commodity microprocessors, including Cray itself in its T3D and T3E models. The Top500 list has been used repeatedly to show that vector machines are loosing ground to those based on commodity microprocessors (e.g. [2]). But will the Earth Simulator and the recent introduction of the X1 line of vector 5

6 vector share [%] % machines % processors % rmax Figure 3: Share of machines, processors, and cycles held by vector machines. machines by Cray herald a new era of vector processing? The truth of the matter is somewhat complex. Figure 3 shows the share of machines, processors, and cycles held by vector machines in the last 10 years. The number of vector machines has indeed dropped precipitously. But the share in cycles, as expressed by the Rmax performance values, has stabilized at about 12 15% of the total since 1998, and surpassed 18% when the Earth Simulator was introduced in 2002 (and, if we accept the claim that Linpack over-estimates the capability of microprocessor-based machines on various important applications, these numbers should actually be larger). Interestingly, the share of installed processors has remained relatively constant throughout the 10-year period. As vector processors are generally much more powerful, there were always many fewer of them, even when vector machines dominated the list. In the mid 90s the other processors were part of massively parallel SIMD machines, and now they are the commodity microprocessors installed in large parallel machines and clusters. Other interesting data is provided by focusing on the bottom end of the list, and tracking the minimal number of processors needed to make the list. The results, shown in Figure 4, indicate that for microprocessor-based machines this number doubles every three years (as predicted based on much less data in [7]). This explains the finding that the power of installed supercomputers grows faster than that of desktop machines: it is the product of improvements due to Moore s law and using increasing numbers of processors [13] (note 6

7 maximal top rank minimal micro minimal processors Figure 4: Maximal and minimal degrees of parallelism represented in the list compared to the minimal number of microprocessors needed and to the parallelism at the top rank. also that even vector machines have required multiple processors to make the list since 1997). This applies across the list, as witnessed by the fact that the ratio of the Rmax value of the rank 100 machine to that of the last machine has been remarkably stable at around 3 since Comparing the minimal size of microprocessor-based machines with the minimal parallelism in the list, we find that the latter grows faster than the former (Figure 4). As the minimal parallelism is invariably achieved using vector processors, the higher slope indicates that vector processors are improving at a slower rate than microprocessors. If they continue at the same rate, the two lines are expected to cross around Alternatively, the two types may converge as innovations initially developed for vector processors are incorporated in microprocessor designs. Limiting Factors The capabilities of top-ranked machines are the product of how many processors they have and what each one can do. Figure 4 also shows the biggest and top-ranking machines in the Top500 list. Since 1997, it seems that there is a sort of glass ceiling that prevents systems with more than about 10,000 processors. The significance of this bound is that it 7

8 limits potential performance improvements by microprocessor-based machines at the top of the list to the doubling every 18 months due to Moore s law; and even this rate is being questioned as chip feature sizes are approaching the atomic scale and manufacturing costs are in the billions of dollars [14]. The bottom of the list, meanwhile, is growing faster by increasing the number of processors used. There are three main factors that limit the number of processors used in large-scale machines. One possible factor is cost. Large scale machines obviously cost a lot of money, with Japan s Earth Simulator coming in at about 350 million dollars. The cost issue is the main motivation for the growth of clusters as an alternative to vendor machines [2]. A few years ago clusters were small-scale endeavors constructed by individual research groups. In the 2003 Top500 list, they occupy 7 of the top 10 slots, and account for more than 40% of the listed systems. As such large-scale clusters are much less expensive than previous vendor machines, cost cannot explain the 10,000 processor limit. The second limiting factor is management. This includes two components. One is general management of the machine by its operators, including configuration management, identifying and replacing faulty parts, etc. The other is on-line management by a combination of daemons and operating system services, including runtime support and scheduling. It seems that these management issues may be the main factor that imposes the 10,000 processor limit. Improved management, and especially improved fault tolerance, are needed in order to surpass this number. An example of progress in this direction is the STORM project from LANL [8]. This is a resource management framework that leverages the capabilities of the Quadrics interconnect in order to implement extremely scalable control functions. The idea is to construct the management software itself as a parallel application, rather than as a distributed one as has typically been done up to now. A third limiting factor is effective usage by applications. The problem is basically one of the programming model employed. Current programming practices are too rigid, and only scale effectively for very regular problems. There is a need for more dynamic and flexible mechanisms that will support massive resources and the reallocations needed to handle imbalance and failures (which may be seen as an extreme case of imbalance: a failed component has zero capacity, equivalent to infinite load). To maintain progress, and in particular to outperform the Earth Simulator using commodity microprocessors, the 10,000 processor bound will have to be broken. A current 8

9 invariant Rmax grows exponentially doubling every 14 to 17 months depending on rank power drops polynomially with rank, with exponent of about 0.7 minimal parallelism grows exponentially doubling every 3 years for microprocessor-based machines, and every 1.5 years for vector machines maximum usable parallelism is 10,000 about 15% of the total Rmax is due to vector processors age drops exponentially in list, with around 300 new machines each year about 50% of machines are used by industry prediction rank 500 will achieve 1 teraflop in 2005, and rank 1 will achieve 1 petaflop in 2008 or 2009 machine at rank r will remain on the list 7.2 log r years in 2009 microprocessors will have the same power as vector processors, and about 512 will be needed to make the list this limit will drive progress towards better management, programming, and reliability, and it will be broken vectors might make a modest comeback current growth rate meets current needs; new markets needed to expand usage of parallel supercomputers Table 2: Invariants and predictions from analysis of the Top500 list. effort to do so is the BlueGene/L project [1], which has a design point of 65,536 processing nodes. One component of the design is to adopt a SIMD-like style, in which large blocks of processors are used en masse. In addition, nodes will not run a full operating system, but rather a simplified kernel. Interestingly, these design choices are re-incarnations of ideas that were common some 10 years ago, and have been disused in the time since. Conclusions The above results (and updates to [7]) are summarized in Table 2, together with predictions based on them. In summary, a large number of parameters have remained relatively static over several years. This does not mean that the supercomputing industry is stagnating, as 9

10 some of the constants are exponential growth rates. But their stability over time makes it possible to make predictions of how things will change in the future. Apart from allowing predictions, the analysis also allows us to identify major challenges. An increasingly important challenge is how to produce more efficient management and scheduling mechanisms, that approach the high utilization that has been typical of vector machines in the past, and at the same time use more than 10,000 processors effectively [10, 11]. This is partly a problem of designing better runtime systems, and partly a problem of devising more flexible programming models. These issues may be more important for long-term progress than the architectural innovations required to achieve a petaflop. Another challenge is to increase the use of parallel machines outside of the research community. Investigating the fraction of machines installed in industry locations shows that the industry share of machines has doubled since the mid 90s, and peaked at over 52% in This is typically taken as a good sign, showing the maturity of the supercomputing field. However, in the last two years industry share has dropped by 10 percentage points. Moreover, the industry share of Rmax has grown at a slower rate than its share of machines, and now stands at 27%. Overall, these findings indicate that new markets have to be found to further increase the use of parallel machines. Acknowledgements Many thanks to the reviewers who helped remove various inaccuracies and improve the paper. References [1] N. R. Adiga et al., An overview of the BlueGene/L supercomputer. In SC2002, Nov [2] G. Bell and J. Gray, What s next in high-performance computing?. Comm. ACM 45(2), pp , Feb [3] G. Cybenko and D. J. Kuck, Revolution or evolution?. IEEE Spectrum 29(9), pp , Sep

11 [4] J. Dongarra, P. Luszczek, and A. Petitet, The LINPACK benchmark: past, present and future. Concurrency & Computation Pract. & Exp. 15(9), pp , Aug [5] J. J. Dongarra, H. W. Meuer, H. D. Simon, and E. Strohmaier, Top500 supercomputer sites. URL (updated every 6 months). [6] Earth Simulator. URL [7] D. G. Feitelson, On the interpretation of Top500 data. Intl. J. High Performance Comput. Appl. 13(2), pp , Summer [8] E. Frachtenberg, F. Petrini, J. Fernandez, S. Pakin, and S. Coll, STORM: lightningfast resource management. In SC2002, Nov [9] J. Gustafson, Teraflops and other false goals. IEEE Parallel & Distributed Technology 2(2), pp. 5 6, Summer [10] J. P. Jones and B. Nitzberg, Scheduling for parallel supercomputing: a historical perspective of achievable utilization. In Job Scheduling Strategies for Parallel Processing, pp. 1 16, Springer-Verlag, Lect. Notes Comput. Sci. vol [11] L. Rudolph and P. Smith, Valuation of ultra-scale computing systems. In Job Scheduling Strategies for Parallel Processing, pp , Springer Verlag, Lect. Notes Comput. Sci. vol [12] E. Strohmaier, who s who in high performance computing. Scientific Computing & Instrumentation, Aug URL [13] E. Strohmaier, J. J. Dongarra, H. W. Meuer, and H. D. Simon, The marketplace of high-performance computing. Parallel Comput. 25(13-14), pp , Dec [14] T. N. Theis, Beyond the silicon transistor: personal observations. Comput. in Sci. & Eng. 5(1), pp , Jan/Feb [15] A. Wong, L. Oliker, W. Kramer, T. Kaltz, and D. Bailey, System utilization benchmark on the Cray T3E and IBM SP2. In Job Scheduling Strategies for Parallel Processing, pp , Springer Verlag, Lect. Notes Comput. Sci. vol

High Performance Computing in Europe and USA: A Comparison

High Performance Computing in Europe and USA: A Comparison Erich Strohmaier 1 and Hans W. Meuer 2 1 NERSC, Lawrence Berkeley National Laboratory, USA 2 University of Mannheim, Germany 1 Introduction In