Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing

Size: px

Start display at page:

Download "Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing"

Rolf Wilson
5 years ago
Views:

1 KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, Jul Copyright 2016 KSII Reevaluating the overhead of data preparation for asymmetri multiore system on graphis proessing Songwen Pei 1,2, Junge Zhang 1, Linhua Jiang 1, Myoung-Seo Kim 2 and Jean-Lu Gaudiot 2 1 Shanghai Key Lab of Modern Optial Systems, University of Shanghai for Siene and Tehnology, Shanghai , China [ {swpei, lhjiang}@usst.edu.n] 2 Parallel Systems and Computer Arhiteture Lab,University of California Irvine, CA 92697, USA [ {myoungseo.kim, gaudiot}@ui.edu] *Corresponding author: Songwen Pei, Myoung-Seo Kim Reeived May 10, 2016; revised May 24, 2016; aepted June 5, 2016; published July 31, 2016 Abstrat As proessor design has been transiting from homogeneous multiore proessor to heterogeneous multiore proessor, traditional Amdahl s law annot meet the new hallenges for asymmetri multiore system. In order to further investigate the impat fators related to the Overhead of Data Preparation (ODP) for Asymmetri multiore systems, we evaluate an asymmetri multiore system built with CPU-GPU by measuring the overheads of memory transfer, omputing kernel, ahe missing and synhronization. This paper demonstrates that dereasing the overhead of data preparation is a promising approah to improve the whole performane of heterogeneous system. Keywords: Overhead of data preparation; heterogeneous multiore; Amdahl s law; power onsumption A preliminary version of this paper appeared in The 1 st International Conferene on Eletronis, Eletrial Engineering, Computer Siene: Innovation and Convergene (EEECS 2016), January 20-22, Phuket, Thailand. This version inludes a onrete analysis and supporting implementation results on asymmetri mulitore system on graphis proessing. This researh was supported by a researh grant from the Shanghai Muniipal Natural Siene Foundation (15ZR ), Shanghai Pujiang Program(16PJ ), the program for Professor of Speial Appointment (Eastern Sholar) at Shanghai Institutions of Higher Learning, USST inubation projet(15hjpy-ms02), and the National Siene Foundation (XPS ). ISSN :

2 3232 Pei et al.: Reevaluating the Overhead of Data Preparation 1. Introdution As we enter the multiore era, asymmetri multiore system[1] is beoming the mainstream in the field of future proessor design. Reent examples inlude Intel s MIC[2], AMD s Kabini[3], and NVIDIA s Projet Denver[4], et. However, the memory wall and ommuniation issues will ontinue inreasing the gap between the performane of an ideal proessor and that of a pratial proessor, beause the overhead of data preparation beomes an unavoidable key problem. Amdahl s Law was a state-of-the-art analytial model that guided software developers to evaluate the atual speedup that ould be ahieved by using parallel programs, or hardware designers to draw muh more elaborate miroarhiteture and omponents. Gene Amdahl[5] defined his law for the speial ase of using n proessors (ores) in parallel when he argued for the single-proessor approah s validity for ahieving large-sale omputing apabilities. He used a limit argument to assume that a fration f of a program s exeution time was infinitely parallelizable with no sheduling overhead and other overheads to generate ready data to be used by omputing units, while the remaining fration, 1 f, was totally sequential. He noted that the speedup on n proessors is governed by equation (1): 1 Speedup(f, n) = f (1- f)+ (1) n This equation Amdahl s law on basis of omputing-entri system whih never takes into aount the potential ost of data preparation. It is only orret as long as three key assumptions are verified: (1) the programs to be exeuted are of fixed-datasize and the fration of the programs that is parallelizable remains onstant as well; (2) there exists an infinite memory spae without extra overhead of swithing memory bloks and disk bloks, et. (3) the overhead of preparing the data to be used by omputing units, whih inludes memory aess, ommuniation on-hips or off-hips and synhronization among ores, an be negleted. Currently, some CPU-GPU memory arhitetures are using separate memory address spaes for the CPU and the GPU, while the programmer is responsible for ommuniation between them by using a relatively slow ommuniation hannel,suh as PCIe bus. This ommuniation bottlenek limits the peak throughput that an be obtained from these systems. Otherwise, the power onsumption has beome an important issue that impats GPU appliations. Therefore, traditional Amdahl s Law is not fit for heterogeneous omputer system, we will take onsideration of the overhead of data preparation in asymmetri multiore system. Despite some ritiisms, Amdahl s Law is still relevant as we enter a heterogeneous multi-ore omputing era. However, the future relevane of the law requires its extension by the inlusion of onstraints and arhitetural trends demanded by modern multiproessor hips. There are a lot of ahievable researhes in theory to extend Amdahl s Law. Woo and Lee[6] extended Amdahl s law for energy-effiient omputing of many-ore, who lassified

3 KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, July many-ore design styles as three type: symmetri supersalar proessor tagged with P*, symmetri smaller power-effiient ore tagged with *, and asymmetri many-ore proessor with supersalar proessor and many smaller ores tagged with P + *. The researh results show that heterogeneous arhiteture is better than symmetri system to save power. Similarly, Marowka[7] extended Amdahl s law for heterogeneous omputing, and investigated how energy effiieny and salability are affeted by the power onstraints for three kind of heterogeneous omputer system, i.e., symmetri, asymmetri and simultaneous asymmetri. The analysis shows learly that greater parallelism is the most important fator affeting power onsumption. Karanikolaou, et al evaluated experimental energy onsumption based on both of performane/power and performane/energy ratio metris[8]. Kim et al foused on the energy effiieny of the sequential part aeleration, and how to determine the optimal frequeny boosting ratio whih maximize energy effiieny[9]. Londono et al modeled the potential dynami energy improvement, optimal frequeny and voltage alloation of a multiore system in terms of extending Amdahl s law[10]. EyerMan et al [11] introdued a model that aounts for ritial setions of te parallelizable fration of a program. Ge et al proposed a power-aware speedup model to predit the saled exeution time of power-aware lusters by isolating the performane effets of hanging proessor frequenies and the number of nodes[12]. In this paper we analyse the extended Amdahl s law metioned in the papers[16-19] and demonstrate that data preparation is another promising approah to improve performane of heterogeneous system. The ontributions in this paper are: (1) Rethinking and analysing the general model of Amdahl s law; (2) Proposing the overhead of data preparation in the extended Amdahl s law; (3) Reevaluating an asymmetri multiore system built with CPU-GPU by measuring the overheads of ommuniation, omputing kernel, ahe missing and synhronization. This paper is organized as follows. Setion II revisits the extended Amdahl s law of asymmetri multiore system by onsidering the overhead of data preparation[13]. Setion III analyses the affeting fators of data preparation. Setion IV proves the affeting fators of data preparation should not be negleted through Rodinia benhmarks. Setion VI onludes our work and future missions. 2. Revisiting the extended Amdahl s law by onsidering the overhead of data preparation The graphis proessing unit (GPU) has made signifiant strides as an aelerator in parallel omputing. However, GPU has resided out on PCIe as a disrete devie, the performane of GPU appliations an be bottleneked by data transfers between the CPU and GPU over PCIe. So the overhead of preparing the data and aessing the memory, of transmitting data on-hip and off-hip, of transferring data between CPU memories and GPU memories for heterogeneous system, of synhronizing proesses, et, beome signifiant that it annot be ignored any longer. However, traditional Amdahl s law only onsiders the ost of instrution exeution. We will thus now assume that whole ost of exeuting a program an be split into two independent parts: ( p ) and (1 p ), as shown in the Fig. 1, one (1 p ) is preparing the data exeution and the other one ( p ) is running instrution when the required data are ready.

4 3234 Pei et al.: Reevaluating the Overhead of Data Preparation Therefore, the Overhead of Data Preparation (ODP) inludes the whole ost of preparing data for exeution. As proposed by [13], ODP an be onsidered to produe a new speedup equation will all it the Extended Amdahl s law as expressed in equation (2): S ( f,, p ) EA 1 f ((1- f ) ) p (1- p ) where p denotes the omputation portion, 1 p denotes the Data Preparation portion normalized to the omputation portion: sine the lok frequenies and the ISAs of CPUs, GPUs, off-hip bus and memory arbitrator would likely be different, we should normalize the performane of Data Preparation instrutions to that of omputing instrutions. The f is the parallelizable omputation portion, 1 f is the sequential omputation portion. (2) 1 ore P 1-P 1-f f 1-fh (1-α) fh α fh ores 1-f f/ 1-fh k(1-α) fh (1-k)(1-α) fh α fh Fig. 1. Illustration of Extended Amdahl's law As shown in Fig. 1, however, equation (2) does not make allowanes for the introdution of tehniques that would derease or eliminate the overhead of data preparation. We further divide the portion of data preparation into three sub-parts: 1 f, f and (1 ) f h h h. f h denotes the portion of the program that an be overlapped with omputing instrutions in theory, where 0< f <1. But, it annot be ompletely overlapped in pratial due to the h unpredited run-time negative fators from the limits of hardware resoures. In other words, the exeution of this portion is masked behind the exeution of the omputation instrutions. In general, memory aess instrutions an be exeuted in parallel with independent omputing instrutions. However, it will not be possible to exeute all of the data preparation instrutions simultaneously with omputing instrutions. For example, if a load instrution is theoretially independent of the next omputing instrutions, it an be theoretially issued with the next omputing instrutions. However, it would not be issued beause the queue of issuing load instrutions is full. Therefore, we introdue the parameter to denote the perentage of data preparation instrutions whih are atually exeuted simultaneously with omputing instrutions before using advaned tehnologies, where 0< <1. Thus, f denotes the portion of real parallelized instrutions for data preparation and (1 ) f h denotes the portion of data preparation instrutions whih annot be overlapped with omputing instrutions before adopting advaned tehnologies. Furthermore, with the help of advaned tehnologies suh as data prefething, speulative exeution, universal memory, no-opy data transfer, 3D NoC, et., the fration of data preparation instrutions whih annot be overlapped on one proessor will sharply derease as h

5 Transfer Time(ms) KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, July the number of ores inreases. In other words, (1 ) f h ould be dereased signifiantly. Furthermore, we introdue a variable k to model how muh perentage of data preparation that annot be overlapped on a ores system even adopting advaned tehnologies, where 0< k <1. After normalization to the omputation, the fration of data preparation instrutions whih annot be overlapped on a ore system beomes fud (1 fh) k (1 ) f. On the h opposite, the fration of data preparation instrutions whih an be overlapped on a ore system beomes f f (1 k ) (1 ) f, where f f 1. While, the pd h h ud pd 1 f denotes the fration of data preparation whih is losely dependent on omputation. h Therefore, we an extend Amdahl's law to be a new equation alled Enhaned Amdahl's law. The performane speedup is governed by: ' 1 S EA( f,, p, fud ) (3) f ((1 f ) ) p fud (1 p ) 3. Analysis the impat fators of data preparation Some multiore performane models tend to ignore the impat of data synhronization, CPU-GPU ommuniation, ahe missing rate and so on. In this setion, we take aount of some possible impat fators suh as kernel overhead, memory overhead, data transfer overhead and ahe missing rate whih ontribute to the overhead of data preparation.amdal s Law on basis never takes aount of those overhead. 3.1 Memory Transfer Overhead In CPU-GPU heterogeneous omputing, one overhead is aused by the disjoint address spaes of the CPU and GPU and the need to expliitly transfer data between the two memory spaes. Before exeuting a kernel on the GPU, all of the data used by kernel needs to be transferred from the CPU memory to the GPU memory. After exeution, the data produed by the kernel mostly needs to be transferred bak to the CPU memory. The funtion to aomplish this memory transferring is usually udamempy to devie pinned to devie from devie pinned from devie E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Bytes per transfer Fig. 2. Time per memory transfer for different size of transferred data

6 Time per all(us) 3236 Pei et al.: Reevaluating the Overhead of Data Preparation Fig. 2 shows the average transfer time by using udamempy strides a wide range as the size of transferring data inrease. Four different onfigurations are measured, and their results are all different whether the data are transferred to or from the GPU memory, and whether we use pinned or non-pinned memory. Pinned memory is the memory whih is alloated by using the udamallohost funtion, whih prevents the memory from being swapped out and provides improved transfer speeds. However, the non-pinned memory is the memory whih is alloated by using the mallo funtion. For the small size of transfer data, there is a onstant overhead of transferring data whih is around 10 µs. As the size of transferring data inreasing about 8 KB, the transfer time starts to inrease linearly. Besides, the overhead of transferring data in pinned memory is smaller than that in non-pinned memory. 3.2 Kernel Overhead Kernel overhead inludes the overhead of synhronization and launhing a kernel. Synhronization overhead an be a barrier to ahieving good performane for appliations utilizing fine-grained synhronization. The overhead of launhing a kernel an severely impat the performane of a CUDA appliation if the kernel is invoked many times. To evaluate the overhead of kernel exeution, we measured the average time required to launh an empty kernel over a large number of kernel invoations. Fig. 3 shows the time of alling eah kernel on four different GPUs. The asynhronization bar represents the ase where the kernel is invoked repeatedly without synhronization between alls. The synhronization bar represents the ase where the udathreadsynhronize funtion is alled after alling eah kernel, and fores the CPU to wait for the exeution of urrent kernel to be finished before alling the next kernel. The y-axis is the time for alling a kernel, whih mostly results from the overhead of aessing the GPU driver API and Runtime API. As shown in the Fig. 3, the average time of alling a kernel with synhronization is about 3.5 times than that of asynhronous mode. Therefore, the synhronization overhead is muh greater than that of asynhronous Synhronization Asynhronization GTX 280 GTX GTS 8800 GTX Fig. 3. Time of alling an empty kernel 3.3 Cahe Missing Rate Another overhead of data preparation is aused by the ahe missing. The appliation with

7 KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, July large problem sizes and irregular aessing memory would ause serious Cahe missing in the run-time beause of unonventional memory aess patterns are poorly handled by pre-fething. CPUs are designed to redue the effetive memory aess lateny through extensive multi ahe level and prefething tehnologies. Thus, a slightly irregular memory aess pattern an be suessfully aptured by the CPU ahes. However, the same aess pattern may be irregular enough to prevent effiient utilization of the GPU memory bandwidth, due to the restritions on aess patterns that must be met in order to ahieve good memory performane. The restritions of aess pattern an be partially relaxed by taking advantage of the GPU speial-purpose address spaes. Both onstant and texture memory provide small on-hip ahes that allow threads to take advantage of fine-grained spatial and temporal loality. In addition, texture memory relaxes the alignment requirements that must be met in order for multiple memory aesses from within the same warp to be oalesed. Zero opy improves performane by eliminating these redundant data opies. Another effetive approah is to use the software-ontrolled shared memory as an expliitly managed ahe, whih an signifiantly improve performane when data elements are frequently reused. 4. Experimental Evaluation In this setion, to evaluate the overhead of the data preparation, we performed different appliations from Rodinia benhmark. In order to implement GPU programs, the Rodinia suite uses CUDA [14, 20], an extension to C for GPUs. CUDA is a parallel omputing platform and programming model invented by NVIDIA. It enables dramati inreases in omputing performane by harnessing the power of the graphis proessing unit (GPU). Rodinia is a benhmark suite for heterogeneous omputing. To help arhitets study emerging platforms suh as GPUs (Graphis Proessing Units), Rodinia inludes appliations and kernels whih target on multi-ore CPU and GPU platforms. 4.1 Platform and Appliations The benhmarks used for experiment are mentioned in Table 1. The Rodinia[15] benhmark suite is designed to provide parallel programs for the study of heterogeneous systems. We use seven benhmarks from Rodinia, whih are k-means (KM), bak propagation (BP), breadth-first searh(bfs), hotspot(hs), Needleman-Wunsh(NW), spekle reduing anisotropi diffusion and stream luster(srad). Table 1. Seleted appliations from Rodinia Benhmark Appliation Desription Input dataset BFS Breadth First Searh Graph1MW_6.txt NW Needleman-Wunsh (Dynami Programming) 2048*2048 HS Hot spot (Strutured Grid) 500*500 BP Bak propagation KM K-means Clustering SC Stream luster SRAD Spekle reduing anisotropi diffusion 2048*2048

8 perentage(%) 3238 Pei et al.: Reevaluating the Overhead of Data Preparation The KM is a lustering algorithm used extensively in the field of data mining. This identifies related points by assoiating eah data point with its nearest luster, omputing new luster entroids, and iterating until onvergene. BP is a mahine-learning algorithm for layered neural network. This appliation is omprised of two phases: the forward phase and bakward phase. HS is a thermal simulation tool used for estimating proessor temperature based on an arhitetural floor plan and simulated power measurements. NW is an optimization method for DNA sequene alignments, and BFS traverses all the onneted omponents in a graph. SRAD is a diffusion algorithm based on partial differential equations and used for removing the spekles in an image without sarifiing important image features. SRAD is widely used in ultrasoni and radar imaging appliations. The inputs to the program are ultrasound images and the value of eah point in the omputation domain depends on its four neighbors. Those seven benhmarks an be divided into two ategories: ompute-intensive and memory-intensive appliations. SRAD and HotSpot are relatively ompute-intensive, while Needleman-Wunsh, Breadth-First Searh, Kmeans, and Stream Cluster are memory-intensive appliation. 4.2 Experimental Results The GPU in our experimental platform for heterogeneous system is NVIDIA GTX 750. It has 512 streaming multi-proessors (SMs), and 2 GB devie memory. We investigate the overhead of data preparation by omparing against the total exeution time of the appliations from Rodinia benhmark. As mentioned in setion III, We mainly fous on overhead of memory transfer, synhronization and ahe missing rate. The omputing performane of appliations is mostly affeted by those overheads, whih we will disuss in the following setions. As shown in Fig. 4, the overhead of different fators are analyzed, suh as kernel exeution, CPU-GPU ommuniation, CPU exeution and synhronization. And to different appliations, the influene of those fators is not the same. For example, the CPU-GPU ommuation overhead of SRAD and BP are relatively high beause of they have a large dataset. In addition, as for NW, HS and BP, synhronization overhead makes up a high proportion of data preparation for the data dependent. 60 Kernel Exeution CPU Exeution CPU-GPU Communiation Synhronization KM NW HS BP SRAD BFS SC Fig. 4. Perentage of total exeution time

9 perentage(%) KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, July Communiation Overhead The overhead of ommuniation between GPUs and CPUs often beomes a major fator to omputing performane. For large dataset, the overhead of data preparation is mainly determined by the bandwidth of PCIe bus. Slowing data transfer between CPU and GPU exaerbates the overhead of data preparation. As shown in the Fig. 5, SRAD and BP require signifiant CPU-GPU ommuniation time beause they need a lot of data to be inputted. The average of ommuniation time for SRAD and BP is almost one-third of the whole runtime. As for NW and HS, all of the omputation is done on the GPU and the results are expliitly transferred bak only after all GPU work has been ompleted. In these appliations, the memory transfer overhead is hard to be further redued. In onlusion, ommuniation overhead should be onsidered in the Extended Amdahl s Law. Understanding the appliation s memory aess patterns on the GPU is ruial to ahieve good performane. If the neighboring threads aess neighboring rows in an array, alloating the array in olumn-major order will allow threads within the same warp to aess ontiguous elements. So warp sheduling an hide the memory lateny and derease the ommuniation overhead to some extent. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CPU-GPU Communiation Other Overhead KM NW HS BP SRAD BFS SC Synhronization Overhead Fig. 5. CPU-GPU ommuniation overhead As shown in the Fig. 6, the overhead of synhronization is a barrier to ahieve high performane of appliations by utilizing fine-grained synhronization. SRAD and SC show relatively low synhronization overhead beause the majority of their omputations are independent. However, NW, HS, BP and BFS have great overhead of synhronization. The synhronization perentage of those appliations is between 30% and 40%, beause the majority of their omputations are interdependent. As show in equation (3), if parallelism is low, ignore the effets of data synhronization overhead is aeptable. But as parallelism growing, this effet beomes more predominant. Che et al. [18] also showed the importane of using the ahed, read-only onstant memory and texture memory spaes for frequently reused

10 perentage(%) perentage(%) 3240 Pei et al.: Reevaluating the Overhead of Data Preparation data within a thread blok. These methods are espeially helpful in reduing the overhead of GPU s data synhronization. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Synhronization Other Overhead KM NW HS BP SRAD BFS SC Fig. 6. Synhronization overhead At all, the overhead of CPU-GPU ommuniation and synhronization has great effet on data preparation. As show in Fig. 7, we an alulate the total perentage of ommuniation and synhronization during the running time. The rate of NW and BP are already above the half of the total time (NW is 52.4% and BP is 55.1%).The average rate of those benhmarks is 37.83%.Aording to Fig. 7, we find that data synhronization and ommuniation affet Amdahl s Law in the hetergeneous arhiteture. It is apparent that the overhead of data preparation (CPU-GPU ommuniation and data synronization) takes up a great proportion of the total exeution time. 60 CPU-GPU Communiation Synhronization KM NW HS BP SRAD BFS SC Average Fig. 7. Overhead of Communiation and Synhronization

11 L2 ahe missing rate(%) KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, July The Overhead of Cahe Missing Cahe missing rate is another fator whih ontributes to the overhead of data preparation. As shown in the Fig. 8, irregular data aess is poorly handled by pre-fething. NW exhibits a high L2 ahe missing rate of 41.0% beause of its irregular memory aess patterns (diagonal strips). BFS exhibits L2 ahe missing rate of 20.8% and KM exhibits 27.1%, whih exhibit regular behavior, present ahe miss rates that are lower but still high enough to be of interest. The L2 ahe miss rates of other appliations range from 2.0% to 12.0%. In terms of those experimental data, we onlude that the appliations with memory-intensive and irregular data aessing would inur high ahe missing rate KM NW HS BP SRAD BFS SC Fig. 8. L2 ahe missing rate 5. Conlusion Aording to the experiment results, alling a kernel would inurs high overhead of performane and programming to deal with the CPU-GPU ommuniation (as in BP and SRAD). High synhronization overheads are aused by omputations interdependent suh as HS and BP. Therefore, the overhead of data preparation is a major fator whih we should not be negleted in heterogeneous system. Traditional Amdahl s Law ignores those overheads, so for tasks of high synhronization, parallel exeution may result in signifiantly lower speedup than as predited by traditional Amdahl s Law. When we take into aount the overhead of data preparation, the speedup will not grow linearly as the number of ores inreases. Therefore, the speedup obtained is always less than the ideal in terms of traditional Amdahl's law. Therefore, we onsider the overhead of data preparation is an important fator by extending traditional Amdahl s Law. The effets of ommuniation, synhronization and ahe missing rate an be overome by proper arhitetural and software design. If a multiore arhiteture has no private memory, then there would be no need in synhronization and ommuniation: the only one memory is shared by all of the ores and data aess at the same spae. Therefore suh memory arhiteture would be unaffeted by those overheads of data preparation. But this may inur long memory lateny. In this paper, we have presented an Extended Amdahl s Law taking into aount the effets

12 3242 Pei et al.: Reevaluating the Overhead of Data Preparation of data preparation. By experiment, the heterogeneous multiore performane is fundamentally limited by data synhronization, CPU-GPU ommuniation and ahe missing rate. The maximum performane speedup ahieved by heterogeneous multiore an be lower than predited by Amdahl s Law. In the future work, we will keep on investigating the performane gain within limited energy (or power) budget while onsidering the overhead of data preparation. Referenes [1] P.Rogers, Heterogeneous System Arhiteture Overview, in Pro. of Hot Chips Sym., Aug Artile (CrossRef Link) [2] A.Duran and M.Klemm, The Intel Many Integrated Core Arhiteture, in Pro. of the International Conferene on High Performane Computing and Simulation(HPCS), pp , July Artile (CrossRef Link) [3] D.Bouvier, B.Cohen, W.Fry, S.Godey and M.Mantor, Kabini: An AMD Aelerated Proessing Unit System on A Chip, MICRO, vol.34, no.2, pp.22-33, Marh Artile (CrossRef Link) [4] Nvidia Corporation. Nvidia Projet Denver, (available in Feb. 2016). [5] Amdahl G M., Validity of the single proessor approah to ahieving large sale omputing apabilities [J], in Pro. of the Amerian Federation of Information Proessing Soieties (AFIPS), , April Artile (CrossRef Link) [6] D.H.Woo and H.S.Lee, Extending Amdahl s Law for Energy-Effiient Computing in the Many-Core Era, Computer, vol.41, no.12, pp.24-31, De Artile (CrossRef Link) [7] A. Marowka, Extending Amdahl s Law for Heterogeneous Computing, in Pro. of the 2012 International Symposium on Parallel and Distributed Proessing with Appliations(ISPDPA), pp , July Artile (CrossRef Link) [8] E.M.Karanikolaou, E.I.Milovanovi, I.Z ˇ.Milovanovi,M.P.Bekakos, Peformane Salability and Energy Consumpution on Distributed and Many-ore Platforms, Journal of Superomputing, vol.70, no.1, pp ,ot Artile (CrossRef Link) [9] S.H.Kim, D.Kim, C.Lee, W.S.Jeong, W.W.Ro, and J.L.Gaudiot, A Performane-Energy Model to EvaluateSingle Thread Exeution Aeleration, Computer Arhiteture Letters, vol.14, no.2, pp , Nov Artile (CrossRef Link) [10] S.M.London o, and J.P.Gyvez, Extending Amdahl s Law for Energy-Effiieny, in Pro. of International Conferene on Energy Aware Computing(ICEAC), pp.1-4, De Artile (CrossRef Link) [11] Eyerman S, Eekhout L., Modeling ritial setions in Amdahl's law and its impliations for multiore design [J]., in Pro. of the 37 th Annual International Symposium on Computer Arhiteture, ACM, New York, US, 38(3): , Artile (CrossRef Link) [12] R.Ge and K.W.Cameron, Power-Aware Speedup, in Pro. of International Parallel and Distributed Proessing Symposium (IPDPS), Marh 2007, pp Artile (CrossRef Link) [13] S.W.Pei, M.S.Kim, J.L.Gaudiot, Extending Amdahl s Law for Heterogeneous Multiore Proessor with Consideration of the Overhead of Data Preparation, IEEE Embedded Systems Letters, vol. 8, no.1, pp.26-29, Marh Artile (CrossRef Link) [14] J. Nikolls, I. Buk, M. Garland, and K. Skadron, Salable parallel programming with CUDA, ACM Queue, vol.6, no.2, pp.40 53, Artile (CrossRef Link) [15] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer,S.-H. Lee, and K. Skadron, Rodinia: A benhmark suite for heterogeneous omputing, in Pro. of IEEE International Symposium on Workload Charaterization, pp Artile (CrossRef Link) [16] Hill, Mark D, Marty, Mihael R., Amdahl's Law in the Multiore Era [J], Computer, 2008, 41(7): Artile (CrossRef Link) [17] Erlin Yao, Yungang Bao, Guangming Tan, et al., Extending Amdahl s Law in the Multiore Era[J]., ACM SIGMETRICS Performane Evaluation Review, 37(2):24-26, Artile (CrossRef Link)

Artile (CrossRef Link) [19] Andreou A G. Beyond Amdahl's Law: An Objetive Funtion That Links Multi-proessor Performane Gains to Delay and Energy [J].

Skadron, A performane study of general purpose appliations on graphis proessors using CUDA, Journal of Parallel and Distributed Computing, 68(10):1370 1380, 2008.

13 KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, July [18] Xian-He Sun, Yong Chen, Reevaluating Amdahl s law in the multiore era [J], Journal of Parallel and distributed Computing, 70(2): , Artile (CrossRef Link) [19] Andreou A G. Beyond Amdahl's Law: An Objetive Funtion That Links Multi-proessor Performane Gains to Delay and Energy [J]. IEEE Transations on Computers, 61(8): , Artile (CrossRef Link) [20] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, A performane study of general purpose appliations on graphis proessors using CUDA, Journal of Parallel and Distributed Computing, 68(10): , Artile (CrossRef Link) Songwen Pei reeived the B.S. in the Shool of Computer from National University of Defene and Tehnology, Changsha, China in 2003, the M.S. in the Shool of Information Engineering from Guizhou University, Guiyang, China in 2006, and the Ph.D. in the Shool of Computer Siene and Tehnology from Fudan University, Shanghai, China in He is urrently an Assoiate Professor of the Computer Siene and Engineering Department at the University of Shanghai for Siene and Tehnology, and he urrently works as a Guest Researher at the Institute of Computing Tehnology, Chinese Aademy of Sienes (2011-), and a Researh Sientist at the Eletrial Engineering and Computer Siene, University of California( ). His researh interests inlude heterogeneous multi-ore proessors, parallel and distributed omputing, loud omputing, big data, and fault-tolerant omputation, et. He is a member of the IEEE, ACM and CCF in China, and he is also a board member of CCF-TCCET, CCF-TCARCH and CCF-YOCSEF Shanghai respetively. He is a talent award-winner of Shanghai Pujiang Program Junge Zhang reeived the B.S. in the omputer siene from Luoyang Normal University in She is urrently a graduate student in omputer arhiteture at the University of Shanghai for Siene and Tehnology. Her urrent researh interests inlude omputer arhiteture, Multi-ore heterogeneous systems, data prefething and GPU power model. Linhua Jiang reeived M.S. degree and Ph.D. degree from Catholi University of Leuven and Leiden University respetively. He is urrently a Professor in the Shool of Optial-Eletrial and Computer Engineering, University of Shanghai for Siene and Tehnology, and is the diretor of the Optial-Computer Siene Researh Group. He was a visiting Faulty of Stanford University. His urrent researh fields inlude Optial-Information and Image Proessing, Computer Network and Cyber-Physial System, Parallel Computing, et. Prof. JIANG published more than one hundred sientifi papers and onferene artiles.

3244 Pei et al.: Reevaluating the Overhead of Data Preparation Myoung-Seo Kim reeived both the B.S. degree in Computer Siene and the B.S. degree in Eletrial and Eletronis Engineering and the M.S. degree in Computer Siene from Yonsei University, Seoul, Korea in 2003 and 2005, respetively.

, Cupertino, California (2008-2009). He reeived the Ph.D. degree in Computer Siene from University of California, Irvine in 2016.

14 3244 Pei et al.: Reevaluating the Overhead of Data Preparation Myoung-Seo Kim reeived both the B.S. degree in Computer Siene and the B.S. degree in Eletrial and Eletronis Engineering and the M.S. degree in Computer Siene from Yonsei University, Seoul, Korea in 2003 and 2005, respetively. He has worked for 5 years performing design and verifiation of portable multimedia system-on-a-hip (SoC) at Samsung Eletronis, Yongin, Korea ( ) and at Apple In., Cupertino, California ( ). He reeived the Ph.D. degree in Computer Siene from University of California, Irvine in He is urrently a prinipal researher in the Parallel Systems and Computer Arhiteture Lab at University of California, Irvine. He was the reipient of the 2011 Yonsei International Foundation Sholarship and the 2012 SK Hynix Study Abroad Sholarship. In addition, he has published more than 15 papers in journals and international onferenes. His researh interests inlude multi/many-ore system-on-a-hip, omputer arhiteture and design, low-power arhiteture, embedded systems, VLSI iruit design and analysis, and design automation. He is a member of the IEEE and the ACM. Jean-Lu Gaudiot reeived the Diplôme d Ingénieur from ESIEE, Paris, Frane in 1976 and the M.S. and Ph.D. degrees in Computer Siene from the University of California, Los Angeles in 1977 and 1982, respetively. He is urrently a Professor and Chair of the Eletrial and Computer Engineering Department at the University of California, Irvine. Prior to joining UCI in January 2002, he was a Professor of Eletrial Engineering at the University of Southern California sine 1982, where he served as and Diretor of the Computer Engineering Division for three years. He has also done miroproessor systems design at Teledyne Controls, Santa Monia, California ( ) and researh in innovative arhitetures at the TRW Tehnology Researh Center, El Segundo, California ( ). He onsults for a number of ompanies involved in the design of high-performane omputer arhitetures. His researh interests inlude multithreaded arhitetures, fault-tolerant multi-proessors, and implementation of reonfigurable arhitetures. He has published over 200 journal and onferene papers. His researh has been sponsored by NSF, DoE, and DARPA, as well as a number of industrial organizations. In January 2006, he beame the first Editor-in-Chief of IEEE Computer Arhiteture Letters, a new publiation of the IEEE Computer Soiety, whih he helped found to the end of failitating short, fast turnaround of fundamental ideas in the Computer Arhiteture domain. From 1999 to 2002, he was the Editor-in-Chief of the IEEE Transations on Computers. He has served on the IEEE Computer Soiety Board of Governors for two terms ( ), was VP of Eduational Ativities (2013), VP of Publiations ( ) and is the 2017 IEEE Computer Soiety President. In 1999, he beame a Fellow of the IEEE. He was elevated to the rank of AAAS Fellow in 2007.

Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reconfigurable Hardware Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,