Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

Size: px

Start display at page:

Download "Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs"

Lora Davis
6 years ago
Views:

1 Utlty-Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt ECE Department The Unversty of Texas at Austn Austn, TX, USA {joao, Flux7 Consultng Austn, TX, USA Computer Archtecture Laboratory Carnege Mellon Unversty Pttsburgh, PA, USA ABSTRACT Asymmetrc Chp Multprocessors (ACMPs) are becomng a realty. ACMPs can speed up parallel applcatons f they can dentfy and accelerate code segments that are crtcal for performance. Proposals already exst for usng coarsegraned thread schedulng and fne-graned bottleneck acceleraton. Unfortunately, there have been no proposals offered thus far to decde whch code segments to accelerate n cases where both coarse-graned thread schedulng and fne-graned bottleneck acceleraton could have value. Ths paper proposes Utlty-Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs (), a cooperatve software/hardware mechansm for dentfyng and acceleratng the most lkely crtcal code segments from a set of multthreaded applcatons runnng on an ACMP. The key dea s a new Utlty of Acceleraton metrc that quantfes the performance beneft of acceleratng a bottleneck or a thread by takng nto account both the crtcalty and the expected speedup. outperforms the best of two state-of-the-art mechansms by 11% for sngle applcaton workloads and by 7% for two-applcaton workloads on an ACMP wth 52 small cores and 3 large cores. Categores and Subject Descrptors C.1.2 [Processor Archtectures]: Multple Data Stream Archtectures (Multprocessors) General Terms Desgn, Performance Keywords Multthreaded applcatons, crtcal sectons, barrers, multcore, asymmetrc CMPs, heterogeneous CMPs 1. INTRODUCTION Parallel applcatons are parttoned nto threads that can execute concurrently on multple cores. Speedup s often lmted when some threads are prevented from dong useful work concurrently because they have to wat for other code segments to fnsh. Asymmetrc Chp Mult-Processors Permssontomakedgtalorhardcopesofallorpartofthsworkfor personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bearthsnotceandthefullctatononthefrstpage.tocopyotherwse,to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ISCA 13 Tel-Avv, Israel Copyrght 2013 ACM /13/06...$ (ACMPs) wth one or few large, fast cores and many small, energy-effcent cores have been proposed for acceleratng the most performance crtcal code segments, whch can lead to sgnfcant performance gans. However, ths approach has heretofore had at least two fundamental lmtatons: 1. The problem of acceleratng only one type of code segment. There are two types of code segments that can become performance lmters: (1) threads that take longer to execute than other threads because of load mbalance or mcroarchtectural mshaps such as cache msses, and (2) code segments, lke contended crtcal sectons, that make other threads wat. We call threads of the frst type laggng threads. They ncrease executon snce the program cannot complete untl all ts threads have fnshed executon. Code segments of the second type reduce parallelsm and can potentally become the crtcal path of the applcaton. Joao et al. [] call these code segments bottlenecks. Pror work accelerates ether laggng threads [6, 5, 13] or bottlenecks [24, ], but not both. Thus, these proposals beneft only the applcatons whose performance s lmted by the type of code segments that they are desgned to accelerate. Note that there s overlap between laggng threads and bottlenecks: laggng threads, f left alone, can eventually make other threads wat and become bottlenecks. However, the goal of the proposals that accelerate laggng threads s to try to prevent them from becomng bottlenecks. Real applcatons often have both bottlenecks and laggng threads. Prevous acceleraton mechansms prove suboptmal n ths case. Bottleneck Identfcaton and Schedulng () [] does not dentfy laggng threads early enough and as a result t does not always accelerate the program s crtcal executon path. Smlarly, laggng thread acceleraton mechansms do not accelerate consecutve nstances of the same crtcal secton that execute on dfferent threads and as a result can mss the opportunty to accelerate the program s crtcal executon path. Combnng bottleneck and laggng thread acceleraton mechansms s non-trval because the combned mechansm must predct the relatve beneft of acceleratng bottlenecks and laggng threads. Note that ths beneft depends on the nput set and program phase, as well as the underlyng machne. Thus a statc soluton would lkely not work well. Whle the exstng acceleraton mechansms are dynamc, they use dfferent metrcs to dentfy good canddates for acceleraton; thus, ther outputs cannot be compared drectly to decde whch code segments to accelerate. 2. The problem of not handlng multple multthreaded applcatons. In practce, an ACMP can be expected to run multple multthreaded applcatons. Each applcaton wll have laggng threads and bottlenecks that beneft dfferently from acceleraton. The prevous work on 1

2 bottleneck acceleraton [24, ] and laggng thread acceleraton [13] does not deal wth multple multthreaded applcatons, makng ther use lmted n practcal systems. To make ACMPs more effectve, we propose Utlty- Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs (). s a general cooperatve software/hardware mechansm to dentfy the most mportant code segments from one or multple applcatons to accelerate on an ACMP to mprove system performance. ntroduces a new Utlty of Acceleraton metrc for each code segment, ether from a laggng thread or a bottleneck, whch s used to decde whch code segments to run on the large cores of the ACMP. The key dea of the utlty metrc s to consder both the acceleraton expected from runnng on a large core and the crtcalty of the code segment for ts applcaton as a whole. Therefore, ths metrc s effectve n makng acceleraton decsons for both sngle- and multpleapplcaton cases. also bulds on and extends prevous proposals to dentfy potental bottlenecks [] and laggng threads [13]. Ths paper makes three man contrbutons: 1. It ntroduces a new Utlty of Acceleraton metrc that combnes a measure of the acceleraton that each code segment acheves, wth a measure of the crtcalty of each code segment. Ths metrc enables meanngful comparsons to decde whch code segments to accelerate regardless of the segment type. We mplement the metrc n the context of an ACMP where acceleraton s performed wth large cores, but the metrc s general enough to be used wth other acceleraton mechansms, e.g., frequency scalng. 2. It provdes the frst mechansm that can accelerate both bottlenecks and laggng threads from a sngle multthreaded applcaton, usng faster cores. It can also leverage ACMPs wth any number of large cores. 3. It s the frst work that accelerates bottlenecks n addton to laggng threads from multple multthreaded applcatons. We evaluate on sngle- and multple-applcaton scenaros on a varety of ACMP confguratons, runnng a set of workloads that ncludes both bottleneck-ntensve applcatons and non-bottleneck-ntensve applcatons. For example, on a 52-small-core and 3-large-core ACMP, mproves average performance of 9 multthreaded applcatons by 11% over the best of prevous proposals that accelerate only laggng threads [13] or only bottlenecks []. On the same ACMP confguraton, mproves average harmonc speedup of 2-applcaton workloads by 7% over our aggressve extensons of prevous proposals to accelerate multple applcatons. Overall, we fnd that sgnfcantly mproves performance over prevous work and ts performance beneft generally ncreases wth larger area budgets and addtonal large cores. 2. MOTIVATION 2.1 Bottlenecks Joao et al. [] defned bottleneck as any code segment that makes other threads wat. Bottlenecks reduce the amount of thread-level parallelsm (TLP); therefore, a program runnng wth sgnfcant bottlenecks can lose some or even all of the potental speedup from parallelzaton. Inter-thread synchronzaton mechansms can create bottlenecks, e.g., contended crtcal sectons, the last thread arrvng to a barrer and the slowest stage of a ppelne-parallel program. Fgure 1 shows four threads executng non-crtcal-secton segments (Non-CS) and a crtcal secton CS (n gray). A crtcal secton enforces mutual excluson: only one thread can execute the crtcal secton at a gven, makng any other threads wantng to execute the same crtcal secton wat, whch reduces the amount of useful work that can be done n parallel.... T2 Non CS Fgure 1: Example of a crtcal secton. CS... The state-of-the-art n bottleneck acceleraton on an ACMP s [], whch conssts of software-based annotaton of potental bottlenecks and hardware-based trackng of thread watng cycles,.e., the number of cycles threads wated for each bottleneck. Then, accelerates the bottlenecks that are responsble for the most thread watng cycles. s effectve n acceleratng crtcal sectons that lmt performance at dfferent s. However, t accelerates threads arrvng last to a barrer and slow stages of a ppelne-parallel program only after they have started makng other threads wat,.e., after accumulatng a mnmum number of thread watng cycles. If could start acceleratng such laggng threads earler, t could remove more thread watng and further reduce executon. 2.2 Laggng Threads A parallel applcaton s composed of groups of threads that splt work and eventually ether synchronze at a barrer, or fnsh and jon. The thread that takes the most to execute n a thread group determnes the executon of the entre group and we call that thread a laggng thread. Thread mbalance can appear at run for multple reasons, e.g., dfferent memory behavor that makes some threads suffer from hgher average memory latency, and dfferent contenton for crtcal sectons that makes some threads wat longer. Fgure 2 shows executon of four threads over. Thread T2 becomes a laggng thread as soon as t starts makng slower progress than the other threads towards reachng the barrer at t2. Note that at t1, T2 becomes the last thread runnng for the barrer and becomes a bottleneck. Therefore, laggng threads are potental future bottlenecks,.e., they become bottlenecks f thread mbalance s not corrected n. Also note that f there are multple threads wth approxmately as much remanng work to do as the most laggng thread, all of them need to be accelerated to actually reduce total executon. Therefore, all those threads have to be consdered laggng threads.... T2... t1 Barrer t2 Fgure 2: Example of a laggng thread (T2). The state-of-the-art n acceleraton of laggng threads are proposals that dentfy a laggng thread by trackng ether thread progress [6, 13] or reasons for a thread to get delayed [5]. Meetng Ponts [6] tracks the threads that are 2

3 laggng n reachng a barrer by countng the number of loop teratons that have been completed. Thread Crtcalty Predctors [5] predct that the threads that suffer from more cache msses wll be delayed and wll become crtcal. Age-based Schedulng [13] accelerates the thread that s predcted or profled to have more remanng work untl the next barrer or the program ext, measured n terms of commtted nstructons. Once a laggng thread s dentfed, t can be accelerated on a large core of an ACMP. 2.3 Applcatons have both Laggng Threads and Bottlenecks Joao et al. [] showed that dfferent bottlenecks can lmt performance at dfferent s. In partcular, contenton for dfferent crtcal sectons can be very dynamc. It s not evdent upfront whether acceleratng a crtcal secton or a laggng thread leads to better performance. Therefore, t s fundamentally mportant to dynamcally dentfy the code segments, ether bottlenecks or laggng threads, that have to be accelerated at any gven. 2.4 Multple Applcatons Fgure 3(a) shows two 4-thread applcatons runnng on small cores of an ACMP wth a sngle large core. Let s assume that at t1 the system has to decde whch thread to accelerate on the large core to maxmze system performance. Wth knowledge of the progress each thread has made towards reachng the next barrer, the system can determne that App1 has one laggng thread, because has sgnfcantly more remanng work to do than the other threads, and App2 has two laggng threads and T2, because both of them have sgnfcantly more work to do than and. Acceleratng from App1 would drectly reduce App1 s executon by some t, whle acceleratng ether or T2 from App2 would not sgnfcantly reduce App2 s executon. It s necessary to accelerate both durng one quantum and T2 durng another quantum to reduce App2 s executon by a smlar t, assumng the speedups for all threads on the large core are smlar. Therefore, system performance wll ncrease more by acceleratng from App1. App 1 T2 App 2 T2 t1 barrer barrer (a) Laggng threads App 1 T2 App 3 T2 t1 barrer (b) Laggng threads and crtcal sectons Fgure 3: Examples of laggng threads and crtcal sectons. Fgure 3(b) shows the same App1 from the prevous example and an App3 wth a strongly-contended crtcal secton (n gray). Every thread from App3 has to wat to execute the crtcal secton at some and the crtcal secton s clearly on the crtcal path of executon (there s always one thread executng the crtcal secton and at least one thread watng for t). At t1, App1 has a sngle laggng thread, whch s App1 s crtcal path. Therefore, every cycle saved by acceleratng from App1 would drectly reduce App1 s executon. Smlarly, every cycle saved by acceleratng nstances of the crtcal secton from App3 on any of ts threads would drectly reduce App3 s executon. Ideally, the system should dynamcally accelerate the code segment that gets a hgher speedup from the large core, ether a segment of the laggng thread from App1 or a sequence of nstances of the crtcal secton from App3. These two examples llustrate that acceleraton decsons need to consder both the crtcalty of code segments and how much speedup they get from the large core. Our goal s to desgn a mechansm that decdes whch code segments, ether from laggng threads or bottlenecks, to run on the avalable large cores of an ACMP, to mprove system performance n the presence of a sngle multthreaded applcaton or multple multthreaded applcatons. 3. UTILITY-BASED ACCELERATION () The core of s a new Utlty of Acceleraton metrc that s used to decde whch code segments to accelerate at any gven. Utlty combnes an estmaton of the acceleraton that each code segment can acheve on a large core, and an estmaton of the crtcalty of the code segment. Fgure 4 shows the three man components of : Laggng Thread Identfcaton unt, Bottleneck Identfcaton unt, and Acceleraton Coordnaton unt. Every schedulng quantum (1M cycles n our experments), the Laggng Thread Identfcaton (LTI) unt produces the set of Hghest- Utlty Laggng Threads (HULT), one for each large core, and the Bottleneck Acceleraton Utlty Threshold (BAUT). Meanwhle, the Bottleneck Identfcaton (BI) unt computes the Utlty of acceleratng each of the most mportant bottlenecks and dentfes those wth Utlty greater than BAUT, whch we call Hghest-Utlty Bottlenecks (HUB). Only these bottlenecks are enabled for acceleraton. Fnally, the Acceleraton Coordnaton (AC) unt decdes whch code segments to run on each large core, ether a thread from the HULT set, or bottlenecks from the HUB set. Bottleneck Identfcaton (BI) Hghest Utlty Bottlenecks (HUB) Bottleneck Acceleraton Utlty Threshold (BAUT) Acceleraton Coordnaton (AC) Laggng Thread Identfcaton (LTI) Hghest Utlty Laggng Threads (HULT) large core control Fgure 4: Block dagram of. 3.1 Utlty of Acceleraton We defne Utlty of Acceleratng a code segment c as the reducton n the applcaton s executon due to acceleraton of c relatve to the applcaton s executon before acceleraton. Formally, U c = T T where T s the reducton n the entre applcaton s executon and T s the orgnal executon of the entre applcaton. If code segment c of length t cycles s accelerated by t, then after multplyng and dvdng by t t, Utlty of acceleratng c can be rewrtten as: U c = T T = ( t t ) ( t T ) ( T t ) = L R G 3

4 L: The frst factor s the Local Acceleraton, whch s the reducton n the executon of solely the code segment c due to runnng on a large core dvded by the orgnal executon of c on the small core. L = t t L depends on the net speedup that runnng on a large core can provde for the code segment, whch s a necessary condton to mprove the applcaton s performance: f L s close to zero or negatve, runnng on the large core s not useful and can be harmful. R: The second factor s the Relevance of Acceleraton, whch measures how mportant code segment c s for the applcaton as a whole: R s the executon (n cycles) of c on the small core dvded by the applcaton s executon (n cycles) before acceleraton. R = t T R lmts the overall speedup that can be obtaned by acceleratng a sngle code segment. For example, let s assume two equally long seral bottlenecks from two dfferent applcatons start at the same and can be accelerated wth the same L factor. One runs for 50% of ts applcaton s executon, whle the other runs for only 1%. Obvously, acceleratng the frst one s a much more effectve use of the large core to mprove system performance. G: The thrd factor s the Global Effect of Acceleraton, whch represents how much of the code segment acceleraton t translates nto a reducton n executon T. G = T t G depends on the crtcalty of code segment c: f c s on the crtcal path, G = 1, otherwse G = 0. In reasonably symmetrc applcatons, multple threads may arrve to the next barrer at about the same and all of them must be accelerated to reduce the applcaton s executon, whch makes each of the threads partally crtcal (0 < G < 1). We wll explan how we estmate each factor L, R and G n Sectons 3.5.1, and 3.5.3, respectvely. 3.2 Laggng Thread Identfcaton The set of Hghest-Utlty Laggng Threads (HULT) s produced every schedulng quantum (.e., Q cycles) by the LTI unt wth the followng steps: 1. Identfy laggng threads. We use the same noton of progress between consecutve synchronzaton ponts as n [13];.e., we assume approxmately the same number of nstructons are expected to be commtted by each thread between consecutve barrers or synchronzaton ponts, and use nstructon count as a metrc of thread progress. 1 A commtted nstructon counter progress s kept as part of each hardware context. After thread creaton or when restartng the threads after a barrer, progress s reset wth a smple command ResetProgress, mplemented as a store to a reserved memory locaton. Fgure 5 shows the progress of several threads from the same applcaton that are runnng on small cores. Thread 1 Note that we are not argung for nstructon count as the best progress metrc. In general, the best progress metrc s applcaton dependent and we envson a mechansm that lets the software defne whch progress metrc to use for each applcaton, ncludng [6, 13, 5] or even each applcaton perodcally reportng how much progress each thread s makng. can be easly extended to use any progress metrc. T2 T5 mnp progress P Fgure 5: Laggng thread dentfcaton. has made the smallest progress mnp and s a laggng thread. Let s assume that f thread s accelerated durng the next schedulng quantum, t wll make P more progress than the other non-accelerated threads. Therefore, wll leave behnd threads T2 and, whch then wll also have to be accelerated to fully take advantage of the acceleraton of. Therefore, we consder the ntal set of laggng threads to be {T 1, T 2, T 3}. We estmate P as AvgDeltaIPC Q, where AvgDeltaIPC s the dfference between average IPC on the large cores and average IPC on the small cores across all threads n the applcaton, measured over fve quanta. 2. Compute Utlty of Acceleraton for each laggng thread. We wll explan how estmates each factor L, R and G n Sectons 3.5.1, and 3.5.3, respectvely. 3. Fnd the Hghest-Utlty Laggng Thread (HULT) set. The HULT set conssts of the laggng threads wth the hghest Utlty. The sze of the set s equal to the number of large cores. Laggng threads from the same applcaton whose Utltes fall wthn a small range of each other 2 are consdered to have the same Utlty and are sorted by ther progress nstructon count (hghest rank for lower progress) to mprove farness and reduce the mpact of naccuraces n Utlty computatons. 4. Determne the Bottleneck Acceleraton Utlty Threshold (BAUT). As long as no bottleneck has hgher Utlty than any laggng thread n the HULT set, no bottleneck should be accelerated. Therefore, the BAUT s smply the smallest Utlty among the threads n the HULT set. To keep track of the relevant characterstcs of each thread, the LTI unt ncludes a Thread Table wth as many entres as hardware contexts, ndexed by software thread ID (td). 3.3 Bottleneck Identfcaton The Hghest-Utlty Bottleneck set s contnuously produced by the Bottleneck Identfcaton (BI) unt. The BI unt s mplemented smlarly to [] wth one fundamental change: nstead of usng thread watng cycles as a metrc to classfy bottlenecks, the BI uses our Utlty of Acceleraton metrc. 3 Software support. The programmer, compler or lbrary delmts potental bottlenecks usng BottleneckCall and BottleneckReturn nstructons, and replaces the code that wats for bottlenecks wth a BottleneckWat nstructon. The purpose of the BottleneckWat nstructon s threefold: 1) t mplements watng for the value on a memory locaton to change, 2) t allows the hardware to keep track of whch threads are watng for each bottleneck and 3) t makes nstructon count a more accurate measure of thread progress by removng the spnnng loops that wat for synchronzaton and execute nstructons that do not make progress. Hardware support: Bottleneck Table. The hardware tracks whch threads are executng or watng for each bottleneck and dentfes the crtcal bottlenecks wth low overhead n hardware usng a Bottleneck Table (BT) n the BI unt. Each BT entry corresponds to a bottleneck and collects all 2 We fnd a range of 2% works well n our experments. 3 Our experments (not shown due to space lmtatons) show usng Utlty outperforms usng thread watng cycles by 1.5% on average across our bottleneck-ntensve applcatons. 4

5 the data requred to compute the Utlty of ts acceleraton. Utlty of acceleratng bottlenecks. The Bottleneck Table computes the L factor once every quantum, as explaned n Secton It recomputes the Utlty of acceleratng each bottleneck whenever ts R (Secton 3.5.2) or G (Secton 3.5.3) factor changes. Therefore, Utlty can change at any, but t does not change very frequently because of how R and G are computed as explaned later. Hghest-Utlty Bottleneck (HUB) set. Bottlenecks wth Utlty above the Bottleneck Acceleraton Utlty Threshold (BAUT) are enabled for acceleraton,.e. they are part of the HUB set. 3.4 Acceleraton Coordnaton The canddate code segments for acceleraton are the laggng threads n the HULT set, one per large core, provded by the LTI unt, and the bottlenecks n the HUB set, whose acceleraton has been enabled by the BI unt. Laggng thread acceleraton. Each laggng thread n the HULT set s assgned to run on a large core at the begnnng of each quantum. The assgnment s based on affnty to preserve cache localty,.e., f a laggng thread wll contnue to be accelerated, t stays on the same large core, and threads newly added to the HULT set try to be assgned to a core that was runnng another thread from the same applcaton. Bottleneck acceleraton. When a small core executes a BottleneckCall nstructon, t checks whether or not the bottleneck s enabled for acceleraton. To avod accessng the global BT on every BottleneckCall, each small core ncludes a local Acceleraton Index Table (AIT) that caches the bottleneck ID (bd), acceleraton enable bt and assgned large core for each bottleneck. 4 If acceleraton s dsabled, the small core executes the bottleneck locally. If acceleraton s enabled, the small core sends a bottleneck executon request to the assgned large core and stalls watng for a response. The large core enqueues the request nto a Schedulng Buffer (SB), whch s a prorty queue based on Utlty. The oldest nstance of the bottleneck wth hghest Utlty s executed by the large core untl the BottleneckReturn nstructon, at whch pont the large core sends a BottleneckDone sgnal to the small core. On recevng the BottleneckDone sgnal, the small core contnues executng the nstructon after the BottleneckCall. The key dea of Algorthm 1, whch controls each large core, s that each large core executes the assgned laggng thread, as long as no bottleneck s mgrated to t to be accelerated. Only bottlenecks wth hgher Utlty than the BAUT (the smallest Utlty among all accelerated laggng threads,.e., the HULT set) are enabled to be accelerated. Therefore, t makes sense for those bottlenecks to preempt a laggng thread wth lower Utlty. The large core executes n one of two modes: acceleratng the assgned laggng thread or acceleratng bottlenecks from ts Schedulng Buffer (SB), when they show up. After no bottleneck shows up for 50Kcycles, the assgned laggng thread s mgrated back to the large core. Ths delay reduces the number of spurous laggng thread mgratons, snce bottlenecks lke contended crtcal sectons usually occur n bursts. The reasons to avod fner-graned nterleavng of laggng threads and bottlenecks are: 1) to reduce the m- 4 Intally all bottlenecks are assgned to the large core runnng the laggng thread wth mnmum Utlty (equal to the BAUT threshold for bottleneck acceleraton), but they can be reassgned as we wll explan n Secton When a bottleneck s ncluded n or excluded from the HUB set, the BT broadcasts the update to the AITs on all small cores. Algorthm 1 Acceleraton Coordnaton whle 1 do // execute a laggng thread mgrate assgned laggng thread from small core whle not bottleneck n SB do run assgned laggng thread end whle mgrate assgned laggng thread back to small core // execute bottlenecks untl no bottleneck shows up for 50Kcycles done wth bottlenecks = false whle not done wth bottlenecks do whle bottleneck n SB do deque from SB and run a bottleneck end whle delay = 0 whle not bottleneck n SB and (delay < 50Kcycles) do wat whle ncrementng delay end whle done wth bottlenecks = not bottleneck n SB end whle end whle pact of frequent mgratons on cache localty and 2) to avod excessve mgraton overhead. Both effects can sgnfcantly reduce or elmnate the beneft of acceleraton. 3.5 Implementaton Detals Estmaton of L. L s related to the speedup S due to runnng on a large core by: L = t = t t/s = 1 1 t t S Any exstng or future technque that estmates performance on a large core based on nformaton collected whle runnng on a small core can be used to estmate S. We use Performance Impact Estmaton (PIE) [27], the latest of such technques. PIE requres measurng total cycles per nstructon (CPI), CPI due to memory accesses, and msses per nstructon (MPI) whle runnng on a small core. For code segments that are runnng on a large core, PIE also provdes an estmate of the slowdown of runnng on a small core based on measurements on the large core. Ths estmaton requres measurng CPI, MPI, average dependency dstance between a last-level cache mss and ts consumer, and the fracton of nstructons that are dependent on the prevous nstructon (because they would force executon of only one nstructon per cycle n the 2-wde n-order small core). Instead of mmedately usng ths estmaton of performance on the small core whle runnng on the large core, our mplementaton remembers the estmated speedup from the last the code segment ran on a small core, because t s more effectve to compare speedups obtaned wth the same technque. After fve quanta we consder the old data to be stale and we swtch to estmate the slowdown on a small core based on measurements on the large core. Each core collects data to compute L for ts current thread and for up to two current bottlenecks to allow trackng up to one level of nested bottlenecks. When a bottleneck fnshes and executes a BottleneckReturn nstructon, a message s sent to the Bottleneck Table n the regular mplementaton. We nclude the data requred to compute L for the bottleneck on ths message wthout addng any extra overhead. Data requred to compute L for laggng threads s sent to the Thread Table at the end of the schedulng quantum Estmaton of R. Snce acceleraton decsons are made at least once every schedulng quantum, the objectve of each decson s to 5

6 maxmze Utlty for one quantum at a. Therefore, we estmate R (and Utlty) only for the next quantum nstead of for the whole run of the applcaton,.e., we use T = Q, the quantum length n cycles. Durng each quantum the hardware collects the number of actve cycles, t lastq, for each thread and for each bottleneck to use as an estmate of the code segment length t for the next quantum. To that end, each Bottleneck Table (BT) entry and each Thread Table (TT) entry nclude an actve bt and a stamp actve. On the BT, the actve bt s set between BottleneckCall and BottleneckReturn nstructons. On the TT, the actve bt s only reset whle executng a BottleneckWat nstructon,.e., whle the thread s watng. When the actve bt s set, stamp actve s set to the current. When the actve bt s reset, the actve cycle count t lastq s ncremented by the dfference between current and stamp actve. Laggng thread actvty can be easly nferred: a runnng thread s always actve, except whle runnng a Bottleneck- Wat nstructon. R estmated = t lastq Q for laggng threads Bottleneck actvty s already reported to the Bottleneck Table for bookkeepng, off the crtcal path, after executng BottleneckCall and BottleneckReturn nstructons. Therefore the actve bt s set by BottleneckCall and reset by BottleneckReturn. Gven that bottlenecks can suddenly become mportant, the Bottleneck Table also keeps an actve cycle counter t lastsubq for the last subquantum, an nterval equal to 1/8 of the quantum. Therefore, R estmated = max( t lastq Q, t lastsubq ) for bottlenecks Q/ Estmaton of G. The G factor measures crtcalty of the code segment,.e., how much of ts acceleraton s expected to reduce total executon. Consequently, we estmate G for each type of code segment as follows: Laggng threads. Crtcalty of laggng threads depends on the number of laggng threads n the applcaton. If there are M laggng threads, all of them have to be evenly accelerated to remove the thread watng they would cause before the next barrer or jonng pont. That s, all M laggng threads are potental crtcal paths wth smlar lengths. Therefore, G estmated = 1/M for each of the laggng threads. Amdahl s seral segments are part of the only thread that exsts,.e., a specal case of laggng threads wth M = 1. Therefore, each seral segment s on the crtcal path and has G = 1. Smlarly, the last thread runnng for a barrer and the slowest stage of a ppelned program are also dentfed as sngle laggng threads and, therefore, have G = 1. Crtcal sectons. Not every contended crtcal secton s on the crtcal path and hgh contenton s not a necessary condton for beng on the crtcal path. Let s consder two cases. Fgure 6(a) shows a crtcal secton that s on the crtcal path (dashed lne), even though there s never more than one thread watng for t. All threads have to wat at some for the crtcal secton, whch makes the crtcal path jump from thread to thread followng the crtcal secton segments. We consder strongly contended crtcal sectons those that have been makng all threads wat n the recent past and estmate G = 1 for them. To dentfy strongly contended crtcal sectons each Bottleneck Table entry ncludes a recent waters bt vector wth one bt per hardware context. Ths bt s set on executng BottleneckWat for the correspondng bottleneck. Each Bottleneck Table entry also keeps a movng average for bottleneck length avg len. If there are N actve threads n the applcaton, recent waters s evaluated every N avg len cycles: f N bts are set, ndcatng that all actve threads had to wat, the crtcal secton s assumed to be strongly contended. Then, the number of ones n recent waters s stored n past waters (see the next paragraph) and recent waters s reset. CS... T (a) Strongly contended crtcal secton (all threads wat for t) CS... T (b) Weakly contended crtcal secton (T2 never wats for t) Fgure 6: Types of crtcal sectons. We call the crtcal sectons that have not made all threads wat n the recent past weakly contended crtcal sectons (see Fgure 6(b)). Acceleratng an nstance of a crtcal secton accelerates not only the thread that s executng t but also every thread that s watng for t. If we assume each thread has the same probablty of beng on the crtcal path, the probablty of acceleratng the crtcal path by acceleratng the crtcal secton would be the fracton of threads that get accelerated,.e., G = (W + 1)/N, when there are W watng threads and a total of N threads. Snce the current number of waters W s very dynamc, we combne t wth hstory (past waters from the prevous paragraph). Therefore, we estmate G for weakly contended crtcal sectons as G estmated = (max(w, past waters) + 1)/N False Seralzaton and Usng Multple Large Cores for Bottleneck Acceleraton. Instances of dfferent bottlenecks from the same or from dfferent applcatons may be accelerated on the same large core. Therefore, a bottleneck may get falsely seralzed,.e., t may have to wat for too long on the Schedulng Buffer for another bottleneck wth hgher Utlty. Bottlenecks that suffer false seralzaton can be reassgned to a dfferent large core, as long as ther Utlty s hgher than that of the laggng thread assgned to run on that large core. Otherwse, a bottleneck that s ready to run but does not have a large core to run s sent back to ts small core to avod false seralzaton and potental starvaton, as n [] Reducng Large Core Watng. Whle a laggng thread s executng on a large core t may start watng for several reasons. Frst, f the thread starts watng for a barrer or s about to ext or be de-scheduled, the thread s mgrated back to ts small core and s replaced wth the laggng thread wth the hghest Utlty that s not runnng on a large core. Second, f the laggng thread starts watng for a crtcal secton that s not beng accelerated, there s a stuaton where a large core s watng for a small 6

7 Structure Purpose Locaton and entry structure (feld szes n bts n parenthess) Cost Thread Table (TT) Bottleneck Table (BT) Acceleraton Index Tables (AIT) Schedulng Buffers (SB) To track threads, dentfy laggng threads and compute ther Utlty of Acceleraton To track bottlenecks [] and compute ther Utlty of Acceleraton To avod accessng BT to fnd f a bottleneck s enabled for acceleraton To store and prortze bottleneck executon requests on each large core LTI unt, one entry per HW thread (52 n ths example). Each entry has 98 bts: td(16), pd(16), s laggng thread(1), num threads(8), stamp actve(24), actve(1), t lastq(16), Utlty(16). One 32-entry table on the BI unt. Each entry has 452 bts: bd(64), pd(16), executers(6), executer vec(64), waters(6), waters sb(6), large core d(2), PIE data(7), stamp actve(24), actve(1), t lastq(16), t lastsubq(13), outg(24), avg len(18), recent waters(64), past waters(6), Utlty(16) One 32-entry table per small core. Each entry has 66 bts: bd(64), enabled(1), large core d(2). Each AIT has 268 bytes. One 52-entry buffer per large core. Each entry has 214 bts: bd(64), small core ID(6), target PC(64), stack ponter(64), Utlty(16). Each SB has 1391 bytes. Total Table 1: Hardware structures for and ther storage cost on an ACMP wth 52 small cores and 3 large cores. Small core Large core Cache coherence L3 cache On-chp nterconnect Off-chp memory bus Memory 637 B 1808 B 13.6 KB 4.1 KB 20.1 KB 2-wde, 5-stage n-order, 4GHz, 32 KB wrte-through, 1-cycle, 8-way, separate I and D L1 caches, 256KB wrte-back, 6-cycle, 8-way, prvate unfed L2 cache 4-wde, 12-stage out-of-order, 128-entry ROB, 4GHz, 32 KB wrte-through, 1-cycle, 8-way, separate I and D L1 caches, 1MB wrte-back, 8-cycle, 8-way, prvate unfed L2 cache MESI protocol, on-chp dstrbuted drectory, L2-to-L2 cache transfers allowed, 8K entres/bank, one bank per core Shared 8MB, wrte-back, 16-way, 20-cycle Bdrectonal rng, 64-bt wde, 2-cycle hop latency 64-bt wde, splt-transacton, 40-cycle, ppelned bus at 1/4 of CPU frequency 32-bank DRAM, modelng all queues and delays, row buffer ht/mss/conflct latences = 25/50/75ns CMP confguratons wth area equvalent to N small cores: LC large cores, SC = N 4 LC small cores. ACMP [15, 16] A large core always runs any sngle-threaded code. Max number of threads s SC + LC. [13] In each quantum, the large cores run the threads wth more expected work to do. Max number of threads s SC+LC. [] The large cores run any sngle-threaded code and bottleneck code segments as proposed n []: 32-entry Bottleneck Table, each large core has an SC-entry Schedulng Buffer, each small core has a 32-entry Acceleraton Index Table. Max number of threads s SC. The large cores run the code segments wth the hghest Utlty of Acceleraton: structures plus an SC-entry Thread Table. Max number of threads s SC. core, whch s neffcent. Instead, we save the context of the watng thread on a shadow regster alas table (RAT) and mgrate the thread that s currently runnng the crtcal secton from ts small core to fnsh on the large core. Thrd, f the accelerated laggng thread wants to enter a crtcal secton that s beng accelerated on a dfferent large core, t s mgrated to the large core assgned to accelerate that crtcal secton, to preserve shared data localty. Fourth, f acceleraton of a crtcal secton s enabled and there are threads watng to enter that crtcal secton on small cores, they are mgrated to execute the crtcal secton on the assgned large core. All these mechansms are mplemented as extensons of the behavor of the BottleneckCall and BottleneckWat nstructons and use the nformaton that s already on the Bottleneck Table Hardware Structures and Cost. Table 1 descrbes the hardware structures requred by and ther storage cost for a 52-small-core, 3-large-core ACMP, whch s only 20.1 KB. does not substantally ncrease storage cost over, snce t only adds the Thread Table and requres mnor changes to the Bottleneck Table Support for Software-based Schedulng. Software can drectly specfy laggng threads f t has better nformaton than what s used by our hardware-based progress trackng. Software can also modfy the quantum length Q dependng on applcaton characterstcs (larger Q means less mgraton overhead, but also less opportunty to accelerate many laggng threads from the same applcaton between consecutve barrers). Fnally, software must be able to specfy prortes for dfferent applcatons, whch would become just an addtonal factor n the Utlty metrc. Our evaluaton does not nclude these features, and explorng them s part of our future work. Table 2: Baselne processor confguraton. 4. EXPERIMENTAL METHODOLOGY We use an x86 cycle-level smulator that models asymmetrc CMPs wth small n-order cores modeled after the Intel Pentum processor and large out-of-order cores modeled after the Intel Core 2 processor. Our smulator fathfully models all latences and core to core communcaton, ncludng those due to executon mgraton. Confguraton detals are shown n Table 2. We compare to prevous work summarzed n Table 3. Our comparson ponts for thread schedulng are based on two state-of-the-art proposals: Age-based Schedulng [13] () and PIE [27]. We chose these baselnes because we use smlar metrcs for progress and speedup estmaton. Note that our baselnes for multple applcatons are aggressve extensons of prevous proposals: combned wth PIE to accelerate laggng threads, and an extenson of that dynamcally shares all large cores among applcatons to accelerate any bottleneck based on relatve thread watng cycles. We evaluate 9 multthreaded workloads wth a wde range of performance mpact from bottlenecks, as shown n Table 4. Our 2-applcaton workloads are composed of all combnatons from the -applcaton set ncludng the 9 multthreaded applcatons plus the compute-ntensve ft nasp, whch s run wth one thread to have a mx of sngle-threaded and multthreaded applcatons. Our 4-applcaton workloads are 50 randomly pcked combnatons of the same applcatons. We run all applcatons to completon. On the multple-applcaton experments we run untl the longest applcaton fnshes and meanwhle, we restart any applcaton that fnshes early to contnue producng nterference and contenton for all resources, ncludng large cores. We measure executon durng the frst run of each applcaton. We run each applcaton wth the optmal number of threads found when runnng alone. When the sum of the optmal number of threads for all applcatons s greater than 7

8 Mechansm Descrpton ACMP Seral porton runs on a large core, parallel porton runs on all cores [3, 15, 16]. Age-based Schedulng algorthm for a sngle multthreaded applcaton as descrbed n [13]. +PIE To compare to a reasonable baselne tor thread schedulng of multple applcatons we use [13] to fnd the most laggng thread wthn each applcaton. Then, we use PIE [27] to pck for each large core the thread that would get the largest speedup among the laggng threads from each applcaton. Seral porton and bottlenecks run on the large cores, parallel porton runs on small cores []. MA- To compare to a reasonable baselne, we extend to multple applcatons by sharng the large cores to accelerate bottlenecks from any applcaton. To follow the key nsghts from, we prortze bottlenecks by thread watng cycles normalzed to the number of threads for each applcaton, regardless of whch applcaton they belong to. Our proposal. Table 3: Expermental confguratons. Workload Descrpton Source Input set # Bottl. Bottleneck descrpton blacksch BlackScholes opton prcng [18] 1M optons 1 Fnal barrer after omp parallel hst ph Hstogram of RGB components Phoenx [19] S (small) 1 Crt. sectons (CS) on map-reduce scheduler plookup IP packet routng [28] 2.5K queres # thr. CS on routng tables s nasp Integer sort NAS sute [4] n = 64K 1 CS on buffer of keys mysql MySQL server [1] SysBench [2] OLTP-nontrx 18 CS on meta data, tables pca ph Prncpal components analyss Phoenx [19] S (small) 1 CS on map-reduce scheduler specjbb JAVA busness benchmark [22] 5 seconds 39 CS on counters, warehouse data tsp Travelng salesman [12] 8 ctes 2 CS on termnaton condton, soluton webcache Cooperatve web cache [26] 0K queres 33 CS on replacement polcy ft nasp FFT computaton NAS sute [4] sze = 32x32x32 1 Run as sngle-threaded applcaton the maxmum number of threads, we reduce the number of threads for the applcaton(s) whose performance s(are) less senstve to the number of threads. Unless otherwse ndcated, we use harmonc mean to compute all the averages n our evaluaton. To measure system performance wth multple applcatons [9] we use Harmonc mean of Speedups (Hspeedup) [14] and Weghted Speedup (Wspeedup)[21], defned below for N applcatons. T alone s the executon when the applcaton runs alone n the system and T shared s the executon measured when all applcatons are runnng. We also report Unfarness [17] as defned below. Hspeedup = N 1 X =0 N 5. EVALUATION T shared T alone Unfarness = N 1 X Wspeedup = max(t alone mn(t alone Table 4: Evaluated workloads. =0 /T shared ) /T shared ) T alone T shared 5.1 Sngle Applcaton We carefully choose the number of threads to run each applcaton wth, because that number sgnfcantly affects performance of multthreaded applcatons. We evaluate two stuatons: (1) number of threads equal to the number of avalable hardware contexts,.e., maxmum number of threads, whch s a common practce for runnng non-i/ontensve applcatons; and (2) optmal number of threads,.e., the number of threads that mnmzes executon, whch we fnd wth an exhaustve search for each applcaton on each confguraton. Table 5 shows the average speedups of over other mechansms for dfferent ACMP confguratons. performs better than the other mechansms on every confguraton, except for multple large cores on a 16- core area budget. and dedcate the large cores to accelerate code segments, unlke ACMP and. Therefore, the maxmum number of threads that can run on and s smaller. Wth an area budget of 16, and cannot overcome the loss of parallel throughput due to runnng sgnfcantly fewer threads. For example, wth 3 large cores, and ACMP can execute applcatons wth up to 7 threads (4 on small cores and 3 on large cores), whle and can execute a maxmum of 4 threads. Overall, the beneft of ncreases wth area budget and number of large cores. Confg. Opt. number of threads Max. number of threads Area LC ACMP ACMP Table 5: Average speedup (%) of over ACMP, and Sngle-Large-CoreACMP Fgure 7 shows the speedup of, and over ACMP, whch accelerates only the Amdahl s seral bottleneck. Each applcaton runs wth ts optmal number of threads for each confguraton. We show results for 16, 32 and 64-small-core area budgets and a sngle large core. On average, our proposal mproves performance over ACMP by 8%/15%/16%, over by 0.2%/7.5%/7.3% and over by 9%/8%/7% for area budgets of 16/32/64 small cores. We make three observatons. Frst, as the number of cores ncreases,, and provde hgher average performance mprovement over ACMP. Performance mprovement of over ncreases wth the number of cores because, unlke, can accelerate bottlenecks, whch have an ncreasngly larger mpact on performance as the number of cores ncreases (as long as the number of threads ncreases). However, performance mprovement of over slghtly decreases wth a hgher number of cores. The reason s that the beneft of acceleratng laggng threads n addton to bottlenecks gets smaller for some benchmarks as the number of threads ncreases, dependng on the actual amount of thread mbalance that can elmnate. Snce and dedcate the large core to accelerate code segments, they can run one fewer thread than ACMP and. Wth an area budget of 16, the mpact of runnng one fewer thread s sgnfcant for and, but s able to overcome 8

9 Speedup norm. to ACMP (%) Speedup norm. to ACMP (%) Speedup norm. to ACMP (%) blacksch blacksch blacksch hst_ph hst_ph hst_ph plookup s_nasp pca_ph mysql specjbb tsp (a) Area budget=16 small cores plookup s_nasp pca_ph mysql specjbb tsp (b) Area budget=32 small cores plookup s_nasp pca_ph mysql specjbb tsp (c) Area budget=64 small cores webcache webcache webcache Fgure 7: Speedup for optmal number of threads, normalzed to ACMP. that dsadvantage wth respect to ACMP and by acceleratng both bottlenecks and laggng threads. Second, as the number of threads ncreases, plookup, s nasp, mysql, tsp and webcache become lmted by contended crtcal sectons and sgnfcantly beneft from. mproves performance more than and because t s able to accelerate both laggng threads and bottlenecks. Hst ph and pca ph are MapReduce applcatons wth no sgnfcant contenton for crtcal sectons where mproves performance over because ts shorter schedulng quantum and lower-overhead hardware-managed thread mgraton accelerates all three parallel portons (map, reduce and merge) more effcently. Thrd, blacksch s a very scalable workload wth nether sgnfcant bottlenecks nor sgnfcant thread mbalance, whch s the worst-case scenaro for all the evaluated mechansms. Therefore, produces the best performance because t accelerates all threads n round-robn order and t can run one more thread than or, whch dedcate the large core to acceleraton. However, performance beneft for blacksch decreases as the number of cores (and threads) ncreases because the large core s multplexed among all threads, resultng n less acceleraton on each thread and a smaller mpact on performance. Note that n a set of symmetrc threads, executon s reduced only by the mnmum amount of that s saved from any thread, whch requres acceleratng all threads evenly. effcently accelerates all threads, smlarly to, but s penalzed by havng to run wth one fewer thread. hmean hmean hmean We conclude that mproves performance of applcatons that have laggng threads, bottlenecks or both by a larger amount than, a prevous proposal to accelerate only laggng threads, and, a prevous proposal to accelerate only bottlenecks Multple-Large-CoreACMP Fgure 8 shows the average speedups across all workloads on the dfferent confguratons wth the same area budgets, runnng wth ther optmal number of threads. The man observaton s that replacng small cores wth large cores on a small area budget (16 cores, Fgure 8(a)) has a very large negatve mpact on performance due to the loss of parallel throughput. Wth an area budget of 32 (Fgure 8(b)) can take advantage of the addtonal large cores to ncrease performance, but cannot, due to the loss of parallel throughput. performs about the same wth 1, 2 or 3 large cores, but stll provdes the best overall performance of all three mechansms. Wth an area budget of 64 (Fgure 8(c)) there s no loss of parallel throughput, except for blacksch. Therefore, both and can take advantage of more large cores. However, average performance of mproves more sgnfcantly wth addtonal large cores. Accordng to per-benchmark data not shown due to space lmtatons, the man reason for ths mprovement s that plookup, mysql and webcache beneft from addtonal large cores because s able to concurrently accelerate the most mportant crtcal sectons and laggng threads. We conclude that as the area budget ncreases becomes more effectve than ACMP, and n takng advantage of addtonal large cores. Speedup norm. to ACMP (%) LC 2LC 3LC (a) Area=16 Speedup norm. to ACMP (%) LC 2LC 3LC (b) Area=32 Speedup norm. to ACMP (%) LC 2LC 3LC (c) Area=64 Fgure 8: Average speedups wth multple large cores, normalzed to ACMP wth 1 large core. 5.2 Multple Applcatons Fgure 9 shows the sorted harmonc speedups of applcaton workloads wth each mechansm: the extensons of prevous proposals to multple applcatons (+PIE and MA-) and, normalzed to the harmonc speedup of ACMP, wth an area budget of 64 small cores. Performance s generally better for than for the other mechansms and the dfference ncreases wth addtonal large cores. Results for 2-applcaton and 4-applcaton workloads wth an area budget of 128 small cores show a smlar pattern and are not shown n detal due to space lmtatons, but we show the averages. Tables 6 and 7 show the mprovement n average weghted speedup, harmonc speedup and unfarness wth over the other mechansms. Average Wspeedup and Hspeedup for are better than for the other mechansms on all confguratons. Unfarness s also reduced, except for three cases wth 4 applcatons. s unfarness measured as maxmum slowdown [7] s even lower than wth the reported metrc (by an average -4.3% for 2 ap- 9

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process