Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

Size: px
Start display at page:

Download "Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs"

Transcription

1 Utlty-Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt ECE Department The Unversty of Texas at Austn Austn, TX, USA {joao, Flux7 Consultng Austn, TX, USA Computer Archtecture Laboratory Carnege Mellon Unversty Pttsburgh, PA, USA ABSTRACT Asymmetrc Chp Multprocessors (ACMPs) are becomng a realty. ACMPs can speed up parallel applcatons f they can dentfy and accelerate code segments that are crtcal for performance. Proposals already exst for usng coarsegraned thread schedulng and fne-graned bottleneck acceleraton. Unfortunately, there have been no proposals offered thus far to decde whch code segments to accelerate n cases where both coarse-graned thread schedulng and fne-graned bottleneck acceleraton could have value. Ths paper proposes Utlty-Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs (), a cooperatve software/hardware mechansm for dentfyng and acceleratng the most lkely crtcal code segments from a set of multthreaded applcatons runnng on an ACMP. The key dea s a new Utlty of Acceleraton metrc that quantfes the performance beneft of acceleratng a bottleneck or a thread by takng nto account both the crtcalty and the expected speedup. outperforms the best of two state-of-the-art mechansms by 11% for sngle applcaton workloads and by 7% for two-applcaton workloads on an ACMP wth 52 small cores and 3 large cores. Categores and Subject Descrptors C.1.2 [Processor Archtectures]: Multple Data Stream Archtectures (Multprocessors) General Terms Desgn, Performance Keywords Multthreaded applcatons, crtcal sectons, barrers, multcore, asymmetrc CMPs, heterogeneous CMPs 1. INTRODUCTION Parallel applcatons are parttoned nto threads that can execute concurrently on multple cores. Speedup s often lmted when some threads are prevented from dong useful work concurrently because they have to wat for other code segments to fnsh. Asymmetrc Chp Mult-Processors Permssontomakedgtalorhardcopesofallorpartofthsworkfor personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bearthsnotceandthefullctatononthefrstpage.tocopyotherwse,to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ISCA 13 Tel-Avv, Israel Copyrght 2013 ACM /13/06...$ (ACMPs) wth one or few large, fast cores and many small, energy-effcent cores have been proposed for acceleratng the most performance crtcal code segments, whch can lead to sgnfcant performance gans. However, ths approach has heretofore had at least two fundamental lmtatons: 1. The problem of acceleratng only one type of code segment. There are two types of code segments that can become performance lmters: (1) threads that take longer to execute than other threads because of load mbalance or mcroarchtectural mshaps such as cache msses, and (2) code segments, lke contended crtcal sectons, that make other threads wat. We call threads of the frst type laggng threads. They ncrease executon snce the program cannot complete untl all ts threads have fnshed executon. Code segments of the second type reduce parallelsm and can potentally become the crtcal path of the applcaton. Joao et al. [] call these code segments bottlenecks. Pror work accelerates ether laggng threads [6, 5, 13] or bottlenecks [24, ], but not both. Thus, these proposals beneft only the applcatons whose performance s lmted by the type of code segments that they are desgned to accelerate. Note that there s overlap between laggng threads and bottlenecks: laggng threads, f left alone, can eventually make other threads wat and become bottlenecks. However, the goal of the proposals that accelerate laggng threads s to try to prevent them from becomng bottlenecks. Real applcatons often have both bottlenecks and laggng threads. Prevous acceleraton mechansms prove suboptmal n ths case. Bottleneck Identfcaton and Schedulng () [] does not dentfy laggng threads early enough and as a result t does not always accelerate the program s crtcal executon path. Smlarly, laggng thread acceleraton mechansms do not accelerate consecutve nstances of the same crtcal secton that execute on dfferent threads and as a result can mss the opportunty to accelerate the program s crtcal executon path. Combnng bottleneck and laggng thread acceleraton mechansms s non-trval because the combned mechansm must predct the relatve beneft of acceleratng bottlenecks and laggng threads. Note that ths beneft depends on the nput set and program phase, as well as the underlyng machne. Thus a statc soluton would lkely not work well. Whle the exstng acceleraton mechansms are dynamc, they use dfferent metrcs to dentfy good canddates for acceleraton; thus, ther outputs cannot be compared drectly to decde whch code segments to accelerate. 2. The problem of not handlng multple multthreaded applcatons. In practce, an ACMP can be expected to run multple multthreaded applcatons. Each applcaton wll have laggng threads and bottlenecks that beneft dfferently from acceleraton. The prevous work on 1

2 bottleneck acceleraton [24, ] and laggng thread acceleraton [13] does not deal wth multple multthreaded applcatons, makng ther use lmted n practcal systems. To make ACMPs more effectve, we propose Utlty- Based Acceleraton of Multthreaded Applcatons on Asymmetrc CMPs (). s a general cooperatve software/hardware mechansm to dentfy the most mportant code segments from one or multple applcatons to accelerate on an ACMP to mprove system performance. ntroduces a new Utlty of Acceleraton metrc for each code segment, ether from a laggng thread or a bottleneck, whch s used to decde whch code segments to run on the large cores of the ACMP. The key dea of the utlty metrc s to consder both the acceleraton expected from runnng on a large core and the crtcalty of the code segment for ts applcaton as a whole. Therefore, ths metrc s effectve n makng acceleraton decsons for both sngle- and multpleapplcaton cases. also bulds on and extends prevous proposals to dentfy potental bottlenecks [] and laggng threads [13]. Ths paper makes three man contrbutons: 1. It ntroduces a new Utlty of Acceleraton metrc that combnes a measure of the acceleraton that each code segment acheves, wth a measure of the crtcalty of each code segment. Ths metrc enables meanngful comparsons to decde whch code segments to accelerate regardless of the segment type. We mplement the metrc n the context of an ACMP where acceleraton s performed wth large cores, but the metrc s general enough to be used wth other acceleraton mechansms, e.g., frequency scalng. 2. It provdes the frst mechansm that can accelerate both bottlenecks and laggng threads from a sngle multthreaded applcaton, usng faster cores. It can also leverage ACMPs wth any number of large cores. 3. It s the frst work that accelerates bottlenecks n addton to laggng threads from multple multthreaded applcatons. We evaluate on sngle- and multple-applcaton scenaros on a varety of ACMP confguratons, runnng a set of workloads that ncludes both bottleneck-ntensve applcatons and non-bottleneck-ntensve applcatons. For example, on a 52-small-core and 3-large-core ACMP, mproves average performance of 9 multthreaded applcatons by 11% over the best of prevous proposals that accelerate only laggng threads [13] or only bottlenecks []. On the same ACMP confguraton, mproves average harmonc speedup of 2-applcaton workloads by 7% over our aggressve extensons of prevous proposals to accelerate multple applcatons. Overall, we fnd that sgnfcantly mproves performance over prevous work and ts performance beneft generally ncreases wth larger area budgets and addtonal large cores. 2. MOTIVATION 2.1 Bottlenecks Joao et al. [] defned bottleneck as any code segment that makes other threads wat. Bottlenecks reduce the amount of thread-level parallelsm (TLP); therefore, a program runnng wth sgnfcant bottlenecks can lose some or even all of the potental speedup from parallelzaton. Inter-thread synchronzaton mechansms can create bottlenecks, e.g., contended crtcal sectons, the last thread arrvng to a barrer and the slowest stage of a ppelne-parallel program. Fgure 1 shows four threads executng non-crtcal-secton segments (Non-CS) and a crtcal secton CS (n gray). A crtcal secton enforces mutual excluson: only one thread can execute the crtcal secton at a gven, makng any other threads wantng to execute the same crtcal secton wat, whch reduces the amount of useful work that can be done n parallel.... T2 Non CS Fgure 1: Example of a crtcal secton. CS... The state-of-the-art n bottleneck acceleraton on an ACMP s [], whch conssts of software-based annotaton of potental bottlenecks and hardware-based trackng of thread watng cycles,.e., the number of cycles threads wated for each bottleneck. Then, accelerates the bottlenecks that are responsble for the most thread watng cycles. s effectve n acceleratng crtcal sectons that lmt performance at dfferent s. However, t accelerates threads arrvng last to a barrer and slow stages of a ppelne-parallel program only after they have started makng other threads wat,.e., after accumulatng a mnmum number of thread watng cycles. If could start acceleratng such laggng threads earler, t could remove more thread watng and further reduce executon. 2.2 Laggng Threads A parallel applcaton s composed of groups of threads that splt work and eventually ether synchronze at a barrer, or fnsh and jon. The thread that takes the most to execute n a thread group determnes the executon of the entre group and we call that thread a laggng thread. Thread mbalance can appear at run for multple reasons, e.g., dfferent memory behavor that makes some threads suffer from hgher average memory latency, and dfferent contenton for crtcal sectons that makes some threads wat longer. Fgure 2 shows executon of four threads over. Thread T2 becomes a laggng thread as soon as t starts makng slower progress than the other threads towards reachng the barrer at t2. Note that at t1, T2 becomes the last thread runnng for the barrer and becomes a bottleneck. Therefore, laggng threads are potental future bottlenecks,.e., they become bottlenecks f thread mbalance s not corrected n. Also note that f there are multple threads wth approxmately as much remanng work to do as the most laggng thread, all of them need to be accelerated to actually reduce total executon. Therefore, all those threads have to be consdered laggng threads.... T2... t1 Barrer t2 Fgure 2: Example of a laggng thread (T2). The state-of-the-art n acceleraton of laggng threads are proposals that dentfy a laggng thread by trackng ether thread progress [6, 13] or reasons for a thread to get delayed [5]. Meetng Ponts [6] tracks the threads that are 2

3 laggng n reachng a barrer by countng the number of loop teratons that have been completed. Thread Crtcalty Predctors [5] predct that the threads that suffer from more cache msses wll be delayed and wll become crtcal. Age-based Schedulng [13] accelerates the thread that s predcted or profled to have more remanng work untl the next barrer or the program ext, measured n terms of commtted nstructons. Once a laggng thread s dentfed, t can be accelerated on a large core of an ACMP. 2.3 Applcatons have both Laggng Threads and Bottlenecks Joao et al. [] showed that dfferent bottlenecks can lmt performance at dfferent s. In partcular, contenton for dfferent crtcal sectons can be very dynamc. It s not evdent upfront whether acceleratng a crtcal secton or a laggng thread leads to better performance. Therefore, t s fundamentally mportant to dynamcally dentfy the code segments, ether bottlenecks or laggng threads, that have to be accelerated at any gven. 2.4 Multple Applcatons Fgure 3(a) shows two 4-thread applcatons runnng on small cores of an ACMP wth a sngle large core. Let s assume that at t1 the system has to decde whch thread to accelerate on the large core to maxmze system performance. Wth knowledge of the progress each thread has made towards reachng the next barrer, the system can determne that App1 has one laggng thread, because has sgnfcantly more remanng work to do than the other threads, and App2 has two laggng threads and T2, because both of them have sgnfcantly more work to do than and. Acceleratng from App1 would drectly reduce App1 s executon by some t, whle acceleratng ether or T2 from App2 would not sgnfcantly reduce App2 s executon. It s necessary to accelerate both durng one quantum and T2 durng another quantum to reduce App2 s executon by a smlar t, assumng the speedups for all threads on the large core are smlar. Therefore, system performance wll ncrease more by acceleratng from App1. App 1 T2 App 2 T2 t1 barrer barrer (a) Laggng threads App 1 T2 App 3 T2 t1 barrer (b) Laggng threads and crtcal sectons Fgure 3: Examples of laggng threads and crtcal sectons. Fgure 3(b) shows the same App1 from the prevous example and an App3 wth a strongly-contended crtcal secton (n gray). Every thread from App3 has to wat to execute the crtcal secton at some and the crtcal secton s clearly on the crtcal path of executon (there s always one thread executng the crtcal secton and at least one thread watng for t). At t1, App1 has a sngle laggng thread, whch s App1 s crtcal path. Therefore, every cycle saved by acceleratng from App1 would drectly reduce App1 s executon. Smlarly, every cycle saved by acceleratng nstances of the crtcal secton from App3 on any of ts threads would drectly reduce App3 s executon. Ideally, the system should dynamcally accelerate the code segment that gets a hgher speedup from the large core, ether a segment of the laggng thread from App1 or a sequence of nstances of the crtcal secton from App3. These two examples llustrate that acceleraton decsons need to consder both the crtcalty of code segments and how much speedup they get from the large core. Our goal s to desgn a mechansm that decdes whch code segments, ether from laggng threads or bottlenecks, to run on the avalable large cores of an ACMP, to mprove system performance n the presence of a sngle multthreaded applcaton or multple multthreaded applcatons. 3. UTILITY-BASED ACCELERATION () The core of s a new Utlty of Acceleraton metrc that s used to decde whch code segments to accelerate at any gven. Utlty combnes an estmaton of the acceleraton that each code segment can acheve on a large core, and an estmaton of the crtcalty of the code segment. Fgure 4 shows the three man components of : Laggng Thread Identfcaton unt, Bottleneck Identfcaton unt, and Acceleraton Coordnaton unt. Every schedulng quantum (1M cycles n our experments), the Laggng Thread Identfcaton (LTI) unt produces the set of Hghest- Utlty Laggng Threads (HULT), one for each large core, and the Bottleneck Acceleraton Utlty Threshold (BAUT). Meanwhle, the Bottleneck Identfcaton (BI) unt computes the Utlty of acceleratng each of the most mportant bottlenecks and dentfes those wth Utlty greater than BAUT, whch we call Hghest-Utlty Bottlenecks (HUB). Only these bottlenecks are enabled for acceleraton. Fnally, the Acceleraton Coordnaton (AC) unt decdes whch code segments to run on each large core, ether a thread from the HULT set, or bottlenecks from the HUB set. Bottleneck Identfcaton (BI) Hghest Utlty Bottlenecks (HUB) Bottleneck Acceleraton Utlty Threshold (BAUT) Acceleraton Coordnaton (AC) Laggng Thread Identfcaton (LTI) Hghest Utlty Laggng Threads (HULT) large core control Fgure 4: Block dagram of. 3.1 Utlty of Acceleraton We defne Utlty of Acceleratng a code segment c as the reducton n the applcaton s executon due to acceleraton of c relatve to the applcaton s executon before acceleraton. Formally, U c = T T where T s the reducton n the entre applcaton s executon and T s the orgnal executon of the entre applcaton. If code segment c of length t cycles s accelerated by t, then after multplyng and dvdng by t t, Utlty of acceleratng c can be rewrtten as: U c = T T = ( t t ) ( t T ) ( T t ) = L R G 3

4 L: The frst factor s the Local Acceleraton, whch s the reducton n the executon of solely the code segment c due to runnng on a large core dvded by the orgnal executon of c on the small core. L = t t L depends on the net speedup that runnng on a large core can provde for the code segment, whch s a necessary condton to mprove the applcaton s performance: f L s close to zero or negatve, runnng on the large core s not useful and can be harmful. R: The second factor s the Relevance of Acceleraton, whch measures how mportant code segment c s for the applcaton as a whole: R s the executon (n cycles) of c on the small core dvded by the applcaton s executon (n cycles) before acceleraton. R = t T R lmts the overall speedup that can be obtaned by acceleratng a sngle code segment. For example, let s assume two equally long seral bottlenecks from two dfferent applcatons start at the same and can be accelerated wth the same L factor. One runs for 50% of ts applcaton s executon, whle the other runs for only 1%. Obvously, acceleratng the frst one s a much more effectve use of the large core to mprove system performance. G: The thrd factor s the Global Effect of Acceleraton, whch represents how much of the code segment acceleraton t translates nto a reducton n executon T. G = T t G depends on the crtcalty of code segment c: f c s on the crtcal path, G = 1, otherwse G = 0. In reasonably symmetrc applcatons, multple threads may arrve to the next barrer at about the same and all of them must be accelerated to reduce the applcaton s executon, whch makes each of the threads partally crtcal (0 < G < 1). We wll explan how we estmate each factor L, R and G n Sectons 3.5.1, and 3.5.3, respectvely. 3.2 Laggng Thread Identfcaton The set of Hghest-Utlty Laggng Threads (HULT) s produced every schedulng quantum (.e., Q cycles) by the LTI unt wth the followng steps: 1. Identfy laggng threads. We use the same noton of progress between consecutve synchronzaton ponts as n [13];.e., we assume approxmately the same number of nstructons are expected to be commtted by each thread between consecutve barrers or synchronzaton ponts, and use nstructon count as a metrc of thread progress. 1 A commtted nstructon counter progress s kept as part of each hardware context. After thread creaton or when restartng the threads after a barrer, progress s reset wth a smple command ResetProgress, mplemented as a store to a reserved memory locaton. Fgure 5 shows the progress of several threads from the same applcaton that are runnng on small cores. Thread 1 Note that we are not argung for nstructon count as the best progress metrc. In general, the best progress metrc s applcaton dependent and we envson a mechansm that lets the software defne whch progress metrc to use for each applcaton, ncludng [6, 13, 5] or even each applcaton perodcally reportng how much progress each thread s makng. can be easly extended to use any progress metrc. T2 T5 mnp progress P Fgure 5: Laggng thread dentfcaton. has made the smallest progress mnp and s a laggng thread. Let s assume that f thread s accelerated durng the next schedulng quantum, t wll make P more progress than the other non-accelerated threads. Therefore, wll leave behnd threads T2 and, whch then wll also have to be accelerated to fully take advantage of the acceleraton of. Therefore, we consder the ntal set of laggng threads to be {T 1, T 2, T 3}. We estmate P as AvgDeltaIPC Q, where AvgDeltaIPC s the dfference between average IPC on the large cores and average IPC on the small cores across all threads n the applcaton, measured over fve quanta. 2. Compute Utlty of Acceleraton for each laggng thread. We wll explan how estmates each factor L, R and G n Sectons 3.5.1, and 3.5.3, respectvely. 3. Fnd the Hghest-Utlty Laggng Thread (HULT) set. The HULT set conssts of the laggng threads wth the hghest Utlty. The sze of the set s equal to the number of large cores. Laggng threads from the same applcaton whose Utltes fall wthn a small range of each other 2 are consdered to have the same Utlty and are sorted by ther progress nstructon count (hghest rank for lower progress) to mprove farness and reduce the mpact of naccuraces n Utlty computatons. 4. Determne the Bottleneck Acceleraton Utlty Threshold (BAUT). As long as no bottleneck has hgher Utlty than any laggng thread n the HULT set, no bottleneck should be accelerated. Therefore, the BAUT s smply the smallest Utlty among the threads n the HULT set. To keep track of the relevant characterstcs of each thread, the LTI unt ncludes a Thread Table wth as many entres as hardware contexts, ndexed by software thread ID (td). 3.3 Bottleneck Identfcaton The Hghest-Utlty Bottleneck set s contnuously produced by the Bottleneck Identfcaton (BI) unt. The BI unt s mplemented smlarly to [] wth one fundamental change: nstead of usng thread watng cycles as a metrc to classfy bottlenecks, the BI uses our Utlty of Acceleraton metrc. 3 Software support. The programmer, compler or lbrary delmts potental bottlenecks usng BottleneckCall and BottleneckReturn nstructons, and replaces the code that wats for bottlenecks wth a BottleneckWat nstructon. The purpose of the BottleneckWat nstructon s threefold: 1) t mplements watng for the value on a memory locaton to change, 2) t allows the hardware to keep track of whch threads are watng for each bottleneck and 3) t makes nstructon count a more accurate measure of thread progress by removng the spnnng loops that wat for synchronzaton and execute nstructons that do not make progress. Hardware support: Bottleneck Table. The hardware tracks whch threads are executng or watng for each bottleneck and dentfes the crtcal bottlenecks wth low overhead n hardware usng a Bottleneck Table (BT) n the BI unt. Each BT entry corresponds to a bottleneck and collects all 2 We fnd a range of 2% works well n our experments. 3 Our experments (not shown due to space lmtatons) show usng Utlty outperforms usng thread watng cycles by 1.5% on average across our bottleneck-ntensve applcatons. 4

5 the data requred to compute the Utlty of ts acceleraton. Utlty of acceleratng bottlenecks. The Bottleneck Table computes the L factor once every quantum, as explaned n Secton It recomputes the Utlty of acceleratng each bottleneck whenever ts R (Secton 3.5.2) or G (Secton 3.5.3) factor changes. Therefore, Utlty can change at any, but t does not change very frequently because of how R and G are computed as explaned later. Hghest-Utlty Bottleneck (HUB) set. Bottlenecks wth Utlty above the Bottleneck Acceleraton Utlty Threshold (BAUT) are enabled for acceleraton,.e. they are part of the HUB set. 3.4 Acceleraton Coordnaton The canddate code segments for acceleraton are the laggng threads n the HULT set, one per large core, provded by the LTI unt, and the bottlenecks n the HUB set, whose acceleraton has been enabled by the BI unt. Laggng thread acceleraton. Each laggng thread n the HULT set s assgned to run on a large core at the begnnng of each quantum. The assgnment s based on affnty to preserve cache localty,.e., f a laggng thread wll contnue to be accelerated, t stays on the same large core, and threads newly added to the HULT set try to be assgned to a core that was runnng another thread from the same applcaton. Bottleneck acceleraton. When a small core executes a BottleneckCall nstructon, t checks whether or not the bottleneck s enabled for acceleraton. To avod accessng the global BT on every BottleneckCall, each small core ncludes a local Acceleraton Index Table (AIT) that caches the bottleneck ID (bd), acceleraton enable bt and assgned large core for each bottleneck. 4 If acceleraton s dsabled, the small core executes the bottleneck locally. If acceleraton s enabled, the small core sends a bottleneck executon request to the assgned large core and stalls watng for a response. The large core enqueues the request nto a Schedulng Buffer (SB), whch s a prorty queue based on Utlty. The oldest nstance of the bottleneck wth hghest Utlty s executed by the large core untl the BottleneckReturn nstructon, at whch pont the large core sends a BottleneckDone sgnal to the small core. On recevng the BottleneckDone sgnal, the small core contnues executng the nstructon after the BottleneckCall. The key dea of Algorthm 1, whch controls each large core, s that each large core executes the assgned laggng thread, as long as no bottleneck s mgrated to t to be accelerated. Only bottlenecks wth hgher Utlty than the BAUT (the smallest Utlty among all accelerated laggng threads,.e., the HULT set) are enabled to be accelerated. Therefore, t makes sense for those bottlenecks to preempt a laggng thread wth lower Utlty. The large core executes n one of two modes: acceleratng the assgned laggng thread or acceleratng bottlenecks from ts Schedulng Buffer (SB), when they show up. After no bottleneck shows up for 50Kcycles, the assgned laggng thread s mgrated back to the large core. Ths delay reduces the number of spurous laggng thread mgratons, snce bottlenecks lke contended crtcal sectons usually occur n bursts. The reasons to avod fner-graned nterleavng of laggng threads and bottlenecks are: 1) to reduce the m- 4 Intally all bottlenecks are assgned to the large core runnng the laggng thread wth mnmum Utlty (equal to the BAUT threshold for bottleneck acceleraton), but they can be reassgned as we wll explan n Secton When a bottleneck s ncluded n or excluded from the HUB set, the BT broadcasts the update to the AITs on all small cores. Algorthm 1 Acceleraton Coordnaton whle 1 do // execute a laggng thread mgrate assgned laggng thread from small core whle not bottleneck n SB do run assgned laggng thread end whle mgrate assgned laggng thread back to small core // execute bottlenecks untl no bottleneck shows up for 50Kcycles done wth bottlenecks = false whle not done wth bottlenecks do whle bottleneck n SB do deque from SB and run a bottleneck end whle delay = 0 whle not bottleneck n SB and (delay < 50Kcycles) do wat whle ncrementng delay end whle done wth bottlenecks = not bottleneck n SB end whle end whle pact of frequent mgratons on cache localty and 2) to avod excessve mgraton overhead. Both effects can sgnfcantly reduce or elmnate the beneft of acceleraton. 3.5 Implementaton Detals Estmaton of L. L s related to the speedup S due to runnng on a large core by: L = t = t t/s = 1 1 t t S Any exstng or future technque that estmates performance on a large core based on nformaton collected whle runnng on a small core can be used to estmate S. We use Performance Impact Estmaton (PIE) [27], the latest of such technques. PIE requres measurng total cycles per nstructon (CPI), CPI due to memory accesses, and msses per nstructon (MPI) whle runnng on a small core. For code segments that are runnng on a large core, PIE also provdes an estmate of the slowdown of runnng on a small core based on measurements on the large core. Ths estmaton requres measurng CPI, MPI, average dependency dstance between a last-level cache mss and ts consumer, and the fracton of nstructons that are dependent on the prevous nstructon (because they would force executon of only one nstructon per cycle n the 2-wde n-order small core). Instead of mmedately usng ths estmaton of performance on the small core whle runnng on the large core, our mplementaton remembers the estmated speedup from the last the code segment ran on a small core, because t s more effectve to compare speedups obtaned wth the same technque. After fve quanta we consder the old data to be stale and we swtch to estmate the slowdown on a small core based on measurements on the large core. Each core collects data to compute L for ts current thread and for up to two current bottlenecks to allow trackng up to one level of nested bottlenecks. When a bottleneck fnshes and executes a BottleneckReturn nstructon, a message s sent to the Bottleneck Table n the regular mplementaton. We nclude the data requred to compute L for the bottleneck on ths message wthout addng any extra overhead. Data requred to compute L for laggng threads s sent to the Thread Table at the end of the schedulng quantum Estmaton of R. Snce acceleraton decsons are made at least once every schedulng quantum, the objectve of each decson s to 5

6 maxmze Utlty for one quantum at a. Therefore, we estmate R (and Utlty) only for the next quantum nstead of for the whole run of the applcaton,.e., we use T = Q, the quantum length n cycles. Durng each quantum the hardware collects the number of actve cycles, t lastq, for each thread and for each bottleneck to use as an estmate of the code segment length t for the next quantum. To that end, each Bottleneck Table (BT) entry and each Thread Table (TT) entry nclude an actve bt and a stamp actve. On the BT, the actve bt s set between BottleneckCall and BottleneckReturn nstructons. On the TT, the actve bt s only reset whle executng a BottleneckWat nstructon,.e., whle the thread s watng. When the actve bt s set, stamp actve s set to the current. When the actve bt s reset, the actve cycle count t lastq s ncremented by the dfference between current and stamp actve. Laggng thread actvty can be easly nferred: a runnng thread s always actve, except whle runnng a Bottleneck- Wat nstructon. R estmated = t lastq Q for laggng threads Bottleneck actvty s already reported to the Bottleneck Table for bookkeepng, off the crtcal path, after executng BottleneckCall and BottleneckReturn nstructons. Therefore the actve bt s set by BottleneckCall and reset by BottleneckReturn. Gven that bottlenecks can suddenly become mportant, the Bottleneck Table also keeps an actve cycle counter t lastsubq for the last subquantum, an nterval equal to 1/8 of the quantum. Therefore, R estmated = max( t lastq Q, t lastsubq ) for bottlenecks Q/ Estmaton of G. The G factor measures crtcalty of the code segment,.e., how much of ts acceleraton s expected to reduce total executon. Consequently, we estmate G for each type of code segment as follows: Laggng threads. Crtcalty of laggng threads depends on the number of laggng threads n the applcaton. If there are M laggng threads, all of them have to be evenly accelerated to remove the thread watng they would cause before the next barrer or jonng pont. That s, all M laggng threads are potental crtcal paths wth smlar lengths. Therefore, G estmated = 1/M for each of the laggng threads. Amdahl s seral segments are part of the only thread that exsts,.e., a specal case of laggng threads wth M = 1. Therefore, each seral segment s on the crtcal path and has G = 1. Smlarly, the last thread runnng for a barrer and the slowest stage of a ppelned program are also dentfed as sngle laggng threads and, therefore, have G = 1. Crtcal sectons. Not every contended crtcal secton s on the crtcal path and hgh contenton s not a necessary condton for beng on the crtcal path. Let s consder two cases. Fgure 6(a) shows a crtcal secton that s on the crtcal path (dashed lne), even though there s never more than one thread watng for t. All threads have to wat at some for the crtcal secton, whch makes the crtcal path jump from thread to thread followng the crtcal secton segments. We consder strongly contended crtcal sectons those that have been makng all threads wat n the recent past and estmate G = 1 for them. To dentfy strongly contended crtcal sectons each Bottleneck Table entry ncludes a recent waters bt vector wth one bt per hardware context. Ths bt s set on executng BottleneckWat for the correspondng bottleneck. Each Bottleneck Table entry also keeps a movng average for bottleneck length avg len. If there are N actve threads n the applcaton, recent waters s evaluated every N avg len cycles: f N bts are set, ndcatng that all actve threads had to wat, the crtcal secton s assumed to be strongly contended. Then, the number of ones n recent waters s stored n past waters (see the next paragraph) and recent waters s reset. CS... T (a) Strongly contended crtcal secton (all threads wat for t) CS... T (b) Weakly contended crtcal secton (T2 never wats for t) Fgure 6: Types of crtcal sectons. We call the crtcal sectons that have not made all threads wat n the recent past weakly contended crtcal sectons (see Fgure 6(b)). Acceleratng an nstance of a crtcal secton accelerates not only the thread that s executng t but also every thread that s watng for t. If we assume each thread has the same probablty of beng on the crtcal path, the probablty of acceleratng the crtcal path by acceleratng the crtcal secton would be the fracton of threads that get accelerated,.e., G = (W + 1)/N, when there are W watng threads and a total of N threads. Snce the current number of waters W s very dynamc, we combne t wth hstory (past waters from the prevous paragraph). Therefore, we estmate G for weakly contended crtcal sectons as G estmated = (max(w, past waters) + 1)/N False Seralzaton and Usng Multple Large Cores for Bottleneck Acceleraton. Instances of dfferent bottlenecks from the same or from dfferent applcatons may be accelerated on the same large core. Therefore, a bottleneck may get falsely seralzed,.e., t may have to wat for too long on the Schedulng Buffer for another bottleneck wth hgher Utlty. Bottlenecks that suffer false seralzaton can be reassgned to a dfferent large core, as long as ther Utlty s hgher than that of the laggng thread assgned to run on that large core. Otherwse, a bottleneck that s ready to run but does not have a large core to run s sent back to ts small core to avod false seralzaton and potental starvaton, as n [] Reducng Large Core Watng. Whle a laggng thread s executng on a large core t may start watng for several reasons. Frst, f the thread starts watng for a barrer or s about to ext or be de-scheduled, the thread s mgrated back to ts small core and s replaced wth the laggng thread wth the hghest Utlty that s not runnng on a large core. Second, f the laggng thread starts watng for a crtcal secton that s not beng accelerated, there s a stuaton where a large core s watng for a small 6

7 Structure Purpose Locaton and entry structure (feld szes n bts n parenthess) Cost Thread Table (TT) Bottleneck Table (BT) Acceleraton Index Tables (AIT) Schedulng Buffers (SB) To track threads, dentfy laggng threads and compute ther Utlty of Acceleraton To track bottlenecks [] and compute ther Utlty of Acceleraton To avod accessng BT to fnd f a bottleneck s enabled for acceleraton To store and prortze bottleneck executon requests on each large core LTI unt, one entry per HW thread (52 n ths example). Each entry has 98 bts: td(16), pd(16), s laggng thread(1), num threads(8), stamp actve(24), actve(1), t lastq(16), Utlty(16). One 32-entry table on the BI unt. Each entry has 452 bts: bd(64), pd(16), executers(6), executer vec(64), waters(6), waters sb(6), large core d(2), PIE data(7), stamp actve(24), actve(1), t lastq(16), t lastsubq(13), outg(24), avg len(18), recent waters(64), past waters(6), Utlty(16) One 32-entry table per small core. Each entry has 66 bts: bd(64), enabled(1), large core d(2). Each AIT has 268 bytes. One 52-entry buffer per large core. Each entry has 214 bts: bd(64), small core ID(6), target PC(64), stack ponter(64), Utlty(16). Each SB has 1391 bytes. Total Table 1: Hardware structures for and ther storage cost on an ACMP wth 52 small cores and 3 large cores. Small core Large core Cache coherence L3 cache On-chp nterconnect Off-chp memory bus Memory 637 B 1808 B 13.6 KB 4.1 KB 20.1 KB 2-wde, 5-stage n-order, 4GHz, 32 KB wrte-through, 1-cycle, 8-way, separate I and D L1 caches, 256KB wrte-back, 6-cycle, 8-way, prvate unfed L2 cache 4-wde, 12-stage out-of-order, 128-entry ROB, 4GHz, 32 KB wrte-through, 1-cycle, 8-way, separate I and D L1 caches, 1MB wrte-back, 8-cycle, 8-way, prvate unfed L2 cache MESI protocol, on-chp dstrbuted drectory, L2-to-L2 cache transfers allowed, 8K entres/bank, one bank per core Shared 8MB, wrte-back, 16-way, 20-cycle Bdrectonal rng, 64-bt wde, 2-cycle hop latency 64-bt wde, splt-transacton, 40-cycle, ppelned bus at 1/4 of CPU frequency 32-bank DRAM, modelng all queues and delays, row buffer ht/mss/conflct latences = 25/50/75ns CMP confguratons wth area equvalent to N small cores: LC large cores, SC = N 4 LC small cores. ACMP [15, 16] A large core always runs any sngle-threaded code. Max number of threads s SC + LC. [13] In each quantum, the large cores run the threads wth more expected work to do. Max number of threads s SC+LC. [] The large cores run any sngle-threaded code and bottleneck code segments as proposed n []: 32-entry Bottleneck Table, each large core has an SC-entry Schedulng Buffer, each small core has a 32-entry Acceleraton Index Table. Max number of threads s SC. The large cores run the code segments wth the hghest Utlty of Acceleraton: structures plus an SC-entry Thread Table. Max number of threads s SC. core, whch s neffcent. Instead, we save the context of the watng thread on a shadow regster alas table (RAT) and mgrate the thread that s currently runnng the crtcal secton from ts small core to fnsh on the large core. Thrd, f the accelerated laggng thread wants to enter a crtcal secton that s beng accelerated on a dfferent large core, t s mgrated to the large core assgned to accelerate that crtcal secton, to preserve shared data localty. Fourth, f acceleraton of a crtcal secton s enabled and there are threads watng to enter that crtcal secton on small cores, they are mgrated to execute the crtcal secton on the assgned large core. All these mechansms are mplemented as extensons of the behavor of the BottleneckCall and BottleneckWat nstructons and use the nformaton that s already on the Bottleneck Table Hardware Structures and Cost. Table 1 descrbes the hardware structures requred by and ther storage cost for a 52-small-core, 3-large-core ACMP, whch s only 20.1 KB. does not substantally ncrease storage cost over, snce t only adds the Thread Table and requres mnor changes to the Bottleneck Table Support for Software-based Schedulng. Software can drectly specfy laggng threads f t has better nformaton than what s used by our hardware-based progress trackng. Software can also modfy the quantum length Q dependng on applcaton characterstcs (larger Q means less mgraton overhead, but also less opportunty to accelerate many laggng threads from the same applcaton between consecutve barrers). Fnally, software must be able to specfy prortes for dfferent applcatons, whch would become just an addtonal factor n the Utlty metrc. Our evaluaton does not nclude these features, and explorng them s part of our future work. Table 2: Baselne processor confguraton. 4. EXPERIMENTAL METHODOLOGY We use an x86 cycle-level smulator that models asymmetrc CMPs wth small n-order cores modeled after the Intel Pentum processor and large out-of-order cores modeled after the Intel Core 2 processor. Our smulator fathfully models all latences and core to core communcaton, ncludng those due to executon mgraton. Confguraton detals are shown n Table 2. We compare to prevous work summarzed n Table 3. Our comparson ponts for thread schedulng are based on two state-of-the-art proposals: Age-based Schedulng [13] () and PIE [27]. We chose these baselnes because we use smlar metrcs for progress and speedup estmaton. Note that our baselnes for multple applcatons are aggressve extensons of prevous proposals: combned wth PIE to accelerate laggng threads, and an extenson of that dynamcally shares all large cores among applcatons to accelerate any bottleneck based on relatve thread watng cycles. We evaluate 9 multthreaded workloads wth a wde range of performance mpact from bottlenecks, as shown n Table 4. Our 2-applcaton workloads are composed of all combnatons from the -applcaton set ncludng the 9 multthreaded applcatons plus the compute-ntensve ft nasp, whch s run wth one thread to have a mx of sngle-threaded and multthreaded applcatons. Our 4-applcaton workloads are 50 randomly pcked combnatons of the same applcatons. We run all applcatons to completon. On the multple-applcaton experments we run untl the longest applcaton fnshes and meanwhle, we restart any applcaton that fnshes early to contnue producng nterference and contenton for all resources, ncludng large cores. We measure executon durng the frst run of each applcaton. We run each applcaton wth the optmal number of threads found when runnng alone. When the sum of the optmal number of threads for all applcatons s greater than 7

8 Mechansm Descrpton ACMP Seral porton runs on a large core, parallel porton runs on all cores [3, 15, 16]. Age-based Schedulng algorthm for a sngle multthreaded applcaton as descrbed n [13]. +PIE To compare to a reasonable baselne tor thread schedulng of multple applcatons we use [13] to fnd the most laggng thread wthn each applcaton. Then, we use PIE [27] to pck for each large core the thread that would get the largest speedup among the laggng threads from each applcaton. Seral porton and bottlenecks run on the large cores, parallel porton runs on small cores []. MA- To compare to a reasonable baselne, we extend to multple applcatons by sharng the large cores to accelerate bottlenecks from any applcaton. To follow the key nsghts from, we prortze bottlenecks by thread watng cycles normalzed to the number of threads for each applcaton, regardless of whch applcaton they belong to. Our proposal. Table 3: Expermental confguratons. Workload Descrpton Source Input set # Bottl. Bottleneck descrpton blacksch BlackScholes opton prcng [18] 1M optons 1 Fnal barrer after omp parallel hst ph Hstogram of RGB components Phoenx [19] S (small) 1 Crt. sectons (CS) on map-reduce scheduler plookup IP packet routng [28] 2.5K queres # thr. CS on routng tables s nasp Integer sort NAS sute [4] n = 64K 1 CS on buffer of keys mysql MySQL server [1] SysBench [2] OLTP-nontrx 18 CS on meta data, tables pca ph Prncpal components analyss Phoenx [19] S (small) 1 CS on map-reduce scheduler specjbb JAVA busness benchmark [22] 5 seconds 39 CS on counters, warehouse data tsp Travelng salesman [12] 8 ctes 2 CS on termnaton condton, soluton webcache Cooperatve web cache [26] 0K queres 33 CS on replacement polcy ft nasp FFT computaton NAS sute [4] sze = 32x32x32 1 Run as sngle-threaded applcaton the maxmum number of threads, we reduce the number of threads for the applcaton(s) whose performance s(are) less senstve to the number of threads. Unless otherwse ndcated, we use harmonc mean to compute all the averages n our evaluaton. To measure system performance wth multple applcatons [9] we use Harmonc mean of Speedups (Hspeedup) [14] and Weghted Speedup (Wspeedup)[21], defned below for N applcatons. T alone s the executon when the applcaton runs alone n the system and T shared s the executon measured when all applcatons are runnng. We also report Unfarness [17] as defned below. Hspeedup = N 1 X =0 N 5. EVALUATION T shared T alone Unfarness = N 1 X Wspeedup = max(t alone mn(t alone Table 4: Evaluated workloads. =0 /T shared ) /T shared ) T alone T shared 5.1 Sngle Applcaton We carefully choose the number of threads to run each applcaton wth, because that number sgnfcantly affects performance of multthreaded applcatons. We evaluate two stuatons: (1) number of threads equal to the number of avalable hardware contexts,.e., maxmum number of threads, whch s a common practce for runnng non-i/ontensve applcatons; and (2) optmal number of threads,.e., the number of threads that mnmzes executon, whch we fnd wth an exhaustve search for each applcaton on each confguraton. Table 5 shows the average speedups of over other mechansms for dfferent ACMP confguratons. performs better than the other mechansms on every confguraton, except for multple large cores on a 16- core area budget. and dedcate the large cores to accelerate code segments, unlke ACMP and. Therefore, the maxmum number of threads that can run on and s smaller. Wth an area budget of 16, and cannot overcome the loss of parallel throughput due to runnng sgnfcantly fewer threads. For example, wth 3 large cores, and ACMP can execute applcatons wth up to 7 threads (4 on small cores and 3 on large cores), whle and can execute a maxmum of 4 threads. Overall, the beneft of ncreases wth area budget and number of large cores. Confg. Opt. number of threads Max. number of threads Area LC ACMP ACMP Table 5: Average speedup (%) of over ACMP, and Sngle-Large-CoreACMP Fgure 7 shows the speedup of, and over ACMP, whch accelerates only the Amdahl s seral bottleneck. Each applcaton runs wth ts optmal number of threads for each confguraton. We show results for 16, 32 and 64-small-core area budgets and a sngle large core. On average, our proposal mproves performance over ACMP by 8%/15%/16%, over by 0.2%/7.5%/7.3% and over by 9%/8%/7% for area budgets of 16/32/64 small cores. We make three observatons. Frst, as the number of cores ncreases,, and provde hgher average performance mprovement over ACMP. Performance mprovement of over ncreases wth the number of cores because, unlke, can accelerate bottlenecks, whch have an ncreasngly larger mpact on performance as the number of cores ncreases (as long as the number of threads ncreases). However, performance mprovement of over slghtly decreases wth a hgher number of cores. The reason s that the beneft of acceleratng laggng threads n addton to bottlenecks gets smaller for some benchmarks as the number of threads ncreases, dependng on the actual amount of thread mbalance that can elmnate. Snce and dedcate the large core to accelerate code segments, they can run one fewer thread than ACMP and. Wth an area budget of 16, the mpact of runnng one fewer thread s sgnfcant for and, but s able to overcome 8

9 Speedup norm. to ACMP (%) Speedup norm. to ACMP (%) Speedup norm. to ACMP (%) blacksch blacksch blacksch hst_ph hst_ph hst_ph plookup s_nasp pca_ph mysql specjbb tsp (a) Area budget=16 small cores plookup s_nasp pca_ph mysql specjbb tsp (b) Area budget=32 small cores plookup s_nasp pca_ph mysql specjbb tsp (c) Area budget=64 small cores webcache webcache webcache Fgure 7: Speedup for optmal number of threads, normalzed to ACMP. that dsadvantage wth respect to ACMP and by acceleratng both bottlenecks and laggng threads. Second, as the number of threads ncreases, plookup, s nasp, mysql, tsp and webcache become lmted by contended crtcal sectons and sgnfcantly beneft from. mproves performance more than and because t s able to accelerate both laggng threads and bottlenecks. Hst ph and pca ph are MapReduce applcatons wth no sgnfcant contenton for crtcal sectons where mproves performance over because ts shorter schedulng quantum and lower-overhead hardware-managed thread mgraton accelerates all three parallel portons (map, reduce and merge) more effcently. Thrd, blacksch s a very scalable workload wth nether sgnfcant bottlenecks nor sgnfcant thread mbalance, whch s the worst-case scenaro for all the evaluated mechansms. Therefore, produces the best performance because t accelerates all threads n round-robn order and t can run one more thread than or, whch dedcate the large core to acceleraton. However, performance beneft for blacksch decreases as the number of cores (and threads) ncreases because the large core s multplexed among all threads, resultng n less acceleraton on each thread and a smaller mpact on performance. Note that n a set of symmetrc threads, executon s reduced only by the mnmum amount of that s saved from any thread, whch requres acceleratng all threads evenly. effcently accelerates all threads, smlarly to, but s penalzed by havng to run wth one fewer thread. hmean hmean hmean We conclude that mproves performance of applcatons that have laggng threads, bottlenecks or both by a larger amount than, a prevous proposal to accelerate only laggng threads, and, a prevous proposal to accelerate only bottlenecks Multple-Large-CoreACMP Fgure 8 shows the average speedups across all workloads on the dfferent confguratons wth the same area budgets, runnng wth ther optmal number of threads. The man observaton s that replacng small cores wth large cores on a small area budget (16 cores, Fgure 8(a)) has a very large negatve mpact on performance due to the loss of parallel throughput. Wth an area budget of 32 (Fgure 8(b)) can take advantage of the addtonal large cores to ncrease performance, but cannot, due to the loss of parallel throughput. performs about the same wth 1, 2 or 3 large cores, but stll provdes the best overall performance of all three mechansms. Wth an area budget of 64 (Fgure 8(c)) there s no loss of parallel throughput, except for blacksch. Therefore, both and can take advantage of more large cores. However, average performance of mproves more sgnfcantly wth addtonal large cores. Accordng to per-benchmark data not shown due to space lmtatons, the man reason for ths mprovement s that plookup, mysql and webcache beneft from addtonal large cores because s able to concurrently accelerate the most mportant crtcal sectons and laggng threads. We conclude that as the area budget ncreases becomes more effectve than ACMP, and n takng advantage of addtonal large cores. Speedup norm. to ACMP (%) LC 2LC 3LC (a) Area=16 Speedup norm. to ACMP (%) LC 2LC 3LC (b) Area=32 Speedup norm. to ACMP (%) LC 2LC 3LC (c) Area=64 Fgure 8: Average speedups wth multple large cores, normalzed to ACMP wth 1 large core. 5.2 Multple Applcatons Fgure 9 shows the sorted harmonc speedups of applcaton workloads wth each mechansm: the extensons of prevous proposals to multple applcatons (+PIE and MA-) and, normalzed to the harmonc speedup of ACMP, wth an area budget of 64 small cores. Performance s generally better for than for the other mechansms and the dfference ncreases wth addtonal large cores. Results for 2-applcaton and 4-applcaton workloads wth an area budget of 128 small cores show a smlar pattern and are not shown n detal due to space lmtatons, but we show the averages. Tables 6 and 7 show the mprovement n average weghted speedup, harmonc speedup and unfarness wth over the other mechansms. Average Wspeedup and Hspeedup for are better than for the other mechansms on all confguratons. Unfarness s also reduced, except for three cases wth 4 applcatons. s unfarness measured as maxmum slowdown [7] s even lower than wth the reported metrc (by an average -4.3% for 2 ap- 9

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Real-Time Guarantees. Traffic Characteristics. Flow Control

Real-Time Guarantees. Traffic Characteristics. Flow Control Real-Tme Guarantees Requrements on RT communcaton protocols: delay (response s) small jtter small throughput hgh error detecton at recever (and sender) small error detecton latency no thrashng under peak

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications

More information

Real-time Scheduling

Real-time Scheduling Real-tme Schedulng COE718: Embedded System Desgn http://www.ee.ryerson.ca/~courses/coe718/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrcal and Computer Engneerng Ryerson Unversty Overvew RTX

More information

Cache Sharing Management for Performance Fairness in Chip Multiprocessors

Cache Sharing Management for Performance Fairness in Chip Multiprocessors Cache Sharng Management for Performance Farness n Chp Multprocessors Xng Zhou Wenguang Chen Wemn Zheng Dept. of Computer Scence and Technology Tsnghua Unversty, Bejng, Chna zhoux07@mals.tsnghua.edu.cn,

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes SPH3UW Unt 7.3 Sphercal Concave Mrrors Page 1 of 1 Notes Physcs Tool box Concave Mrror If the reflectng surface takes place on the nner surface of the sphercal shape so that the centre of the mrror bulges

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Classifying Acoustic Transient Signals Using Artificial Intelligence

Classifying Acoustic Transient Signals Using Artificial Intelligence Classfyng Acoustc Transent Sgnals Usng Artfcal Intellgence Steve Sutton, Unversty of North Carolna At Wlmngton (suttons@charter.net) Greg Huff, Unversty of North Carolna At Wlmngton (jgh7476@uncwl.edu)

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

Verification by testing

Verification by testing Real-Tme Systems Specfcaton Implementaton System models Executon-tme analyss Verfcaton Verfcaton by testng Dad? How do they know how much weght a brdge can handle? They drve bgger and bgger trucks over

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal

More information

Intro. Iterators. 1. Access

Intro. Iterators. 1. Access Intro Ths mornng I d lke to talk a lttle bt about s and s. We wll start out wth smlartes and dfferences, then we wll see how to draw them n envronment dagrams, and we wll fnsh wth some examples. Happy

More information

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution Dynamc Voltage Scalng of Supply and Body Bas Explotng Software Runtme Dstrbuton Sungpack Hong EE Department Stanford Unversty Sungjoo Yoo, Byeong Bn, Kyu-Myung Cho, Soo-Kwan Eo Samsung Electroncs Taehwan

More information

Intelligent Information Acquisition for Improved Clustering

Intelligent Information Acquisition for Improved Clustering Intellgent Informaton Acquston for Improved Clusterng Duy Vu Unversty of Texas at Austn duyvu@cs.utexas.edu Mkhal Blenko Mcrosoft Research mblenko@mcrosoft.com Prem Melvlle IBM T.J. Watson Research Center

More information

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT Bran J. Wolf, Joseph L. Hammond, and Harlan B. Russell Dept. of Electrcal and Computer Engneerng, Clemson Unversty,

More information

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments Comparson of Heurstcs for Schedulng Independent Tasks on Heterogeneous Dstrbuted Envronments Hesam Izakan¹, Ath Abraham², Senor Member, IEEE, Václav Snášel³ ¹ Islamc Azad Unversty, Ramsar Branch, Ramsar,

More information

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Thread Cluster Memory Schedulng: Explotng Dfferences n Memory Access Behavor Yoongu Km Mchael Papamchael Onur Mutlu Mor Harchol-Balter yoonguk@ece.cmu.edu papamx@cs.cmu.edu onur@cmu.edu harchol@cs.cmu.edu

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-74 Volume 0 Issue BoTechnology 04 An Indan Journal FULL PAPER BTAIJ 0() 04 [684-689] Revew on Chna s sports ndustry fnancng market based on market -orented

More information

Video Proxy System for a Large-scale VOD System (DINA)

Video Proxy System for a Large-scale VOD System (DINA) Vdeo Proxy System for a Large-scale VOD System (DINA) KWUN-CHUNG CHAN #, KWOK-WAI CHEUNG *# #Department of Informaton Engneerng *Centre of Innovaton and Technology The Chnese Unversty of Hong Kong SHATIN,

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems:

RAP. Speed/RAP/CODA. Real-time Systems. Modeling the sensor networks. Real-time Systems. Modeling the sensor networks. Real-time systems: Speed/RAP/CODA Presented by Octav Chpara Real-tme Systems Many wreless sensor network applcatons requre real-tme support Survellance and trackng Border patrol Fre fghtng Real-tme systems: Hard real-tme:

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Mixed-Criticality Scheduling on Multiprocessors using Task Grouping

Mixed-Criticality Scheduling on Multiprocessors using Task Grouping Mxed-Crtcalty Schedulng on Multprocessors usng Task Groupng Jankang Ren Lnh Th Xuan Phan School of Software Technology, Dalan Unversty of Technology, Chna Computer and Informaton Scence Department, Unversty

More information

Space-Optimal, Wait-Free Real-Time Synchronization

Space-Optimal, Wait-Free Real-Time Synchronization 1 Space-Optmal, Wat-Free Real-Tme Synchronzaton Hyeonjoong Cho, Bnoy Ravndran ECE Dept., Vrgna Tech Blacksburg, VA 24061, USA {hjcho,bnoy}@vt.edu E. Douglas Jensen The MITRE Corporaton Bedford, MA 01730,

More information

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems S. J and D. Shn: An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems 2355 An Effcent Garbage Collecton for Flash Memory-Based Vrtual Memory Systems Seunggu J and Dongkun Shn, Member,

More information

Maintaining temporal validity of real-time data on non-continuously executing resources

Maintaining temporal validity of real-time data on non-continuously executing resources Mantanng temporal valdty of real-tme data on non-contnuously executng resources Tan Ba, Hong Lu and Juan Yang Hunan Insttute of Scence and Technology, College of Computer Scence, 44, Yueyang, Chna Wuhan

More information

Load-Balanced Anycast Routing

Load-Balanced Anycast Routing Load-Balanced Anycast Routng Chng-Yu Ln, Jung-Hua Lo, and Sy-Yen Kuo Department of Electrcal Engneerng atonal Tawan Unversty, Tape, Tawan sykuo@cc.ee.ntu.edu.tw Abstract For fault-tolerance and load-balance

More information

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals Agenda & Readng COMPSCI 8 SC Applcatons Programmng Programmng Fundamentals Control Flow Agenda: Decsonmakng statements: Smple If, Ifelse, nested felse, Select Case s Whle, DoWhle/Untl, For, For Each, Nested

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Learning-Based Top-N Selection Query Evaluation over Relational Databases Learnng-Based Top-N Selecton Query Evaluaton over Relatonal Databases Lang Zhu *, Wey Meng ** * School of Mathematcs and Computer Scence, Hebe Unversty, Baodng, Hebe 071002, Chna, zhu@mal.hbu.edu.cn **

More information

CS 268: Lecture 8 Router Support for Congestion Control

CS 268: Lecture 8 Router Support for Congestion Control CS 268: Lecture 8 Router Support for Congeston Control Ion Stoca Computer Scence Dvson Department of Electrcal Engneerng and Computer Scences Unversty of Calforna, Berkeley Berkeley, CA 9472-1776 Router

More information

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices Steps for Computng the Dssmlarty, Entropy, Herfndahl-Hrschman and Accessblty (Gravty wth Competton) Indces I. Dssmlarty Index Measurement: The followng formula can be used to measure the evenness between

More information

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing EDA222/DIT161 Real-Tme Systems, Chalmers/GU, 2014/2015 Lecture #8 Real-Tme Systems Real-Tme Systems Lecture #8 Specfcaton Professor Jan Jonsson Implementaton System models Executon-tme analyss Department

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

Active Contours/Snakes

Active Contours/Snakes Actve Contours/Snakes Erkut Erdem Acknowledgement: The sldes are adapted from the sldes prepared by K. Grauman of Unversty of Texas at Austn Fttng: Edges vs. boundares Edges useful sgnal to ndcate occludng

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Burst Round Robin as a Proportional-Share Scheduling Algorithm

Burst Round Robin as a Proportional-Share Scheduling Algorithm Burst Round Robn as a Proportonal-Share Schedulng Algorthm Tarek Helmy * Abdelkader Dekdouk ** * College of Computer Scence & Engneerng, Kng Fahd Unversty of Petroleum and Mnerals, Dhahran 31261, Saud

More information

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries Run-Tme Operator State Spllng for Memory Intensve Long-Runnng Queres Bn Lu, Yal Zhu, and lke A. Rundenstener epartment of Computer Scence, Worcester Polytechnc Insttute Worcester, Massachusetts, USA {bnlu,

More information

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7 Optmzed Regonal Cachng for On-Demand Data Delvery Derek L. Eager Mchael C. Ferrs Mary K. Vernon Unversty of Saskatchewan Unversty of Wsconsn Madson Saskatoon, SK Canada S7N 5A9 Madson, WI 5376 eager@cs.usask.ca

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Performance Evaluation

Performance Evaluation Performance Evaluaton [Ch. ] What s performance? of a car? of a car wash? of a TV? How should we measure the performance of a computer? The response tme (or wall-clock tme) t takes to complete a task?

More information

Edge Detection in Noisy Images Using the Support Vector Machines

Edge Detection in Noisy Images Using the Support Vector Machines Edge Detecton n Nosy Images Usng the Support Vector Machnes Hlaro Gómez-Moreno, Saturnno Maldonado-Bascón, Francsco López-Ferreras Sgnal Theory and Communcatons Department. Unversty of Alcalá Crta. Madrd-Barcelona

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Appears n the Proceedngs of the 51st Annual IEEE/ACM Internatonal Symposum on Mcroarchtecture (MICRO), 218 Adaptve Schedulng for Systems wth Asymmetrc Memory Herarches Po-An Tsa, Changpng Chen, Danel Sanchez

More information

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management. //7 Prnceton Unversty Computer Scence 7: Introducton to Programmng Systems Goals of ths Lecture Storage Management Help you learn about: Localty and cachng Typcal storage herarchy Vrtual memory How the

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

Multitasking and Real-time Scheduling

Multitasking and Real-time Scheduling Multtaskng and Real-tme Schedulng EE8205: Embedded Computer Systems http://www.ee.ryerson.ca/~courses/ee8205/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrcal and Computer Engneerng Ryerson Unversty

More information

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography   Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department

More information

Scheduling and queue management. DigiComm II

Scheduling and queue management. DigiComm II Schedulng and queue management Tradtonal queung behavour n routers Data transfer: datagrams: ndvdual packets no recognton of flows connectonless: no sgnallng Forwardng: based on per-datagram forwardng

More information

VFH*: Local Obstacle Avoidance with Look-Ahead Verification

VFH*: Local Obstacle Avoidance with Look-Ahead Verification 2000 IEEE Internatonal Conference on Robotcs and Automaton, San Francsco, CA, Aprl 24-28, 2000, pp. 2505-25 VFH*: Local Obstacle Avodance wth Look-Ahead Verfcaton Iwan Ulrch and Johann Borensten The Unversty

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

A Holistic View of Stream Partitioning Costs

A Holistic View of Stream Partitioning Costs A Holstc Vew of Stream Parttonng Costs Nkos R. Katspoulaks, Alexandros Labrnds, Panos K. Chrysanths Unversty of Pttsburgh Pttsburgh, Pennsylvana, USA {katsp, labrnd, panos}@cs.ptt.edu ABSTRACT Stream processng

More information

WITH rapid improvements of wireless technologies,

WITH rapid improvements of wireless technologies, JOURNAL OF SYSTEMS ARCHITECTURE, SPECIAL ISSUE: HIGHLY-RELIABLE CPS, VOL. 00, NO. 0, MONTH 013 1 Adaptve GTS Allocaton n IEEE 80.15.4 for Real-Tme Wreless Sensor Networks Feng Xa, Ruonan Hao, Je L, Naxue

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters Proper Choce of Data Used for the Estmaton of Datum Transformaton Parameters Hakan S. KUTOGLU, Turkey Key words: Coordnate systems; transformaton; estmaton, relablty. SUMMARY Advances n technologes and

More information

Virtual Machine Migration based on Trust Measurement of Computer Node

Virtual Machine Migration based on Trust Measurement of Computer Node Appled Mechancs and Materals Onlne: 2014-04-04 ISSN: 1662-7482, Vols. 536-537, pp 678-682 do:10.4028/www.scentfc.net/amm.536-537.678 2014 Trans Tech Publcatons, Swtzerland Vrtual Machne Mgraton based on

More information

Shared Running Buffer Based Proxy Caching of Streaming Sessions

Shared Running Buffer Based Proxy Caching of Streaming Sessions Shared Runnng Buffer Based Proxy Cachng of Streamng Sessons Songqng Chen, Bo Shen, Yong Yan, Sujoy Basu Moble and Meda Systems Laboratory HP Laboratores Palo Alto HPL-23-47 March th, 23* E-mal: sqchen@cs.wm.edu,

More information

THE low-density parity-check (LDPC) code is getting

THE low-density parity-check (LDPC) code is getting Implementng the NASA Deep Space LDPC Codes for Defense Applcatons Wley H. Zhao, Jeffrey P. Long 1 Abstract Selected codes from, and extended from, the NASA s deep space low-densty party-check (LDPC) codes

More information

Self-tuning Histograms: Building Histograms Without Looking at Data

Self-tuning Histograms: Building Histograms Without Looking at Data Self-tunng Hstograms: Buldng Hstograms Wthout Lookng at Data Ashraf Aboulnaga Computer Scences Department Unversty of Wsconsn - Madson ashraf@cs.wsc.edu Surajt Chaudhur Mcrosoft Research surajtc@mcrosoft.com

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

On Achieving Fairness in the Joint Allocation of Buffer and Bandwidth Resources: Principles and Algorithms

On Achieving Fairness in the Joint Allocation of Buffer and Bandwidth Resources: Principles and Algorithms On Achevng Farness n the Jont Allocaton of Buffer and Bandwdth Resources: Prncples and Algorthms Yunka Zhou and Harsh Sethu (correspondng author) Abstract Farness n network traffc management can mprove

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams

Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams Self-Tunng, Bandwdth-Aware Montorng for Dynamc Data Streams Navendu Jan, Praveen Yalagandula, Mke Dahln, Yn Zhang Mcrosoft Research HP Labs The Unversty of Texas at Austn Abstract We present, a self-tunng,

More information

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

Technical Report. i-game: An Implicit GTS Allocation Mechanism in IEEE for Time- Sensitive Wireless Sensor Networks

Technical Report. i-game: An Implicit GTS Allocation Mechanism in IEEE for Time- Sensitive Wireless Sensor Networks www.hurray.sep.pp.pt Techncal Report -GAME: An Implct GTS Allocaton Mechansm n IEEE 802.15.4 for Tme- Senstve Wreless Sensor etworks Ans Koubaa Máro Alves Eduardo Tovar TR-060706 Verson: 1.0 Date: Jul

More information

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated. Some Advanced SP Tools 1. umulatve Sum ontrol (usum) hart For the data shown n Table 9-1, the x chart can be generated. However, the shft taken place at sample #21 s not apparent. 92 For ths set samples,

More information