Uncorrected Proof. Thread-Level Speculation

Size: px

Start display at page:

Download "Uncorrected Proof. Thread-Level Speculation"

Andrea Mason
6 years ago
Views:

1 Encyclopeda of Parallel Computng /3/8 12:30 Page 1 #2 T Thread-Level Josep Torrellas Unversty of Illnos at Urbana-Champagn 4231 Sebel Center,M/C-258,Urbana,IL,USA Synonyms Speculatve multthreadng (SM); Speculatve parallelzaton; Speculatve run-tme parallelzaton; Speculatve threadng; Speculatve thread-level parallelzaton; Thread-level data speculaton (TLDS); TLS Defnton Thread-Level (TLS) refers to an envronment where executon threads operate speculatvely, performng potentally unsafe operatons, and temporarly bufferng the state they generate n a buffer or cache. At a certan pont, the operatons of a thread are declared to be correct or ncorrect. If they are correct, the thread commts, mergng the state t generated wth the correct state of the program; f they are ncorrect, the thread s squashed and typcally restarted from ts begnnng. The term TLS s most often assocated to a scenaro where the purpose s to execute a sequental applcaton n parallel. In ths case, the compler or the hardware breaks down the applcaton nto speculatve threads that execute n parallel. However, strctly speakng, TLS can be appled to any envronment where threads are executed speculatvely and can be squashed and restarted. Dscusson Basc Concepts n Thread-Level In ts most common use, Thread-Level (TLS) conssts of extractng unts of work (.e., tasks) from a sequental applcaton and executng them on dfferent threads n parallel, hopng not to volate sequental semantcs. The control flow n the sequen- 35 tal code mposes a relatve orderng between the tasks, 36 whch s expressed n terms of predecessor and suc- 37 cessor tasks. The sequental code also nduces a data 38 dependence relaton on the memory accesses ssued by 39 the dfferent tasks that parallel executon cannot volate. 40 AtasksSpeculatve when t may perform or may 41 have performed operatons that volate data or con- 42 trol dependences wth ts predecessor tasks. Other- 43 wse, the task s nonspeculatve. The memory accesses 44 ssued by speculatve tasks are called speculatve mem- 45 ory accesses. 46 When a nonspeculatve task fnshes executon, t s 47 ready to Commt. The role of commt s to nform the 48 rest of the system that the data generated by the task 49 s now part of the safe, nonspeculatve program state. 50 Among other operatons, commttng always nvolves 51 passng the Commt Token to the mmedate succes- 52 sor task. Ths s because mantanng correct sequental 53 semantcs n the parallel executon requres that tasks 54 commt n order from predecessor to successor. If a task 55 reaches ts end and s stll speculatve, t cannot com- 56 mt untl t acqures nonspeculatve status and all ts 57 predecessors have commtted. 58 Fgure 1 shows an example of several tasks run- 59 nng on four processors. In ths example, when task T3 60 executng on processor 4 fnshes the executon, t can- 61 notcommtuntltspredecessortaskst0,t1,andt2 62 also fnsh and commt. In the meantme, dependng on 63 the hardware support, processor 4 may have to stall or 64 may be able to start executng speculatve task T7. The 65 example also shows how the nonspeculatve task status 66 changes as tasks fnsh and commt, and the passng of 67 the commt token. 68 Memory accesses ssued by a speculatve task 69 must be handled carefully. Stores generate Speculatve 70 Versons of data that cannot smply be merged wth 71 the nonspeculatve state of the program. The reason s 72 that they may be ncorrect. Consequently, these versons 73 Davd Padua (ed.), Encyclopeda of Parallel Computng, DOI / , Sprnger Scence+Busness Meda LLC 2011

2 Encyclopeda of Parallel Computng /3/8 12:30 Page 2 #3 2 T Thread-Level Proc# Tme T0 T1 T2 T3 T4 T5 Nonspeculatve task tmelne T6 Commt token transfer Thread-Level. Fg. 1 A set of tasks executng on four processors. The fgure shows the nonspeculatve task tmelne and the transfer of the commt token 74 are stored n a Speculatve Buffer local to the processor 75 runnng the task e.g., the frst-level cache. Only when 76 the task becomes nonspeculatve are ts versons safe. 77 Loads ssued by a speculatve task try to fnd the 78 requested datum n the local speculatve buffer. If they 79 mss, they fetch the correct verson from the memory 80 subsystem,.e., the closest predecessor verson from the 81 speculatve buffers of other tasks. If no such verson 82 exsts, they fetch the datum from memory. 83 As tasks execute n parallel, the system must den- 84 tfy any volatons of cross-task data dependences. 85 Typcally, ths s done wth specal hardware or soft- 86 ware support that tracks, for each ndvdual task, the 87 data that the task wrote and the data that the task read 88 wthout frst wrtng t. A data-dependence volaton s 89 flagged when a task modfes a datum that has been read 90 earler by a successor task. At ths pont, the consumer 91 task s squashed and all the data versons that t has 92 produced are dscarded. Then, the task s re-executed. 93 Fgure 2 shows an example of a data-dependence 94 volaton. In the example, each teraton of a loop 95 s a task. Each teraton ssues two accesses to an 96 array, through an un-analyzable subscrpted subscrpt. 97 At run-tme, teraton J wrtes A[5] after ts succes- 98 sor teraton J+2 reads A[5]. Ths s a Read After 99 Wrte (RAW) dependence that gets volated due to 100 the parallel executon. Consequently, teraton J+2 s 101 squashed and restarted. Ordnarly, all the successor Stall T7 tasks of teraton J+2 are also squashed at ths tme 102 because they may have consumed versons generated 103 by the squashed task. Whle t s possble to selectvely 104 squash only tasks that used ncorrect data, t would 105 nvolve extra complexty. Fnally, as teraton J+2 re- 106 executes, t wll re-read A[5]. However, at ths tme, the 107 value read wll be the verson generated by teraton J. 108 Note that WAR and WAW dependence volatons do 109 not need to nduce task squashes. The successor task has 110 prematurely wrtten the datum, but the datum remans 111 buffered n ts speculatve buffer. A subsequent read 112 from a predecessor task (n a WAR volaton) wll get a 113 correct verson, whle a subsequent wrte from a prede- 114 cessor task (n a WAW volaton) wll generate a verson 115 that wll be merged wth man memory before the one 116 from the successor task. 117 However, many proposed TLS schemes, to reduce 118 hardware complexty, nduce squashes n a varety of st- 119 uatons. For nstance, f the system has no support to 120 keepdfferentversonsofthesamedatumndfferent 121 speculatve buffers n the machne, cross-task WAR and 122 WAW dependence volatons nduce squashes. More- 123 over, f the system only tracks accesses on a per-lne 124 bass, t cannot dsambguate accesses to dfferent words 125 n the same memory lne. In ths case, false sharng of a 126 cache lne by two dfferent processors can appear as a 127 data-dependence volaton and also trgger a squash. 128

3 Encyclopeda of Parallel Computng /3/8 12:30 Page 3 #4 Thread-Level T 3 for (=0; <N; ++) {... = A[L[]] +... Iteraton J Iteraton J+1 Iteraton J+2... = A[4] = A[2] = A[5] A[K[]] =.... } A[5] =... A[2] =... A[6] =... RAW volaton Thread-Level. Fg. 2 Example of a data-dependence volaton Fnally, whle TLS can be appled to varous code structures,tsmostoftenappledtoloops.inths case, tasks are typcally formed by a set of consecutve teratons. The rest of ths artcle s organzed as follows: Frst, the artcle brefly classfes TLS schemes. Then, t descrbes the two major problems that any TLS scheme has to solve, namely, bufferng and managng speculatve state, and detectng and handlng dependence volatons. Next, t descrbes the ntal efforts n TLS, other uses of TLS, and machnes that use TLS. Classfcaton of Thread-Level Schemes There have been many proposals of TLS schemes. They can be broadly classfed dependng on the emphass on hardware versus software, and the type of target machne. The majorty of the proposed schemes use hardware support to detect cross-task dependence volatons that result n task squashes (e.g., [1, 4, 6, 8, 11, 12, 14, 16, 18, 20, 23, 27, 28, 31, 32, 36]). Typcally, ths s attaned by usng the hardware cache coherence protocol, whch sends coherence messages between the caches when multple processors access the same memory lne. Among all these hardware-based schemes, the majorty rely on a compler or a software layer to dentfy and prepare the tasks that should be executed n parallel. Consequently, there have been several proposals for TLS complers (e.g., [9, 19, 33, 34]). Very few schemes rely on the hardware to dentfy the tasks (e.g., [1]). Several schemes, especally n the early stages of TLS research, proposed software-only approaches to TLS (e.g., [7, 13, 25, 26]). In ths case, the compler typcally generates code that causes each task to keep shadow locatons and, after the parallel executon, checks f multple tasks have updated a common locaton. If they have, the orgnal state s restored. Most proposed TLS schemes target small shared- 166 memory machnes of about two to eght processors 167 (e.g., [14, 18, 27, 29]). It s n ths range of paral- 168 lelsm that TLS s most cost effectve. Some TLS pro- 169 posals have focused on smaller machnes and have 170 extended a superscalar core wth some hardware unts 171 that execute threads speculatvely [1, 20]. Fnally, some 172 TLS proposals have targeted scalable multprocessors 173 [4, 23, 28]. Ths s a more challengng envronment, 174 gven the longer communcaton latences nvolved. It 175 requres applcatons that have sgnfcant parallelsm 176 that cannot be analyzed statcally by the compler. 177 Bufferng and Managng Speculatve State 178 The state produced by speculatve tasks s unsafe, snce 179 such tasks may be squashed. Therefore, any TLS scheme 180 must be able to dentfy such state and, when neces- 181 sary, separate t from the rest of the memory state. 182 For ths, TLS systems use structures, such as caches 183 [4, 6, 12, 18, 28], and specal buffers [8, 14, 23, 32], or 184 undo logs [7, 11, 36]. Ths secton outlnes the chal- 185 lenges n bufferng and managng speculatve state. A 186 more detaled analyss and a taxonomy s presented by 187 Garzaran et al. [10]. 188 Multple Versons of the Same Varable 189 n the System 190 Every tme that a task wrtes for the frst tme to a 191 varable, a new verson of the varable appears n the 192 system. Thus, two speculatve tasks runnng on dfferent 193 processors may create two dfferent versons of the same 194 varable [4, 12]. These versons need to be buffered sep- 195 arately, and specal actons may need to be taken so that 196 a reader task can fnd the correct verson out of the sev- 197 eral coexstng n the system. Such a verson wll be the 198 verson created by the producer task that s the closest 199 predecessor of the reader task. 200 A task has at most a sngle verson of any gven 201 varable, even f t wrtes to the varable multple tmes. 202

4 Encyclopeda of Parallel Computng /3/8 12:30 Page 4 #5 4 T Thread-Level The reason s that, on a dependence volaton, the whole task s undone. Therefore, there s no need to keep ntermedate values of the varable. Multple Speculatve Tasks per Processor When a processor fnshes executng a task, the task may stll be speculatve. If the TLS bufferng support s such that the processor can only hold state from a sngle speculatve task, the processor stalls untl the task commts. However, to better tolerate task load mbalance, the local buffer may have been desgned to buffer state from several speculatve tasks, enablng the processor to execute another speculatve task. In ths case, the state of each task must be tagged wth the ID of the task. Multple Versons of the Same Varable n a Sngle Processor When a processor buffers state from multple speculatve tasks, t s possble that two such tasks create two versons of the same varable. Ths occurs n loadmbalanced applcatons that exhbt prvate data patterns (.e., WAW dependences between tasks). In ths case, the buffer wll have to hold multple versons of the same varable. Each verson wll be tagged wth a dfferent task ID. Ths support ntroduces complcaton to the buffer or cache. Indeed, on an external request, extra comparsons wll need to be done f the cache has two versons of the same varable. Mergng of Task State The state produced by speculatve tasks s typcally merged wth man memory at task commt tme; however, t can nstead be merged as t s beng generated. The frst approach s called Archtectural Man Memory (AMM) or Lazy Verson Management; thesecondone s called Future Man Memory (FMM) or Eager Verson Management. These schemes dffer on whether the man memory contans only safe data (AMM) or t can also contan speculatve data (FMM). In AMM systems, all speculatve versons reman n caches or buffers that are kept separate from the coherent memory state. Only when a task becomes nonspeculatve can ts buffered state be merged wth man memory. In a straghtforward mplementaton, when a task commts, all the buffered drty cache lnes are merged wth man memory, ether by wrtng back the lnes to memory [4] or by requestng ownershp for 246 them to obtan coherence wth man memory [28]. 247 In FMM systems, versons from speculatve tasks are 248 merged wth the coherent memory when they are gen- 249 erated. However, to enable recovery from task squashes, 250 when a task generates a speculatve verson of a varable, 251 the prevous verson of the varable s saved n a log. 252 Note that, n both approaches, the coherent memory 253 state can temporarly resde n caches, whch functon 254 n ther tradtonal role of extensons of man memory. 255 Detectng and Handlng Dependence 256 Volatons 257 Basc Concepts 258 The second aspect of TLS nvolves detectng and han- 259 dlng dependence volatons. Most TLS proposals focus 260 on data dependences, rather than control dependences. 261 To detect (cross-task) data-dependence volatons, most 262 TLS schemes use the same approach. Specfcally, when 263 a speculatve task wrtes a datum, the hardware sets a 264 Speculatve Wrte bt assocated wth the datum n the 265 cache; when a speculatve task reads a datum before t 266 wrtes to t (an event called Exposed Read), the hard- 267 ware sets an Exposed Read bt. Dependng on the TLS 268 scheme supported, these accesses also cause a tag asso- 269 cated wth the datum to be set to the ID of the task. 270 In addton, when a task wrtes a datum, the cache 271 coherence protocol transacton that sends nvaldatons 272 to other caches checks these bts. If a successor task has 273 ts Exposed Read bt set for the datum, the successor 274 task has prematurely read the datum (.e., ths s a RAW 275 dependence volaton), and s squashed [18]. 276 If the Speculatve Wrte and Exposed Read bts are 277 kept on a per-word bass, only dependences on the same 278 word can cause squashes. However, keepng and man- 279 tanng such bts on a per-word bass n caches, network 280 messages, and perhaps drectory modules s costly n 281 hardware. Moreover, t does not come naturally to the 282 coherence protocol of multprocessors, whch operate 283 at the granularty of memory lnes. 284 Keepng these bts on a per-lne bass s cheaper and 285 compatble wth manstream cache coherence proto- 286 cols. However, the hardware cannot then dsambguate 287 accesses at word level. Furthermore, t cannot combne 288 dfferent versons of a lne that have been updated n df- 289 ferent words. Consequently, cross-task RAW and WAW 290

5 Encyclopeda of Parallel Computng /3/8 12:30 Page 5 #6 Thread-Level T volatons, on both the same word and dfferent words of a lne (.e., false sharng), cause squashes. Task squash s a very costly operaton. The cost s threefold: overhead of the squash operaton tself, loss of whatever correct work has already been performed by the offendng task and ts successors, and cache msses n the offendng task and ts successors needed to reload state when restartng. The latter overhead appears because, as part of the squash operaton, the speculatve state n the cache s nvaldated. Fgure 3a shows an example of a RAW volaton across tasks and +j+1. The consumer task and ts successors are squashed. Technques to Avod Squashes Snce squashes are so expensve, there are technques to avod them. If the compler can conclude that a certan par of accesses wll frequently cause a data-dependence volaton, t can statcally nsert a synchronzaton operaton that forces the correct task orderng at runtme. Alternatvely, the machne can have hardware support that records, at runtme, where dependence volatons occur. Such hardware may record the program counter of the read or wrtes nvolved, or the address of the memory locaton beng accessed. Based on ths Tme a RAW +j +j+1 Sqsh Sqsh +j+2 b +j Commt +j+1 +j+2 nformaton, when these program counters are reached 315 or the memory locaton s accessed, the hardware can 316 try one of several technques to avod the volaton. Ths 317 secton outlnes some of the technques that can be used. 318 A more complete descrpton of the choces s presented 319 by Cntra and Torrellas [5]. Wthout loss of generalty, a 320 RAW volaton s assumed. 321 Based on past hstory, the predctor may predct 322 that the par of conflctng accesses are engaged n false 323 sharng. In ths case, t can smply allow the read to pro- 324 ceed and then the subsequent wrte to execute slently, 325 wthout sendng nvaldatons. Later, before the con- 326 sumer task s allowed to commt, t s necessary to 327 check whether the sectons of the lne read by the con- 328 sumer overlap wth the sectons of the lne wrtten by 329 the producer. Ths can be easly done f the caches 330 have per-word access bts. If there s no overlap, t was 331 false sharng and the squash s avoded. Fgure 3bshows 332 the resultng tme lne. 333 When there s a true data dependence between tasks, 334 a squash can be avoded wth effectve use of value pre- 335 dcton. Specfcally, the predctor can predct the value 336 that the producer wll produce, speculatvely provde t 337 to the consumer s read, and let the consumer proceed. 338 Useful work Wasted correct work Commt c +j +j+1 +j+2 Release Squash overhead Checkng overhead +j Commt d +j+1 Release +j+2 Possbly ncorrect work Stall overhead Thread-Level. Fg. 3 RAW data-dependence volaton that results n a squash (a) or that does not cause a squash due to false sharng or value predcton (b), or consumer stall (c and d)

6 Encyclopeda of Parallel Computng /3/8 12:30 Page 6 #7 6 T Thread-Level Agan, before the consumer s allowed to commt, t s necessary to check that the value provded was correct. The tmelne s also shown n Fg. 3b. In cases where the predctor s unable to predct the value, t can avod the squash by stallng the consumer task at the tme of the read. Ths case can use two possble approaches. An aggressve approach s to release the consumer task and let t read the current value as soon as the predcted producer task commts. The tme lne s shown n Fg. 3c. In ths case, f an ntervenng task between the frst producer and the consumer later wrtes the lne, the consumer wll be squashed. A more conservatve approach s not to release the consumer task untl t becomes nonspeculatve. In ths case, the presence of multple predecessor wrters wll not squash the consumer. The tme lne s shown n Fg. 3d. Intal Efforts n Thread-Level An early proposal for hardware support for a form of speculatve parallelzaton was made by Knght [16] n the context of functonal languages. Later, the Multscalar processor [27]wasthefrstproposaltouseaform of TLS wthn a sngle-chp multthreaded archtecture. A software-only form of TLS was proposed n the LRPDtest [25]. Early proposals of hardware-based TLS nclude the work of several authors [14, 17, 21, 29, 35]. Other Uses of Thread-Level TLS concepts have been used n envronments that have goals other than tryng to parallelze sequental programs. For example, they have been used to speed up explctly parallel programs through Speculatve Synchronzaton [22], or for parallel program debuggng [24] orprogrammontorng[37]. Smlar concepts to TLS have been used n systems supportng hardware transactonal memory [15] and contnuous atomc-block operaton [30]. Machnes that Use Thread-Level Several machnes bult by computer manufacturers have hardware support for some form of TLS although the specfc mplementaton detals are typcally not dsclosed. Such machnes nclude systems desgned for Java applcatons such as Sun Mcrosystems MAJC chp [31] and Azul Systems Vega processor [2]. The most hgh-profle system wth hardware support for speculatve threads s Sun Mcrosystems ROCK 383 processor [3]. Other manufacturers are rumored to be 384 developng prototypes wth smlar hardware. 385 Related Entres 386 Instructon-Level 387 Speculatve Synchronzaton 388 Transactonal Memory 389 Bblography Akkary H, Drscoll M (1998) A dynamc multthreadng proces- 391 sor. In: Internatonal symposum on mcroarchtecture, Dallas, 392 November Azul Systems. Vega 3 Processor products/vega/processor Chaudhry S, Cypher R, Ekman M, Karlsson M, Landn A, Yp S, 396 Zeffer H, Tremblay M (2009) Smultaneous speculatve threadng: 397 a novel ppelne archtecture mplemented n Sun s ROCK Pro- 398 cessor. In: Internatonal symposum on computer archtecture, 399 Austn, June Cntra M, Martínez JF, Torrellas J (2000) Archtectural support 401 for scalable speculatve parallelzaton n shared-memory mult- 402 processors. In: Internatonal symposum on computer archtec- 403 ture, Vancouver, June 2000, pp Cntra M, Torrellas J (2002) Elmnatng squashes through 405 learnng cross-thread volatons n speculatve parallelzaton for 406 multprocessors. In: Proceedngs of the 8th Hgh-Performance 407 computer archtecture conference, Boston, Feb Fgueredo R, Fortes J (2001) Hardware support for extract- 409 ng coarse-gran speculatve parallelsm n dstrbuted shared- 410 memory multprocesors. In: Proceedngs of the nternatonal 411 conference on parallel processng, Valenca, Span, September Frank M, Lee W, Amarasnghe S (2001) A software framework 414 for supportng general purpose applcatons on raw computaton 415 fabrcs. Techncal report, MIT/LCS Techncal Memo MIT-LCS- 416 TM-619, July Frankln M, Soh G (1996) ARB: a hardware mechansm for 418 dynamc reorderng of memory references. IEEE Trans Comput (5): Garca C, Madrles C, Sanchez J, Marcuello P, Gonzalez A, 421 Tullsen D (2005) Mtoss compler: An nfrastructure for specu- 422 latve threadng based on pre-computaton slces. In: Conference 423 on programmng language desgn and mplementaton, Chcago, 424 Illnos, June Garzarán M, Prvulovc M, Llabería J, Vñals V, Rauchwerger L, 426 Torrellas J (2005) Tradeoffs n bufferng speculatve memory 427 state for thread-level speculaton n multprocessors. ACM Trans 428 Archt Code Optm Garzaran MJ, Prvulovc M, Llabería JM, Vñals V, Rauchwerger L, 430 Torrellas J (2003) Usng software loggng to support mult- 431 verson bufferng n thread-level speculaton. In: Internatonal 432 AU1

7 Encyclopeda of Parallel Computng /3/8 12:30 Page 7 #8 Thread-Level T conference on parallel archtectures and complaton technques, New Orleans, Sept Gopal S, Vjaykumar T, Smth J, Soh G (1998) Speculatve versonng cache. In: Internatonal symposum on hgh-performance computer archtecture, Las Vegas, Feb Gupta M, Nm R (1998) Technques for speculatve run-tme parallelzaton of loops. In: Proceedngs of supercomputng 1998, ACM Press, Melbourne, Australa, Nov Hammond L, Wlley M, Olukotun K (1998) Data speculaton support for a chp multprocessor. In: Internatonal conference on archtectural support for programmng languages and operatng systems, San Jose, Calforna, Oct 1998, pp Herlhy M, Moss E (1993) Transactonal memory: archtectural support for lock-free data structures. In: Internatonal symposum on computer archtecture, IEEE Computer Socety Press, San Dego, May Knght T (1986) An archtecture for mostly functonal languages. In: ACM lsp and functonal programmng conference, ACM Press, New York, Aug 1986, pp Krshnan V, Torrellas J (1998) Hardware and software support for speculatve executon of sequental bnares on a chpmultprocessor. In: Internatonal conference on supercomputng, Melbourne, Australa, July Krshnan V, Torrellas J (1999) A chp-multprocessor archtecture wth speculatve multthreadng. IEEE Trans Comput 48(9): Lu W, Tuck J, Ceze L, Ahn W, Strauss K, Renau J, Torrellas J (2006) POSH: A TLS compler that explots program structure. In: Internatonal symposum on prncples and practce of parallel programmng, San Dego, Mar Marcuello P, Gonzalez A (1999) Clustered speculatve multthreaded processors. In: Internatonal conference on supercomputng, Rhodes, Island, June 1999, pp Marcuello P, Gonzalez A, Tubella J (1998) Speculatve multthreaded processors. In: Internatonal conference on supercomputng, ACM, Melbourne, Australa, July Martnez J, Torrellas J (2002) Speculatve synchronzaton: applyng thread-level speculaton to explctly parallel applcatons. In: Internatonal conference on archtectural support for programmng languages and operatng systems, San Jose, Oct Prvulovc M, Garzaran MJ, Rauchwerger L, Torrellas J (2001) Removng archtectural bottlenecks to the scalablty of speculatve parallelzaton. In: Proceedngs of the 28th nternatonal symposum on computer archtecture (ISCA 01), New York, June 2001, pp Prvulovc M, Torrellas J (2003) ReEnact: usng thread-level speculaton to debug data races n multthreaded codes. In: Internatonal symposum on computer archtecture, San Dego, June Rauchwerger L, Padua D (1995) The LRPD test: speculatve runtme parallelzaton of loops wth prvatzaton and reducton parallelzaton. In: Conference on programmng language desgn and mplementaton, La Jolla, Calforna, June Rundberg P, Stenstrom P (2000) Low-cost thread-level data 486 dependence speculaton on multprocessors. In: Fourth work- 487 shop on multthreaded executon, archtecture and complaton, 488 Monterrey, Dec SohG,BreachS,VjaykumarT(1995)Multscalarprocessors.In: 490 Internatonal Symposum on computer archtecture, ACM Press, 491 New York, June Steffan G, Colohan C, Zha A, Mowry T (2000) A scalable 493 approach to thread-level speculaton. In: Proceedngs of the 27th 494 Annual Internatonal symposum on computer archtecture, Van- 495 couver, June 2000, pp Steffan G, Mowry TC (1998) The potental for usng thread- 497 level data speculaton to facltate automatc parallelzaton. In: 498 Internatonal symposum on hgh-performance computer arch- 499 tecture, Las Vegas, Feb TorrellasJ,CezeL,TuckJ,CascavalC,MontesnosP,AhnW, 501 Prvulovc M (2009) The bulk multcore archtecture for mproved 502 programmablty. Communcatons of the ACM, New York Tremblay M (1999) MAJC: mcroprocessor archtecture for java 504 computng. Hot Chps, Palo Alto, Aug Tsa J, Huang J, Amlo C, Llja D, Yew P (1999) The superthreaded 506 processor archtecture. IEEE Trans Comput 48(9): Vjaykumar T, Soh G (1998) Task selecton for a multscalar pro- 508 cessor. In: Internatonal symposum on mcroarchtecture, Dallas, 509 Nov 1998, pp Zha A, Colohan C, Steffan G, Mowry T (2002) Compler opt- 511 mzaton of scalar value communcaton between speculatve 512 threads. In: Internatonal conference on archtectural support for 513 programmng languages and operatng systems, San Jose, Oct Zhang Y, Rauchwerger L, Torrellas J (1998) Hardware for specula- 516 tve run-tme parallelzaton n dstrbuted shared-memory mul- 517 tprocessors. In: Proceedngs of the 4th Internatonal symposum 518 on hgh-performance computer archtecture (HPCA), Phoenx, 519 Feb 1998, pp Zhang Y, Rauchwerger L, Torrellas J (1999) Hardware for spec- 521 ulatve parallelzaton of partally-parallel loops n DSM mult- 522 processors. In: Proceedngs of the 5th nternatonal symposum 523 on hgh-performance computer archtecture, Orlando, Jan 1999, 524 pp ZhouP,QnF,LuW,ZhouY,Torrellas(2004)Watcher:effcent 526 archtectural support for software debuggng. In: Internatonal 527 symposum on computer archtecture, IEEE Computer socety, 528 München, June

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr