CS61C:GreatIdeasin ComputerArchitecture Miki$Lus(g ReviewofLastLecture Mul?processorsystemsusessharedmemory (singleaddressspace) Cachecoherenceimplementssharedmemory evenwithmul?plecopiesinmul?plecaches Trackstateofblocksrela?vetoothercaches (e.g.moesiprotocol) Falsesharingaconcern 1 2 ParallelRequests Assignedtocomputer e.g.search Garcia ParallelThreads Assignedtocore e.g.lookup,ads GreatIdea#4:Parallelism ParallelInstruc?ons >1instruc?on@one?me e.g.5pipelinedinstruc?ons ParallelData >1dataitem@one?me e.g.addof4pairsofwords Hardwaredescrip?ons Allgatesfunc?oningin parallelatsame?me Leverage- Parallelism-&- Achieve-High- Performance- So=ware--------Hardware- Synchroniza+on,- OpenMP- Warehouse Scale Computer Core$ Instruc?onUnit(s) CacheMemory Core Func?onal Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Computer$ Memory Input/Output Smart Phone Core Logic$Gates$ 3 Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs 4 DataRacesandSynchroniza?on Analogy:BuyingMilk Twomemoryaccessesformadata-race-if differentthreadsaccessthesameloca?on,andat leastoneisawrite,andtheyoccuroneader another Meansthattheresultofaprogramcanvary dependingonchance(whichthreadranfirst?) Avoiddataracesbysynchronizingwri?ngandreading togetdeterminis?cbehavior Synchroniza?ondonebyuseralevelrou?nesthat relyonhardwaresynchroniza?oninstruc?ons 5 Yourfridgehasnomilk.Youandyour roommatewillreturnfromclassesatsome pointandcheckthefridge Whoevergetshomefirstwillcheckthefridge, goandbuymilk,andreturn Whatiftheotherpersongetsbackwhilethe firstpersonisbuyingmilk? You vejustboughttwiceasmuchmilkasyou need Itwould vehelpedtohaveledanote 6
LockSynchroniza?on(1/2) Usea Lock tograntaccesstoaregion (cri+cal-sec+on)sothatonlyonethreadcan operateata?me Needallprocessorstobeabletoaccessthelock, sousealoca?oninsharedmemoryasthe-lock- Processorsreadlockandeitherwait(iflocked) orsetlockandgointocri?calsec?on 0meanslockisfree/open/unlocked/lockoff 1meanslockisset/closed/locked/lockon LockSynchroniza?on(2/2) Pseudocode: Check lock Canloop/idlehere iflocked Set the lock Critical section (e.g. change shared variables) Unset the lock 7 8 PossibleLockImplementa?on Lock(a.k.a.busywait) Get_lock: # $s0 -> addr of lock addiu $t1,$zero,1 # t1 = Locked value Loop: lw $t0,0($s0) # load lock bne $t0,$zero,loop # loop if locked Lock: sw $t1,0($s0) # Unlocked, so lock Unlock Unlock: sw $zero,0($s0) Anyproblemswiththis? 9 PossibleLockProblem Thread1 addiu $t1,$zero,1 Loop: lw $t0,0($s0) bne $t0,$zero,loop Lock: sw $t1,0($s0) Thread2 addiu $t1,$zero,1 Loop: lw $t0,0($s0) bne $t0,$zero,loop Lock: sw $t1,0($s0) Time$ Both-threads-think-they-have-set-the-lock--- Exclusive-access-not-guaranteed- 10 HardwareSynchroniza?on Synchroniza?oninMIPS Hardwaresupportrequiredtopreventan interloper(anotherthread)fromchangingthe value Atomic-read/writememoryopera?on Nootheraccesstotheloca?onallowedbetween thereadandwrite Howbesttoimplementinsodware? Singleinstr?Atomicswapofregister memory Pairofinstr?Oneforread,oneforwrite 11 Load-linked: - -ll rt,off(rs) Store-condi+onal: sc rt,off(rs) Returns1(success)ifloca?onhasnotchanged sincethell Returns0(failure)ifloca?onhaschanged Notethatscclobberstheregistervaluebeing stored(rt) Needtohaveacopyelsewhereifyouplanon repea?ngonfailureorusingvaluelater 12
Synchroniza?oninMIPSExample Atomicswap(totest/setlockvariable) Exchangecontentsofregisterandmemory: $s4 Mem($s1) try: add $t0,$zero,$s4 #copy value ll $t1,0($s1) #load linked sc $t0,0($s1) #store conditional beq $t0,$zero,try #loop if sc fails add $s4,$zero,$t1 #load value in $s4 scwouldfailifanotherthreadsexecutesschere 13 TestaandaSet Inasingleatomicopera?on: Test-toseeifamemoryloca?onisset (containsa1) Set-it(to1)ifitisn t(itcontainedazero whentested) OtherwiseindicatethattheSetfailed, sotheprogramcantryagain Whileaccessing,nootherinstruc?on canmodifythememoryloca?on, includingothertestaandaset instruc?ons Usefulforimplemen?nglock opera?ons 14 TestaandaSetinMIPS Example:MIPSsequencefor implemen?ngat&sat($s1) Try: addiu $t0,$zero,1 ll $t1,0($s1) bne $t1,$zero,try sc $t0,0($s1) beq $t0,$zero,try Locked: # critical section Unlock: sw $zero,0($s1) Ques(on:$$Considerthefollowingcodewhen executedconcurrentlybytwothreads. Whatpossiblevaluescanresultin*($s0)? # *($s0) = 100 lw $t0,0($s0) addi $t0,$t0,1 sw $t0,0($s0) 101$or$102 100,$101,$or$102 100$or$101 15 16 Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs Project2:MapReduce Part1:DueOct27 th Part2:DueNov2 nd Administrivia Wereceivedagrantfromamazon totalling>$60kforproject2 17 18
Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs WhatisOpenMP? APIusedformul?athreaded,sharedmemory parallelism CompilerDirec?ves Run?meLibraryRou?nes EnvironmentVariables Portable Standardized Resources:hvp://www.openmp.org/ andhvp://compu?ng.llnl.gov/tutorials/openmp/ 19 20 OpenMPSpecifica?on 21 SharedMemoryModelwithExplicit ThreadabasedParallelism Mul?plethreadsinasharedmemory environment,explicitprogrammingmodelwith fullprogrammercontroloverparalleliza?on Pros:$ Takesadvantageofsharedmemory,programmer neednotworry(thatmuch)aboutdataplacement Compilerdirec?vesaresimpleandeasytouse Legacyserialcodedoesnotneedtoberewriven Cons:$ Codecanonlyberuninsharedmemoryenvironments CompilermustsupportOpenMP(e.g.gcc4.2) 22 OpenMPinCS61C OpenMPProgrammingModel OpenMPisbuiltontopofC,soyoudon thaveto learnawholenewprogramminglanguage Makesuretoadd#include <omp.h> Compilewithflag:gcc -fopenmp Mostlyjustafewlinesofcodetolearn YouwillNOTbecomeexpertsatOpenMP Useslidesasreference,willlearntouseinlab Key$ideas:$ Sharedvs.Privatevariables OpenMPdirec?vesforparalleliza?on,worksharing, synchroniza?on 23 Fork$B$Join$Model:$ OpenMPprogramsbeginassingleprocess(master-thread) andexecutessequen?allyun?lthefirstparallelregion constructisencountered FORK:--Masterthreadthencreatesateamofparallelthreads Statementsinprogramthatareenclosedbytheparallelregion constructareexecutedinparallelamongthevariousthreads JOIN:Whentheteamthreadscompletethestatementsinthe parallelregionconstruct,theysynchronizeandterminate, leavingonlythemasterthread 24
OpenMPExtendsCwithPragmas PragmasareapreprocessormechanismC providesforlanguageextensions Commonlyimplementedpragmas: structurepacking,symbolaliasing,floa?ng pointexcep?onmodes(notcoveredin61c) GoodmechanismforOpenMPbecause compilersthatdon'trecognizeapragmaare supposedtoignorethem Runsonsequen?alcomputerevenwith embeddedpragmas 25 parallelpragmaandscope BasicOpenMPconstructforparalleliza?on: #pragma omp parallel { /* code goes here */ Eachthreadrunsacopyofcodewithintheblock ThreadschedulingisnonPdeterminis+c- OpenMPdefaultissharedvariables Tomakeprivate,needtodeclarewithpragma: #pragma omp parallel private (x) Thisisannoying,butcurlybraceMUSTgoonseparate linefrom#pragma 26 ThreadCrea?on HowmanythreadswillOpenMPcreate? DefinedbyOMP_NUM_THREADS environmentvariable(orcodeprocedurecall) Setthisvariabletothemaximumnumberof threadsyouwantopenmptouse Usuallyequalsthenumberofcoresinthe underlyinghardwareonwhichtheprogramisrun OMP_NUM_THREADS OpenMPintrinsictosetnumberofthreads: omp_set_num_threads(x); OpenMPintrinsictogetnumberofthreads: num_th = omp_get_num_threads(); OpenMPintrinsictogetThreadIDnumber: th_id = omp_get_thread_num(); 27 28 #include <stdio.h> #include <omp.h> int main () { int nthreads, tid; ParallelHelloWorld /* Fork team of threads with private var tid */ #pragma omp parallel private(tid) { tid = omp_get_thread_num(); /* get thread id */ printf("hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); /* All threads join master and terminate */ 29 Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs 30
OpenMPDirec?ves(WorkaSharing) Thesearedefinedwithinaparallelsec?on Sharesitera?onsofa loopacrossthethreads Eachsec?onisexecuted byaseparatethread Serializestheexecu?on ofathread 31 ParallelStatementShorthand #pragma omp parallel { #pragma omp for for(i=0;i<len;i++) { canbeshortenedto: #pragma omp parallel for for(i=0;i<len;i++) { Alsoworksforsections Thisistheonly direc?veinthe parallelsec?on 32 BuildingBlock:forloop Parallelforpragma- for (i=0; i<max; i++) zero[i] = 0; Breakfor-loop-intochunks,andallocateeachtoa separatethread e.g.ifmax=100with2threads: assign0a49tothread0,and50a99tothread1 Musthaverela?velysimple shape foranopenmpa awarecompilertobeabletoparallelizeit Necessaryfortheruna?mesystemtobeabletodetermine howmanyoftheloopitera?onstoassigntoeachthread Noprematureexitsfromtheloopallowed i.e.nobreak,return,exit,gotostatements Ingeneral, don tjump outsideofany pragmablock 33 #pragma omp parallel for for (i=0; i<max; i++) zero[i] = 0;$ Masterthreadcreatesaddi?onalthreads, eachwithaseparateexecu?oncontext Allvariablesdeclaredoutsideforloopare sharedbydefault,exceptforloopindex whichisprivate-perthread(why?) Implicitsynchroniza?onatendofforloop Divideindexregionssequen?allyperthread Thread0gets0,1,,(max/n)a1; Thread1getsmax/n,max/n+1,,2*(max/n)a1 Why? 34 OpenMPTiming MatrixMul?plyinOpenMP Elapsedwallclock?me: double omp_get_wtime(void); Returnselapsedwallclock?meinseconds Timeismeasuredperthread,noguaranteecanbe madethattwodis?nctthreadsmeasurethesame?me Timeismeasuredfrom some?meinthepast, sosubtractresultsoftwocallsto omp_get_wtime togetelapsed?me 35 start_time = omp_get_wtime(); #pragma omp parallel for private(tmp, i, j, k) for (i=0; i<mdim; i++){ for (j=0; j<ndim; j++){ tmp = 0.0; for( k=0; k<pdim; k++){ Outerloopspread acrossnthreads; innerloopsinsidea singlethread /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += *(A+(i*Pdim+k)) * *(B+(k*Ndim+j)); *(C+(i*Ndim+j)) = tmp; run_time = omp_get_wtime() - start_time; 36
NotesonMatrixMul?plyExample Moreperformanceop?miza?onsavailable: Highercompiler-op+miza+on(aO2,aO3)toreduce numberofinstruc?onsexecuted Cache-blockingtoimprovememoryperformance UsingSIMDSSEinstruc?onstoraisefloa?ngpoint computa?onrate(dlp) 37 OpenMPDirec?ves(Synchroniza?on) Thesearedefinedwithinaparallelsec?on master Codeblockexecutedonlybythemasterthread (allotherthreadsskip) critical Codeblockexecutedbyonlyonethreadata?me atomic Specificmemoryloca?onmustbeupdatedatomically (likeaminiacriticalsec?onforwri?ngtomemory) Appliestosinglestatement,notcodeblock 38 OpenMPReduc?on Reduc+onspecifiesthatoneormoreprivate variablesarethesubjectofareduc?on opera?onatendofparallelregion Clausereduction(operation:var) Opera+on:Operatortoperformonthevariables attheendoftheparallelregion Var:--Oneormorevariablesonwhichtoperform scalarreduc?on #pragma omp for reduction(+:nsum) for (i = START ; i <= END ; i++) nsum += i; 39 Summary Dataracesleadtosubtleparallelbugs Synchroniza?onviahardwareprimi?ves: MIPSdoesitwithLoadLinked+StoreCondi?onal OpenMPassimpleparallelextensiontoC Duringparallelfork,beawareofwhichvariables shouldbesharedvs.privateamongthreads Workasharingaccomplishedwithfor/sec?ons Synchroniza?onaccomplishedwithcri?cal/ atomic/reduc?on 40 Agenda Youareresponsibleforthematerialcontained onthefollowingslides,thoughwemaynothave enough?metogettotheminlecture. Theyhavebeenpreparedinawaythatshould beeasilyreadableandthematerialwillbe toucheduponinthefollowinglecture. Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs 41 42
OpenMPPiball#1:DataDependencies OpenMPPiball#2:SharingIssues Considerthefollowingcode: a[0] = 1; for(i=1; i<5000; i++) a[i] = i + a[i-1]; Therearedependenciesbetweenloop itera?ons Spli ngthisloopbetweenthreadsdoesnot guaranteeinaorderexecu?on Outoforderloopexecu?onwillresultin undefinedbehavior(i.e.likelywrongresult) 43 Considerthefollowingloop: #pragma omp parallel for for(i=0; i<n; i++){ temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; tempisasharedvariable #pragma omp parallel for private(temp) for(i=0; i<n; i++){ temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; 44 OpenMPPiball#3:Upda?ngShared VariablesSimultaneously Nowconsideraglobalsum: for(i=0; i<n; i++) sum = sum + a[i]; Thiscanbedonebysurroundingthesumma?onbya critical/atomicsec?onorreductionclause: #pragma omp parallel for reduction(+:sum) { for(i=0; i<n; i++) sum = sum + a[i]; Compilercangeneratehighlyefficientcodefor reduction 45 OpenMPPiball#4:ParallelOverhead Spawningandreleasingthreadsresultsin significantoverhead Bevertohavefewerbutlargerparallel regions Parallelizeoverthelargestloopthatyoucan (eventhoughitwillinvolvemoreworktodeclare alloftheprivatevariablesandeliminate dependencies) 46 OpenMPPiball#4:ParallelOverhead start_time = omp_get_wtime(); Toomuchoverheadinthread genera?ontohavethisstatement for (i=0; i<ndim; i++){ runthisfrequently. for (j=0; j<mdim; j++){ Poorchoiceoflooptoparallelize. tmp = 0.0; #pragma omp parallel for reduction(+:tmp) for( k=0; k<pdim; k++){ /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j)); *(C+(i*Ndim+j)) = tmp; run_time = omp_get_wtime() - start_time; 47