Tody s Outline Arhitetures Progrmming nd Synhroniztion Disuss pper on Cosmi Cube (messge pssing) Messge pssing review Cosmi Cube disussion > Messge pssing mhine Shred memory model > Communition > Synhroniztion Ultromputer/RP3 disussion > Shred memory mhine Shred memory progrmming Fine grin versus orse grin prllelism How do hes hnge things > Improve nd omplite! > Beehive 6.173 Fll 2010 L07 Agrwl - 1 - - 2 - Pge 1
Review Messge Pssing Prllel Progrmming Model How to Reeive Messge Messge Beehive uses polling Privte Memory Privte Memory Core 0 lol he ld Core1 lol he lol he lol he messge Proess A Proess B E.g., MPI P stio ldio Wit in loop if no msg Communition: vi messges Synhroniztion: vi messges Messge - 3 - - 4 - Pge 2
The Cosmi Cube The Erliest Messge Pssing Mhine The Cosmi Cube - 5 - M Witing on reply A M M M M Messge P P P P P interfe 0 1 2 3 4... 7 Dist=1 B Swith proess to hide lteny Dist=3 64 nodes (remember, multiores on single hip rrived ir 2000) Diret network hyperube (detils lter in ourse) Privte memories Messge sends by lling into OS Routing in softwre Sequentil progrmming on eh proessor & messge send/reeive (muh like Beehive) Hide omm lteny by swithing proesses Simple hrdwre Disuss pper - 6 - Dist=7 Pge 3
Next, Rell, Prllel Progrmming Model Ultromputer Design Blkbord ptures stte Shred memory memory lok M0 M1 This is butterfly network vrint of Omeg network Designers Threds E.g., pthreds Communition: vi shred memory Synhroniztion: shred memory loks Indiret network Omeg network (detils lter in ourse) Shred memory mhine Communition/synhroniztion through shred memory Hrdwre routing of memory requests No lteny hiding wit for memory request Conept built s IBM RP3 mhine (we will see this lter) - 7 - - 8 - Pge 4
Populrized SPMD Progrmming (Single-progrm multiple-dt) Quik Detour Brrier Synhroniztion DO P_A R_A P_B S_A P_C P_D DO Brrier synhroniztion Glob_C=5 DO Glob_Z=Glob_Z+1 DO DO Prllel setion Replite setion Seril setion You will do this in lb 4 Proessors P dd sub Brrier Wit or or Brrier OK to proeed Brrier Wit Time or xor Brrier Wit Brrier synhroniztion pplies to set of proesses Annotte sequentil progrms A proess tht exeutes brrier must wit until ll other proesses hve exeuted their brrier Disuss how to do brrier on Beehive using messge pssing - 9 - - 10 - Pge 5
SPMD Progrmming Approh Single Progrm Multiple Dt Adding Pir of Vetors A Sequentil Progrm # define LENGTH 1000000 You should lern this! int [LENGTH], b[length], [LENGTH]; int i=0; Most prllel progrms written for ommodity multiores use this style (ll ommodity multiores hppen to be shred memory mhines!)* All proessors run opy of the sme progrm (ommonly slightly modified version of the sequentil progrm) Proessor-speifi behvior reted using unique proessor IDs Also need to introdue synhroniztion s neessry Let s do simple exmple to build intuition min() /* Initiliztions */... /* red in the two vetors */... i = 0; while (i < LENGTH) [i] = [i] + b[i]; i = i + 1; /* output the nswer */... *Note tht, in generl, SPMD style of progrmming n be pplied to either shred memory or messge pssing mhines Sequentil ddition of two vetors - 11 - - 12 - Pge 6
Prllel SPMD Version Assume Ultromputer model. Assume no hes, single word memory ess # define LENGTH 1000000 int [LENGTH], b[length], [LENGTH]; int i=0; int L=0; min() /* rete prllel proesses */... /* Initiliztions */ if (mypid == 0)... /* red in the two vetors */ if (mypid == 0)... int myi; myi = getwork(); while (i < LENGTH) /* output the nswer */ if (mypid == 0)... int getwork() getlok(); i = i + 1; /* inrement is tomi */ releselok(); return(i); Assume eh proess runs the rest of the sme progrm Only proess 0 runs this Get n index on whih to work. [myi] = [myi] + b[myi]; Exmple of self myi = getwork(); sheduling Pure No hes, single word reds/writes i=4 Sequentil ddition of two vetors - 13 - - 14 - Pge 7
Pure Lok: Using Test-nd-Set Instrution Exmple of spin lok Pure Test-nd-Set Instrution Implementtion old vl wrt 1 i=4 void getlok() while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() L = 0; /* relese the lok */ - 15 - T&S(L) tomi red-write [Return old vlue; Write 1] How to implement T&S in HW? In SW? i=4 T&S(L) tomi red-write [Return old vlue; Write 1] void getlok() while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() How to implement T&S in SW? Dekker s Alg. Problem: Lok is held for lod-store yle! Loks out even the lok releser. L = 0; /* relese the lok */ Cn we do better? Ides? - 16 - Pge 8
Pure Test & Test & Set Pure Bkoff onept void getlok() i=4 while (L == 1) ; while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() L = 0; /* relese the lok */ - 17 - T&S(L) tomi red-write [Red old vlue; Write 1] Any other problems? void getlok() i=4 while (T&S(L) == 1) ; /* loop till you get the lok */ void releselok() L = 0; /* relese the lok */ - 18 - T&S(L) tomi red-write [Red old vlue; Write 1] while (L == 1) ; /* introdue bkoff here */ Cn do exponentil bkoff Qudrti bkoff Rndom bkoff, et. We engineers love to optimize! Pge 9
Pure So, getting work item is not so hep fter ll, is it? Any ides? Jobi, Sme Bsi Conept i L i=4,5,6,7 Corse grin prllelism (versus fine grin prllelism): Get blok of 4 or 16 or more indies eh time to mortize the overhed of loking - 19 - Getwork() grbs n index to row (e.g.) Synhroniztion s before Lods nd stores to shred rry Finish row. How do I know when to strt next jobi itertion? Use brrier fter you finish your row Lots of ommunition over the network And very energy ineffiient - 20 - Pge 10
Pure 32-bit energy osts in 40nm DRAM red: ~1000pJ Send 1mm distne: ~10pJ Ides? Ches! Add: ~ 1pJ Register red: ~1pJ Che red (smll L1): ~10pJ - 21 - Pge 11