Advanced MPI Programming

Size: px

Start display at page:

Download "Advanced MPI Programming"

Leo Dalton
5 years ago
Views:

1 Advanced MPI Programming Latestslidesandcodeexamplesareavailableat Pavan%Balaji% Argonne'Na*onal'Laboratory' Web:' James%Dinan% Intel'Corp.' ' Torsten%Hoefler% ETH'Zurich' Web:'hEp://htor.inf.ethz.ch/' Rajeev%Thakur% Argonne'Na*onal'Laboratory' Web:' '

2 Hybrid Programming with Threads, Shared Memory, and GPUs

3 MPI and Threads MPIdescribesparallelismbetweenprocesses'(withseparate addressspaces) ThreadparallelismprovidesasharedNmemorymodelwithina process OpenMPandPthreadsarecommonmodels OpenMPprovidesconvenientfeaturesforloopNlevelparallelism. Threadsarecreatedandmanagedbythecompiler,basedonuser direc7ves. Pthreadsprovidemorecomplexanddynamicapproaches.Threadsare createdandmanagedexplicitlybytheuser. Advanced'MPI,'SC13'(11/17/2013)' 85

4 Programming for Multicore Almostallchipsaremul7corethesedays Today sclusterso^encomprisemul7plecpuspernodesharing memory,andthenodesthemselvesareconnectedbya network Commonop7onsforprogrammingsuchclusters AllMPI MPIbetweenprocessesbothwithinanodeandacrossnodes MPIinternallyusessharedmemorytocommunicatewithinanode MPI+OpenMP UseOpenMPwithinanodeandMPIacrossnodes MPI+Pthreads UsePthreadswithinanodeandMPIacrossnodes Thelamertwoapproachesareknownas hybridprogramming Advanced'MPI,'SC13'(11/17/2013)' 86

5 Hybrid Programming with MPI+Threads MPIBonly%Programming% Rank%0% Rank%1% MPI+Threads%Hybrid%Programming% Rank%0% Rank%1% InMPINonlyprogramming, eachmpiprocesshasasingle programcounter InMPI+threadshybrid programming,therecanbe mul7plethreadsexecu7ng simultaneously AllthreadsshareallMPI objects(communicators, requests) TheMPIimplementa7onmight needtotakeprecau7onsto makesurethestateofthempi stackisconsistent Advanced'MPI,'SC13'(11/17/2013)' 87

6 MPI s Four Levels of Thread Safety MPIdefinesfourlevelsofthreadsafetyNNtheseare commitmentstheapplica7onmakestothempi MPI_THREAD_SINGLE:onlyonethreadexistsintheapplica7on MPI_THREAD_FUNNELED:mul7threaded,butonlythemainthread makesmpicalls(theonethatcalledmpi_init_thread) MPI_THREAD_SERIALIZED:mul7threaded,butonlyonethreadat'a'*me makesmpicalls MPI_THREAD_MULTIPLE:mul7threadedandanythreadcanmakeMPI callsatany7me(withsomerestric7onstoavoidraces seenextslide) Threadlevelsareinincreasingorder Ifanapplica7onworksinFUNNELEDmode,itcanworkinSERIALIZED MPIdefinesanalterna7vetoMPI_Init MPI_Init_thread(requested,provided) Applica*on'gives'level'it'needs;'MPI'implementa*on'gives'level'it'supports' Advanced'MPI,'SC13'(11/17/2013)' 88

7 MPI_THREAD_SINGLE Therearenothreadsinthesystem E.g.,therearenoOpenMPparallelregions int main(int argc, char ** argv) { int buf[100]; MPI_Init(argc, argv); MPI_Comm_rank(MPI_COMM_WORLD, rank); for (i = 0; i < 100; i++) compute(buf[i]); /* Do MPI stuff */ MPI_Finalize(); } return 0; Advanced'MPI,'SC13'(11/17/2013)' 89

8 MPI_THREAD_FUNNELED AllMPIcallsaremadebythemasterthread OutsidetheOpenMPparallelregions InOpenMPmasterregions int main(int argc, char ** argv) { int buf[100], provided; MPI_Init_thread(argc, argv, MPI_THREAD_FUNNELED, provided); MPI_Comm_rank(MPI_COMM_WORLD, rank); #pragma omp parallel for for (i = 0; i < 100; i++) compute(buf[i]); /* Do MPI stuff */ MPI_Finalize(); } return 0; Advanced'MPI,'SC13'(11/17/2013)' 90

9 MPI_THREAD_SERIALIZED OnlyonethreadcanmakeMPIcallsata7me ProtectedbyOpenMPcri7calregions int main(int argc, char ** argv) { int buf[100], provided; MPI_Init_thread(argc, argv, MPI_THREAD_SERIALIZED, provided); MPI_Comm_rank(MPI_COMM_WORLD, rank); #pragma omp parallel for for (i = 0; i < 100; i++) { compute(buf[i]); #pragma omp critical /* Do MPI stuff */ } MPI_Finalize(); } return 0; Advanced'MPI,'SC13'(11/17/2013)' 91

10 MPI_THREAD_MULTIPLE AnythreadcanmakeMPIcallsany7me(restric7onsapply) int main(int argc, char ** argv) { int buf[100], provided; MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, provided); MPI_Comm_rank(MPI_COMM_WORLD, rank); #pragma omp parallel for for (i = 0; i < 100; i++) { compute(buf[i]); /* Do MPI stuff */ } MPI_Finalize(); } return 0; Advanced'MPI,'SC13'(11/17/2013)' 92

11 Threads and MPI Animplementa7onisnotrequiredtosupportlevelshigher thanmpi_thread_single;thatis,animplementa7onisnot requiredtobethreadsafe AfullythreadNsafeimplementa7onwillsupport MPI_THREAD_MULTIPLE AprogramthatcallsMPI_Init(insteadofMPI_Init_thread) shouldassumethatonlympi_thread_singleissupported A'threaded'MPI'program'that'does'not'call'MPI_Init_thread'is' an'incorrect'program'(common'user'error'we'see)' Advanced'MPI,'SC13'(11/17/2013)' 93

12 Specification of MPI_THREAD_MULTIPLE Ordering:Whenmul7plethreadsmakeMPIcallsconcurrently, theoutcomewillbeasifthecallsexecutedsequen7allyinsome (any)order Orderingismaintainedwithineachthread Usermustensurethatcollec7veopera7onsonthesamecommunicator, window,orfilehandlearecorrectlyorderedamongthreads E.g.,cannotcallabroadcastononethreadandareduceonanotherthreadon thesamecommunicator Itistheuser'sresponsibilitytopreventraceswhenthreadsinthesame applica7onpostconflic7ngmpicalls E.g.,accessinganinfoobjectfromonethreadandfreeingitfromanother thread Blocking:BlockingMPIcallswillblockonlythecallingthreadand willnotpreventotherthreadsfromrunningorexecu7ngmpi func7ons Advanced'MPI,'SC13'(11/17/2013)' 94

13 Ordering in MPI_THREAD_MULTIPLE: Incorrect Example with Collectives Thread 1 Process 0 MPI_Bcast(comm) Process 1 MPI_Bcast(comm) Thread 2 MPI_Barrier(comm) MPI_Barrier(comm) P0andP1canhavedifferentorderingsofBcastandBarrier Heretheusermustusesomekindofsynchroniza7onto ensurethateitherthread1orthread2getsscheduledfirston bothprocesses Otherwiseabroadcastmaygetmatchedwithabarrieronthe samecommunicator,whichisnotallowedinmpi Advanced'MPI,'SC13'(11/17/2013)' 95

14 Ordering in MPI_THREAD_MULTIPLE: Incorrect Example with RMA int main(int argc, char ** argv) { /* Initialize MPI and RMA window */ #pragma omp parallel for for (i = 0; i < 100; i++) { target = rand(); MPI_Win_lock(MPI_LOCK_EXCLUSIVE, target, 0, win); MPI_Put(..., win); MPI_Win_unlock(target, win); } /* Free MPI and RMA window */ } return 0; Different%threads%can%lock%the%same%process%causing%mulHple%locks%to%the%same%target%before% the%first%lock%is%unlocked% Advanced'MPI,'SC13'(11/17/2013)' 96

15 Ordering in MPI_THREAD_MULTIPLE: Incorrect Example with Object Management Thread 1 Process 0 MPI_Bcast(comm) Process 1 MPI_Bcast(comm) Thread 2 MPI_Comm_free(comm) MPI_Comm_free(comm) Theuserhastomakesurethatonethreadisnotusingan objectwhileanotherthreadisfreeingit Thisisessen7allyanorderingissue;theobjectmightgetfreedbefore itisused Advanced'MPI,'SC13'(11/17/2013)' 97

16 Blocking Calls in MPI_THREAD_MULTIPLE: Correct Example Thread 1 Process 0 MPI_Recv(src=1) Process 1 MPI_Recv(src=0) Thread 2 MPI_Send(dst=1) MPI_Send(dst=0) Animplementa7onmustensurethattheaboveexample neverdeadlocksforanyorderingofthreadexecu7on Thatmeanstheimplementa7oncannotsimplyacquirea threadlockandblockwithinanmpifunc7on.itmust releasethelocktoallowotherthreadstomakeprogress. Advanced'MPI,'SC13'(11/17/2013)' 98

17 Performance with MPI_THREAD_MULTIPLE Threadsafetydoesnotcomeforfree Theimplementa7onmustprotectcertaindatastructuresor partsofcodewithmutexesorcri7calsec7ons Tomeasuretheperformanceimpact,weranteststomeasure communica7onperformancewhenusingmul7plethreads versusmul7pleprocesses Forresults,seeThakur/Gropppaper: TestSuiteforEvalua7ng PerformanceofMul7threadedMPICommunica7on, Parallel' Compu*ng,2009 Advanced'MPI,'SC13'(11/17/2013)' 100

18 Why is it hard to optimize MPI_THREAD_MULTIPLE MPIinternallymaintainsseveralresources BecauseofMPIseman7cs,itisrequiredthatallthreadshave accesstosomeofthedatastructures E.g.,thread1canpostanIrecv,andthread2canwaitforits comple7on thustherequestqueuehastobesharedbetweenboth threads Sincemul7plethreadsareaccessingthissharedqueue,itneedstobe locked addsalotofoverhead Advanced'MPI,'SC13'(11/17/2013)' 102

19 Hybrid Programming: Correctness Requirements HybridprogrammingwithMPI+threadsdoesnotdomuchto reducethecomplexityofthreadprogramming Yourapplica7ons7llhastobeacorrectmul7Nthreadedapplica7on Ontopofthat,youalsoneedtomakesureyouarecorrectlyfollowing MPIseman7cs Manycommercialdebuggersoffersupportfordebugging hybridmpi+threadsapplica7ons(mostlyformpi+pthreads andmpi+openmp) Advanced'MPI,'SC13'(11/17/2013)' 103

20 Example of Problem with Threads Ptolemyisaframeworkformodeling,simula7on,anddesignof concurrent,realn7me,embeddedsystems DevelopedatUCBerkeley(PI:EdLee) Itisarigorouslytested,widelyusedpieceofso^ware PtolemyIIwasfirstreleasedin2000 Yet,onApril26,2004,fouryearsa^eritwasfirstreleased,the codedeadlocked! Thebugwaslurkingfor4yearsofwidespreaduseandtes7ng! Afastermachineorsomethingthatchangedthe7mingcaughtthe bug See TheProblemwithThreads byedlee,ieeecomputer,2006 Advanced'MPI,'SC13'(11/17/2013)' 104

21 An Example we encountered recently Wereceivedabugreportaboutaverysimple mul7threadedmpiprogramthathangs Runwith2processes Eachprocesshas2threads Boththreadscommunicatewiththreadsontheother processasshowninthenextslide WespentseveralhourstryingtodebugMPICHbefore discoveringthatthebugisactuallyintheuser sprogram Advanced'MPI,'SC13'(11/17/2013)' 105

22 2 Proceses, 2 Threads, Each Thread Executes this Code for(j=0;j<2;j++){ if(rank==1){ for(i=0;i<2;i++) MPI_Send(NULL,0,MPI_CHAR,0,0,MPI_COMM_WORLD); for(i=0;i<2;i++) } MPI_Recv(NULL,0,MPI_CHAR,0,0,MPI_COMM_WORLD,stat); else{/*rank==0*/ for(i=0;i<2;i++) MPI_Recv(NULL,0,MPI_CHAR,1,0,MPI_COMM_WORLD,stat); for(i=0;i<2;i++) } } MPI_Send(NULL,0,MPI_CHAR,1,0,MPI_COMM_WORLD); Advanced'MPI,'SC13'(11/17/2013)' 106

23 Intended Ordering of Operations Rank%0% Rank%1% 2recvs(T1) 2sends(T1) 2recvs(T1) 2sends(T1) 2recvs(T2) 2sends(T2) 2recvs(T2) 2sends(T2) 2sends(T1) 2recvs(T1) 2sends(T1) 2recvs(T1) 2sends(T2) 2recvs(T2) 2sends(T2) 2recvs(T2) Everysendmatchesareceiveontheotherrank Advanced'MPI,'SC13'(11/17/2013)' 107

24 What Happened Thread 1 Rank 0 2 recvs 2 sends 2 recvs 2 sends Rank 1 2 sends 2 recvs 2 sends 2 recvs Thread 2 2 recvs 2 sends 2 recvs 2 sends 2 sends 2 recvs 2 sends 2 recvs All4threadsstuckinreceivesbecausethesendsfromone itera7ongotmatchedwithreceivesfromthenextitera7on Advanced'MPI,'SC13'(11/17/2013)' 108

25 Possible Ordering of Operations in Practice Rank%0% Rank%1% 2recvs(T1) 2sends(T1) 1recv(T1) 1recv(T1) 2sends(T1) 1recv(T2) 1recv(T2) 2sends(T2) 2recvs(T2) 2sends(T2) 2sends(T1) 1recv(T1) 1recv(T1) 2sends(T1) 2recvs(T1) 2sends(T2) 1recv(T2) 1recv(T2) 2sends(T2) 2recvs(T2) BecausetheMPIopera7onscanbeissuedinanarbitrary orderacrossthreads,allthreadscouldblockinarecvcall Advanced'MPI,'SC13'(11/17/2013)' 109

26 Hybrid Programming with Shared Memory MPIN3allowsdifferentprocessestoallocatesharedmemory throughmpi MPI_Win_allocate_shared UsesmanyoftheconceptsofoneNsidedcommunica7on Applica7onscandohybridprogrammingusingMPIorload/ storeaccessesonthesharedmemorywindow OtherMPIfunc7onscanbeusedtosynchronizeaccessto sharedmemoryregions Canbesimplertoprogramthanthreads Advanced'MPI,'SC13'(11/17/2013)' 110

27 Creating Shared Memory Regions in MPI MPI_COMM_WORLD% MPI_Comm_split_type(COMM_TYPE_SHARED) Shared%memory% communicator% MPI_Win_allocate_shared Shared%memory% communicator% Shared%memory% communicator% Shared%memory% window% Shared%memory% window% Shared%memory% window% Advanced'MPI,'SC13'(11/17/2013)' 111

28 Regular RMA windows vs. Shared memory windows Load/store' Load/store' P0% Local% memory% PUT/GET' TradiHonal%RMA%windows% P1% Local% memory% P0% P1% Load/store' Local%memory% Shared%memory%windows% Load/store' Load/store' Sharedmemorywindowsallow applica7onprocessestodirectly performload/storeaccesseson allofthewindowmemory E.g.,x[100]=10 Alloftheexis7ngRMAfunc7ons canalsobeusedonsuch memoryformoreadvanced seman7cssuchasatomic opera7ons Canbeveryusefulwhen processeswanttousethreads onlytogetaccesstoallofthe memoryonthenode Youcancreateasharedmemory windowandputyourshareddata Advanced'MPI,'SC13'(11/17/2013)' 112

29 Memory allocation and placement Sharedmemoryalloca7ondoesnotneedtobeuniform acrossprocesses Processescanallocateadifferentamountofmemory(evenzero) TheMPIstandarddoesnotspecifywherethememorywould beplaced(e.g.,whichphysicalmemoryitwillbepinnedto) Implementa7onscanchoosetheirownstrategies,thoughitis expectedthatanimplementa7onwilltrytoplacesharedmemory allocatedbyaprocess closetoit Thetotalallocatedsharedmemoryonacommunicatoris con7guousbydefault Userscanpassaninfohintcalled noncon7g thatwillallowthempi implementa7ontoalignmemoryalloca7onsfromeachprocessto appropriateboundariestoassistwithplacement Advanced'MPI,'SC13'(11/17/2013)' 113

30 Shared Arrays with Shared memory windows int main(int argc, char ** argv) { int buf[100]; MPI_Init(argc, argv); MPI_Comm_split_type(..., MPI_COMM_TYPE_SHARED,.., comm); MPI_Win_allocate_shared(comm,..., win); MPI_Comm_rank(comm, rank); MPI_Win_lockall(win); /* copy data to local part of shared memory */ MPI_Barrier(comm); /* use shared memory */ MPI_Win_unlock_all(win); } MPI_Win_free(win); MPI_Finalize(); return 0; Advanced'MPI,'SC13'(11/17/2013)' 114

33 Walkthrough of 2D Stencil Code with Shared Memory Windows Codecanbedownloadedfrom Advanced'MPI,'SC13'(11/17/2013)' 115

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler