Review!of!Last!Lecture! CS!61C:!Great!Ideas!in!! Computer!Architecture! Synchroniza+on,- OpenMP- Great!Idea!#4:!Parallelism! Agenda!

Similar documents
CS 61C: Great Ideas in Computer Architecture. Synchronization, OpenMP

CS 61C: Great Ideas in Computer Architecture. Synchroniza+on, OpenMP. Senior Lecturer SOE Dan Garcia

You Are Here! Peer Instruction: Synchronization. Synchronization in MIPS. Lock and Unlock Synchronization 7/19/2011

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Instructor: Randy H. Katz hbp://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #16. Warehouse Scale Computer

CS 61C: Great Ideas in Computer Architecture Lecture 20: Thread- Level Parallelism (TLP) and OpenMP Part 2

CS 61C: Great Ideas in Computer Architecture Lecture 20: Thread- Level Parallelism (TLP) and OpenMP Part 2

CS 61C: Great Ideas in Computer Architecture Lecture 19: Thread- Level Parallelism and OpenMP Intro

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism: OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread- Level Parallelism (TLP) and OpenMP

CS 110 Computer Architecture

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread- Level Parallelism (TLP) and OpenMP. Serializes the execunon of a thread

CS 61C: Great Ideas in Computer Architecture TLP and Cache Coherency

Parallel Programming with OpenMP. CS240A, T. Yang

CS3350B Computer Architecture

Tieing the Threads Together

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

CS 61C: Great Ideas in Computer Architecture Thread Level Parallelism (TLP)

EPL372 Lab Exercise 5: Introduction to OpenMP

High Performance Computing: Tools and Applications

ECE 574 Cluster Computing Lecture 10

Introduction to OpenMP

CS 61C: Great Ideas in Computer Architecture TLP and Cache Coherency

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread Level Parallelism

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Lecture 4: OpenMP Open Multi-Processing

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

/Users/engelen/Sites/HPC folder/hpc/openmpexamples.c

Shared Memory Programming Model

Assignment 1 OpenMP Tutorial Assignment

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Multithreading in C with OpenMP

Shared Memory Programming Paradigm!

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Shared Memory Architecture

Parallel programming using OpenMP

Introduction to OpenMP

DPHPC: Introduction to OpenMP Recitation session

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Shared Memory Parallelism - OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP

DPHPC: Introduction to OpenMP Recitation session

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Heterogeneous Compu.ng using openmp lecture 2

OpenMP Programming. Aiichiro Nakano

Parallel Computing Parallel Programming Languages Hwansoo Han

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Introduction to OpenMP.

Alfio Lazzaro: Introduction to OpenMP

Parallel Processing/Programming

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP Fundamentals Fork-join model and data environment

Concurrent Programming with OpenMP

Shared Memory Parallelism using OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Data Environment: Default storage attributes

Introduction to OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

<Insert Picture Here> OpenMP on Solaris

CS691/SC791: Parallel & Distributed Computing

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Shared Memory Programming with OpenMP

Introduc)on to OpenMP

Synchronisation in Java - Java Monitor

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Session 4: Parallel Programming with OpenMP

Introduc4on to OpenMP and Threaded Libraries Ivan Giro*o

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Introduction to. Slides prepared by : Farzana Rahman 1

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Using OpenMP. Shaohao Chen Research Boston University

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Introduction to Standard OpenMP 3.1

OPENMP TIPS, TRICKS AND GOTCHAS

Parallel Programming in C with MPI and OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

OpenMP on Ranger and Stampede (with Labs)

Introduction to OpenMP

Scientific Computing

!OMP #pragma opm _OPENMP

Introduction to OpenMP

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

ME759 High Performance Computing for Engineering Applications

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Parallel and Distributed Programming. OpenMP

Transcription:

CS61C:GreatIdeasin ComputerArchitecture Miki$Lus(g ReviewofLastLecture Mul?processorsystemsusessharedmemory (singleaddressspace) Cachecoherenceimplementssharedmemory evenwithmul?plecopiesinmul?plecaches Trackstateofblocksrela?vetoothercaches (e.g.moesiprotocol) Falsesharingaconcern 1 2 ParallelRequests Assignedtocomputer e.g.search Garcia ParallelThreads Assignedtocore e.g.lookup,ads GreatIdea#4:Parallelism ParallelInstruc?ons >1instruc?on@one?me e.g.5pipelinedinstruc?ons ParallelData >1dataitem@one?me e.g.addof4pairsofwords Hardwaredescrip?ons Allgatesfunc?oningin parallelatsame?me Leverage- Parallelism-&- Achieve-High- Performance- So=ware--------Hardware- Synchroniza+on,- OpenMP- Warehouse Scale Computer Core$ Instruc?onUnit(s) CacheMemory Core Func?onal Unit(s) A 0 +B 0 A 1 +B 1 A 2 +B 2 A 3 +B 3 Computer$ Memory Input/Output Smart Phone Core Logic$Gates$ 3 Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs 4 DataRacesandSynchroniza?on Analogy:BuyingMilk Twomemoryaccessesformadata-race-if differentthreadsaccessthesameloca?on,andat leastoneisawrite,andtheyoccuroneader another Meansthattheresultofaprogramcanvary dependingonchance(whichthreadranfirst?) Avoiddataracesbysynchronizingwri?ngandreading togetdeterminis?cbehavior Synchroniza?ondonebyuseralevelrou?nesthat relyonhardwaresynchroniza?oninstruc?ons 5 Yourfridgehasnomilk.Youandyour roommatewillreturnfromclassesatsome pointandcheckthefridge Whoevergetshomefirstwillcheckthefridge, goandbuymilk,andreturn Whatiftheotherpersongetsbackwhilethe firstpersonisbuyingmilk? You vejustboughttwiceasmuchmilkasyou need Itwould vehelpedtohaveledanote 6

LockSynchroniza?on(1/2) Usea Lock tograntaccesstoaregion (cri+cal-sec+on)sothatonlyonethreadcan operateata?me Needallprocessorstobeabletoaccessthelock, sousealoca?oninsharedmemoryasthe-lock- Processorsreadlockandeitherwait(iflocked) orsetlockandgointocri?calsec?on 0meanslockisfree/open/unlocked/lockoff 1meanslockisset/closed/locked/lockon LockSynchroniza?on(2/2) Pseudocode: Check lock Canloop/idlehere iflocked Set the lock Critical section (e.g. change shared variables) Unset the lock 7 8 PossibleLockImplementa?on Lock(a.k.a.busywait) Get_lock: # $s0 -> addr of lock addiu $t1,$zero,1 # t1 = Locked value Loop: lw $t0,0($s0) # load lock bne $t0,$zero,loop # loop if locked Lock: sw $t1,0($s0) # Unlocked, so lock Unlock Unlock: sw $zero,0($s0) Anyproblemswiththis? 9 PossibleLockProblem Thread1 addiu $t1,$zero,1 Loop: lw $t0,0($s0) bne $t0,$zero,loop Lock: sw $t1,0($s0) Thread2 addiu $t1,$zero,1 Loop: lw $t0,0($s0) bne $t0,$zero,loop Lock: sw $t1,0($s0) Time$ Both-threads-think-they-have-set-the-lock--- Exclusive-access-not-guaranteed- 10 HardwareSynchroniza?on Synchroniza?oninMIPS Hardwaresupportrequiredtopreventan interloper(anotherthread)fromchangingthe value Atomic-read/writememoryopera?on Nootheraccesstotheloca?onallowedbetween thereadandwrite Howbesttoimplementinsodware? Singleinstr?Atomicswapofregister memory Pairofinstr?Oneforread,oneforwrite 11 Load-linked: - -ll rt,off(rs) Store-condi+onal: sc rt,off(rs) Returns1(success)ifloca?onhasnotchanged sincethell Returns0(failure)ifloca?onhaschanged Notethatscclobberstheregistervaluebeing stored(rt) Needtohaveacopyelsewhereifyouplanon repea?ngonfailureorusingvaluelater 12

Synchroniza?oninMIPSExample Atomicswap(totest/setlockvariable) Exchangecontentsofregisterandmemory: $s4 Mem($s1) try: add $t0,$zero,$s4 #copy value ll $t1,0($s1) #load linked sc $t0,0($s1) #store conditional beq $t0,$zero,try #loop if sc fails add $s4,$zero,$t1 #load value in $s4 scwouldfailifanotherthreadsexecutesschere 13 TestaandaSet Inasingleatomicopera?on: Test-toseeifamemoryloca?onisset (containsa1) Set-it(to1)ifitisn t(itcontainedazero whentested) OtherwiseindicatethattheSetfailed, sotheprogramcantryagain Whileaccessing,nootherinstruc?on canmodifythememoryloca?on, includingothertestaandaset instruc?ons Usefulforimplemen?nglock opera?ons 14 TestaandaSetinMIPS Example:MIPSsequencefor implemen?ngat&sat($s1) Try: addiu $t0,$zero,1 ll $t1,0($s1) bne $t1,$zero,try sc $t0,0($s1) beq $t0,$zero,try Locked: # critical section Unlock: sw $zero,0($s1) Ques(on:$$Considerthefollowingcodewhen executedconcurrentlybytwothreads. Whatpossiblevaluescanresultin*($s0)? # *($s0) = 100 lw $t0,0($s0) addi $t0,$t0,1 sw $t0,0($s0) 101$or$102 100,$101,$or$102 100$or$101 15 16 Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs Project2:MapReduce Part1:DueOct27 th Part2:DueNov2 nd Administrivia Wereceivedagrantfromamazon totalling>$60kforproject2 17 18

Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs WhatisOpenMP? APIusedformul?athreaded,sharedmemory parallelism CompilerDirec?ves Run?meLibraryRou?nes EnvironmentVariables Portable Standardized Resources:hvp://www.openmp.org/ andhvp://compu?ng.llnl.gov/tutorials/openmp/ 19 20 OpenMPSpecifica?on 21 SharedMemoryModelwithExplicit ThreadabasedParallelism Mul?plethreadsinasharedmemory environment,explicitprogrammingmodelwith fullprogrammercontroloverparalleliza?on Pros:$ Takesadvantageofsharedmemory,programmer neednotworry(thatmuch)aboutdataplacement Compilerdirec?vesaresimpleandeasytouse Legacyserialcodedoesnotneedtoberewriven Cons:$ Codecanonlyberuninsharedmemoryenvironments CompilermustsupportOpenMP(e.g.gcc4.2) 22 OpenMPinCS61C OpenMPProgrammingModel OpenMPisbuiltontopofC,soyoudon thaveto learnawholenewprogramminglanguage Makesuretoadd#include <omp.h> Compilewithflag:gcc -fopenmp Mostlyjustafewlinesofcodetolearn YouwillNOTbecomeexpertsatOpenMP Useslidesasreference,willlearntouseinlab Key$ideas:$ Sharedvs.Privatevariables OpenMPdirec?vesforparalleliza?on,worksharing, synchroniza?on 23 Fork$B$Join$Model:$ OpenMPprogramsbeginassingleprocess(master-thread) andexecutessequen?allyun?lthefirstparallelregion constructisencountered FORK:--Masterthreadthencreatesateamofparallelthreads Statementsinprogramthatareenclosedbytheparallelregion constructareexecutedinparallelamongthevariousthreads JOIN:Whentheteamthreadscompletethestatementsinthe parallelregionconstruct,theysynchronizeandterminate, leavingonlythemasterthread 24

OpenMPExtendsCwithPragmas PragmasareapreprocessormechanismC providesforlanguageextensions Commonlyimplementedpragmas: structurepacking,symbolaliasing,floa?ng pointexcep?onmodes(notcoveredin61c) GoodmechanismforOpenMPbecause compilersthatdon'trecognizeapragmaare supposedtoignorethem Runsonsequen?alcomputerevenwith embeddedpragmas 25 parallelpragmaandscope BasicOpenMPconstructforparalleliza?on: #pragma omp parallel { /* code goes here */ Eachthreadrunsacopyofcodewithintheblock ThreadschedulingisnonPdeterminis+c- OpenMPdefaultissharedvariables Tomakeprivate,needtodeclarewithpragma: #pragma omp parallel private (x) Thisisannoying,butcurlybraceMUSTgoonseparate linefrom#pragma 26 ThreadCrea?on HowmanythreadswillOpenMPcreate? DefinedbyOMP_NUM_THREADS environmentvariable(orcodeprocedurecall) Setthisvariabletothemaximumnumberof threadsyouwantopenmptouse Usuallyequalsthenumberofcoresinthe underlyinghardwareonwhichtheprogramisrun OMP_NUM_THREADS OpenMPintrinsictosetnumberofthreads: omp_set_num_threads(x); OpenMPintrinsictogetnumberofthreads: num_th = omp_get_num_threads(); OpenMPintrinsictogetThreadIDnumber: th_id = omp_get_thread_num(); 27 28 #include <stdio.h> #include <omp.h> int main () { int nthreads, tid; ParallelHelloWorld /* Fork team of threads with private var tid */ #pragma omp parallel private(tid) { tid = omp_get_thread_num(); /* get thread id */ printf("hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); /* All threads join master and terminate */ 29 Agenda Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs 30

OpenMPDirec?ves(WorkaSharing) Thesearedefinedwithinaparallelsec?on Sharesitera?onsofa loopacrossthethreads Eachsec?onisexecuted byaseparatethread Serializestheexecu?on ofathread 31 ParallelStatementShorthand #pragma omp parallel { #pragma omp for for(i=0;i<len;i++) { canbeshortenedto: #pragma omp parallel for for(i=0;i<len;i++) { Alsoworksforsections Thisistheonly direc?veinthe parallelsec?on 32 BuildingBlock:forloop Parallelforpragma- for (i=0; i<max; i++) zero[i] = 0; Breakfor-loop-intochunks,andallocateeachtoa separatethread e.g.ifmax=100with2threads: assign0a49tothread0,and50a99tothread1 Musthaverela?velysimple shape foranopenmpa awarecompilertobeabletoparallelizeit Necessaryfortheruna?mesystemtobeabletodetermine howmanyoftheloopitera?onstoassigntoeachthread Noprematureexitsfromtheloopallowed i.e.nobreak,return,exit,gotostatements Ingeneral, don tjump outsideofany pragmablock 33 #pragma omp parallel for for (i=0; i<max; i++) zero[i] = 0;$ Masterthreadcreatesaddi?onalthreads, eachwithaseparateexecu?oncontext Allvariablesdeclaredoutsideforloopare sharedbydefault,exceptforloopindex whichisprivate-perthread(why?) Implicitsynchroniza?onatendofforloop Divideindexregionssequen?allyperthread Thread0gets0,1,,(max/n)a1; Thread1getsmax/n,max/n+1,,2*(max/n)a1 Why? 34 OpenMPTiming MatrixMul?plyinOpenMP Elapsedwallclock?me: double omp_get_wtime(void); Returnselapsedwallclock?meinseconds Timeismeasuredperthread,noguaranteecanbe madethattwodis?nctthreadsmeasurethesame?me Timeismeasuredfrom some?meinthepast, sosubtractresultsoftwocallsto omp_get_wtime togetelapsed?me 35 start_time = omp_get_wtime(); #pragma omp parallel for private(tmp, i, j, k) for (i=0; i<mdim; i++){ for (j=0; j<ndim; j++){ tmp = 0.0; for( k=0; k<pdim; k++){ Outerloopspread acrossnthreads; innerloopsinsidea singlethread /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += *(A+(i*Pdim+k)) * *(B+(k*Ndim+j)); *(C+(i*Ndim+j)) = tmp; run_time = omp_get_wtime() - start_time; 36

NotesonMatrixMul?plyExample Moreperformanceop?miza?onsavailable: Highercompiler-op+miza+on(aO2,aO3)toreduce numberofinstruc?onsexecuted Cache-blockingtoimprovememoryperformance UsingSIMDSSEinstruc?onstoraisefloa?ngpoint computa?onrate(dlp) 37 OpenMPDirec?ves(Synchroniza?on) Thesearedefinedwithinaparallelsec?on master Codeblockexecutedonlybythemasterthread (allotherthreadsskip) critical Codeblockexecutedbyonlyonethreadata?me atomic Specificmemoryloca?onmustbeupdatedatomically (likeaminiacriticalsec?onforwri?ngtomemory) Appliestosinglestatement,notcodeblock 38 OpenMPReduc?on Reduc+onspecifiesthatoneormoreprivate variablesarethesubjectofareduc?on opera?onatendofparallelregion Clausereduction(operation:var) Opera+on:Operatortoperformonthevariables attheendoftheparallelregion Var:--Oneormorevariablesonwhichtoperform scalarreduc?on #pragma omp for reduction(+:nsum) for (i = START ; i <= END ; i++) nsum += i; 39 Summary Dataracesleadtosubtleparallelbugs Synchroniza?onviahardwareprimi?ves: MIPSdoesitwithLoadLinked+StoreCondi?onal OpenMPassimpleparallelextensiontoC Duringparallelfork,beawareofwhichvariables shouldbesharedvs.privateamongthreads Workasharingaccomplishedwithfor/sec?ons Synchroniza?onaccomplishedwithcri?cal/ atomic/reduc?on 40 Agenda Youareresponsibleforthematerialcontained onthefollowingslides,thoughwemaynothave enough?metogettotheminlecture. Theyhavebeenpreparedinawaythatshould beeasilyreadableandthematerialwillbe toucheduponinthefollowinglecture. Synchroniza?onaACrashCourse Administrivia OpenMPIntroduc?on OpenMPDirec?ves Workshare Synchroniza?on Bonus:CommonOpenMPPiballs 41 42

OpenMPPiball#1:DataDependencies OpenMPPiball#2:SharingIssues Considerthefollowingcode: a[0] = 1; for(i=1; i<5000; i++) a[i] = i + a[i-1]; Therearedependenciesbetweenloop itera?ons Spli ngthisloopbetweenthreadsdoesnot guaranteeinaorderexecu?on Outoforderloopexecu?onwillresultin undefinedbehavior(i.e.likelywrongresult) 43 Considerthefollowingloop: #pragma omp parallel for for(i=0; i<n; i++){ temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; tempisasharedvariable #pragma omp parallel for private(temp) for(i=0; i<n; i++){ temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; 44 OpenMPPiball#3:Upda?ngShared VariablesSimultaneously Nowconsideraglobalsum: for(i=0; i<n; i++) sum = sum + a[i]; Thiscanbedonebysurroundingthesumma?onbya critical/atomicsec?onorreductionclause: #pragma omp parallel for reduction(+:sum) { for(i=0; i<n; i++) sum = sum + a[i]; Compilercangeneratehighlyefficientcodefor reduction 45 OpenMPPiball#4:ParallelOverhead Spawningandreleasingthreadsresultsin significantoverhead Bevertohavefewerbutlargerparallel regions Parallelizeoverthelargestloopthatyoucan (eventhoughitwillinvolvemoreworktodeclare alloftheprivatevariablesandeliminate dependencies) 46 OpenMPPiball#4:ParallelOverhead start_time = omp_get_wtime(); Toomuchoverheadinthread genera?ontohavethisstatement for (i=0; i<ndim; i++){ runthisfrequently. for (j=0; j<mdim; j++){ Poorchoiceoflooptoparallelize. tmp = 0.0; #pragma omp parallel for reduction(+:tmp) for( k=0; k<pdim; k++){ /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j)); *(C+(i*Ndim+j)) = tmp; run_time = omp_get_wtime() - start_time; 47