Parallel and Distributed Computing

Size: px

Start display at page:

Download "Parallel and Distributed Computing"

Lorena Phillips
5 years ago
Views:

1 Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 26, 2008 José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

2 Outline Cache Coherent NUMA AMD Opteron IBM Cell Broadband Engine programming NUMA systems OpenCL MapReduce José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

3 Shared-Memory Systems also known as Uniform Memory Access (UMA) architecture Symmetric Shared-Memory Multiprocessors (SMP) P P P P Main Memory I / O José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

4 Distributed-Memory Systems or Non-Uniform Memory Access (NUMA) architecture Multicomputers P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

5 Cache-Coherent NUMA Limitations of UMA / SMP: José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

6 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

7 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: high communication overhead! José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

8 Cache-Coherent NUMA Limitations of UMA / SMP: limited scalability (8 to 12 cores), due to contention in accessing memory! Limitations of Multicomputers: high communication overhead! Intermediate solution: Cache-Coherent NUMA, or ccnuma (also known as Distributed Shared Memory, DSM). Examples: IBM Cell, AMD Opteron José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

9 Cache-Coherent NUMA P P Cache Cache Main Memory I / O Main Memory I / O Interconnection Network Main Memory I / O Main Memory I / O Cache Cache P P José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

10 Cache-Coherent NUMA P P Main Memory Cache Main Memory Cache I / O Main Memory Cache P Main Memory Cache P highly scalable memory bandwidth grows with computational power cache coherence possible due to shared global bus José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

11 AMD Opteron each AMD s Opteron chip has its own memory controller, allowing for easy system extension each node may be a single- or multi-core each node has L1 and L2 caches José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

IBM Cell Broadband Engine Heterogeneous multiprocessor: Power Processing Element (PPE): Master processor 8 Synergistic Processing Elements (SPE): fully functional

12 IBM Cell Broadband Engine Heterogeneous multiprocessor: Power Processing Element (PPE): Master processor 8 Synergistic Processing Elements (SPE): fully functional RISC processors Local storage size per SPE: 256kB SPEs can only access own local memory José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

13 NUMA Aware Systems For optimal performance on NUMA systems: processes should be located on processors that are as close as possible to the memory that the process accesses allocate all memory for a process in the same processor OS with multi-queue scheduler with a runqueue per processor dispatch all child processes on the same processor through the life of the parent processes Linux and Windows OSs are NUMA ready. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

14 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

15 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process run process on node 0 with memory allocated on node 0 and 1. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

16 NUMA Aware Systems On Linux, numactl defines scheduling and/or memory placement policy: numactl --interleave=all bigdatabase run bigdatabase with its memory interleaved on all CPUs. numactl --cpubind=0 --membind=0,1 process run process on node 0 with memory allocated on node 0 and 1. numactl --show show the NUMA state José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

17 Programming NUMA Systems gcc provides a library with a simple programming interface of NUMA systems. #include <numa.h> gcc... -lnuma Defines policies for: thread binding memory allocation Before any other routine is used, int numa available() must be called. If it returns -1, all other functions in this library are undefined. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

18 Programming NUMA Systems Querying the system: int numa max node() returns the number of nodes in the systems long numa node size(int node, long *freep) returns the memory size of node node, and the current free memory in freep int numa distance(int node1, int node2) reports the distance in the machine topology between two nodes. The factors are a multiple of 10. It returns 0 when the distance cannot be determined. A node has distance 10 to itself. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

19 Programming NUMA Systems Querying the system: int numa max node() returns the number of nodes in the systems long numa node size(int node, long *freep) returns the memory size of node node, and the current free memory in freep int numa distance(int node1, int node2) reports the distance in the machine topology between two nodes. The factors are a multiple of 10. It returns 0 when the distance cannot be determined. A node has distance 10 to itself. Thread binding: int numa run on node(int node) binds the current thread and its children to node node (for a set of nodes, a nodemask can be specified) José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

20 Programming NUMA Systems Memory allocation: void *numa alloc onnode(size t size, int node) allocates size bytes of memory on a specific node node void *numa alloc local(size t size) allocates size bytes of memory on the local node void *numa alloc interleaved(size t size) allocates size bytes of memory page interleaved on all nodes void *numa alloc(size t size) allocates size bytes of memory with the current NUMA policy void numa free(void *start, size t size) frees size bytes of memory starting at start, allocated by the numa alloc * functions above José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

21 Programming NUMA Systems Memory allocation: void *numa alloc onnode(size t size, int node) allocates size bytes of memory on a specific node node void *numa alloc local(size t size) allocates size bytes of memory on the local node void *numa alloc interleaved(size t size) allocates size bytes of memory page interleaved on all nodes void *numa alloc(size t size) allocates size bytes of memory with the current NUMA policy void numa free(void *start, size t size) frees size bytes of memory starting at start, allocated by the numa alloc * functions above Node masks: Define a subset of nodes to which thread binding and memory allocation apply. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

22 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; Sun Sparc Rock. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

23 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; Sun Sparc Rock. GPUs with increased number of SIMD cores Example: NVIDIA GTX 280 / GTX 260 José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

24 Future Architecture Trends CPUs with increased number of SMP cores Examples: Intel Core 2 Quad; AMD Bulldozer; Sun Sparc Rock. GPUs with increased number of SIMD cores Example: NVIDIA GTX 280 / GTX 260 CPU / GPU convergence Examples: Intel Larrabee; AMD / ATI Fusion Future trend is: many simple cores each core with vector (SIMD) capabilities (of growing length) José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

25 Future Architecture Trends José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

26 Parallel Programming Models OpenMP, PThreads for SMP systems libnuma for ccnuma systems CUDA for NVIDIA s GPUs; CTU (Close To Metal) for ATI s GPUs Message Passing Interface, MPI This disparate set of models creates challenges of targeting algorithms to optimally exploit available computational power. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

27 Parallel Programming Models OpenMP, PThreads for SMP systems libnuma for ccnuma systems CUDA for NVIDIA s GPUs; CTU (Close To Metal) for ATI s GPUs Message Passing Interface, MPI This disparate set of models creates challenges of targeting algorithms to optimally exploit available computational power. Parallel algorithms need to address combinations of these models! José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

OpenCL Many issues are common to all parallel programming models! OpenCL cross-platform language recently proposed for data (and task) parallel programming for both GPUs and CPUs.

28 OpenCL Many issues are common to all parallel programming models! OpenCL cross-platform language recently proposed for data (and task) parallel programming for both GPUs and CPUs. OpenCL was created by Apple in cooperation with others, and will be an open standard administered by the Khronos Group. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

29 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

30 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

31 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

32 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

33 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation Enable embedded and handheld devices through an embedded profile in the specification José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

34 OpenCL Enable use of all computational resources in a system program GPUs, CPUs, Cell, DSP and other processors as peers support both data- and task- parallel compute models Low-level, high-performance abstraction but with silicon-portability approachable - but primarily targeted at expert developers ecosystem foundation - no middleware or convenience functions Efficient C-based parallel programming model familiar language for rapid adoption Close integration with OpenGL and other 3D APIs for advanced visualization application and innovation Enable embedded and handheld devices through an embedded profile in the specification Drive future hardware requirements e.g. floating point precision requirements José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

35 MapReduce Paradigm MapReduce: a simple programming model, proposed by Google, motivated by large-scale data processing, applicable to many computing problems MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

36 MapReduce Usage Programmer specifies two functions: map (in key, in value) list(out key, intermediate value) Processes input key/value pair Produces set of intermediate pairs reduce (out key, list(intermediate value)) list(out value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

37 Example Map and Reduce Functions map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v=key in values: result += ParseInt(v); Emit(AsString(result)); José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

38 MapReduce Execution Overview José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

39 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

40 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

41 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

42 MapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

43 MapReduce Fault tolerance On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don t yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

44 Review Cache Coherent NUMA AMD Opteron IBM Cell Broadband Engine programming NUMA systems OpenCL MapReduce José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

45 Next Classes efficient parallelization of common algorithms sorting search numerical algorithms José Monteiro (DEI / IST) Parallel and Distributed Computing / 26

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer